in this version, literal compression is always disabled for ZSTD_fast strategy.
Performance parity between ZSTD_compress_advanced() and ZSTD_compress_generic()
result of ZSTD_compress_advanced()
is different from ZSTD_compress_generic()
when using negative compression levels
because the disabling of huffman compression is not passed in parameters.
The (pretty old) code inside ZSTD_compress()
was making some pretty bold assumptions
on what's inside a CCtx and how to init it.
This is pretty fragile by design.
CCtx content evolve.
Knowledge of how to handle that should be concentrate in one place.
A side effect of this strategy
is that ZSTD_compress() wouldn't check for BMI2 capability,
and is therefore missing out some potential speed opportunity.
This patch makes ZSTD_compress() use
the same initialization and release functions
as the normal creator / destructor ones.
Measured on my laptop, with a custom version of bench
manually modified to use ZSTD_compress() (instead of the advanced API) :
This patch :
1#silesia.tar : 211984896 -> 73651053 (2.878), 312.2 MB/s , 723.8 MB/s
2#silesia.tar : 211984896 -> 70163650 (3.021), 226.2 MB/s , 649.8 MB/s
3#silesia.tar : 211984896 -> 66996749 (3.164), 169.4 MB/s , 636.7 MB/s
4#silesia.tar : 211984896 -> 65998319 (3.212), 136.7 MB/s , 619.2 MB/s
dev branch :
1#silesia.tar : 211984896 -> 73651053 (2.878), 291.7 MB/s , 727.5 MB/s
2#silesia.tar : 211984896 -> 70163650 (3.021), 216.2 MB/s , 655.7 MB/s
3#silesia.tar : 211984896 -> 66996749 (3.164), 162.2 MB/s , 633.1 MB/s
4#silesia.tar : 211984896 -> 65998319 (3.212), 130.6 MB/s , 618.6 MB/s
when parameters are "equivalent",
the context is re-used in continue mode,
hence needed workspace size is not recalculated.
This incidentally also evades the size-down check and action.
This patch intercepts the "continue mode"
so that the size-down check and action is actually triggered.
recently introduce into the new dictionary mode.
The bug could be reproduced with this command :
./zstreamtest -v --opaqueapi --no-big-tests -s4092 -t639
error was in function ZSTD_count_2segments() :
the beginning of the 2nd segment corresponds to prefixStart
and not the beginning of the current block (istart == src).
This would result in comparing the wrong byte.
removed "cached" structure.
prices are now saved in the optimal table.
Primarily done for simplification.
Might improve speed by a little.
But actually, and surprisingly, also improves ratio in some circumstances.
recent experienced showed that
default distribution table for offset
can get it wrong pretty quickly with the nb of symbols,
while it remains a reasonable choice much longer for lengths symbols.
Changed the formula,
so that dynamic threshold is now 32 symbols for offsets.
It remains at 64 symbols for lengths.
Detection based on defaultNormLog
zstd rejects blocks which do not compress by at least a certain amount.
In which case, such block is simply emitted uncompressed (even if a little bit of compression could be achieved).
This is better for decompression speed, hence for energy.
The logic is controlled by ZSTD_minGain().
The rule is applied uniformly, at all compression levels.
This change makes btultra accepts blocks with poor compression ratios.
We presume that users of btultra mode prefers compression ratio over some decompress speed gains.
The threshold for minimum gain is lowered for btultra
from s>>6 (~1.5% minimum gain)
to s>>7 (~0.8% minimum gain).
This is a prudent change.
Not sure if it's large enough.
ensure that, when frequency[symbol]==0,
result is (tableLog + 1) bits
with both upper-bit and fractional-bit estimates.
Also : enable BIT_DEBUG in /tests
Work around bug in zstd decoder
Pull request #1144 exercised a new path in the zstd decoder that proved to
be buggy. Avoid the extremely rare bug by emitting an uncompressed block.
Estimate the cost for using FSE modes `set_basic`, `set_compressed`, and
`set_repeat`, and select the one with the lowest cost.
* The cost of `set_basic` is computed using the cross-entropy cost
function `ZSTD_crossEntropyCost()`, using the normalized default count
and the count.
* The cost of `set_repeat` is computed using `FSE_bitCost()`. We check the
previous table to see if it is able to represent the distribution.
* The cost of `set_compressed` is computed with the entropy cost function
`ZSTD_entropyCost()`, together with the cost of writing the normalized
count `ZSTD_NCountCost()`.
this patch makes btultra do 2 passes on the first block,
the first one being dedicated to collecting statistics
so that the 2nd pass is more accurate.
It translates into a very small compression ratio gain :
enwik7, level 20:
blocks 4K : 2.142 -> 2.153
blocks 16K : 2.447 -> 2.457
blocks 64K : 2.716 -> 2.726
On the other hand, the cpu cost is doubled.
The trade off looks bad.
Though, that's ultimately a price to pay to reach better compression ratio.
So it's only enabled when setting btultra.
this improves compression ratio by a *tiny* amount.
It also reduces speed by a small amount.
Consequently, bit-fractional evaluation is only turned on for btultra.
for FSE symbols.
While it seems to work, the gains are negligible compared to rough maxNbBits evaluation.
There are even a few losses sometimes, that still need to be explained.
Furthermode, there are still cases where btlazy2 does a better job than btopt,
which seems rather strange too.
for proper estimation of symbol's weights
when using dictionary compression.
Note : using only huffman costs is not good enough,
presumably because sequence symbol costs are incorrect.
reported by @let-def.
It's actually a bug in ZSTD_compressBegin_usingCDict()
which would pass a wrong pledgedSrcSize value (0 instead of ZSTD_CONTENTSIZE_UNKNOWN)
resulting in wrong window size, resulting in downsized seqStore,
resulting in segfault when writing into the seqStore later in the process.
Added a test in fuzzer to cover this use case (fails before the patch).
The new advanced API basically set `requestedParams = appliedParams` when
using a dictionary. This halted all parameter adjustment, which can hurt
compression ratio if, for example, the window log is small for the first
call, but the rest of the files are large.
This patch fixes the bug, and checks that the `requestedParams` don't change
in the new advanced API when using a dictionary, and generally in the fuzzer.
Zstdmt uses prefixes to load the overlap between segments. Loading extra
positions makes compression non-deterministic, depending on the previous
job the context was used for. Since loading extra position takes extra
time as well, only do it when creating a `ZSTD_CDict`.
Fixes#1077.
The `avgJobSize` must not be lower than 256 KB for single-pass mode.
In `zstd.h` we say the minimum value for `ZSTD_p_jobSize` is 1 MB,
so ensure that we always pick a size >= 1 MB.
Found by libFuzzer fuzzer tests with large input limits.
this makes it possible to specify extremely large negative compression levels,
achieving the side effect as "no compression".
It will also be possible to define larger targetlength for ultra compression mode.
There is no adverse side effect due to removing this limit.
Integrate ldm into zstdmt by running it in serial and in order in the first
step of each job, in the same place as the hash gets updated. The input
buffer is sized to fit the whole LDM window and 2 full buffers of slack.
Input buffers cannot be reused until the LDM step is done with them.
After the LDM step is finished, the jobs don't actually have access to the
full window, only the overlap.
Tested on a few different multi-GB files with and without sanitizers,
and with different numbers of threads.
* Computes the XXH hash in the worker threads.
* Workers get a sequence number and wait until ther number shows up. On
error, ensures that its sequence is finished, so future threads don't
get blocked.
* Sets up for ldm integration, which will go in the same spot.
Setting `loadedDictEnd` was accidently removed from `ZSTD_loadDictionaryContent()`,
which means that dictionary compression will only be able to reference the parts of
the dictionary within the window. The spec allows us to reference the entire
dictionary so long as even one byte is in the window.
`ZSTD_enforceMaxDist()` incorrectly always allowed offsets up to `loadedDictEnd`
beyond the window, even once the dictionary was out of range.
When overflow protection kicked in, the check `current > loadedDictEnd + maxDist`
is incorrect if `loadedDictEnd` isn't reset back to zero. `current` could be reset
below the value, which would incorrectly allow references beyond the window. This
bug is present in `master`, but is very hard to trigger, since it requires both
dictionaries and data which triggers overflow correction.
Summary:
Allocate a single input buffer large enough to house each job, as well as
enough space for the IO thread to write 2 extra buffers. One goes in the
`POOL` queue, and one to fill, and then block on a full `POOL` queue.
Since we can't overlap with the prefix, we allocate space for 3 extra
input buffers.
Test Plan:
* CI
* With and without ASAN/UBSAN run zstdmt with different number of threads
on two large binaries, and verify that their checksums match.
* Test on the tip of the zstdmt ldm integration.
Reviewers: cyan
Differential Revision: https://phabricator.intern.facebook.com/D7284007
Tasks: T25664120
Summary:
* Expose the reference external sequences API for zstdmt.
Allows external sequences of any length, which get split when necessary.
* Reset the LDM window when the context is reset.
* Store the maximum number of LDM sequences.
* Sequence generation now returns the number of last literals.
* Fix sequence generation to not throw out the last literals when blocks of
more than 1 MB are encountered.
Expose reference external sequence API
* Expose the reference external sequences API for zstdmt.
* Allows external sequences of any length, which get split when necessary.
* Reset the LDM window when the context is reset.
* Store the maximum number of LDM sequences.
* Sequence generation now returns the number of last literals.
* Fix sequence generation to not throw out the last literals when blocks of
more than 1 MB are encountered.
Test Plan:
* CI
* Test the zstdmt ldm integration stacked on top of this diff
Reviewers: cyan
Differential Revision: https://phabricator.intern.facebook.com/D7283968
Tasks: T25664120
* Expose the reference external sequences API for zstdmt.
Allows external sequences of any length, which get split when necessary.
* Reset the LDM window when the context is reset.
* Store the maximum number of LDM sequences.
* Sequence generation now returns the number of last literals.
* Fix sequence generation to not throw out the last literals when blocks of
more than 1 MB are encountered.
The overflow protection is broken when the window log is `> (3U << 29)`, so 31.
It doesn't work when `current` isn't around `1U << windowLog` ahead of `lowLimit`,
and the the assertion `current > newCurrent` fails. This happens when the same
context is used many times over, but with a large window log, like in zstdmt.
Fix it by triggering correction based on `nextSrc - base` instead of `lowLimit`.
The added test fails before the patch, and passes after.
access negative compression levels from command line
for both compression and benchmark modes.
also : ensure proper propagation of parameters
through ZSTD_compress_generic() interface.
added relevant cli tests.
negative compression level trade compression ratio for more compression speed.
They turn off huffman compression of literals,
and use row 0 as baseline with a stepSize = -cLevel.
added associated test in fuzzer
also added : new advanced parameter ZSTD_p_literalCompression
Update code documentation, and properly names a few "magic constants".
Also, HUF_compress_internal() gets a cleaner way
to determine size of tables inside workspace.
* `ZSTD_ldm_generateSequences()` generates the LDM sequences and
stores them in a table. It should work with any chunk size, but
is currently only called one block at a time.
* `ZSTD_ldm_blockCompress()` emits the pre-defined sequences, and
instead of encoding the literals directly, it passes them to a
secondary block compressor. The code to handle chunk sizes greater
than the block size is currently commented out, since it is unused.
The next PR will uncomment exercise this code.
* During optimal parsing, ensure LDM `minMatchLength` is at least
`targetLength`. Also don't emit repcode matches in the LDM block
compressor. Enabling the LDM with the optimal parser now actually improves
the compression ratio.
* The compression ratio is very similar to before. It is very slightly
different, because the repcode handling is slightly different. If I remove
immediate repcode checking in both branches the compressed size is exactly
the same.
* The speed looks to be the same or better than before.
Up Next (in a separate PR)
--------------------------
Allow sequence generation to happen prior to compression, and produce more
than a block worth of sequences. Expose some API for zstdmt to consume.
This will test out some currently untested code in
`ZSTD_ldm_blockCompress()`.
as it's faster, due to one memory scan instead of two
(confirmed by microbenchmark).
Note : as ZSTD_reduceIndex() is rarely invoked,
it does not translate into a visible gain.
Consider it an exercise in auto-vectorization and micro-benchmarking.
This makes it easier to explain that nbWorkers=0 --> single-threaded mode,
while nbWorkers=1 --> asynchronous mode (one mode thread on top of the "main" caller thread).
No need for an additional asynchronous mode flag.
nbWorkers>=2 works the same as nbThreads>=2 previously.
to avoid confusion with blocks.
also:
- jobs are cut into chunks of 512KB now, to reduce nb of mutex calls.
- fix function declaration ZSTD_getBlockSizeMax()
- fix outdated comment
Other job members are accessed directly.
This avoids a full job copy, which would access everything,
including a few members that are supposed to be used by worker only,
uselessly requiring additional locks to avoid race conditions.
writeLastEmptyBlock() must release srcBuffer
as mtctx assumes it's done by job worker.
minor : changed 2 job member names (src->srcBuffer, srcStart->prefixStart) for clarity
replaced by equivalent signal job->consumer == job->srcSize.
created additional functions
ZSTD_writeLastEmptyBlock()
and
ZSTDMT_writeLastEmptyBlock()
required when it's necessary to finish a frame with a last empty job, to create an "end of frame" marker.
It avoids creating a job with srcSize==0.
When ZSTD_e_end directive is provided,
the question is not only "are internal buffers completely flushed",
it is also "is current frame completed".
In some rare cases,
it was possible for internal buffers to be completely flushed,
triggering a @return == 0,
but frame was not completed as it needed a last null-size block to mark the end,
resulting in an unfinished frame.
no real consequence, but pollute tsan tests :
job->dstBuff is being modified inside worker,
while main thread might read it accidentally
because it copies whole job.
But since it doesn't used dstBuff, there is no real consequence.
Other potential solution : only copy useful data, instead of whole job
When the dictionary is <= 8 bytes, no data is loaded from the dictionary.
In this case the repcodes weren't set, because they were inserted after the
size check. Fix this problem in general by first setting the cdict state to
a clean state of an empty dictionary, then filling the state from there.
Produces 3 statistics for ongoing frame compression :
- ingested
- consumed (effectively compressed)
- produced
Ingested can be larger than consumed due to buffering effect.
For the time being, this patch mostly fixes the % ratio issue,
since it computes consumed / produced,
instead of ingested / produced.
That being said, update is not "smooth",
because on a slow enough setting,
fileio spends most of its time waiting for a worker to complete its job.
This could be improved thanks to more granular flushing
i.e. start flushing before ongoing job is fully completed.
ZSTD_create?Dict() is required to produce a ?Dict* return type
because `free()` does not accept a `const type*` argument.
If it wasn't for this restriction, I would have preferred to create a `const ?Dict*` object
to emphasize the fact that, once created, a dictionary never changes
(hence can be shared concurrently until the end of its lifetime).
There is no such limitation with initStatic?Dict() :
as stated in the doc, there is no corresponding free() function,
since `workspace` is provided, hence allocated, externally,
it can only be free() externally.
Which means, ZSTD_initStatic?Dict() can return a `const ZSTD_?Dict*` pointer.
Tested with `make all`, to catch initStatic's users,
which, incidentally, also updated zstd.h documentation.
would create too large buffers,
since default job size == window size * 4.
This would crash on 32-bit systems.
Also : jobSize being a 32-bit unsigned, it cannot be >= 4 GB,
so the formula was failing for large window sizes >= 1 GB.
Fixed now : max job Size is 2 GB, whatever the window size.
this happened on 32-bits build when requiring a too large input buffer,
typically on wlog=29, creating jobs of 2 GB size.
also : zstd32 now compiles with multithread support enabled by default
(can be disabled with HAVE_THREAD=0)
Shaves 492,076 B off of the `ZSTD_CDict`.
The size of a `ZSTD_CDict` created from a 112,640 B dictionary is:
| Level | Before (B) | After (B) |
|-------|------------|-----------|
| 1 | 648,448 | 156,412 |
| 3 | 1,140,008 | 647,932 |
This new parameter makes it possible to call
streaming ZSTDMT with a single thread set
which is non blocking.
It makes it possible for the main thread to do other tasks in parallel
while the worker thread does compression.
Typically, for zstd cli, it means it can do I/O stuff.
Applied within fileio.c, this patch provides non-negligible gains during compression.
Tested on my laptop, with enwik9 (1000000000 bytes) : time zstd -f enwik9
With traditional single-thread blocking mode :
real 0m9.557s
user 0m8.861s
sys 0m0.538s
With new single-worker non blocking mode :
real 0m7.938s
user 0m8.049s
sys 0m0.514s
=> 20% faster
it still fallbacks to single-thread blocking invocation
when input is small (<1job)
or when invoking ZSTDMT_compress(), which is blocking.
Also : fixed a bug in new block-granular compression routine.
Pathological samples may result in literal section being incompressible.
This case is now detected,
and literal distribution is replaced by one that can be written into the dictionary.
It used to stop on reaching extDict, for simplification.
As a consequence, there was a small loss of performance each time the round buffer would restart from beginning.
It's not a large difference though, just several hundreds of bytes on silesia.
This patch fixes it.
now selected for levels 13, 14 and 15.
Also : dropped the requirement for monotonic memory budget increase of compression levels,,
which was required for ZSTD_estimateCCtxSize()
in order to ensure that a memory budget for level L is large enough for any level <= L.
This condition is now ensured at run time inside ZSTD_estimateCCtxSize().
we want the dictionary table to be fully sorted,
not just lazily filled.
Dictionary loading is a bit more intensive,
but it saves cpu cycles for match search during compression.
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
params1 was swapped with params2.
This used to be a non-issue when testing for strict equality,
but now that some tests look for "sufficient size" `<=`, order matters.
The deep fuzzer tests caught a subtle bug that was probably there for a long time.
The impact of the bug is not a crash, or any other clear error signal,
rather, it reduces performance, by cutting data into smaller blocks.
Eventually, the following test would fail because it produces too many 1-byte blocks,
requiring more space than buffer can provide :
`./zstreamtest_asan --mt -s3514 -t1678312 -i1678314`
The root scenario is as follows :
- Create context, initialize it using explicit parameters or a `cdict` to pin them down, set `pledgedSrcSize=1`
- The compression parameters will not be adapted, but `windowSize` and `blockSize` will be automatically set to `1`.
`windowSize` and `blockSize` are dynamic values, set within `ZSTD_resetCCtx_internal()`.
The automatic adaptation makes it possible to generate smaller contexts for smaller input sizes.
- Complete compression
- New compression with same context, using same parameters, but `pledgedSrcSize=ZSTD_CONTENTSIZE_UNKNOWN`
trigger "continue mode"
- Continue mode doesn't modify blockSize, because it used to depend on `windowLog` only,
but in fact, it also depends on `pledgedSrcSize`.
- The "old" blocksize (1) is still there,
next compression will use this value to cut input into blocks,
resulting in more blocks and worse performance than necessary performance.
Given the scenario, and its possible variants, I'm surprised it did not show up before.
But I suspect it did show up, it's just that it never triggered an error, because "worse performance" is not a trigger.
The above test is a special corner case, where performance is so impacted that it reaches an error case.
The fix works, but I'm not completely pleased.
I think the current code relies too much on implied relations between variables.
This will likely break again in the future when some related part of the code change.
Unfortunately, no time to make larger changes if we want to keep the release target for zstd v1.3.3.
So a longer term fix will have to be considered after the release.
To do : create a reliable test case which triggers this scenario for CI tests.
`zstreamtest --newapi` (and `--opaqueapi`) create and destroy way too many threads
resulting in failure of tsan tests,
and potentially connected to the qemu flaky tests.
This is because, at each test, the nb of threads can be changed (random).
The `--no-big-tests` directive reduce this choice to 1/2 threads,
in order to limit memory usage, especially for qemu and 32-bits builds.
Unfortunately, swapping between 1 and 2 threads is enough to constantly create/destroy new mtctx.
This patch takes advantage of the following property :
via compress_generic, no internal mtctx is needed for nbThreads < 2.
As a consequence, when nbThreads == 2, the currently active mtctx is necessarily good.
This dramatically reduces the nb of thread creations when invoking `zstreamtest --newapi --no-big-tests`
(only when parent cctx itself is created, which is randomized to 1/256 tests).
Expected outcome :
- at a minimum : tsan tests shall now work continuously without exploding the thread counter
- at best : flaky qemu tests on `zstreamtest --newapi --no-big-tests` may stop being flaky, due to less stress from constant thread creation/destruction
Real world impact :
minimal, I don't expect users to constantly change `nbThreads` between each invocation.
If `nbThreads` remains stable, existing implementation re-uses existing mtctx.
Also : `zstreamtest --newapi` but without `--no-big-tests` doesn't benefit as much,
since this test can select a random `nbThreads` value between 1 and 4.
The current patch only reduces opportunity to free/create mtctx (for example : 2->1->2 doesn't need a new mtctx)
but doesn't completely eliminate it, since `nbThreads` can still change between 2/3/4.
A more complete solution could be to only use 2 out of 4 allocated threads, thus keeping the pool at a constant size.
This would require a larger change to `POOL_*` api though.
taking advantage of `btopt` improved speed to tune parameters.
Levels 16-19 are stronger than previous release, making the graph more favorable.
In theory, I should also update small-size tables,
but I got lazy on that one ...
zstd streaming API was adding a null-block at end of frame for small input.
Reason is : on small input, a single block is enough.
ZSTD_CStream would size its input buffer to expect a single block of this size,
automatically triggering a flush on reaching this size.
Unfortunately, that last byte was generally received before the "end" directive (at least in `fileio`).
The later "end" directive would force the creation of a 3-bytes last block to indicate end of frame.
The solution is to not flush automatically, which is btw the expected behavior.
It happens in this case because blocksize is defined with exactly the same size as input.
Just adding one-byte is enough to stop triggering the automatic flush.
I initially looked at another solution, solving the problem directly in the compression context.
But it felt awkward.
Now, the underlying compression API `ZSTD_compressContinue()` would take the decision the close a frame
on reaching its expected end (`pledgedSrcSize`).
This feels awkward, a responsability over-reach, beyond the definition of this API.
ZSTD_compressContinue() is clearly documented as a guaranteed flush,
with ZSTD_compressEnd() generating a guaranteed end.
I faced similar issue when trying to port a similar mechanism at the higher streaming layer.
Having ZSTD_CStream end a frame automatically on reaching `pledgedSrcSize` can surprise the caller,
since it did not explicitly requested an end of frame.
The only sensible action remaining after that is to end the frame with no additional input.
This adds additional logic in the ZSTD_CStream state to check this condition.
Plus some potential confusion on the meaning of ZSTD_endStream() with no additional input (ending confirmation ? new 0-size frame ?)
In the end, just enlarging input buffer by 1 byte feels the least intrusive change.
It's also a contract remaining inside the streaming layer, so the logic is contained in this part of the code.
The patch also introduces a new test checking that size of small frame is as expected, without additional 3-bytes null block.
This patch restores capability for each file to receive adapted compression parameters depending on its size.
The bug breaking this feature was relatively silly :
setting a parameter with a value "0" is supposed to be a no-op.
Unfortunately, it would pin down compression parameters as if they were manually set,
preventing later automatic adaptation.
Unfortunately, I'm currently short of a test case that could check this situation and trigger an error.
Compression parameters selection between tableID 0,1,2,3 is largely internal,
leaving no trace to outside world, not even in frame header.
windowLog is now enforced from provided compression parameters,
instead of being copied blindly from `cdict`
where it could be smaller.
also :
- fix a minor bug in zstreamtest --mt : advanced parameters must be set before init
- changed advanced parameter name to ZSTDMT_jobSize
While the final result is still, technically, a frame,
the resulting frame expands initial data instead of compressing it.
This is because the streaming API creates a tiny 1-byte buffer for input,
because it believes input is empty (0-bytes),
because in the past, 0 used to mean "unknown" instead.
This patch fixes the issue.
Todo : add a test which traps the issue.
last such side-effect was modifying cctx->loadedDictEnd on setting forceWindow.
It is no a useless operation, so it's removed.
No side-effect left when setting a compression parameter.
Any ZSTD_CCtx_setParameter() shall just write the requested parameter, without further action.
Any action shall be taken at parameter application only (during init).
It makes it possible to just copy CCtxParams from external container to internal state,
and get rid of the more complex code which was trying to compensate for missing actions.
There was a flaw in the formula
which compared literal cost with match cost :
at a given position,
a non-null literal suite is going to be part of next sequence,
while if position ends a previous match, to immediately start another match,
next sequence will have a litlength of zero.
A litlength of zero has a non-null cost.
It follows that literals cost should be compared to match cost + litlength==0.
Not doing so gave a structural advantage to matches, which would be selected more often.
I believe that's what led to the creation of the strange heuristic which added a complex cost to matches.
The heuristic was actually compensating.
It was probably created through multiple trials, settling for best outcome on a given scenario (I suspect silesia.tar).
The problem with this heuristic is that it's hard to understand,
and unfortunately, any future change in the parser would impact the way it should be calculated and its effects.
The "proper" formula makes it possible to remove this heuristic.
Now, the problem is : in a head to head comparison, it's sometimes better, sometimes worse.
Note that all differences are small (< 0.01 ratio).
In general, the newer formula is better for smaller files (for example, calgary.tar and enwik7).
I suspect that's because starting statistics are pretty poor (another area of improvement).
However, for silesia.tar specifically, it's worse at level 22 (while being better at level 17, so even compression level has an impact ...).
It's a pity that zstd -22 gets worse on silesia.tar.
That being said, I like that the new code gets rid of strange variables,
which were introducing complexity for any future evolution (faster variants being in mind).
Therefore, in spite of this detrimental side effect, I tend to be in favor of it.
optState was used both to evaluate price
and to cache cost of previously calculated literals.
This created a strong dependency, forcing parser to request cost in a strict order.
This limitation is forbids future parser with skipping capabilities.
After this patch, caching literals price still exists,
but is now explicit, in a stack structure.
merging of repcode search into btsearch introduced a small compression ratio regressio at max level :
1.3.2 : 52728769
after repMerge patch : 52760789 (+32020)
A few minor changes have produced this difference.
They can be hard to spot.
This patch buys back about half of the difference,
by no longer inserting position at hc3 when a long match is found there.
It feels strangely counter-intuitive, but works :
after this patch : 52742555 (-18234)
Fixed : multithreading to compress some small data with dictionary
Fixed : ZSTD_initCStream_usingCDict()
Improved streaming memory usage when pledgedSrcSize is known.
ZSTD_updateTree() expected to be followed by a Bt match finder, which would update zc->nextToUpdate.
With the new optimal match finder, it's not necessarily the case : a match might be found during repcode or hash3, and stops there because it reaches sufficient_len, without even entering the binary tree.
Previous policy was to nonetheless update zc->nextToUpdate, but the current position would not be inserted, creating "holes" in the btree, aka positions that will no longer be searched.
Now, when current position is not inserted, zc->nextToUpdate is not update, expecting ZSTD_updateTree() to fill the tree later on.
Solution selected is that ZSTD_updateTree() takes care of properly setting zc->nextToUpdate,
so that it no longer depends on a future function to do this job.
It took time to get there, as the issue started with a memory sanitizer error.
The pb would have been easier to spot with a proper `assert()`.
So this patch add a few of them.
Additionnally, I discovered that `make test` does not enable `assert()` during CLI tests.
This patch enables them.
Unfortunately, these `assert()` triggered other (unrelated) bugs during CLI tests, mostly within zstdmt.
So this patch also fixes them.
- Changed packed structure for gcc memory access : memory sanitizer would complain that a read "might" reach out-of-bound position on the ground that the `union` is larger than the type accessed.
Now, to avoid this issue, each type is independent.
- ZSTD_CCtxParams_setParameter() : @return provides the value of parameter, clamped/fixed appropriately.
- ZSTDMT : changed constant name to ZSTDMT_JOBSIZE_MIN
- ZSTDMT : multithreading is automatically disabled when srcSize <= ZSTDMT_JOBSIZE_MIN, since only one thread will be used in this case (saves memory and runtime).
- ZSTDMT : nbThreads is automatically clamped on setting the value.
this version has same speed as branch `opt`
which is itself 5-10% slower than branch `dev`
(no identified reason)
It does not compress exactly the same as `opt` or `dev`,
maybe because it doesn't stop search after repcodes,
leading to sometimes better compression, sometimes worse
(by a small margin).
warning : _extDict path does not work for the time being
This means that benchmark module works,
but file module will fail with large files (and high compression level).
Objective is to fuse _extDict path into current one,
in order to have a single parser to maintain.
ZSTD_getPrice() and ZSTD_updatePrice() accept normal matchlength as argument
instead of matchlength-MINMATCH,
which makes them easier / more logical to use and read.
Conversion is simply done internally.
added some traces and assert
related to hunting a potential ubsan error in 32-bits more
(it ends up being a compiler-side issue : https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82802).
Modified one pointer arithmetic expression for a more conformant way.
as per documentation, on ZSTD_setPledgedSrcSize() :
> If all data is provided and consumed in a single round,
> this value (pledgedSrcSize) is overriden by srcSize instead.
This wasn't applied before compression level is transformed into compression parameters.
As a consequence, small input missed compression parameters adaptation.
It seems to work fine now : compression was compared with ZSTD_compress_advanced(),
results were the same.
ZSTD_compress() and friends would treat an empty input as an unknown size
when selecting parameters. Thus, they would drastically overallocate the
context. Tell ZSTD_getParams() that the source size is 1 when it is empty.
It was multiple reasons stacked :
- Visual use a different code path, because ZSTD_NEWAPI is not defined
- fileio.c sends `0` as `pledgedSrcSize` to mean `ZSTD_CONTENTSIZE_UNKNOWN` (fixed)
- ZSTDMT_resetCCtx() interpreted `0` as "empty" instead of "unknown" (fixed)
It isn't useful in any case to repeat default tables.
Saves a few bytes on Silesia, since we don't trigger the dictionary
heuristic.
Before: 211988480 => 73651998 bytes
After: 211988480 => 73651721 bytes
when determining compression parameters
to compress one file only.
For multiple files, it still "bets" that files are going to be small.
There was also a bug recently added in ZSTD_CCtx_loadDictionary_advanced()
making it incapable to use pledgedSrcSize to determine compression parameters.
In `ZSTD_compressBegin_advanced()`, `ZSTD_parameters` are used to set the
compression parameters, but the level didn't get set to `CLEVEL_CUSTOM`, so
`ZSTD_compressBlock()` used the wrong parameters when checking the source
size.
ZSTD_compressBound() works fine, but is only useful for dynamic allocation.
For static allocation, only a macro can provide the amount during compilation time.
It's not good to mix old and new API
ZSTD_resetCStream() doesn't just set pledgedSrcSize :
it also sets the CCtx for a single thread compression.
Problem is, when 2+ threads are defined in cctx->requestedParams,
ZSTD_compress_generic() will want to start MT compression,
since initialization is supposed to have already happened (thanks to ZSTD_resetCStream())
except that the underlying ZSTDMT_CCtx* object is not created,
resulting in a segfault.
This is an invalid construction
(correct one is to use ZSTD_CCtx_setPledgedSrcSize()).
I haven't found a nice way to mitigate this impact if someone makes the same mistake.
At some point, removing the old API to keep only the new API within fileio.c will limit these risks.
srcSize is read and provided at each file, not at resource creation.
This used to be useful with older API, because it could not re-adapt parameters between sessions.
At some point, it will be better to remove the old code, and only keep the new_api.
It works fine by now.
In some complex scenario,
the buffer would be freed because it's too large,
another buffer would be allocated, but fail,
trigger an error,
and the general buffer pool would then be freed,
where the definition of the already freed buffer would be found
(beyond total index, but still), and freed again, resulting in double-free error.
It would previously exit when srcSize is unknown.
But in the case of custom parameters,
hLog and cLog can still be too large in comparison with windowLog.
Reduces maximum memory allocated during zstreamtest --newapi
* Maximum window size in 32-bit mode is 1GB, since allocations for 2GB fail
on my Mac.
* Maximum window size in 64-bit mode is 2GB, since that is the largest
power of 2 that works with the overflow prevention.
* Allow `--long=windowLog` to set the window log, along with
`--zstd=wlog=#`. These options also set the window size during
decompression, but don't override `--memory=#` if it is set.
* Present a helpful error message when the window size is too large during
decompression.
* The long range matcher defaults to a hash log 7 less than the window log,
which keeps it at 20 for window log 27.
* Keep the default long range matcher window size and the default maximum
window size at 27 for the API and CLI.
* Add tests that use the maximum window size and hash size for compression
and decompression.
resulting in undefined symbol error.
Push the requirement to GCC 4 for now.
Another solution, proposed by @NWilson, is to use __LONG_MAX__ instead.
__LONG_MAX__ is a GCC-specific constant, which value is supposed to depend on underlying target hardware (32/64 bits)
Might be better, but seems also more complex, hence more prone to side effects.
Keeping the simple solution for now (just rely on __GNUC__)