Previously, each job would reserve a CCtx right before being posted.
The CCtx would be "part of the job description",
and only released when the job is completed (aka flushed).
For ZSTDMT_compress(), which creates all jobs first and only join at the end,
that meant one CCtx per job.
The nb of jobs used to be == nb of threads,
but since latest modification,
which reduces the size of jobs in order to spread the load of difficult areas,
it also increases the nb of jobs for large sources / small compression level.
This resulted in many more CCtx being created.
In this new version, CCtx are reserved within the worker thread.
It guaranteea there cannot be more CCtx reserved than workers (<= nb threads).
To do that, it required to make the CCtx Pool multi-threading-safe :
it can now be called from multiple threads in parallel.
Doesn't speed optimize this buffer-to-buffer scenario yet.
Still internally defers to streaming implementation.
Also : fixed a long standing bug in ZSTDMT streaming API.
It now only uses compressionParameters as argument.
It produces many changes throughout user code,
though hopefully they tend to be simple :
just provide the cParams part from existing ZSTD_parameters.
Some programs might depend on ZSTD_createCDict_advanced() to pass frame parameters.
This change will force them to revisit this strategy and fix it,
since frame parameters are effectively silently ignored in current version.
does no longer allocate temporary buffers
when there is enough room in dstBuffer to decompress directly there.
(previous method would skip that for 1st chunk only).
Also : fix ZSTD_compressBound() for small srcSize
XXH_STATIC_LINKING_ONLY protection macro is intended to be triggered just before the include.
The main idea is to keep this setting local :
user module shall explicitly understand and accept the static linking restriction
which becomes transparent when triggering the macro at project level.
Global definition also triggers redefinition warnings for user modules which do locally define the macro.
This new version compiles lib and cli without warning when the macro is set globally.
That's not a scenario to be recommended, since it trades a local effect for a global one,
but it was easy enough to provide from zstd side.
There used to be a (very small) chance that
loading prefix from previous segment
would be confused with a real zstd dictionary.
For that to happen, the prefix needs to start
with the same value as dictionary magic.
That's 1 chance in 4 billions if all values have equal probability.
But in fact, since some values are more common (0x00000000 for example)
others are less common, and dictionary magic was selected to be one of them,
so probabilities are likely even lower.
Anyway, this risk is no down to zero
by adding a new CCtx parameter : ZSTD_p_forceRawDict
Current parameter policy : the parameter "stick" to its CCtx,
so any dictionary loading after ZSTD_p_forceRawDict is set
will be loaded in "raw" ("content only") mode,
even if CCtx is re-used multiple times with multiple different dictionary.
It's up to the user to reset this value differently if it needs so.
the minimum size condition size is applied transparently (no warning, no error)
like previous minimum section size condition (1 KB) which still applies.
Previous version was requiring a fairly large initial amount of input data
before starting to create compression jobs.
This new version starts the process much sooner.
fileio.c was continually pushing more content without giving a chance to flush compressed one.
It would block the job queue when input data was accumulated too fast (requiring to define many threads).
Fixed : fileio flushes whatever it can after each input attempt.
Sections 2+ read a bit of data from previous section
in order to improve compression ratio.
This also costs some CPU, to reference read data.
Read data is currently fixed to window>>3 size
By default, section sizes are 4x window size.
This new setting allow manual selection of section sizes.
The larger they are, the (slightly) better the compression ratio,
but also the higher the memory allocation cost,
and eventually the lesser the nb of possible threads,
since each section is compressed by a single thread.
It also introduces a prototype to set generic parameters,
ZSTDMT_setMTCtxParameter()
The idea is that it's possible to add enums
to extend the list of parameters that can be set this way.
This is more long-term oriented than a fixed-size struct.
Consider it as a test.
In some (rare) cases, job list could be blocked by a first job still being processed,
while all following ones are completed, waiting to be flushed.
In such case, the current job-table implementation is unable to accept new job.
As a consequence, a call to ZSTDMT_compressStream() can be useless (nothing read, nothing flushed),
with the risk to trigger a busy-wait on the caller side
(needlessly loop over ZSTDMT_compressStream() ).
In such a case, ZSTDMT_compressStream() will block until the first job is completed and ready to flush.
It ensures some forward progress by guaranteeing it will flush at least a part of the completed job.
Energy-wasting busy-wait is avoided.
Like ZSTD_initCStream_usingDict(),
ZSTDMT_initCStream_usingDict() now keep a copy of dict internally.
This way, dict can be released :
it does not longer have to outlive all future compression sessions.
Correctly compress with custom params and dictionary
Added relevant fuzzer test in zstreamtest
Also :
new macro ZSTDMT_SECTION_LOGSIZE_MIN, which sets a minimum size for a full job
(note : a flush() command can still generate a partial job anytime)
Also : fixed corner case, where nb of jobs completed becomes > jobQueueSize
which is possible when many flushes are issued
while there is not enough dst buffer to flush completed ones.
MT compression generates a single frame.
Multi-threading operates by breaking the frames into independent sections.
But from a decoder perspective, there is no difference :
it's just a suite of blocks.
Problem is, decoder preserves repCodes from previous block to start decoding next block.
This is also valid between sections, since they are no different than changing block.
Previous version would incorrectly initialize repcodes to their default value at the beginning of each section.
When using them, there was a mismatch between encoder (default values) and decoder (values from previous block).
This change ensures that repcodes won't be used at the beginning of a new section.
It works by setting them to 0.
This only works with regular (single segment) variants : extDict variants will fail !
Fortunately, sections beyond the 1st one belong to this category.
To be checked : btopt strategy.
This change was only validated from fast to btlazy2 strategies.
In some complex scenarios (free() without finishing compression),
it is possible that some resources are still into jobs
and not collected back into pools.
In which case, previous version of free() would miss them.
This would be equivalent to a leak.
New version ensures that it even foes after such resource.
It requires job consumers to properly mark resources as released,
by replacing entries by NULL after releasing back to the pool.
Obviously, it's not recommended to free() zstdmt context mid-term,
still that's now a supported scenario.
The same methodology is also used to ensure proper resource collection
after an error is detected.
Still to do :
- detect compression errors (not just allocation ones)
- properly manage resource when init() is called without finishing previous compression.
The main issue was to avoid a caller to continually loop on {flush,end}Stream()
when there was nothing ready to be flushed but still some compression work ongoing in a worker thread.
The continuous loop would have resulted in wasted energy.
The new version makes call to {flush,end}Stream blocking when there is nothing ready to be flushed.
Of course, if all worker threads have exhausted job, it will return zero (all flush completed).
Note : There are still some remaining issues to report error codes
and properly collect back resources into pools when an error is triggered.
In previous version, main function would return early when detecting a job error.
Late threads resources were therefore not collected back into pools.
New version just register the error, but continue the collecting process.
All buffers and context should be released back to pool before leaving main function.
Result from getBuffer and getCCtx could be NULL when allocation fails.
Now correctly checks : job creation stop and last job reports an allocation error.
releaseBuffer and releaseCCtx are now also compatible with NULL input.
Identified a new potential issue :
when early job fails, later jobs are not collected for resource retrieval.
Since the result of mt compression is a single frame,
changed naming, which implied the concatenation of multiple frames.
minor : ensures that content size is written in header
The new strategy involves cutting frame at block level.
The result is a single frame, preserving ZSTD_getDecompressedSize()
As a consequence, bench can now make a full round-trip,
since the result is compatible with ZSTD_decompress().
This strategy will not make it possible to decode the frame with multiple threads
since the exact cut between independent blocks is not known.
MT decoding needs further discussions.
use ZSTD_freeCCtxPool() to release the partially created pool.
avoids to duplicate logic.
Also : identified a new difficult corner case :
when freeing the Pool, all CCtx should be previously released back to the pool.
Otherwise, it means some CCtx are still in use.
There is currently no clear policy on what to do in such a case.
Note : it's supposed to never happen.
Since pool creation/usage is static, it has no external user,
which limits risks.