On my laptop:
Before:
./zstd32 -b --zstd=wlog=27 silesia.tar enwik8 -S
3#silesia.tar : 211984896 -> 66683478 (3.179), 97.6 MB/s , 400.7 MB/s
3#enwik8 : 100000000 -> 35643153 (2.806), 76.5 MB/s , 303.2 MB/s
After:
./zstd32 -b --zstd=wlog=27 silesia.tar enwik8 -S
3#silesia.tar : 211984896 -> 66683478 (3.179), 97.4 MB/s , 435.0 MB/s
3#enwik8 : 100000000 -> 35643153 (2.806), 76.2 MB/s , 338.1 MB/s
Mileage vary, depending on file, and cpu type.
But a generic rule is : x86 benefits less from "long-offset mode" than x64,
maybe due to register pressure.
On "entropy", long-mode is _never_ a win for x86.
On my laptop though, it may, depending on file and compression level
(enwik8 benefits more from "long-mode" than silesia).
ZSTD_create?Dict() is required to produce a ?Dict* return type
because `free()` does not accept a `const type*` argument.
If it wasn't for this restriction, I would have preferred to create a `const ?Dict*` object
to emphasize the fact that, once created, a dictionary never changes
(hence can be shared concurrently until the end of its lifetime).
There is no such limitation with initStatic?Dict() :
as stated in the doc, there is no corresponding free() function,
since `workspace` is provided, hence allocated, externally,
it can only be free() externally.
Which means, ZSTD_initStatic?Dict() can return a `const ZSTD_?Dict*` pointer.
Tested with `make all`, to catch initStatic's users,
which, incidentally, also updated zstd.h documentation.
it still fallbacks to single-thread blocking invocation
when input is small (<1job)
or when invoking ZSTDMT_compress(), which is blocking.
Also : fixed a bug in new block-granular compression routine.
It used to stop on reaching extDict, for simplification.
As a consequence, there was a small loss of performance each time the round buffer would restart from beginning.
It's not a large difference though, just several hundreds of bytes on silesia.
This patch fixes it.
It does not feel "right" from a dependency perspective.
ZSTD_initDCtx_internal() is triggered once, on DCtx creation,
while ZSTD_decompressBegin() is invoked at the beginning of each new frame,
and is also a user-facing prototype.
Downside : a DCtx must be init before first usage !
This was always the intention by the way, and is documented as such.
This stage is automatically done within ZSTD_decompress() and variants,
and also within ZSTD_decompressStream().
Only ZSTD_decompressContinue() is impacted,
it must be preceded by a ZSTD_decompressBegin(), as detailed in doc.
A test has been fixed, to no longer rely on undocumented assumption that ZSTD_decompressBegin() is invoked during init.