338 Commits

Author SHA1 Message Date
Yann Collet
14312d833e zstdmt : fix : loading prefix from previous segments
There used to be a (very small) chance that
loading prefix from previous segment
would be confused with a real zstd dictionary.
For that to happen, the prefix needs to start
with the same value as dictionary magic.
That's 1 chance in 4 billions if all values have equal probability.
But in fact, since some values are more common (0x00000000 for example)
others are less common, and dictionary magic was selected to be one of them,
so probabilities are likely even lower.

Anyway, this risk is no down to zero
by adding a new CCtx parameter : ZSTD_p_forceRawDict

Current parameter policy : the parameter "stick" to its CCtx,
so any dictionary loading after ZSTD_p_forceRawDict is set
will be loaded in "raw" ("content only") mode,
even if CCtx is re-used multiple times with multiple different dictionary.
It's up to the user to reset this value differently if it needs so.
2017-02-23 23:42:12 -08:00
Przemyslaw Skibinski
d8114e5802 zstd_compress.c: fix memory leaks 2017-02-21 18:59:56 +01:00
Anders Oleson
517577bf53 spelling fixes in comments
i.e. occurred labeled Huffman
2017-02-20 12:08:59 -08:00
Nick Terrell
ecf90ca24b [zstdmt] Fix MSAN failure with ZSTD_p_forceWindow
Reproduction steps:

```
make zstreamtest CC=clang CFLAGS="-O3 -g -fsanitize=memory -fsanitize-memory-track-origins"
./zstreamtest -vv -t4178 -i4178 -s4531
```

How to get to the error in gdb (may be a more efficient way):

* 2 breaks at zstd_compress.c:2418  -- in ZSTD_compressContinue_internal()
* 2 breaks at zstd_compress.c:2276  -- in ZSTD_compressBlock_internal()
* 1 break at zstd_compress.c:1547

Why the error occurred:

When `zc->forceWindow == 1`, after calling `ZSTD_loadDictionaryContent()` we
have `zc->loadedDictEnd == zc->nextToUpdate == 0`. But, we've really loaded up
to `iend` into the dictionary. Then in `ZSTD_compressBlock_internal()` we see
that `current > zc->nextToUpdate + 384`, so we load the last 192 bytes a second
time. In this case the bytes we are loading are a block of all 0s, starting in
the previous block. So when we are loading the last 192 bytes, we find a `match`
in the future, 183 bytes beyond `ip`. Since the block is all 0s, the match
extends to the end of the block. But in `ZSTD_count()` we only check that
`pIn < pInLoopLimit`, but since `pMatch > pIn`, `pMatch` eventually points past
the end of the buffer, causing the MSAN failure.

The fix:

The line changed sets sets `zc->nextToUpdate` to the end of the dictionary.
This is the behavior that existed before `ZSTD_p_forceWindow` was introduced.
This fixes the exposing test case. Since the code doesn't fail without
`zc->forceWindow`, it makes sense that this works. I've run the command
`./zstreamtest -T2mn` 64 times without failures. CI should also verify nothing
obvious broke.
2017-02-13 19:11:22 -08:00
Sean Purcell
2db7249265 Make pledgedSrcSize meaning clear for other functions
- Added tests
- Moved new size functions to static link only
2017-02-09 11:49:58 -08:00
Sean Purcell
0f5c95af44 Disambiguate pledgedSrcSize == 0
- Modify ZSTD CLI to only set contentSizeFlag if it _knows_ the size
- Change pzstd to stop setting contentSizeFlag without accurate pledgedSrcSize
2017-02-08 15:12:46 -08:00
Yann Collet
06e7697f96 added test of new parameter ZSTD_p_forceWindow 2017-01-25 16:39:03 -08:00
Yann Collet
bb0027405a fixed zstdmt corruption issue when enabling overlapped sections
see Asana board for detailed explanation on why and how to fix it
2017-01-25 16:25:38 -08:00
Yann Collet
c593348722 ZSTDMT_initCStream_usingDict() can outlive dict
Like ZSTD_initCStream_usingDict(),
ZSTDMT_initCStream_usingDict() now keep a copy of dict internally.
This way, dict can be released :
it does not longer have to outlive all future compression sessions.
2017-01-22 16:44:15 -08:00
Yann Collet
d7e3cb58c5 Resolved merge conflict dev+zstdmt 2017-01-20 16:44:50 -08:00
Yann Collet
b459aad5b4 renamed savedRep into repToConfirm 2017-01-19 17:33:37 -08:00
Yann Collet
32dfae6f98 fixed Multi-threaded compression
MT compression generates a single frame.
Multi-threading operates by breaking the frames into independent sections.
But from a decoder perspective, there is no difference :
it's just a suite of blocks.

Problem is, decoder preserves repCodes from previous block to start decoding next block.
This is also valid between sections, since they are no different than changing block.

Previous version would incorrectly initialize repcodes to their default value at the beginning of each section.
When using them, there was a mismatch between encoder (default values) and decoder (values from previous block).

This change ensures that repcodes won't be used at the beginning of a new section.
It works by setting them to 0.
This only works with regular (single segment) variants : extDict variants will fail !
Fortunately, sections beyond the 1st one belong to this category.

To be checked : btopt strategy.
This change was only validated from fast to btlazy2 strategies.
2017-01-19 10:32:55 -08:00
Sean Purcell
57d423c5df Don't create dict in streaming apis if dictSize == 0 2017-01-17 14:31:35 -08:00
Gregory Szorc
7d6f478d15 Set dictionary ID in ZSTD_initCStream_usingCDict()
When porting python-zstandard to use ZSTD_initCStream_usingCDict()
so compression dictionaries could be reused, an automated test
failed due to compressed content changing.

I tracked this down to ZSTD_initCStream_usingCDict() not
setting the dictID field of the ZSTD_CCtx attached to the
ZSTD_CStream instance.

I'm not 100% convinced this is the correct or full solution,
as I'm still seeing one automated test failing with this change.
2017-01-14 17:44:54 -08:00
Yann Collet
b05c4828ea zstdmt : correctly check for cctx and buffer allocation
Result from getBuffer and getCCtx could be NULL when allocation fails.
Now correctly checks : job creation stop and last job reports an allocation error.
releaseBuffer and releaseCCtx are now also compatible with NULL input.

Identified a new potential issue :
when early job fails, later jobs are not collected for resource retrieval.
2017-01-12 02:01:28 +01:00
Yann Collet
5eb749e734 ZSTDMT_compress() creates a single frame
The new strategy involves cutting frame at block level.
The result is a single frame, preserving ZSTD_getDecompressedSize()

As a consequence, bench can now make a full round-trip,
since the result is compatible with ZSTD_decompress().

This strategy will not make it possible to decode the frame with multiple threads
since the exact cut between independent blocks is not known.
MT decoding needs further discussions.
2017-01-11 18:21:25 +01:00
Yann Collet
aca113f4f5 fixed ZSTD_sizeof_?Dict() 2016-12-23 22:25:03 +01:00
Yann Collet
4e5eea61a8 added ZSTD_createDDict_byReference() 2016-12-21 16:44:35 +01:00
Yann Collet
1f57c2ed32 added : ZSTD_createCDict_byReference() 2016-12-21 16:20:11 +01:00
Nick Terrell
8157a4c3cc Fix dictionary loading bug causing an MSAN failure
Offset rep codes must be in the range `[1, dictSize)`.
Fix dictionary loading to reject `0` as a offset rep code.
2016-12-20 10:47:52 -08:00
Yann Collet
d564faa3c6 fix : ZSTD_initCStream_srcSize() correctly set srcSize in frame header 2016-12-18 21:39:15 +01:00
Yann Collet
e795c8a5f6 Added ZSTD_initCStream_srcSize().
Added relevant test cases in zstreamtest
2016-12-13 17:00:14 +01:00
Yann Collet
c3a5c4bef8 introduced cycleLog 2016-12-12 00:47:30 +01:00
Yann Collet
c261f71f6a minor variation of rescale fix 2016-12-12 00:25:07 +01:00
Nick Terrell
3826207a70 Simplify segfault fix
Take advantage of the fact that `chainLog <= windowLog`.
2016-12-10 18:46:55 -08:00
Nick Terrell
0012332ce0 Fix compression segfault
When the overflow protection kicks in, it makes sure that ip - ctx->base
isn't too large.  However, it didn't ensure that saved offsets are
still valid.  This change ensures that any valid offsets (<= windowLog)
are still representable after the update.

The bug would shop up on line 1056, when `offset_1 > current + 1`, which
causes an underflow.  This in turn, would cause a segfault on line 1063.

The input must necessarily be longer than 1 GB for this issue to occur.
Even then, it only occurs if one of the last 3 matches is larger than
the chain size and block size.
2016-12-09 17:15:33 -08:00
Yann Collet
a0d742b1e4 introduced HUF_buildCTable_wksp(), to reduce stack memory usage 2016-12-01 17:47:30 -08:00
Yann Collet
643d9a234b replaced usage of FSE_buildCTable by FSE_buildCTable_wksp, using less stack space in the process 2016-12-01 16:24:04 -08:00
Yann Collet
e928f7e16d introduced ext_wksp variants of count to reduce stack memory usage 2016-12-01 16:13:35 -08:00
Yann Collet
d79a9a00d9 Introduced FSE_compress_wksp() and FSE_buildCTable_wksp() to reduce stack memory usage 2016-11-30 15:52:20 -08:00
Yann Collet
25f46dcc0f minor const 2016-11-29 16:59:27 -08:00
Przemyslaw Skibinski
3d18088b38 updated windres 2016-11-17 18:04:41 +01:00
Yann Collet
407a11f63e fixed Visual compatibility 2016-11-03 15:52:01 -07:00
Nick Terrell
d82efd8a70 ZSTD_compress_usingDict() when dict gets loaded
Specify that when `dict == NULL || dictSize < 8` no dictionary
gets loaded.
Also add some periods.
2016-11-02 18:07:16 -07:00
Yann Collet
ee5b725823 ZSTD_initCStream() optimization : do not allocate a CDict when no dictionary used 2016-10-27 14:20:55 -07:00
Yann Collet
335ad5d4d4 added ZSTD_initDStream_usingDDict() .
slightly optimized ZSTD_initDStream() when no dictionary .
fixed ZSTD_sizeof_CStream() .
2016-10-25 17:47:02 -07:00
Yann Collet
9516234e67 first sketch for ZSTD_initCStream_usingCDict() 2016-10-25 16:19:52 -07:00
Yann Collet
62d9a7ddfd Merge pull request #429 from inikep/btopt2
Btopt2
2016-10-25 14:48:43 -07:00
Przemyslaw Skibinski
5c5f01f3da added ZSTD_btopt2 strategy 2016-10-25 12:25:07 +02:00
Nick Terrell
b2c39a22b0 Fix compiler narrowing warning 2016-10-24 14:50:13 -07:00
Nick Terrell
f698ad6deb Merge remote-tracking branch 'upstream/dev' into fixes
* upstream/dev:
  added doc\zstd_manual.html
  added contrib\gen_html
  zstd_compression_format.md moved to doc/
  Fix small bug in ZSTD_execSequence()
  improved ZSTD_compressBlock_opt_extDict_generic
  protect ZSTD_decodeFrameHeader() from invalid usage, as suggested by @spaskob
  zstd_opt.h: small improvement in compression ratio
  improved dicitonary segment merge
  use implicit rules to compile zstd_decompress.c
  detect early impossible decompression scenario in legacy decoder v0.5
  no repeat mode in legacy v0.5
  fixed invalid invocation of dictionary in legacy decoder v0.5
  fix edge case
  fix command line interpretation
  fixed minor corner case
  zstd.h: added the Introduction section
  fixed clang 3.5 warnings
  zstd.h: updated comments
2016-10-24 13:10:13 -07:00
Nick Terrell
f9c9af3c2e Reject dictionaries with incomplete entropy tables
If a dictionary specifies that a symbol has probability zero in its
`matchLength`, `literalLength`, or `offset` FSE table, but the symbol
appears when compressing input, the compressor fails.

Ensure that dictionaries support all `matchLength`, and `literalLength`
codes.  They must also support all of the `offset` codes required to
represent every possible offset that can appear in the first block.
2016-10-24 10:42:44 -07:00
Przemyslaw Skibinski
3ee94a7600 zstd_compression_format.md moved to doc/ 2016-10-24 15:58:07 +02:00
Nick Terrell
bfd943ace5 Fix buffer overrun in ZSTD_loadDictEntropyStats()
The table log set by `FSE_readNCount()` was not checked in
`ZSTD_loadDictEntropyStats()`.  This caused `FSE_buildCTable()`
to stack/heap overflow in a few places.

The benchmarks look good, there is no obvious compression performance regression:

  > ./zstds/zstd.opt.0 -i10 -b1 -e10 ~/bench/silesia.tar
   1#silesia.tar       : 211988480 ->  73656930 (2.878), 271.6 MB/s , 716.8 MB/s
   2#silesia.tar       : 211988480 ->  70162842 (3.021), 204.8 MB/s , 671.1 MB/s
   3#silesia.tar       : 211988480 ->  66997986 (3.164), 156.8 MB/s , 658.6 MB/s
   4#silesia.tar       : 211988480 ->  66002591 (3.212), 136.4 MB/s , 665.3 MB/s
   5#silesia.tar       : 211988480 ->  65008480 (3.261),  98.9 MB/s , 647.0 MB/s
   6#silesia.tar       : 211988480 ->  62979643 (3.366),  65.2 MB/s , 670.4 MB/s
   7#silesia.tar       : 211988480 ->  61974560 (3.421),  44.9 MB/s , 688.2 MB/s
   8#silesia.tar       : 211988480 ->  61028308 (3.474),  32.4 MB/s , 711.9 MB/s
   9#silesia.tar       : 211988480 ->  60416751 (3.509),  21.1 MB/s , 718.1 MB/s
  10#silesia.tar       : 211988480 ->  60174239 (3.523),  22.2 MB/s , 721.8 MB/s

  > ./compress_zstds/zstd.opt.1 -i10 -b1 -e10 ~/bench/silesia.tar
   1#silesia.tar       : 211988480 ->  73656930 (2.878), 273.8 MB/s , 722.0 MB/s
   2#silesia.tar       : 211988480 ->  70162842 (3.021), 203.2 MB/s , 666.6 MB/s
   3#silesia.tar       : 211988480 ->  66997986 (3.164), 157.4 MB/s , 666.5 MB/s
   4#silesia.tar       : 211988480 ->  66002591 (3.212), 132.1 MB/s , 661.9 MB/s
   5#silesia.tar       : 211988480 ->  65008480 (3.261),  96.8 MB/s , 641.6 MB/s
   6#silesia.tar       : 211988480 ->  62979643 (3.366),  63.1 MB/s , 677.0 MB/s
   7#silesia.tar       : 211988480 ->  61974560 (3.421),  44.3 MB/s , 678.2 MB/s
   8#silesia.tar       : 211988480 ->  61028308 (3.474),  33.1 MB/s , 708.9 MB/s
   9#silesia.tar       : 211988480 ->  60416751 (3.509),  21.5 MB/s , 710.1 MB/s
  10#silesia.tar       : 211988480 ->  60174239 (3.523),  21.9 MB/s , 723.9 MB/s
2016-10-17 16:55:52 -07:00
Yann Collet
2b361cf2f1 minor opt 2016-10-14 16:09:07 -07:00
Nick Terrell
3b9cdf9220 Fix ubsan failures (pass NULL to memcpy) 2016-10-12 20:54:42 -07:00
Yann Collet
cf409a7e2a fixed : init*_advanced() followed by reset() with different pledgedSrcSiz 2016-09-26 16:41:05 +02:00
Yann Collet
97b378a6f8 Streaming : dictionary compression on multiple files / segments can correctly provide srcSize into header (when provided) using pledgedSrcSize. 2016-09-21 17:20:19 +02:00
Yann Collet
993060e0f2 cli : better adaptation to small files 2016-09-21 16:46:08 +02:00
Yann Collet
a6bdf55759 fixed memory leak 2016-09-15 17:02:06 +02:00