Commit Graph

5077 Commits (712318a2440121dd88bb0bcd9ce956b13fb3c5ac)

Author SHA1 Message Date
Yann Collet 712318a244
Merge pull request #1146 from terrelln/fse-fix
[zstd] Fix decompression edge case
2018-05-23 16:41:42 -07:00
Nick Terrell f2d0924b87 Variable declarations 2018-05-23 14:58:58 -07:00
Nick Terrell c92dd11940 Error if reported size is too large in edge case 2018-05-23 14:47:20 -07:00
Nick Terrell a97e9a627a [zstd] Fix decompression edge case
This edge case is only possible with the new optimal encoding selector,
since before zstd would always choose `set_basic` for small numbers of
sequences.

Fix `FSE_readNCount()` to support buffers < 4 bytes.

Credit to OSS-Fuzz
2018-05-23 12:16:00 -07:00
Yann Collet 27dc078aa6
Merge pull request #1144 from terrelln/fse-entropy
Approximate FSE encoding costs for selection
2018-05-22 19:25:37 -07:00
Yann Collet 4a498f03dc
Merge pull request #1145 from terrelln/spec
Clarify what happens when Number_of_Sequences == 0
2018-05-22 16:21:40 -07:00
Nick Terrell 73f4c890cd Clarify what happens when Number_of_Sequences == 0 2018-05-22 16:12:33 -07:00
Nick Terrell e3959d5eba Fixes 2018-05-22 16:06:33 -07:00
Nick Terrell 49cf880513 Approximate FSE encoding costs for selection
Estimate the cost for using FSE modes `set_basic`, `set_compressed`, and
`set_repeat`, and select the one with the lowest cost.

* The cost of `set_basic` is computed using the cross-entropy cost
  function `ZSTD_crossEntropyCost()`, using the normalized default count
  and the count.
* The cost of `set_repeat` is computed using `FSE_bitCost()`. We check the
  previous table to see if it is able to represent the distribution.
* The cost of `set_compressed` is computed with the entropy cost function
  `ZSTD_entropyCost()`, together with the cost of writing the normalized
  count `ZSTD_NCountCost()`.
2018-05-22 14:33:22 -07:00
Yann Collet 27af35c110
Merge pull request #1143 from facebook/tableLevels
Update table of compression levels
2018-05-19 14:40:37 -07:00
Yann Collet ade583948d Merge branch 'tableLevels' of github.com:facebook/zstd into tableLevels 2018-05-18 18:23:40 -07:00
Yann Collet 5381369cb1 Merge branch 'dev' into tableLevels 2018-05-18 18:23:27 -07:00
Yann Collet ca06a1d82f
Merge pull request #1142 from terrelln/better-dict
[cover] Small compression ratio improvement
2018-05-18 17:19:13 -07:00
Yann Collet 38c2c46823 Merge branch 'dev' into tableLevels 2018-05-18 17:17:45 -07:00
Yann Collet b0b3fb517d updated compression levels for blocks of 256KB 2018-05-18 17:17:12 -07:00
Nick Terrell 7cbb8bbbbf [cover] Small compression ratio improvement
The cover algorithm selects one segment per epoch, and it selects the epoch
size such that `epochs * segmentSize ~= dictSize`. Selecting less epochs
gives the algorithm more candidates to choose from for each segment it
selects, and then it will loop back to the first epoch when it hits the
last one.

The trade off is that now it takes longer to select each segment, since it
has to look at more data before making a choice.

I benchmarked on the following data sets using this command:

```sh
$ZSTD -T0 -3 --train-cover=d=8,steps=256 $DIR -r -o dict && $ZSTD -3 -D dict -rc $DIR | wc -c
```

| Data set     | k (approx) |  Before  |  After   | % difference |
|--------------|------------|----------|----------|--------------|
| GitHub       | ~1000      |   738138 |   746610 |       +1.14% |
| hg-changelog | ~90        |  4295156 |  4285336 |       -0.23% |
| hg-commands  | ~500       |  1095580 |  1079814 |       -1.44% |
| hg-manifest  | ~400       | 16559892 | 16504346 |       -0.34% |

There is some noise in the measurements, since small changes to `k` can
have large differences, which is why I'm using `steps=256`, to try to
minimize the noise. However, the GitHub data set still has some noise.

If I run the GitHub data set on my Mac, which presumably lists directory
entries in a different order, so the dictionary builder sees the files in
a different order, or I use `steps=1024` I see these results.

| Run        | Before | After  | % difference |
|------------|--------|--------|--------------|
| steps=1024 | 738138 | 734470 |       -0.50% |
| MacBook    | 738451 | 737132 |       -0.18% |

Question: Should we expose this as a parameter? I don't think it is
necessary. Someone might want to turn it up to exchange a much longer
dictionary building time in exchange for a slightly better dictionary.
I tested `2`, `4`, and `16`, and `4` got most of the benefit of `16`
with a faster running time.
2018-05-18 16:15:27 -07:00
Yann Collet 44303428c6
Merge pull request #1139 from fbrosson/prefetch
__builtin_prefetch did probably not exist before gcc 3.1.
2018-05-18 13:23:35 -07:00
fbrosson 291824f49d __builtin_prefetch did probably not exist before gcc 3.1. 2018-05-18 18:40:11 +00:00
Yann Collet bd6417de7f
Merge pull request #1140 from fbrosson/cpu-asm
Drop colon in asm snippet to make old versions of gcc happy.
2018-05-18 10:32:16 -07:00
fbrosson 16bb8f1f9e Drop colon in asm snippet to make old versions of gcc happy. 2018-05-18 17:05:36 +00:00
Yann Collet 63eeeaa1dd update table levels for blocks <= 16K
also : allow hlog to be slighly larger than windowlog,
as it's apparently good for both speed and compression ratio.
2018-05-16 16:13:37 -07:00
Yann Collet 9938b17d4c
Merge pull request #1135 from facebook/frameCSize
decompress: changed error code when input is too large
2018-05-15 11:02:53 -07:00
Yann Collet b14c4bff96
Merge pull request #1136 from terrelln/fix
Fix failing Travis tests
2018-05-15 11:02:01 -07:00
Nick Terrell 30d9c84b1a Fix failing Travis tests 2018-05-15 09:46:20 -07:00
Yann Collet f372ffc64d
Merge pull request #1127 from facebook/staticDictCost
Improved optimal parser with dictionary
2018-05-14 17:45:50 -07:00
Yann Collet d59cf02df0 decompress: changed error code when input is too large
ZSTD_decompress() can decompress multiple frames sent as a single input.
But the input size must be the exact sum of all compressed frames, no more.

In the case of a mistake on srcSize, being larger than required,
ZSTD_decompress() will try to decompress a new frame after current one, and fail.
As a consequence, it will issue an error code, ERROR(prefix_unknown).

While the error is technically correct
(the decoder could not recognise the header of _next_ frame),
it's confusing, as users will believe that the first header of the first frame is wrong,
which is not the case (it's correct).
It makes it more difficult to understand that the error is in the source size, which is too large.

This patch changes the error code provided in such a scenario.
If (at least) a first frame was successfully decoded,
and then following bytes are garbage values,
the decoder assumes the provided input size is wrong (too large),
and issue the error code ERROR(srcSize_wrong).
2018-05-14 15:32:28 -07:00
Yann Collet c8c67f7c84 Merge branch 'dev' into tableLevels 2018-05-14 11:55:52 -07:00
Yann Collet 174bd3d4a7
Merge pull request #1131 from facebook/zstdcli
minor: control numeric argument overflow
2018-05-14 11:53:58 -07:00
Yann Collet 5d76201fee
Merge pull request #1130 from facebook/man
fix #1115
2018-05-14 11:52:53 -07:00
Yann Collet 902db38798
Merge pull request #1129 from facebook/paramgrill
Paramgrill refactoring
2018-05-14 11:52:41 -07:00
Yann Collet 3870db1ba5 Merge branch 'dev' into tableLevels 2018-05-14 11:52:05 -07:00
Yann Collet 4da0216db0
Merge pull request #1133 from felixhandte/travis-fix
Make Travis CI Run `apt-get update`
2018-05-14 09:59:43 -07:00
W. Felix Handte e26be5a7b3 Travis CI Runs apt-get Update 2018-05-14 11:55:21 -04:00
Yann Collet 2c392952f9 paramgrill: use NB_LEVELS_TRACKED in loop
make it easier to generate/track more levels
than ZSTD_maxClevel()
2018-05-13 17:25:53 -07:00
Yann Collet c9227ee16b update table for 128 KB blocks 2018-05-13 17:15:07 -07:00
Yann Collet b4250489cf update compression levels for large inputs 2018-05-13 01:53:38 -07:00
Yann Collet 9cd5c63771 cli: control numeric argument overflow
exit on overflow
backported from paramgrill
added associated test case
2018-05-12 14:29:33 -07:00
Yann Collet 3f89cd1081 minor : factor out errorOut() 2018-05-12 14:09:32 -07:00
Yann Collet b824d213cb fix #1115 2018-05-12 10:21:30 -07:00
Yann Collet 50993901b2 paramgrill: subtle change in level spacing
distance between levels is slightly increased
to compensate for level 1 speed improvements
and the will to have stronger level 19
extending the range of speed to cover.
2018-05-12 09:40:04 -07:00
Yann Collet a3f2e84a37 added programmable constraints 2018-05-11 19:43:08 -07:00
Yann Collet 17c19fbbb5 generalized use of readU32FromChar()
and check input overflow
2018-05-11 17:32:26 -07:00
Yann Collet 761758982e replaced FSE_count by FSE_count_simple
to reduce usage of stack memory.

Also : tweaked a few comments, as suggested by @terrelln
2018-05-11 16:03:37 -07:00
Yann Collet 66b81817b5
Merge pull request #1128 from facebook/libdir
minor Makefile patch
2018-05-11 11:47:59 -07:00
Yann Collet 3193d692c2 minor patch, ensuring LIBDIR is created before installation
follow-up from #1123
2018-05-11 11:31:48 -07:00
Yann Collet 99ddca43a6 fixed wrong assertion
base can actually overflow
2018-05-10 19:48:09 -07:00
Yann Collet 0d7626672d fixed c++ conversion warning 2018-05-10 18:17:21 -07:00
Yann Collet 09d0fa29ee minor adjusting of weights 2018-05-10 18:13:48 -07:00
Yann Collet 1a26ec6e8d opt: init statistics from dictionary
instead of starting from fake "default" statistics.
2018-05-10 17:59:12 -07:00
Yann Collet 74b1c75d64 btopt : minor adjustment of update frequencies 2018-05-10 16:32:36 -07:00