Commit Graph

191 Commits (47685ac8560b0700a144a3ba7a7bc7a52d6859f9)

Author SHA1 Message Date
Jennifer Liu f5228f2c44 Refactoring 2018-07-31 13:58:54 -07:00
Jennifer Liu 4e29bc2469 Use CDict instead of CCtx in analyzeEntropy 2018-07-31 10:36:45 -07:00
Jennifer Liu 612b346ed5 Add explanation for split=100 2018-07-11 15:50:28 -07:00
Jennifer Liu 5021441d86 Change default splitPoint to 100 2018-07-10 11:19:33 -07:00
Jennifer Liu 456f290e31 Change back to splitPoint<=0 2018-07-09 13:53:25 -07:00
Jennifer Liu 7efabb2cf6 Only make 0.0 default splitPoint 2018-07-09 12:26:53 -07:00
Jennifer Liu 015a00af0f Change cover_sum back to 2 parameters and fix splitPoint issues 2018-07-06 14:24:18 -07:00
Jennifer Liu 0bbff01211 Fix testing parameter 2018-07-05 22:40:32 -07:00
Jennifer Liu a085d1aae1 Allow splitPoint==1.0 (using all samples for both training and testing) 2018-07-05 10:38:45 -07:00
Jennifer Liu 0881184c89 Some edits based on pull request comments 2018-07-03 17:53:27 -07:00
Jennifer Liu 16e75e8804 Update minimal training sample size 2018-07-03 12:07:06 -07:00
Jennifer Liu 348e5f77a9 Add split=# to cli 2018-06-29 17:54:41 -07:00
Jennifer Liu 52fbbbcb6b Explicitly cast double to unsigned 2018-06-29 16:17:20 -07:00
Jennifer Liu f9d19b83fb Fix variable declaration problem 2018-06-29 15:46:56 -07:00
Jennifer Liu e061d84016 Another fix to comparator 2018-06-29 15:38:08 -07:00
Jennifer Liu 59797d3328 Fix splitPoint floating point comparison problem 2018-06-29 12:47:03 -07:00
Jennifer Liu 0ef06f2e8a Split samples into train and test sets 2018-06-29 12:33:34 -07:00
Yann Collet fa41bcc2c2 grouped debug functions into debug.h
There were 2 competing set of debug functions
within zstd_internal.h and bitstream.h.
They were mostly duplicate, and required care to avoid messing with each other.

There is now a single implementation, shared by both.

Significant change :
The macro variable ZSTD_DEBUG does no longer exist,
it has been replaced by DEBUGLEVEL,
which required modifying several source files.
2018-06-13 15:43:09 -04:00
Nick Terrell 7cbb8bbbbf [cover] Small compression ratio improvement
The cover algorithm selects one segment per epoch, and it selects the epoch
size such that `epochs * segmentSize ~= dictSize`. Selecting less epochs
gives the algorithm more candidates to choose from for each segment it
selects, and then it will loop back to the first epoch when it hits the
last one.

The trade off is that now it takes longer to select each segment, since it
has to look at more data before making a choice.

I benchmarked on the following data sets using this command:

```sh
$ZSTD -T0 -3 --train-cover=d=8,steps=256 $DIR -r -o dict && $ZSTD -3 -D dict -rc $DIR | wc -c
```

| Data set     | k (approx) |  Before  |  After   | % difference |
|--------------|------------|----------|----------|--------------|
| GitHub       | ~1000      |   738138 |   746610 |       +1.14% |
| hg-changelog | ~90        |  4295156 |  4285336 |       -0.23% |
| hg-commands  | ~500       |  1095580 |  1079814 |       -1.44% |
| hg-manifest  | ~400       | 16559892 | 16504346 |       -0.34% |

There is some noise in the measurements, since small changes to `k` can
have large differences, which is why I'm using `steps=256`, to try to
minimize the noise. However, the GitHub data set still has some noise.

If I run the GitHub data set on my Mac, which presumably lists directory
entries in a different order, so the dictionary builder sees the files in
a different order, or I use `steps=1024` I see these results.

| Run        | Before | After  | % difference |
|------------|--------|--------|--------------|
| steps=1024 | 738138 | 734470 |       -0.50% |
| MacBook    | 738451 | 737132 |       -0.18% |

Question: Should we expose this as a parameter? I don't think it is
necessary. Someone might want to turn it up to exchange a much longer
dictionary building time in exchange for a slightly better dictionary.
I tested `2`, `4`, and `16`, and `4` got most of the benefit of `16`
with a faster running time.
2018-05-18 16:15:27 -07:00
Yann Collet 1da629f2ad
Merge pull request #1104 from terrelln/fast-train
Allow negative compression levels in training
2018-04-09 14:16:20 -07:00
Nick Terrell 569e2abccd Allow negative compression levels in training
* Set `dictCLevel` in `zstdcli.c`.
* Only set to default level if the compression level `== 0`, not `<= 0`.
2018-04-09 12:12:03 -07:00
Björn Ketelaars 462aed6811 zstd requires a stable sort.
On OpenBSD qsort() is not guaranteed to be stable, their mergesort() is.
This fixes issue #1088. All the hard work has been done by @terrelln.
2018-04-05 07:59:16 +02:00
Yann Collet 9f8ed23b5b bumped version number to v1.3.4
also added a paragraph on using compression level with training mode
as this is a recurrent question (see for example #1004)
2018-01-27 22:23:26 -08:00
Yann Collet 752bae4a48 added warning message
when pathological dataset is detected
(note : cover_optimize needs -v to display the warning)
2018-01-11 11:29:28 -08:00
Yann Collet e8093dde09 fixed #304
Pathological samples may result in literal section being incompressible.
This case is now detected,
and literal distribution is replaced by one that can be written into the dictionary.
2018-01-11 11:16:32 -08:00
Yann Collet 218e9fe0fc added a test case for dictBuilder failure
cyclic data set makes the entropy stage fails
now, onto a fix for #304 ...
2018-01-11 09:42:38 -08:00
Yann Collet c173dbd6e7 no longer supported starting C++17 2017-12-04 18:00:53 -08:00
Nick Terrell 6c41adfb28 [libzstd] pthread function prefixed with ZSTD_
* `sed -i 's/pthread_/ZSTD_pthread_/g' lib/{,common,compress,decompress,dictBuilder}/*.[hc]`
* Fix up `lib/common/threading.[hc]`
* `sed -i s/PTHREAD_MUTEX_LOCK/ZSTD_PTHREAD_MUTEX_LOCK/g lib/compress/zstdmt_compress.c`
2017-09-27 11:48:48 -07:00
Yann Collet 77c137b3ae minor comment refactor 2017-09-14 15:12:57 -07:00
Yann Collet 3128e03be6 updated license header
to clarify dual-license meaning as "or"
2017-09-08 00:09:23 -07:00
Nick Terrell 376f435914 [dictBuilder] Set default compression level to 3 2017-08-24 16:21:05 -07:00
Dmitriy Titarenko 20f715d709 Fix displayLevel overflow 2017-08-23 15:56:15 +05:00
Yann Collet bd9c8ca146 Merge pull request #811 from terrelln/segmentSize
[cover] Fix end condition for small dictionary
2017-08-22 14:36:30 -07:00
Nick Terrell 29c2d9a4d0 [cover] Turn down notification for ZDICT subroutines 2017-08-21 14:28:31 -07:00
Nick Terrell 98de3f6847 [cover] Add dictionary size to compressed size 2017-08-21 14:23:17 -07:00
Nick Terrell 9a54a315aa [cover] Convert score to U32 and check for zero 2017-08-21 13:30:07 -07:00
Nick Terrell d49eb40c03 [cover] Stop when segmentSize is less than d 2017-08-21 13:10:03 -07:00
Nick Terrell f306d400c0 [cover] Fix divide by zero 2017-08-21 11:12:11 -07:00
Yann Collet 32fb407c9d updated a bunch of headers
for the new license
2017-08-18 16:52:05 -07:00
Yann Collet b71363b967 check pthread_*_init() success condition 2017-07-19 01:05:40 -07:00
Yann Collet 2bd6440be0 pinned down error code enum values
Note : all error codes are changed by this new version,
but it's expected to be the last change for existing codes.

Codes are now grouped by category, and receive a manually attributed value.
The objective is to guarantee that
error code values will not change in the future
when introducing new codes.
Intentionnal empty spaces and ranges are defined
in order to keep room for potential new codes.
2017-07-13 17:12:16 -07:00
Yann Collet 590937df20 Merge pull request #739 from facebook/refPrefix
ZSTD_refPrefix
2017-06-29 04:36:03 -07:00
Yann Collet 7d3816183f exposed ZSTD_MAGIC_DICTIONARY in zstd.h
makes it easier to explain ZSTD_dictMode
2017-06-27 13:50:34 -07:00
Nick Terrell 5b7fd7c422 [zdict] Make COVER the default algorithm 2017-06-26 21:09:22 -07:00
Yann Collet ee970398b2 Merge branch 'dev' into advancedAPI2 2017-05-22 12:33:56 -07:00
Nick Terrell a1280406b0 [libzstd] Allow users to define custom visibility 2017-05-19 18:01:59 -07:00
Yann Collet fa3671eac7 changed ZSTD_BLOCKSIZE_ABSOLUTEMAX into ZSTD_BLOCKSIZE_MAX
Also :
change ZSTD_getBlockSizeMax() into ZSTD_getBlockSize()
created ZSTD_BLOCKSIZELOG_MAX
2017-05-19 10:51:30 -07:00
Nick Terrell f376d47c11 [CLI] Switch dictionary builder on CLI to cover 2017-05-02 11:18:27 -07:00
Nick Terrell 020b960e13 [cover] Make optimization faster 2017-05-02 11:02:48 -07:00
Nick Terrell f2d9ef1dc0 [cover] Optimize case where d <= 8 2017-05-02 11:02:43 -07:00
Nick Terrell 865918dd04 Fix typo in zdict.h 2017-05-02 11:02:37 -07:00
Nick Terrell 5152fb2cb2 Convert all tabs to spaces 2017-03-29 18:51:58 -07:00
Yann Collet 4cf0093571 restored bonus rule 2017-03-26 14:51:00 -07:00
Yann Collet 69017bf253 Merge branch 'dev' into LegacyDictBuilder 2017-03-26 14:39:13 -07:00
Yann Collet 582760818f minor refactor
add const
changed if for easier to add new conditions
2017-03-26 03:04:56 -07:00
Yann Collet 858f72eeb8 fixed dictBuilder issue
dictionary loading would fail during entropy analysis
2017-03-26 02:50:00 -07:00
Yann Collet ecee9f2ef8 fixed conversion warnings 2017-03-26 00:59:14 -07:00
Yann Collet 4c41d37fcc changed test for new syntax
--dictID= and --maxdict=
2017-03-24 18:36:56 -07:00
Yann Collet d41f707e88 minor improvement : remove duplicates with 1 char prefix difference 2017-03-24 17:56:45 -07:00
Yann Collet 96aa3019b2 changed advanced commands --maxdict= and --dictID=
now works with the `=` variant, which is the recommended one.
Old variant `--dictID #` still works, for compatibility with existing scripts.
Long term objective is to remove the old variant..
2017-03-24 16:04:29 -07:00
Yann Collet 9da3b215ec Ensure all limits derived from same constants
Now uses ZDICT_DICTSIZE_MIN and ZDICT_CONTENTSIZE_MIN
from zdict.h.

Also : reduced values to 256 and 128 respectively
2017-03-24 15:02:09 -07:00
Yann Collet f332ece468 dictBuilder fails to create dictionary on certain input
Properly expressed with an error code (see zstd_errors.h)
and a cli return code != 0
2017-03-23 16:24:02 -07:00
Sean Purcell 042ba122ae Change g_displayLevel to int and fix DISPLAYUPDATE flush 2017-03-23 11:21:59 -07:00
Nick Terrell 976e325b2e Fix COVER_optimizeTrainFromBuffer() resource leaks
Thanks to @nemequ for reporting the resource leaks.
2017-03-02 15:54:39 -08:00
Nick Terrell 545987996a Fix deprecation warnings for clang with C++14 2017-02-08 17:38:17 -08:00
Nick Terrell 71c5263c00 Attribute cover dictionary code 2017-02-07 11:35:07 -08:00
Nick Terrell 43474313f8 Fix documentation about memory usage 2017-01-27 18:43:05 -08:00
Nick Terrell 2fe9126591 Add multithread support to COVER 2017-01-27 11:56:02 -08:00
Nick Terrell 8d984699db Document memory requirements for COVER algorithm 2017-01-09 18:20:10 -08:00
Nick Terrell 555e281637 Handle large input size in 32-bit mode correctly 2017-01-09 18:20:06 -08:00
Nick Terrell 3a1fefcf00 Simplify COVER parameters 2017-01-02 17:51:38 -08:00
Nick Terrell 96b39f65fa Add COVER dictionary builder 2017-01-02 13:22:51 -08:00
Yann Collet aca113f4f5 fixed ZSTD_sizeof_?Dict() 2016-12-23 22:25:03 +01:00
Nick Terrell 1b5d4a7d53 ZDICT_finalizeDictionary() flipped comparison 2016-12-22 18:14:57 -08:00
Nick Terrell bcbe77e994 ZDICT_finalizeDictionary() flipped comparison
`ZDICT_finalizeDictionary()` had a flipped comparison.
I also allowed `dictBufferCapacity == dictContentSize`.
It might be the case that the user wants to fill the dictionary
completely up, and then let zstd take exactly the space it needs
for the entropy tables.
2016-12-22 18:01:14 -08:00
Nick Terrell 78a0072d5a Fix failing test due to deprecation warning 2016-12-22 17:36:16 -08:00
Yann Collet d76d1a9ef0 added ZDICT_finalizeDictionary() 2016-12-22 20:18:43 +01:00
Yann Collet 0819abe3c1 added ZSTD_createDDict_byReference() body 2016-12-21 19:25:15 +01:00
Yann Collet 1496c3dc47 Fix : size estimation when some samples are very large 2016-12-18 11:58:23 +01:00
Yann Collet d46ecb58a5 added dll compilation tests 2016-12-17 16:28:12 +01:00
Nick Terrell 8de46ab51a Export all API functions 2016-12-16 13:27:30 -08:00
Yann Collet 0a5a5fb7fd Fix #418 : printing selected segments in zdict debug mode can segfault with certain pathological patterns 2016-11-02 13:57:55 -07:00
Yann Collet 52c1bf93fe improved dicitonary segment merge 2016-10-18 16:34:58 -07:00
Yann Collet 2b361cf2f1 minor opt 2016-10-14 16:09:07 -07:00
Yann Collet df6797447f update dictionary builder warning comments 2016-09-27 15:14:32 +02:00
Yann Collet 47094ea66b added comment on filePos 2016-09-26 18:03:33 +02:00
Yann Collet 97b378a6f8 Streaming : dictionary compression on multiple files / segments can correctly provide srcSize into header (when provided) using pledgedSrcSize. 2016-09-21 17:20:19 +02:00
Yann Collet d56dbc02d3 removed g_displayLevel 2016-09-02 17:28:41 -07:00
Yann Collet 855766d73d clarified dictionary in format description 2016-09-02 17:04:49 -07:00
Yann Collet d725427a3c g_time => local displayTime 2016-09-02 15:32:39 -07:00
Yann Collet 4ded9e591c added boilerplate 2016-08-30 11:06:28 -07:00
Yann Collet 3b15f1f10f minor refactor 2016-08-30 09:58:50 -07:00
Yann Collet 87c18b2ebd fixed multiple minor warnings for XCode 2016-08-26 01:43:47 +02:00
Yann Collet da3fbcb302 Added ZDICT_getDictID() 2016-08-19 14:23:58 +02:00
Yann Collet a5dbf9f629 Merge pull request #297 from borzunov/dev
Export functions related to dictionary compression from DLL
2016-08-18 15:05:01 +02:00
Yann Collet 49d105cfcf better warning and error messages in case of dictionary training failure (#292) 2016-08-18 15:02:11 +02:00
Alexander Borzunov 0f6f17a14f Rename ZSTDLIB_API to ZDICTLIB_API in zdict.h 2016-08-18 16:47:06 +05:00
Alexander Borzunov 1f48382b1a Export functions related to dictionary compression from DLL 2016-08-18 16:12:49 +05:00
Yann Collet e9b414d825 fixed msan warning (#281) 2016-08-11 22:09:09 +02:00
Yann Collet e0b4a2d40f fixed dictionary generation, reported by Bartosz Taudul 2016-08-03 03:36:03 +02:00