[cover] Small compression ratio improvement
The cover algorithm selects one segment per epoch, and it selects the epoch size such that `epochs * segmentSize ~= dictSize`. Selecting less epochs gives the algorithm more candidates to choose from for each segment it selects, and then it will loop back to the first epoch when it hits the last one. The trade off is that now it takes longer to select each segment, since it has to look at more data before making a choice. I benchmarked on the following data sets using this command: ```sh $ZSTD -T0 -3 --train-cover=d=8,steps=256 $DIR -r -o dict && $ZSTD -3 -D dict -rc $DIR | wc -c ``` | Data set | k (approx) | Before | After | % difference | |--------------|------------|----------|----------|--------------| | GitHub | ~1000 | 738138 | 746610 | +1.14% | | hg-changelog | ~90 | 4295156 | 4285336 | -0.23% | | hg-commands | ~500 | 1095580 | 1079814 | -1.44% | | hg-manifest | ~400 | 16559892 | 16504346 | -0.34% | There is some noise in the measurements, since small changes to `k` can have large differences, which is why I'm using `steps=256`, to try to minimize the noise. However, the GitHub data set still has some noise. If I run the GitHub data set on my Mac, which presumably lists directory entries in a different order, so the dictionary builder sees the files in a different order, or I use `steps=1024` I see these results. | Run | Before | After | % difference | |------------|--------|--------|--------------| | steps=1024 | 738138 | 734470 | -0.50% | | MacBook | 738451 | 737132 | -0.18% | Question: Should we expose this as a parameter? I don't think it is necessary. Someone might want to turn it up to exchange a much longer dictionary building time in exchange for a slightly better dictionary. I tested `2`, `4`, and `16`, and `4` got most of the benefit of `16` with a faster running time.dev
parent
44303428c6
commit
7cbb8bbbbf
|
@ -620,7 +620,7 @@ static size_t COVER_buildDictionary(const COVER_ctx_t *ctx, U32 *freqs,
|
|||
/* Divide the data up into epochs of equal size.
|
||||
* We will select at least one segment from each epoch.
|
||||
*/
|
||||
const U32 epochs = (U32)(dictBufferCapacity / parameters.k);
|
||||
const U32 epochs = MAX(1, (U32)(dictBufferCapacity / parameters.k / 4));
|
||||
const U32 epochSize = (U32)(ctx->suffixSize / epochs);
|
||||
size_t epoch;
|
||||
DISPLAYLEVEL(2, "Breaking content into %u epochs of size %u\n", epochs,
|
||||
|
|
Loading…
Reference in New Issue