diff --git a/zstd_compression_format.md b/zstd_compression_format.md index ce17f569..94b1dc5c 100644 --- a/zstd_compression_format.md +++ b/zstd_compression_format.md @@ -76,9 +76,9 @@ allowing streaming operations. General Structure of Zstandard Frame format ------------------------------------------- -| MagicNb | F. Header | Block | (More blocks) | EndMark | -|:-------:|:----------:| ----- | ------------- | ------- | -| 4 bytes | 2-14 bytes | | | 3 bytes | +| MagicNb | Frame Header | Block | (More blocks) | EndMark | +|:-------:|:-------------:| ----- | ------------- | ------- | +| 4 bytes | 2-14 bytes | | | 3 bytes | __Magic Number__ @@ -87,11 +87,11 @@ Value : 0xFD2FB527 __Frame Header__ -2 to 14 Bytes, to be detailed in the next part. +2 to 14 Bytes, detailed in [next part](#frame-header). __Data Blocks__ -To be detailed later on. +Detailed in [next chapter](#data-blocks). That’s where compressed data is stored. __EndMark__ @@ -101,12 +101,12 @@ This last block header may optionally host a __Content Checksum__ . ##### __Content Checksum__ -Content Checksum verify that frame content has been regenrated correctly. +Content Checksum verify that frame content has been regenerated correctly. The content checksum is the result of [xxh64() hash function](https://www.xxHash.com) digesting the original (decoded) data as input, and a seed of zero. Bits from 11 to 32 (included) are extracted to form a 22 bits checksum -stored into the last block header. +stored into the endmark body. ``` mask22bits = (1<<22)-1; contentChecksum = (XXH64(content, size, 0) >> 11) & mask22bits; @@ -224,6 +224,10 @@ Provides guarantees on maximum back-reference distance that will be present within compressed data. This information is useful for decoders to allocate enough memory. +`WD` byte is optional. It's not present in `single segment` mode. +In which case, the maximum back-reference distance is the content size itself, +which can be any value from 1 to 2^64-1 bytes (16 EB). + | BitNb | 7-3 | 0-2 | | --------- | -------- | -------- | | FieldName | Exponent | Mantissa | @@ -236,20 +240,16 @@ windowAdd = (windowBase / 8) * Mantissa; windowSize = windowBase + windowAdd; ``` The minimum window size is 1 KB. -The maximum size is (15*(2^38))-1 bytes, which is almost 1.875 TB. +The maximum size is `15*(1<<38)` bytes, which is 1.875 TB. To properly decode compressed data, a decoder will need to allocate a buffer of at least `windowSize` bytes. -Note that `WD` byte is optional. It's not present in `single segment` mode. -In which case, the maximum back-reference distance is the content size itself, -which can be any value from 1 to 2^64-1 bytes (16 EB). - In order to preserve decoder from unreasonable memory requirements, a decoder can refuse a compressed frame which requests a memory size beyond decoder's authorized range. -For better interoperability, +For improved interoperability, decoders are recommended to be compatible with window sizes of 8 MB. Encoders are recommended to not request more than 8 MB. It's merely a recommendation though, @@ -266,10 +266,10 @@ it's up to the caller to make sure it uses the correct dictionary. Field size depends on __Dictionary ID flag__. 1 byte can represent an ID 0-255. 2 bytes can represent an ID 0-65535. -4 bytes can represent an ID 0-(2^32-1). +4 bytes can represent an ID 0-4294967295. It's allowed to represent a small ID (for example `13`) -with a large 4-bytes dictionary ID, losing some efficiency in the process. +with a large 4-bytes dictionary ID, losing some compacity in the process. __Frame Content Size__ @@ -331,7 +331,7 @@ There are 4 block types : while the "compressed" block is just 1 byte (the byte to repeat). - EndMark : this is not a block. Signal the end of the frame. The rest of the field may be optionally filled by a checksum - (see [__Content Checksum__]). + (see [Content Checksum](#content-checksum)). Block sizes must respect a few rules : - In compressed mode, compressed size if always strictly `< decompressed size`. @@ -345,9 +345,9 @@ Where the actual data to decode stands. It might be compressed or not, depending on previous field indications. A data block is not necessarily "full" : since an arbitrary “flush” may happen anytime, -block decompressed content can be any size, up to Block Maximum Size. -Block Maximum Decompressed Size is the smallest of : -- Max back-reference distance +block decompressed content can be any size, +up to Block Maximum Decompressed Size, which is the smallest of : +- Maximum back-reference distance - 128 KB @@ -364,7 +364,9 @@ Its design is pretty straightforward, with the sole objective to allow the decoder to quickly skip over user-defined data and continue decoding. -Skippable frames defined in this specification are compatible with LZ4 ones. +Skippable frames defined in this specification are compatible with [LZ4] ones. + +[LZ4]:http://www.lz4.org __Magic Number__ : @@ -393,8 +395,8 @@ in order to properly allocate destination buffer. See [Data Blocks](#data-blocks) for more details. A compressed block consists of 2 sections : -- Literals section -- Sequences section +- [Literals section](#literals-section) +- [Sequences section](#sequences-section) ### Prerequisites To decode a compressed block, the following elements are necessary : @@ -439,12 +441,11 @@ This is a 2-bits field, describing 4 different block types : | Block Type | Compressed | Repeat | Raw | RLE | - Compressed : This is a standard huffman-compressed block, - starting with a huffman tree description. - See details below. + starting with a huffman tree description. + See details below. - Repeat Stats : This is a huffman-compressed block, - using huffman tree from previous huffman-compressed block. - Huffman tree description will be skipped. - Compressed stream is equivalent to "compressed" block type. + using huffman tree _from previous huffman-compressed literals block_. + Huffman tree description will be skipped. - Raw : Literals are stored uncompressed. - RLE : Literals consist of a single byte value repeated N times. @@ -476,24 +477,24 @@ using a long format, accepting the reduced compacity. __Sizes format for Compressed literals block__ : Note : also applicable to "repeat-stats" blocks. -- Value : 00 : 4 streams - Compressed and regenerated sizes use 10 bits (0-1023) - Total literal header size is 3 bytes -- Value : 01 : _Single stream_ - Compressed and regenerated sizes use 10 bits (0-1023) - Total literal header size is 3 bytes -- Value : 10 : 4 streams - Compressed and regenerated sizes use 14 bits (0-16383) - Total literal header size is 4 bytes -- Value : 10 : 4 streams - Compressed and regenerated sizes use 18 bits (0-262143) - Total literal header size is 5 bytes +- Value : 00 : 4 streams. + Compressed and regenerated sizes use 10 bits (0-1023). + Total literal header size is 3 bytes. +- Value : 01 : _Single stream_. + Compressed and regenerated sizes use 10 bits (0-1023). + Total literal header size is 3 bytes. +- Value : 10 : 4 streams. + Compressed and regenerated sizes use 14 bits (0-16383). + Total literal header size is 4 bytes. +- Value : 10 : 4 streams. + Compressed and regenerated sizes use 18 bits (0-262143). + Total literal header size is 5 bytes. Compressed and regenerated size fields follow big endian convention. #### Huffman Tree description -This section is only present when block type is `Compressed` (`0`). +This section is only present when literals block type is `Compressed` (`0`). Prefix coding represents symbols from an a priori known alphabet by bit sequences (codes), one code for each symbol, @@ -546,7 +547,7 @@ It gives the following serie of weights : The decoder will do the inverse operation : having collected weights of literals from `0` to `4`, it knows the last literal, `5`, is present with a non-zero weight. -The weight of `5` can be deduced by joining to the nearest power of 2. +The weight of `5` can be deducted by joining to the nearest power of 2. Sum of 2^(weight-1) (excluding 0) is : `8 + 4 + 2 + 0 + 1 = 15` Nearest power of 2 is 16. @@ -558,24 +559,17 @@ This is a single byte value (0-255), which tells how to decode the list of weights. - if headerByte >= 242 : this is one of 14 pre-defined weight distributions : - + 242 : 1x1 (+ 1x1) - + 243 : 2x1 (+ 1x2) - + 244 : 3x1 (+ 1x1) - + 245 : 4x1 (+ 1x4) - + 246 : 7x1 (+ 1x1) - + 247 : 8x1 (+ 1x8) - + 248 : 15x1 (+ 1x1) - + 249 : 16x1 (+ 1x16) - + 250 : 31x1 (+ 1x1) - + 251 : 32x1 (+ 1x32) - + 252 : 63x1 (+ 1x1) - + 253 : 64x1 (+ 1x64) - + 254 :127x1 (+ 1x1) - + 255 :128x1 (+ 1x128) + +| value |242|243|244|245|246|247|248|249|250|251|252|253|254|255| +| -------- | +| Nb of 1s | 1 | 2 | 3 | 4 | 7 | 8 | 15| 16| 31| 32| 63| 64|127|128| +|Complement| 1 | 2 | 1 | 4 | 1 | 8 | 1 | 16| 1 | 32| 1 | 64| 1 |128| + +_Note_ : complement is by using the "join to nearest power of 2" rule. - if headerByte >= 128 : this is a direct representation, where each weight is written directly as a 4 bits field (0-15). - The full representation occupies ((nbSymbols+1)/2) bytes, + The full representation occupies `((nbSymbols+1)/2)` bytes, meaning it uses a last full byte even if nbSymbols is odd. `nbSymbols = headerByte - 127;` @@ -593,13 +587,13 @@ To decode an FSE bitstream, it is necessary to know its compressed size. Compressed size is provided by `headerByte`. It's also necessary to know its maximum decompressed size. In this case, it's `255`, since literal values range from `0` to `255`, -and the last symbol value is not represented. +and last symbol value is not represented. An FSE bitstream starts by a header, describing probabilities distribution. It will create a Decoding Table. It is necessary to know the maximum accuracy of distribution to properly allocate space for the Table. -For a list of huffman weights, this maximum is 8 bits. +For a list of huffman weights, this maximum is 7 bits. FSE header and bitstreams are described in a separated chapter. @@ -652,13 +646,12 @@ presuming the CPU has enough parallelism available. For single stream, header provides both the compressed and regenerated size. For 4-streams though, header only provides compressed and regenerated size of all 4 streams combined. - In order to properly decode the 4 streams, it's necessary to know the compressed and regenerated size of each stream. Regenerated size is easiest : each stream has a size of `(totalSize+3)/4`, -except the last one, which is up to 3 bytes smaller, to reach totalSize. +except the last one, which is up to 3 bytes smaller, to reach `totalSize`. Compressed size must be provided explicitly : in the 4-streams variant, bitstreams are preceded by 3 unsigned Little Endian 16-bits values. @@ -704,11 +697,11 @@ A match copy command specifies an offset and a length. The offset gives the position to copy from, which can stand within a previous block. -These are 3 symbol types, `literalLength`, `matchLength` and `offset`, +There are 3 symbol types, `literalLength`, `matchLength` and `offset`, which are encoded together, interleaved in a single _bitstream_. -Each symbol decoding consists of a _code_, -which specifies a baseline and a number of additional bits. +Each symbol is a _code_ in its own context, +which specifies a baseline and a number of bits to add. _Codes_ are FSE compressed, and interleaved with raw additional bits in the same bitstream. @@ -717,7 +710,7 @@ followed by optional Probability tables for each symbol type, followed by the bitstream. To decode the Sequence section, it's required to know its size. -This size is deducted from "blockSize - literalSectionSize". +This size is deducted from `blockSize - literalSectionSize`. #### Sequences section header @@ -733,9 +726,9 @@ Let's call its first byte `byte0`. - `if (byte0 == 0)` : there are no sequences. The sequence section stops there. Regenerated content is defined entirely by literals section. -- `if (byte0 < 128)` : nbSeqs = byte0 . Uses 1 byte. -- `if (byte0 < 255)` : nbSeqs = ((byte0-128) << 8) + byte1 . Uses 2 bytes. -- `if (byte0 == 255)`: nbSeqs = byte1 + (byte2<<8) + 0x7F00 . Uses 3 bytes. +- `if (byte0 < 128)` : `nbSeqs = byte0;` . Uses 1 byte. +- `if (byte0 < 255)` : `nbSeqs = ((byte0-128) << 8) + byte1;` . Uses 2 bytes. +- `if (byte0 == 255)`: `nbSeqs = byte1 + (byte2<<8) + 0x7F00;` . Uses 3 bytes. __Symbol compression modes__ @@ -760,8 +753,8 @@ They follow the same enumeration : - "RLE" : it's a single code, repeated `nbSeqs` times. - "Repeat" : re-use distribution table from previous compressed block. - "FSE" : standard FSE compression. - Symbol type requires a distribution table, - which will be described in next part. + A distribution table will be present. + It will be described in [next part](#distribution-tables). #### Symbols decoding @@ -772,8 +765,9 @@ They define lengths from 0 to 131071 bytes. | Code | 0-15 | | ------ | ---- | -| nbBits | 0 | | value | Code | +| nbBits | 0 | + | Code | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | | -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | @@ -792,7 +786,7 @@ They define lengths from 0 to 131071 bytes. __Default distribution__ -When "compression mode" is defined as "default distribution", +When "compression mode" is "predef"", a pre-defined distribution is used for FSE compression. Here is its definition. It uses an accuracy of 6 bits (64 states). @@ -810,8 +804,8 @@ They define lengths from 3 to 131074 bytes. | Code | 0-31 | | ------ | -------- | -| nbBits | 0 | | value | Code + 3 | +| nbBits | 0 | | Code | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | | -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | @@ -847,8 +841,8 @@ short matchLengths_defaultDistribution[53] = Offset codes are values ranging from `0` to `N`, with `N` being limited by maximum backreference distance. -A decoder is free to limit its maximum `N` supported, -although the recommendation is to support at least up to `22`. +A decoder is free to limit its maximum `N` supported. +Recommendation is to support at least up to `22`. For information, at the time of this writing. the reference decoder supports a maximum `N` value of `28` in 64-bits mode. @@ -863,14 +857,15 @@ if (OFValue > 3) offset = OFValue - 3; OFValue from 1 to 3 are special : they define "repeat codes", which means one of the previous offsets will be repeated. They are sorted in recency order, with 1 meaning the most recent one. +See [Repeat offsets](#repeat-offsets) paragraph. __Default distribution__ -When "compression mode" is defined as "default distribution", +When "compression mode" is defined as "predef", a pre-defined distribution is used for FSE compression. Here is its definition. It uses an accuracy of 5 bits (32 states), -and support a maximum `N` of 28, allowing offset values up to 536,870,908 . +and supports a maximum `N` of 28, allowing offset values up to 536,870,908 . If any sequence in the compressed block requires an offset larger than this, it's not possible to use the default distribution to represent it. @@ -920,7 +915,7 @@ It depends on : __example__ : Presuming an AccuracyLog of 8, and presuming 100 probabilities points have already been distributed, - the decoder may discover value from `0` to `255 - 100 + 1 == 156` (included). + the decoder may read any value from `0` to `255 - 100 + 1 == 156` (included). Therefore, it must read `log2sup(156) == 8` bits. - Value decoded : small values use 1 less bit : @@ -946,7 +941,7 @@ Probability is obtained from Value decoded by following formulae : It means value `0` becomes negative probability `-1`. `-1` is a special probability, which means `less than 1`. -Its effect on distribution table is described in a later paragraph. +Its effect on distribution table is described in next paragraph. For the purpose of calculating cumulated distribution, it counts as one. When a symbol has a probability of `zero`, @@ -988,8 +983,10 @@ each symbol gets attributed as many cells as its probability. Cell allocation is spreaded, not linear : each successor position follow this rule : -`position += (tableSize>>1) + (tableSize>>3) + 3); -position &= tableSize-1;` +``` +position += (tableSize>>1) + (tableSize>>3) + 3; +position &= tableSize-1; +``` A position is skipped if already occupied, typically by a "less than 1" probability symbol. @@ -999,11 +996,11 @@ Each state will decode the current symbol. To get the Number of bits and baseline required for next state, it's first necessary to sort all states in their natural order. -The lower states will need 1 bit more than higher ones. +The lower states will need 1 more bit than higher ones. __Example__ : Presuming a symbol has a probability of 5. -It receives 5 state values. +It receives 5 state values. States are sorted in natural order. Next power of 2 is 8. Space of probabilities is divided into 8 equal parts. @@ -1034,8 +1031,8 @@ All sequences are stored in a single bitstream, read _backward_. It is therefore necessary to know the bitstream size, which is deducted from compressed block size. -The exact last bit of the stream is followed by a set-bit-flag. -The highest bit of last byte is this flag. +The bit of the stream is followed by a set-bit-flag. +Highest bit of last byte is this flag. It does not belong to the useful part of the bitstream. Therefore, last byte has 0-7 useful bits. Note that it also means that last byte cannot be `0`.