improved zstd_compression_format.md

dev
inikep 2016-08-13 09:36:24 +02:00
parent f4f8add621
commit e81f2cb938
1 changed files with 203 additions and 201 deletions

View File

@ -60,8 +60,9 @@ explaining which parameter is unsupported.
Overall conventions
-----------
In this document square brackets i.e. `[` and `]` are used to indicate optional fields or parameters.
In this document:
- square brackets i.e. `[` and `]` are used to indicate optional fields or parameters.
- a naming convention for identifiers is `Mixed_Case_With_Underscores`
Definitions
-----------
@ -106,7 +107,7 @@ The structure of a single Zstandard frame is following:
__`Magic_Number`__
4 Bytes, Little-endian format.
4 Bytes, little-endian format.
Value : 0xFD2FB527
__`Frame_Header`__
@ -122,7 +123,7 @@ __`Content_Checksum`__
An optional 32-bit checksum, only present if `Content_Checksum_flag` is set.
The content checksum is the result
of [xxh64() hash function](https://www.xxHash.com)
of [xxh64() hash function](http://www.xxhash.org)
digesting the original (decoded) data as input, and a seed of zero.
The low 4 bytes of the checksum are stored in little endian format.
@ -214,11 +215,11 @@ __`Dictionary_ID_flag`__
This is a 2-bits flag (`= FHD & 3`),
telling if a dictionary ID is provided within the header.
It also specifies the size of this field.
It also specifies the size of this field as `Field_Size`.
| Value | 0 | 1 | 2 | 3 |
| -------- | --- | --- | --- | --- |
|Field size| 0 | 1 | 2 | 4 |
|`Flag_Value`| 0 | 1 | 2 | 3 |
| ---------- | --- | --- | --- | --- |
|`Field_Size`| 0 | 1 | 2 | 4 |
### `Window_Descriptor`
@ -230,22 +231,22 @@ The `Window_Descriptor` byte is optional. It is absent when `Single_Segment_flag
In this case, the maximum back-reference distance is the content size itself,
which can be any value from 1 to 2^64-1 bytes (16 EB).
| Bit numbers | 7-3 | 0-2 |
| ----------- | -------- | -------- |
| Field name | Exponent | Mantissa |
| Bit numbers | 7-3 | 0-2 |
| ----------- | ---------- | ---------- |
| Field name | `Exponent` | `Mantissa` |
Maximum distance is given by the following formulae :
```
windowLog = 10 + Exponent;
windowBase = 1 << windowLog;
windowAdd = (windowBase / 8) * Mantissa;
windowSize = windowBase + windowAdd;
Window_Size = windowBase + windowAdd;
```
The minimum window size is 1 KB.
The maximum size is `15*(1<<38)` bytes, which is 1.875 TB.
To properly decode compressed data,
a decoder will need to allocate a buffer of at least `windowSize` bytes.
a decoder will need to allocate a buffer of at least `Window_Size` bytes.
In order to preserve decoder from unreasonable memory requirements,
a decoder can refuse a compressed frame
@ -287,7 +288,7 @@ the following ranges are reserved for future use and should not be used :
This is the original (uncompressed) size. This information is optional.
The `Field_Size` is provided according to value of `Frame_Content_Size_flag`.
The `Field_Size` can be equal to 0 (not present), 1, 2, 4 or 8 bytes.
Format is Little-endian.
Format is little-endian.
| `Field_Size` | Range |
| ------------ | ---------- |
@ -309,7 +310,7 @@ The structure of `Data_Block` is following:
|:------------:|:------------:|:------------:|:---------------:|
| 1 bit | 2 bits | 21 bits | n bytes |
The block header uses 3-bytes.
The block header (`Last_Block`, `Block_Type`, and `Block_Size`) uses 3-bytes.
__`Last_Block`__
@ -379,7 +380,7 @@ Skippable frames defined in this specification are compatible with [LZ4] ones.
__`Magic_Number`__
4 Bytes, Little-endian format.
4 Bytes, little-endian format.
Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F.
All 16 values are valid to identify a skippable frame.
@ -387,7 +388,7 @@ __`Frame_Size`__
This is the size, in bytes, of the following `User_Data`
(without including the magic number nor the size field itself).
This field is represented using 4 Bytes, Little-endian format, unsigned 32-bits.
This field is represented using 4 Bytes, little-endian format, unsigned 32-bits.
This means `User_Data` cant be bigger than (2^32-1) bytes.
__`User_Data`__
@ -408,7 +409,7 @@ A compressed block consists of 2 sections :
### Prerequisites
To decode a compressed block, the following elements are necessary :
- Previous decoded blocks, up to a distance of `windowSize`,
- Previous decoded blocks, up to a distance of `Window_Size`,
or all previous blocks when `Single_Segment_flag` is set.
- List of "recent offsets" from previous compressed block.
- Decoding tables of previous compressed block for each symbol type
@ -467,19 +468,19 @@ __`Size_Format`__
and `Regenerated_Size` (the decompressed size). It will also decode the number of streams.
- For `Raw_Block` and `RLE_Block` it's enough to decode `Regenerated_Size`.
For values spanning several bytes, convention is Little-endian.
For values spanning several bytes, convention is little-endian.
__`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ :
- Value : x0 : `Regenerated_Size` uses 5 bits (0-31).
Total literal header size is 1 byte.
`size = h[0]>>3;`
`Literals_Section_Header` has 1 byte.
`Regenerated_Size = Header[0]>>3`
- Value : 01 : `Regenerated_Size` uses 12 bits (0-4095).
Total literal header size is 2 bytes.
`size = (h[0]>>4) + (h[1]<<4);`
`Literals_Section_Header` has 2 bytes.
`Regenerated_Size = (Header[0]>>4) + (Header[1]<<4)`
- Value : 11 : `Regenerated_Size` uses 20 bits (0-1048575).
Total literal header size is 3 bytes.
`size = (h[0]>>4) + (h[1]<<4) + (h[2]<<12);`
`Literals_Section_Header` has 3 bytes.
`Regenerated_Size = (Header[0]>>4) + (Header[1]<<4) + (Header[2]<<12)`
Note : it's allowed to represent a short value (ex : `13`)
using a long format, accepting the reduced compacity.
@ -487,19 +488,19 @@ using a long format, accepting the reduced compacity.
__`Size_Format` for `Compressed_Literals_Block` and `Repeat_Stats_Literals_Block`__ :
- Value : 00 : _Single stream_.
`Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023).
Total literal header size is 3 bytes.
Both `Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023).
`Literals_Section_Header` has 3 bytes.
- Value : 01 : 4 streams.
`Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023).
Total literal header size is 3 bytes.
Both `Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023).
`Literals_Section_Header` has 3 bytes.
- Value : 10 : 4 streams.
`Compressed_Size` and `Regenerated_Size` use 14 bits (0-16383).
Total literal header size is 4 bytes.
Both `Compressed_Size` and `Regenerated_Size` use 14 bits (0-16383).
`Literals_Section_Header` has 4 bytes.
- Value : 11 : 4 streams.
`Compressed_Size` and `Regenerated_Size` use 18 bits (0-262143).
Total literal header size is 5 bytes.
Both `Compressed_Size` and `Regenerated_Size` use 18 bits (0-262143).
`Literals_Section_Header` has 5 bytes.
`Compressed_Size` and `Regenerated_Size` fields follow little-endian convention.
Both `Compressed_Size` and `Regenerated_Size` fields follow little-endian convention.
#### `Huffman_Tree_Description`
@ -526,53 +527,53 @@ This specification limits maximum code length to 11 bits.
##### Representation
All literal values from zero (included) to last present one (excluded)
are represented by `weight` values, from 0 to `maxBits`.
Transformation from `weight` to `nbBits` follows this formulae :
`nbBits = weight ? maxBits + 1 - weight : 0;` .
The last symbol's weight is deduced from previously decoded ones,
are represented by `Weight` values, from 0 to `Max_Number_of_Bits`.
Transformation from `Weight` to `Number_of_Bits` follows this formulae :
`Number_of_Bits = Weight ? Max_Number_of_Bits + 1 - Weight : 0` .
The last symbol's `Weight` is deduced from previously decoded ones,
by completing to the nearest power of 2.
This power of 2 gives `maxBits`, the depth of the current tree.
This power of 2 gives `Max_Number_of_Bits`, the depth of the current tree.
__Example__ :
Let's presume the following Huffman tree must be described :
| literal | 0 | 1 | 2 | 3 | 4 | 5 |
| ------- | --- | --- | --- | --- | --- | --- |
| nbBits | 1 | 2 | 3 | 0 | 4 | 4 |
| literal | 0 | 1 | 2 | 3 | 4 | 5 |
| ---------------- | --- | --- | --- | --- | --- | --- |
| `Number_of_Bits` | 1 | 2 | 3 | 0 | 4 | 4 |
The tree depth is 4, since its smallest element uses 4 bits.
Value `5` will not be listed, nor will values above `5`.
Values from `0` to `4` will be listed using `weight` instead of `nbBits`.
Weight formula is : `weight = nbBits ? maxBits + 1 - nbBits : 0;`
Values from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`.
Weight formula is : `Weight = Number_of_Bits ? Max_Number_of_Bits + 1 - Number_of_Bits : 0`.
It gives the following serie of weights :
| weights | 4 | 3 | 2 | 0 | 1 |
| ------- | --- | --- | --- | --- | --- |
| literal | 0 | 1 | 2 | 3 | 4 |
| `Weight` | 4 | 3 | 2 | 0 | 1 |
| -------- | --- | --- | --- | --- | --- |
| literal | 0 | 1 | 2 | 3 | 4 |
The decoder will do the inverse operation :
having collected weights of literals from `0` to `4`,
it knows the last literal, `5`, is present with a non-zero weight.
The weight of `5` can be deducted by joining to the nearest power of 2.
Sum of 2^(weight-1) (excluding 0) is :
`8 + 4 + 2 + 0 + 1 = 15`
Sum of `2^(Weight-1)` (excluding 0) is :
`8 + 4 + 2 + 0 + 1 = 15`.
Nearest power of 2 is 16.
Therefore, `maxBits = 4` and `weight[5] = 1`.
Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = 1`.
##### Huffman Tree header
This is a single byte value (0-255),
which tells how to decode the list of weights.
- if headerByte >= 128 : this is a direct representation,
where each weight is written directly as a 4 bits field (0-15).
The full representation occupies `((nbSymbols+1)/2)` bytes,
meaning it uses a last full byte even if nbSymbols is odd.
`nbSymbols = headerByte - 127;`.
Note that maximum nbSymbols is 255-127 = 128.
- if `headerByte` >= 128 : this is a direct representation,
where each `Weight` is written directly as a 4 bits field (0-15).
The full representation occupies `((Number_of_Symbols+1)/2)` bytes,
meaning it uses a last full byte even if `Number_of_Symbols` is odd.
`Number_of_Symbols = headerByte - 127`.
Note that maximum `Number_of_Symbols` is 255-127 = 128.
A larger serie must necessarily use FSE compression.
- if headerByte < 128 :
- if `headerByte` < 128 :
the serie of weights is compressed by FSE.
The length of the FSE-compressed serie is `headerByte` (0-127).
@ -605,29 +606,29 @@ When both states have overflowed the bitstream, end is reached.
##### Conversion from weights to Huffman prefix codes
All present symbols shall now have a `weight` value.
It is possible to transform weights into nbBits, using this formula :
`nbBits = nbBits ? maxBits + 1 - weight : 0;` .
All present symbols shall now have a `Weight` value.
It is possible to transform weights into Number_of_Bits, using this formula :
`Number_of_Bits = Number_of_Bits ? Max_Number_of_Bits + 1 - Weight : 0` .
Symbols are sorted by weight. Within same weight, symbols keep natural order.
Symbols with a weight of zero are removed.
Symbols are sorted by `Weight`. Within same `Weight`, symbols keep natural order.
Symbols with a `Weight` of zero are removed.
Then, starting from lowest weight, prefix codes are distributed in order.
__Example__ :
Let's presume the following list of weights has been decoded :
| Literal | 0 | 1 | 2 | 3 | 4 | 5 |
| ------- | --- | --- | --- | --- | --- | --- |
| weight | 4 | 3 | 2 | 0 | 1 | 1 |
| Literal | 0 | 1 | 2 | 3 | 4 | 5 |
| -------- | --- | --- | --- | --- | --- | --- |
| `Weight` | 4 | 3 | 2 | 0 | 1 | 1 |
Sorted by weight and then natural order,
it gives the following distribution :
| Literal | 3 | 4 | 5 | 2 | 1 | 0 |
| ------------ | --- | --- | --- | --- | --- | ---- |
| weight | 0 | 1 | 1 | 2 | 3 | 4 |
| nb bits | 0 | 4 | 4 | 3 | 2 | 1 |
| prefix codes | N/A | 0000| 0001| 001 | 01 | 1 |
| Literal | 3 | 4 | 5 | 2 | 1 | 0 |
| ---------------- | --- | --- | --- | --- | --- | ---- |
| `Weight` | 0 | 1 | 1 | 2 | 3 | 4 |
| `Number_of_Bits` | 0 | 4 | 4 | 3 | 2 | 1 |
| prefix codes | N/A | 0000| 0001| 001 | 01 | 1 |
#### Literals bitstreams
@ -638,7 +639,7 @@ As seen in a previous paragraph,
there are 2 flavors of Huffman-compressed literals :
single stream, and 4-streams.
4-streams is useful for CPU with multiple execution units and OoO operations.
4-streams is useful for CPU with multiple execution units and out-of-order operations.
Since each stream can be decoded independently,
it's possible to decode them up to 4x faster than a single stream,
presuming the CPU has enough parallelism available.
@ -653,11 +654,11 @@ Regenerated size of each stream can be calculated by `(totalSize+3)/4`,
except for last one, which can be up to 3 bytes smaller, to reach `totalSize`.
Compressed size is provided explicitly : in the 4-streams variant,
bitstreams are preceded by 3 unsigned Little-Endian 16-bits values.
bitstreams are preceded by 3 unsigned little-endian 16-bits values.
Each value represents the compressed size of one stream, in order.
The last stream size is deducted from total compressed size
and from previously decoded stream sizes :
`stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize;`
`stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize`.
##### Bitstreams read and decode
@ -676,7 +677,7 @@ Starting from the end,
it's possible to read the bitstream in a little-endian fashion,
keeping track of already used bits.
Reading the last `maxBits` bits,
Reading the last `Max_Number_of_Bits` bits,
it's then possible to compare extracted value to decoding table,
determining the symbol to decode and number of bits to discard.
@ -696,7 +697,7 @@ A match copy command specifies an offset and a length.
The offset gives the position to copy from,
which can be within a previous block.
There are 3 symbol types, `literalLength`, `matchLength` and `offset`,
There are 3 symbol types, literals lengths, offsets and match lengths,
which are encoded together, interleaved in a single _bitstream_.
Each symbol is a _code_ in its own context,
@ -708,126 +709,125 @@ The Sequences section starts by a header,
followed by optional Probability tables for each symbol type,
followed by the bitstream.
| Header | [LitLengthTable] | [OffsetTable] | [MatchLengthTable] | bitStream |
| ------ | ---------------- | ------------- | ------------------ | --------- |
| `Sequences_Section_Header` | [`Literals_Length_Table`] | [`Offset_Table`] | [`Match_Length_Table`] | bitStream |
| -------------------------- | ------------------------- | ---------------- | ---------------------- | --------- |
To decode the Sequence section, it's required to know its size.
This size is deducted from `blockSize - literalSectionSize`.
#### Sequences section header
#### `Sequences_Section_Header`
Consists in 2 items :
- Nb of Sequences
- Flags providing Symbol compression types
- `Number_of_Sequences`
- Symbol compression modes
__Nb of Sequences__
__`Number_of_Sequences`__
This is a variable size field, `nbSeqs`, using between 1 and 3 bytes.
This is a variable size field using between 1 and 3 bytes.
Let's call its first byte `byte0`.
- `if (byte0 == 0)` : there are no sequences.
The sequence section stops there.
Regenerated content is defined entirely by literals section.
- `if (byte0 < 128)` : `nbSeqs = byte0;` . Uses 1 byte.
- `if (byte0 < 255)` : `nbSeqs = ((byte0-128) << 8) + byte1;` . Uses 2 bytes.
- `if (byte0 == 255)`: `nbSeqs = byte1 + (byte2<<8) + 0x7F00;` . Uses 3 bytes.
- `if (byte0 < 128)` : `Number_of_Sequences = byte0` . Uses 1 byte.
- `if (byte0 < 255)` : `Number_of_Sequences = ((byte0-128) << 8) + byte1` . Uses 2 bytes.
- `if (byte0 == 255)`: `Number_of_Sequences = byte1 + (byte2<<8) + 0x7F00` . Uses 3 bytes.
__Symbol encoding modes__
__Symbol compression modes__
This is a single byte, defining the compression mode of each symbol type.
| BitNb | 7-6 | 5-4 | 3-2 | 1-0 |
| ------- | ------ | ------ | ------ | -------- |
|FieldName| LLType | OFType | MLType | Reserved |
|Bit number| 7-6 | 5-4 | 3-2 | 1-0 |
| -------- | ----------------------- | -------------- | -------------------- | ---------- |
|Field name| `Literals_Lengths_Mode` | `Offsets_Mode` | `Match_Lengths_Mode` | `Reserved` |
The last field, `Reserved`, must be all-zeroes.
`LLType`, `OFType` and `MLType` define the compression mode of
Literal Lengths, Offsets and Match Lengths respectively.
`Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the compression mode of
literals lengths, offsets and match lengths respectively.
They follow the same enumeration :
| Value | 0 | 1 | 2 | 3 |
| ---------------- | ------ | --- | ---------- | ------ |
| Compression Mode | predef | RLE | Compressed | Repeat |
| Value | 0 | 1 | 2 | 3 |
| ------------------ | ----------------- | ---------- | --------------------- | ------------- |
| `Compression_Mode` | `Predefined_Mode` | `RLE_Mode` | `FSE_Compressed_Mode` | `Repeat_Mode` |
- "predef" : uses a pre-defined distribution table.
- "RLE" : it's a single code, repeated `nbSeqs` times.
- "Repeat" : re-use distribution table from previous compressed block.
- "Compressed" : standard FSE compression.
- `Predefined_Mode` : uses a predefined distribution table.
- `RLE_Mode` : it's a single code, repeated `Number_of_Sequences` times.
- `Repeat_Mode` : re-use distribution table from previous compressed block.
- `FSE_Compressed_Mode` : standard FSE compression.
A distribution table will be present.
It will be described in [next part](#distribution-tables).
#### Symbols decoding
##### Literal Lengths codes
##### Literals Length codes
Literal lengths codes are values ranging from `0` to `35` included.
Literals length codes are values ranging from `0` to `35` included.
They define lengths from 0 to 131071 bytes.
| Code | 0-15 |
| ------ | ---- |
| length | Code |
| nbBits | 0 |
| `Literals_Length_Code` | 0-15 |
| ---------------------- | ---------------------- |
| length | `Literals_Length_Code` |
| `Number_of_Bits` | 0 |
| `Literals_Length_Code` | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
| ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| Baseline | 16 | 18 | 20 | 22 | 24 | 28 | 32 | 40 |
| `Number_of_Bits` | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
| Code | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| Baseline | 16 | 18 | 20 | 22 | 24 | 28 | 32 | 40 |
| nb Bits | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
| `Literals_Length_Code` | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
| ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| Baseline | 48 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 |
| `Number_of_Bits` | 4 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
| Code | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| Baseline | 48 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 |
| nb Bits | 4 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
| Code | 32 | 33 | 34 | 35 |
| -------- | ---- | ---- | ---- | ---- |
| Baseline | 8192 |16384 |32768 |65536 |
| nb Bits | 13 | 14 | 15 | 16 |
| `Literals_Length_Code` | 32 | 33 | 34 | 35 |
| ---------------------- | ---- | ---- | ---- | ---- |
| Baseline | 8192 |16384 |32768 |65536 |
| `Number_of_Bits` | 13 | 14 | 15 | 16 |
__Default distribution__
When "compression mode" is "predef"",
a pre-defined distribution is used for FSE compression.
When `Compression_Mode` is `Predefined_Mode`,
a predefined distribution is used for FSE compression.
Below is its definition. It uses an accuracy of 6 bits (64 states).
```
short literalLengths_defaultDistribution[36] =
short literalsLength_defaultDistribution[36] =
{ 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1,
-1,-1,-1,-1 };
```
##### Match Lengths codes
##### Match Length codes
Match lengths codes are values ranging from `0` to `52` included.
Match length codes are values ranging from `0` to `52` included.
They define lengths from 3 to 131074 bytes.
| Code | 0-31 |
| ------ | -------- |
| value | Code + 3 |
| nbBits | 0 |
| `Match_Length_Code` | 0-31 |
| ------------------- | ----------------------- |
| value | `Match_Length_Code` + 3 |
| `Number_of_Bits` | 0 |
| Code | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 |
| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| Baseline | 35 | 37 | 39 | 41 | 43 | 47 | 51 | 59 |
| nb Bits | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
| `Match_Length_Code` | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 |
| ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| Baseline | 35 | 37 | 39 | 41 | 43 | 47 | 51 | 59 |
| `Number_of_Bits` | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
| Code | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 |
| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| Baseline | 67 | 83 | 99 | 131 | 258 | 514 | 1026 | 2050 |
| nb Bits | 4 | 4 | 5 | 7 | 8 | 9 | 10 | 11 |
| `Match_Length_Code` | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 |
| ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| Baseline | 67 | 83 | 99 | 131 | 258 | 514 | 1026 | 2050 |
| `Number_of_Bits` | 4 | 4 | 5 | 7 | 8 | 9 | 10 | 11 |
| Code | 48 | 49 | 50 | 51 | 52 |
| -------- | ---- | ---- | ---- | ---- | ---- |
| Baseline | 4098 | 8194 |16486 |32770 |65538 |
| nb Bits | 12 | 13 | 14 | 15 | 16 |
| `Match_Length_Code` | 48 | 49 | 50 | 51 | 52 |
| ------------------- | ---- | ---- | ---- | ---- | ---- |
| Baseline | 4098 | 8194 |16486 |32770 |65538 |
| `Number_of_Bits` | 12 | 13 | 14 | 15 | 16 |
__Default distribution__
When "compression mode" is defined as "predef",
a pre-defined distribution is used for FSE compression.
When `Compression_Mode` is defined as `Predefined_Mode`,
a predefined distribution is used for FSE compression.
Here is its definition. It uses an accuracy of 6 bits (64 states).
```
@ -840,31 +840,32 @@ short matchLengths_defaultDistribution[53] =
##### Offset codes
Offset codes are values ranging from `0` to `N`,
with `N` being limited by maximum backreference distance.
Offset codes are values ranging from `0` to `N`.
A decoder is free to limit its maximum `N` supported.
Recommendation is to support at least up to `22`.
For information, at the time of this writing.
the reference decoder supports a maximum `N` value of `28` in 64-bits mode.
An offset code is also the nb of additional bits to read,
and can be translated into an `OFValue` using the following formulae :
An offset code is also the number of additional bits to read,
and can be translated into an `Offset_Value` using the following formulae :
```
OFValue = (1 << offsetCode) + readNBits(offsetCode);
if (OFValue > 3) offset = OFValue - 3;
Offset_Value = (1 << offsetCode) + readNBits(offsetCode);
if (Offset_Value > 3) offset = Offset_Value - 3;
```
It means that maximum `Offset_Value` is `2^(N+1))-1` and it supports back-reference distance up to 2^(N+1))-4
but is limited by [maximum back-reference distance](#window_descriptor).
OFValue from 1 to 3 are special : they define "repeat codes",
Offset_Value from 1 to 3 are special : they define "repeat codes",
which means one of the previous offsets will be repeated.
They are sorted in recency order, with 1 meaning the most recent one.
See [Repeat offsets](#repeat-offsets) paragraph.
__Default distribution__
When "compression mode" is defined as "predef",
a pre-defined distribution is used for FSE compression.
When `Compression_Mode` is defined as `Predefined_Mode`,
a predefined distribution is used for FSE compression.
Here is its definition. It uses an accuracy of 5 bits (32 states),
and supports a maximum `N` of 28, allowing offset values up to 536,870,908 .
@ -882,38 +883,38 @@ short offsetCodes_defaultDistribution[53] =
Following the header, up to 3 distribution tables can be described.
When present, they are in this order :
- Literal lengthes
- Literals lengths
- Offsets
- Match Lengthes
- Match Lengths
The content to decode depends on their respective encoding mode :
- Predef : no content. Use pre-defined distribution table.
- RLE : 1 byte. This is the only code to use across the whole compressed block.
- FSE : A distribution table is present.
- Repeat mode : no content. Re-use distribution from previous compressed block.
- `Predefined_Mode` : no content. Use predefined distribution table.
- `RLE_Mode` : 1 byte. This is the only code to use across the whole compressed block.
- `FSE_Compressed_Mode` : A distribution table is present.
- `Repeat_Mode` : no content. Re-use distribution from previous compressed block.
##### FSE distribution table : condensed format
An FSE distribution table describes the probabilities of all symbols
from `0` to the last present one (included)
on a normalized scale of `1 << AccuracyLog` .
on a normalized scale of `1 << Accuracy_Log` .
It's a bitstream which is read forward, in little-endian fashion.
It's not necessary to know its exact size,
since it will be discovered and reported by the decoding process.
The bitstream starts by reporting on which scale it operates.
`AccuracyLog = low4bits + 5;`
Note that maximum `AccuracyLog` for literal and match lengthes is `9`,
`Accuracy_Log = low4bits + 5`.
Note that maximum `Accuracy_Log` for literal and match length is `9`,
and for offsets it is `8`. Higher values are considered errors.
Then follow each symbol value, from `0` to last present one.
The nb of bits used by each field is variable.
The number of bits used by each field is variable.
It depends on :
- Remaining probabilities + 1 :
__example__ :
Presuming an AccuracyLog of 8,
Presuming an `Accuracy_Log` of 8,
and presuming 100 probabilities points have already been distributed,
the decoder may read any value from `0` to `255 - 100 + 1 == 156` (included).
Therefore, it must read `log2sup(156) == 8` bits.
@ -927,17 +928,17 @@ It depends on :
values from 99 to 156 use 8 bits.
This is achieved through this scheme :
| Value read | Value decoded | nb Bits used |
| ---------- | ------------- | ------------ |
| 0 - 98 | 0 - 98 | 7 |
| 99 - 127 | 99 - 127 | 8 |
| 128 - 226 | 0 - 98 | 7 |
| 227 - 255 | 128 - 156 | 8 |
| Value read | Value decoded | Number of bits used |
| ---------- | ------------- | ------------------- |
| 0 - 98 | 0 - 98 | 7 |
| 99 - 127 | 99 - 127 | 8 |
| 128 - 226 | 0 - 98 | 7 |
| 227 - 255 | 128 - 156 | 8 |
Symbols probabilities are read one by one, in order.
Probability is obtained from Value decoded by following formulae :
`Proba = value - 1;`
`Proba = value - 1`
It means value `0` becomes negative probability `-1`.
`-1` is a special probability, which means `less than 1`.
@ -952,9 +953,9 @@ This repeat flag tells how many probabilities of zeroes follow the current one.
It provides a number ranging from 0 to 3.
If it is a 3, another 2-bits repeat flag follows, and so on.
When last symbol reaches cumulated total of `1 << AccuracyLog`,
When last symbol reaches cumulated total of `1 << Accuracy_Log`,
decoding is complete.
If the last symbol makes cumulated total go above `1 << AccuracyLog`,
If the last symbol makes cumulated total go above `1 << Accuracy_Log`,
distribution is considered corrupted.
Then the decoder can tell how many bytes were used in this process,
@ -969,14 +970,14 @@ to create a unique decoding table.
It follows the following build rule :
The table has a size of `tableSize = 1 << AccuracyLog;`.
The table has a size of `tableSize = 1 << Accuracy_Log`.
Each cell describes the symbol decoded,
and instructions to get the next state.
Symbols are scanned in their natural order for `less than 1` probabilities.
Symbols with this probability are being attributed a single cell,
starting from the end of the table.
These symbols define a full state reset, reading `AccuracyLog` bits.
These symbols define a full state reset, reading `Accuracy_Log` bits.
All remaining symbols are sorted in their natural order.
Starting from symbol `0` and table position `0`,
@ -1005,7 +1006,7 @@ It receives 5 state values. States are sorted in natural order.
Next power of 2 is 8.
Space of probabilities is divided into 8 equal parts.
Presuming the AccuracyLog is 7, it defines 128 states.
Presuming the `Accuracy_Log` is 7, it defines 128 states.
Divided by 8, each share is 16 large.
In order to reach 8, 8-5=3 lowest states will count "double",
@ -1014,13 +1015,13 @@ requiring one more bit in the process.
Numbering starts from higher states using less bits.
| state order | 0 | 1 | 2 | 3 | 4 |
| ----------- | ----- | ----- | ------ | ---- | ----- |
| width | 32 | 32 | 32 | 16 | 16 |
| nb Bits | 5 | 5 | 5 | 4 | 4 |
| range nb | 2 | 4 | 6 | 0 | 1 |
| baseline | 32 | 64 | 96 | 0 | 16 |
| range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 |
| state order | 0 | 1 | 2 | 3 | 4 |
| ---------------- | ----- | ----- | ------ | ---- | ----- |
| width | 32 | 32 | 32 | 16 | 16 |
| `Number_of_Bits` | 5 | 5 | 5 | 4 | 4 |
| range number | 2 | 4 | 6 | 0 | 1 |
| baseline | 32 | 64 | 96 | 0 | 16 |
| range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 |
Next state is determined from current state
by reading the required number of bits, and adding the specified baseline.
@ -1044,9 +1045,9 @@ The bitstream starts with initial state values,
each using the required number of bits in their respective _accuracy_,
decoded previously from their normalized distribution.
It starts by `Literal Length State`,
followed by `Offset State`,
and finally `Match Length State`.
It starts by `Literals_Length_State`,
followed by `Offset_State`,
and finally `Match_Length_State`.
Reminder : always keep in mind that all values are read _backward_.
@ -1056,9 +1057,9 @@ A state gives a code.
A code provides a baseline and number of bits to add.
See [Symbol Decoding] section for details on each symbol.
Decoding starts by reading the nb of bits required to decode offset.
Decoding starts by reading the number of bits required to decode offset.
It then does the same for match length,
and then for literal length.
and then for literals length.
Offset / matchLength / litLength define a sequence.
It starts by inserting the number of literals defined by `litLength`,
@ -1066,11 +1067,11 @@ then continue by copying `matchLength` bytes from `currentPos - offset`.
The next operation is to update states.
Using rules pre-calculated in the decoding tables,
`Literal Length State` is updated,
followed by `Match Length State`,
and then `Offset State`.
`Literals_Length_State` is updated,
followed by `Match_Length_State`,
and then `Offset_State`.
This operation will be repeated `NbSeqs` times.
This operation will be repeated `Number_of_Sequences` times.
At the end, the bitstream shall be entirely consumed,
otherwise bitstream is considered corrupted.
@ -1081,7 +1082,7 @@ otherwise bitstream is considered corrupted.
As seen in [Offset Codes], the first 3 values define a repeated offset.
They are sorted in recency order, with 1 meaning "most recent one".
There is an exception though, when current sequence's literal length is `0`.
There is an exception though, when current sequence's literals length is `0`.
In which case, repcodes are "pushed by one",
so 1 becomes 2, 2 becomes 3,
and 3 becomes "offset_1 - 1_byte".
@ -1112,31 +1113,32 @@ But dictionaries created by `zstd --train` follow a format, described here.
__Pre-requisites__ : a dictionary has a known length,
defined either by a buffer limit, or a file size.
| Header | DictID | Stats | Content |
| ------ | ------ | ----- | ------- |
| `Magic_Number` | `Dictionary_ID` | `Entropy_Tables` | `Content` |
| -------------- | --------------- | ---------------- | --------- |
__Header__ : 4 bytes ID, value 0xEC30A437, Little-Endian format
__`Magic_Number`__ : 4 bytes ID, value 0xEC30A437, little-endian format
__Dict_ID__ : 4 bytes, stored in Little-Endian format.
DictID can be any value, except 0 (which means no DictID).
__`Dictionary_ID`__ : 4 bytes, stored in little-endian format.
`Dictionary_ID` can be any value, except 0 (which means no `Dictionary_ID`).
It's used by decoders to check if they use the correct dictionary.
_Reserved ranges :_
_Reserved ranges :_
If the frame is going to be distributed in a private environment,
any dictionary ID can be used.
any `Dictionary_ID` can be used.
However, for public distribution of compressed frames,
some ranges are reserved for future use :
the following ranges are reserved for future use and should not be used :
- low range : 1 - 32767 : reserved
- high range : >= (2^31) : reserved
- low range : 1 - 32767
- high range : >= (2^31)
__Stats__ : Entropy tables, following the same format as a [compressed blocks].
__`Entropy_Tables`__ : following the same format as a [compressed blocks].
They are stored in following order :
Huffman tables for literals, FSE table for offset,
FSE table for matchLenth, and FSE table for litLength.
It's finally followed by 3 offset values, populating recent offsets,
stored in order, 4-bytes little-endian each, for a total of 12 bytes.
__Content__ : Where the actual dictionary content is.
__`Content`__ : Where the actual dictionary content is.
Content size depends on Dictionary size.
[compressed blocks]: #the-format-of-compressed_block