updated spec
This commit is contained in:
parent
f9cac7a734
commit
9ca73364e6
@ -712,7 +712,7 @@ static seq_t ZSTD_decodeSequence(seqState_t* seqState)
|
|||||||
if (llCode == 0 && offset <= 1) offset = 1-offset;
|
if (llCode == 0 && offset <= 1) offset = 1-offset;
|
||||||
|
|
||||||
if (offset != 0) {
|
if (offset != 0) {
|
||||||
size_t temp = seqState->prevOffset[offset];
|
size_t const temp = seqState->prevOffset[offset];
|
||||||
if (offset != 1) {
|
if (offset != 1) {
|
||||||
seqState->prevOffset[2] = seqState->prevOffset[1];
|
seqState->prevOffset[2] = seqState->prevOffset[1];
|
||||||
}
|
}
|
||||||
|
@ -131,9 +131,6 @@ Typical gains range from ~10% (at 64KB) to x5 better (at <1KB).
|
|||||||
.TP
|
.TP
|
||||||
.B \-B#
|
.B \-B#
|
||||||
cut file into independent blocks of size # (default: no block)
|
cut file into independent blocks of size # (default: no block)
|
||||||
.TP
|
|
||||||
.B \-r#
|
|
||||||
test all compression levels from 1 to # (default: disabled)
|
|
||||||
|
|
||||||
|
|
||||||
.SH BUGS
|
.SH BUGS
|
||||||
|
@ -25,7 +25,7 @@ Introduction
|
|||||||
The purpose of this document is to define a lossless compressed data format,
|
The purpose of this document is to define a lossless compressed data format,
|
||||||
that is independent of CPU type, operating system,
|
that is independent of CPU type, operating system,
|
||||||
file system and character set, suitable for
|
file system and character set, suitable for
|
||||||
File compression, Pipe and streaming compression
|
file compression, pipe and streaming compression,
|
||||||
using the [Zstandard algorithm](http://www.zstandard.org).
|
using the [Zstandard algorithm](http://www.zstandard.org).
|
||||||
|
|
||||||
The data can be produced or consumed,
|
The data can be produced or consumed,
|
||||||
@ -99,7 +99,7 @@ __EndMark__
|
|||||||
The flow of blocks ends when the last block header brings an _end signal_ .
|
The flow of blocks ends when the last block header brings an _end signal_ .
|
||||||
This last block header may optionally host a __Content Checksum__ .
|
This last block header may optionally host a __Content Checksum__ .
|
||||||
|
|
||||||
__Content Checksum__
|
##### __Content Checksum__
|
||||||
|
|
||||||
Content Checksum verify that frame content has been regenrated correctly.
|
Content Checksum verify that frame content has been regenrated correctly.
|
||||||
The content checksum is the result
|
The content checksum is the result
|
||||||
@ -108,7 +108,8 @@ digesting the original (decoded) data as input, and a seed of zero.
|
|||||||
Bits from 11 to 32 (included) are extracted to form a 22 bits checksum
|
Bits from 11 to 32 (included) are extracted to form a 22 bits checksum
|
||||||
stored into the last block header.
|
stored into the last block header.
|
||||||
```
|
```
|
||||||
contentChecksum = (XXH64(content, size, 0) >> 11) & (1<<22)-1);
|
mask22bits = (1<<22)-1;
|
||||||
|
contentChecksum = (XXH64(content, size, 0) >> 11) & mask22bits;
|
||||||
```
|
```
|
||||||
Content checksum is only present when its associated flag
|
Content checksum is only present when its associated flag
|
||||||
is set in the frame descriptor.
|
is set in the frame descriptor.
|
||||||
@ -183,7 +184,7 @@ which requests a memory size beyond decoder's authorized range.
|
|||||||
For broader compatibility, decoders are recommended to support
|
For broader compatibility, decoders are recommended to support
|
||||||
memory sizes of at least 8 MB.
|
memory sizes of at least 8 MB.
|
||||||
This is just a recommendation,
|
This is just a recommendation,
|
||||||
as each decoder is free to support higher or lower limits,
|
each decoder is free to support higher or lower limits,
|
||||||
depending on local limitations.
|
depending on local limitations.
|
||||||
|
|
||||||
__Unused bit__
|
__Unused bit__
|
||||||
@ -204,13 +205,14 @@ to signal a feature that must be interpreted in order to decode the frame.
|
|||||||
__Content checksum flag__
|
__Content checksum flag__
|
||||||
|
|
||||||
If this flag is set, a content checksum will be present into the EndMark.
|
If this flag is set, a content checksum will be present into the EndMark.
|
||||||
The checksum is a 22 bits value extracted from the XXH64() of data.
|
The checksum is a 22 bits value extracted from the XXH64() of data,
|
||||||
See __Content Checksum__ .
|
and stored into endMark. See [__Content Checksum__](#content-checksum) .
|
||||||
|
|
||||||
__Dictionary ID flag__
|
__Dictionary ID flag__
|
||||||
|
|
||||||
This is a 2-bits flag (`= FHD & 3`),
|
This is a 2-bits flag (`= FHD & 3`),
|
||||||
telling if a dictionary ID is provided within the header
|
telling if a dictionary ID is provided within the header.
|
||||||
|
It also specifies the size of this field.
|
||||||
|
|
||||||
| Value | 0 | 1 | 2 | 3 |
|
| Value | 0 | 1 | 2 | 3 |
|
||||||
| ------- | --- | --- | --- | --- |
|
| ------- | --- | --- | --- | --- |
|
||||||
@ -286,10 +288,10 @@ Format is Little endian.
|
|||||||
|
|
||||||
When field size is 1, 4 or 8 bytes, the value is read directly.
|
When field size is 1, 4 or 8 bytes, the value is read directly.
|
||||||
When field size is 2, _an offset of 256 is added_.
|
When field size is 2, _an offset of 256 is added_.
|
||||||
It's allowed to represent a small size (ex: `18`) using the 8-bytes variant.
|
It's allowed to represent a small size (ex: `18`) using any compatible variant.
|
||||||
A size of `0` means `content size is unknown`.
|
A size of `0` means `content size is unknown`.
|
||||||
In which case, the `WD` byte will necessarily be present,
|
In which case, the `WD` byte will necessarily be present,
|
||||||
and becomes the only hint to help memory allocation.
|
and becomes the only hint to guide memory allocation.
|
||||||
|
|
||||||
In order to preserve decoder from unreasonable memory requirement,
|
In order to preserve decoder from unreasonable memory requirement,
|
||||||
a decoder can refuse a compressed frame
|
a decoder can refuse a compressed frame
|
||||||
@ -317,8 +319,8 @@ There are 4 block types :
|
|||||||
| ---------- | ---------- | --- | --- | ------- |
|
| ---------- | ---------- | --- | --- | ------- |
|
||||||
| Block Type | Compressed | Raw | RLE | EndMark |
|
| Block Type | Compressed | Raw | RLE | EndMark |
|
||||||
|
|
||||||
- Compressed : this is a Zstandard compressed block,
|
- Compressed : this is a [Zstandard compressed block](#compressed-block-format),
|
||||||
detailed in a later part of this specification.
|
detailed in another section of this specification.
|
||||||
"block size" is the compressed size.
|
"block size" is the compressed size.
|
||||||
Decompressed size is unknown,
|
Decompressed size is unknown,
|
||||||
but its maximum possible value is guaranteed (see below)
|
but its maximum possible value is guaranteed (see below)
|
||||||
@ -329,12 +331,12 @@ There are 4 block types :
|
|||||||
while the "compressed" block is just 1 byte (the byte to repeat).
|
while the "compressed" block is just 1 byte (the byte to repeat).
|
||||||
- EndMark : this is not a block. Signal the end of the frame.
|
- EndMark : this is not a block. Signal the end of the frame.
|
||||||
The rest of the field may be optionally filled by a checksum
|
The rest of the field may be optionally filled by a checksum
|
||||||
(see frame checksum).
|
(see [__Content Checksum__]).
|
||||||
|
|
||||||
Block sizes must respect a few rules :
|
Block sizes must respect a few rules :
|
||||||
- In compressed mode, compressed size if always strictly `< contentSize`.
|
- In compressed mode, compressed size if always strictly `< decompressed size`.
|
||||||
- Block decompressed size is necessarily <= maximum back-reference distance .
|
- Block decompressed size is always <= maximum back-reference distance .
|
||||||
- Block decompressed size is necessarily <= 128 KB
|
- Block decompressed size is always <= 128 KB
|
||||||
|
|
||||||
|
|
||||||
__Data__
|
__Data__
|
||||||
@ -343,8 +345,8 @@ Where the actual data to decode stands.
|
|||||||
It might be compressed or not, depending on previous field indications.
|
It might be compressed or not, depending on previous field indications.
|
||||||
A data block is not necessarily "full" :
|
A data block is not necessarily "full" :
|
||||||
since an arbitrary “flush” may happen anytime,
|
since an arbitrary “flush” may happen anytime,
|
||||||
block content can be any size, up to Block Maximum Size.
|
block decompressed content can be any size, up to Block Maximum Size.
|
||||||
Block Maximum Size is the smallest of :
|
Block Maximum Decompressed Size is the smallest of :
|
||||||
- Max back-reference distance
|
- Max back-reference distance
|
||||||
- 128 KB
|
- 128 KB
|
||||||
|
|
||||||
@ -388,7 +390,7 @@ This specification details the content of a _compressed block_.
|
|||||||
A compressed block has a size, which must be known.
|
A compressed block has a size, which must be known.
|
||||||
It also has a guaranteed maximum regenerated size,
|
It also has a guaranteed maximum regenerated size,
|
||||||
in order to properly allocate destination buffer.
|
in order to properly allocate destination buffer.
|
||||||
See "Frame format" for more details.
|
See [Data Blocks](#data-blocks) for more details.
|
||||||
|
|
||||||
A compressed block consists of 2 sections :
|
A compressed block consists of 2 sections :
|
||||||
- Literals section
|
- Literals section
|
||||||
@ -397,15 +399,15 @@ A compressed block consists of 2 sections :
|
|||||||
### Prerequisites
|
### Prerequisites
|
||||||
To decode a compressed block, the following elements are necessary :
|
To decode a compressed block, the following elements are necessary :
|
||||||
- Previous decoded blocks, up to a distance of `windowSize`,
|
- Previous decoded blocks, up to a distance of `windowSize`,
|
||||||
or all frame's previous blocks in "single segment" mode.
|
or all previous blocks in "single segment" mode.
|
||||||
- List of "recent offsets" from previous compressed block.
|
- List of "recent offsets" from previous compressed block.
|
||||||
- Decoding tables of previous compressed block for each symbol type
|
- Decoding tables of previous compressed block for each symbol type
|
||||||
(literals, litLength, matchLength, offset)
|
(literals, litLength, matchLength, offset).
|
||||||
|
|
||||||
|
|
||||||
### Literals section
|
### Literals section
|
||||||
|
|
||||||
Literals are compressed using order-0 huffman compression.
|
Literals are compressed using huffman compression.
|
||||||
During sequence phase, literals will be entangled with match copy operations.
|
During sequence phase, literals will be entangled with match copy operations.
|
||||||
All literals are regrouped in the first part of the block.
|
All literals are regrouped in the first part of the block.
|
||||||
They can be decoded first, and then copied during sequence operations,
|
They can be decoded first, and then copied during sequence operations,
|
||||||
@ -456,7 +458,7 @@ Sizes format are divided into 2 families :
|
|||||||
|
|
||||||
For values spanning several bytes, convention is Big-endian.
|
For values spanning several bytes, convention is Big-endian.
|
||||||
|
|
||||||
__Sizes format for Raw or RLE block__ :
|
__Sizes format for Raw or RLE literals block__ :
|
||||||
|
|
||||||
- Value : 0x : Regenerated size uses 5 bits (0-31).
|
- Value : 0x : Regenerated size uses 5 bits (0-31).
|
||||||
Total literal header size is 1 byte.
|
Total literal header size is 1 byte.
|
||||||
@ -471,7 +473,7 @@ __Sizes format for Raw or RLE block__ :
|
|||||||
Note : it's allowed to represent a short value (ex : `13`)
|
Note : it's allowed to represent a short value (ex : `13`)
|
||||||
using a long format, accepting the reduced compacity.
|
using a long format, accepting the reduced compacity.
|
||||||
|
|
||||||
__Sizes format for Compressed Block__ :
|
__Sizes format for Compressed literals block__ :
|
||||||
|
|
||||||
Note : also applicable to "repeat-stats" blocks.
|
Note : also applicable to "repeat-stats" blocks.
|
||||||
- Value : 00 : 4 streams
|
- Value : 00 : 4 streams
|
||||||
@ -491,23 +493,23 @@ Compressed and regenerated size fields follow big endian convention.
|
|||||||
|
|
||||||
#### Huffman Tree description
|
#### Huffman Tree description
|
||||||
|
|
||||||
This section is only present when block type is _compressed_ (`0`).
|
This section is only present when block type is `Compressed` (`0`).
|
||||||
|
|
||||||
Prefix coding represents symbols from an a priori known
|
Prefix coding represents symbols from an a priori known alphabet
|
||||||
alphabet by bit sequences (codes), one code for each symbol, in
|
by bit sequences (codes), one code for each symbol,
|
||||||
a manner such that different symbols may be represented by bit
|
in a manner such that different symbols may be represented
|
||||||
sequences of different lengths, but a parser can always parse
|
by bit sequences of different lengths,
|
||||||
an encoded string unambiguously symbol-by-symbol.
|
but a parser can always parse an encoded string
|
||||||
|
unambiguously symbol-by-symbol.
|
||||||
|
|
||||||
Given an alphabet with known symbol frequencies, the Huffman
|
Given an alphabet with known symbol frequencies,
|
||||||
algorithm allows the construction of an optimal prefix code
|
the Huffman algorithm allows the construction of an optimal prefix code
|
||||||
(one which represents strings with those symbol frequencies
|
using the fewest bits of any possible prefix codes for that alphabet.
|
||||||
using the fewest bits of any possible prefix codes for that
|
Such a code is called a Huffman code.
|
||||||
alphabet). Such a code is called a Huffman code.
|
|
||||||
|
|
||||||
Huffman code must not exceed a maximum code length.
|
Prefix code must not exceed a maximum code length.
|
||||||
More bits improve accuracy but cost more header size,
|
More bits improve accuracy but cost more header size,
|
||||||
and requires more memory for decoding operations.
|
and require more memory for decoding operations.
|
||||||
|
|
||||||
The current format limits the maximum depth to 15 bits.
|
The current format limits the maximum depth to 15 bits.
|
||||||
The reference decoder goes further, by limiting it to 11 bits.
|
The reference decoder goes further, by limiting it to 11 bits.
|
||||||
@ -552,7 +554,8 @@ Therefore, `maxBits = 4` and `weight[5] = 1`.
|
|||||||
|
|
||||||
##### Huffman Tree header
|
##### Huffman Tree header
|
||||||
|
|
||||||
This is a single byte value (0-255), which tells how to decode the tree.
|
This is a single byte value (0-255),
|
||||||
|
which tells how to decode the list of weights.
|
||||||
|
|
||||||
- if headerByte >= 242 : this is one of 14 pre-defined weight distributions :
|
- if headerByte >= 242 : this is one of 14 pre-defined weight distributions :
|
||||||
+ 242 : 1x1 (+ 1x1)
|
+ 242 : 1x1 (+ 1x1)
|
||||||
@ -896,7 +899,7 @@ The content to decode depends on their respective compression mode :
|
|||||||
|
|
||||||
An FSE distribution table describes the probabilities of all symbols
|
An FSE distribution table describes the probabilities of all symbols
|
||||||
from `0` to the last present one (included)
|
from `0` to the last present one (included)
|
||||||
on a normalized scale of `2^AccuracyLog` .
|
on a normalized scale of `1 << AccuracyLog` .
|
||||||
|
|
||||||
It's a bitstream which is read forward, in little-endian fashion.
|
It's a bitstream which is read forward, in little-endian fashion.
|
||||||
It's not necessary to know its exact size,
|
It's not necessary to know its exact size,
|
||||||
@ -952,7 +955,7 @@ This repeat flag tells how many probabilities of zeroes follow the current one.
|
|||||||
It provides a number ranging from 0 to 3.
|
It provides a number ranging from 0 to 3.
|
||||||
If it is a 3, another 2-bits repeat flag follows, and so on.
|
If it is a 3, another 2-bits repeat flag follows, and so on.
|
||||||
|
|
||||||
When last symbol reaches cumulated total of `2^AccuracyLog`,
|
When last symbol reaches cumulated total of `1 << AccuracyLog`,
|
||||||
decoding is complete.
|
decoding is complete.
|
||||||
Then the decoder can tell how many bytes were used in this process,
|
Then the decoder can tell how many bytes were used in this process,
|
||||||
and how many symbols are present.
|
and how many symbols are present.
|
||||||
@ -960,15 +963,147 @@ and how many symbols are present.
|
|||||||
The bitstream consumes a round number of bytes.
|
The bitstream consumes a round number of bytes.
|
||||||
Any remaining bit within the last byte is just unused.
|
Any remaining bit within the last byte is just unused.
|
||||||
|
|
||||||
If the last symbol makes cumulated total go above `2^AccuracyLog`,
|
If the last symbol makes cumulated total go above `1 << AccuracyLog`,
|
||||||
distribution is considered corrupted.
|
distribution is considered corrupted.
|
||||||
|
|
||||||
##### FSE decoding : from normalized distribution to decoding tables
|
##### FSE decoding : from normalized distribution to decoding tables
|
||||||
|
|
||||||
|
The distribution of normalized probabilities is enough
|
||||||
|
to create a unique decoding table.
|
||||||
|
|
||||||
|
It follows the following build rule :
|
||||||
|
|
||||||
|
The table has a size of `tableSize = 1 << AccuracyLog;`.
|
||||||
|
Each cell describes the symbol decoded,
|
||||||
|
and instructions to get the next state.
|
||||||
|
|
||||||
|
Symbols are scanned in their natural order for `less than 1` probabilities.
|
||||||
|
Symbols with this probability are being attributed a single cell,
|
||||||
|
starting from the end of the table.
|
||||||
|
These symbols define a full state reset, reading `AccuracyLog` bits.
|
||||||
|
|
||||||
|
All remaining symbols are sorted in their natural order.
|
||||||
|
Starting from symbol `0` and table position `0`,
|
||||||
|
each symbol gets attributed as many cells as its probability.
|
||||||
|
Cell allocation is spreaded, not linear :
|
||||||
|
each successor position follow this rule :
|
||||||
|
|
||||||
|
`position += (tableSize>>1) + (tableSize>>3) + 3);
|
||||||
|
position &= tableSize-1;`
|
||||||
|
|
||||||
|
A position is skipped if already occupied,
|
||||||
|
typically by a "less than 1" probability symbol.
|
||||||
|
|
||||||
|
The result is a list of state values.
|
||||||
|
Each state will decode the current symbol.
|
||||||
|
|
||||||
|
To get the Number of bits and baseline required for next state,
|
||||||
|
it's first necessary to sort all states in their natural order.
|
||||||
|
The lower states will need 1 bit more than higher ones.
|
||||||
|
|
||||||
|
__Example__ :
|
||||||
|
Presuming a symbol has a probability of 5.
|
||||||
|
It receives 5 state values.
|
||||||
|
|
||||||
|
Next power of 2 is 8.
|
||||||
|
Space of probabilities is divided into 8 equal parts.
|
||||||
|
Presuming the AccuracyLog is 7, it defines 128 states.
|
||||||
|
Divided by 8, each share is 16 large.
|
||||||
|
|
||||||
|
In order to reach 8, 8-5=3 lowest states will count "double",
|
||||||
|
taking shares twice larger,
|
||||||
|
requiring one more bit in the process.
|
||||||
|
|
||||||
|
Numbering starts from higher states using less bits.
|
||||||
|
|
||||||
|
| state order | 0 | 1 | 2 | 3 | 4 |
|
||||||
|
| ----------- | ----- | ----- | ------ | ---- | ----- |
|
||||||
|
| width | 32 | 32 | 32 | 16 | 16 |
|
||||||
|
| nb Bits | 5 | 5 | 5 | 4 | 4 |
|
||||||
|
| range nb | 2 | 4 | 6 | 0 | 1 |
|
||||||
|
| baseline | 32 | 64 | 96 | 0 | 16 |
|
||||||
|
| range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 |
|
||||||
|
|
||||||
|
Next state is determined from current state
|
||||||
|
by reading the required number of bits, and adding the specified baseline.
|
||||||
|
|
||||||
|
|
||||||
#### Bitstream
|
#### Bitstream
|
||||||
|
|
||||||
|
All sequences are stored in a single bitstream, read _backward_.
|
||||||
|
It is therefore necessary to know the bitstream size,
|
||||||
|
which is deducted from compressed block size.
|
||||||
|
|
||||||
|
The exact last bit of the stream is followed by a set-bit-flag.
|
||||||
|
The highest bit of last byte is this flag.
|
||||||
|
It does not belong to the useful part of the bitstream.
|
||||||
|
Therefore, last byte has 0-7 useful bits.
|
||||||
|
Note that it also means that last byte cannot be `0`.
|
||||||
|
|
||||||
|
##### Starting states
|
||||||
|
|
||||||
|
The bitstream starts with initial state values,
|
||||||
|
each using the required number of bits in their respective _accuracy_,
|
||||||
|
decoded previously from their normalized distribution.
|
||||||
|
|
||||||
|
It starts by `Literal Length State`,
|
||||||
|
followed by `Offset State`,
|
||||||
|
and finally `Match Length State`.
|
||||||
|
|
||||||
|
Reminder : always keep in mind that all values are read _backward_.
|
||||||
|
|
||||||
|
##### Decoding a sequence
|
||||||
|
|
||||||
|
A state gives a code.
|
||||||
|
A code provides a baseline and number of bits to add.
|
||||||
|
See [Symbol Decoding] section for details on each symbol.
|
||||||
|
|
||||||
|
Decoding starts by reading the nb of bits required to decode offset.
|
||||||
|
It then does the same for match length,
|
||||||
|
and then for literal length.
|
||||||
|
|
||||||
|
Offset / matchLength / litLength define a sequence, which can be applied.
|
||||||
|
|
||||||
|
The next operation is to update states.
|
||||||
|
Using rules pre-calculated in the decoding tables,
|
||||||
|
`Literal Length State` is updated,
|
||||||
|
followed by `Match Length State`,
|
||||||
|
and then `Offset State`.
|
||||||
|
|
||||||
|
This operation will be repeated `NbSeqs` times.
|
||||||
|
At the end, the bitstream shall be entirely consumed,
|
||||||
|
otherwise bitstream is considered corrupted.
|
||||||
|
|
||||||
|
[Symbol Decoding]:#symbols-decoding
|
||||||
|
|
||||||
|
##### Repeat offsets
|
||||||
|
|
||||||
|
As seen in [Offset Codes], the first 3 values define a repeated offset.
|
||||||
|
They are sorted in recency order, with 1 meaning "most recent one".
|
||||||
|
|
||||||
|
There is an exception though, when current sequence's literal length is `0`.
|
||||||
|
In which case, 1 would just make previous match longer.
|
||||||
|
Therefore, in such case, 1 means in fact 2, and 2 is impossible.
|
||||||
|
Meaning of 3 is unmodified.
|
||||||
|
|
||||||
|
Repeat offsets start with the following values : 1, 4 and 8 (in order).
|
||||||
|
|
||||||
|
Then each block receives its start value from previous compressed block.
|
||||||
|
Note that non-compressed blocks are skipped,
|
||||||
|
they do not contribute to offset history.
|
||||||
|
|
||||||
|
[Offset Codes]: #offset-codes
|
||||||
|
|
||||||
|
###### Offset updates rules
|
||||||
|
|
||||||
|
When the new offset is a normal one,
|
||||||
|
offset history is simply translated by one position,
|
||||||
|
with the new offset taking first spot.
|
||||||
|
|
||||||
|
- When repeat offset 1 (most recent) is used, history is unmodified.
|
||||||
|
- When repeat offset 2 is used, it's swapped with offset 1.
|
||||||
|
- When repeat offset 3 is used, it takes first spot,
|
||||||
|
pushing the other ones by one position.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user