updated doc
This commit is contained in:
parent
d916c908e0
commit
00d44abe71
2
NEWS
2
NEWS
@ -2,7 +2,7 @@ v0.7.3
|
|||||||
added : OpenBSD target, by Juan Francisco Cantero Hurtado
|
added : OpenBSD target, by Juan Francisco Cantero Hurtado
|
||||||
|
|
||||||
v0.7.2
|
v0.7.2
|
||||||
fixed : ZSTD_decompressBlock() using multiple consecutive blocks. Reported by Greg Slazinski
|
fixed : ZSTD_decompressBlock() using multiple consecutive blocks. Reported by Greg Slazinski.
|
||||||
fixed : potential segfault on very large files (many gigabytes). Reported by Chip Turner.
|
fixed : potential segfault on very large files (many gigabytes). Reported by Chip Turner.
|
||||||
fixed : CLI displays system error message when destination file cannot be created (#231). Reported by Chip Turner.
|
fixed : CLI displays system error message when destination file cannot be created (#231). Reported by Chip Turner.
|
||||||
|
|
||||||
|
@ -386,7 +386,7 @@ User Data can be anything. Data will just be skipped by the decoder.
|
|||||||
Compressed block format
|
Compressed block format
|
||||||
-----------------------
|
-----------------------
|
||||||
This specification details the content of a _compressed block_.
|
This specification details the content of a _compressed block_.
|
||||||
A compressed block has a size, which must be known in order to decode it.
|
A compressed block has a size, which must be known.
|
||||||
It also has a guaranteed maximum regenerated size,
|
It also has a guaranteed maximum regenerated size,
|
||||||
in order to properly allocate destination buffer.
|
in order to properly allocate destination buffer.
|
||||||
See "Frame format" for more details.
|
See "Frame format" for more details.
|
||||||
@ -396,19 +396,21 @@ A compressed block consists of 2 sections :
|
|||||||
- Sequences section
|
- Sequences section
|
||||||
|
|
||||||
### Prerequisite
|
### Prerequisite
|
||||||
For proper decoding, a compressed block requires access to following elements :
|
To decode a compressed block, it's required to access to following elements :
|
||||||
- Previous decoded blocks, up to a distance of `windowSize`,
|
- Previous decoded blocks, up to a distance of `windowSize`,
|
||||||
or all previous blocks in the same frame "single segment" mode.
|
or all frame's previous blocks in "single segment" mode.
|
||||||
- List of "recent offsets" from previous compressed block.
|
- List of "recent offsets" from previous compressed block.
|
||||||
|
- Decoding tables of previous compressed block for each symbol type
|
||||||
|
(literals, litLength, matchLength, offset)
|
||||||
|
|
||||||
|
|
||||||
### Compressed Literals
|
### Literals section
|
||||||
|
|
||||||
Literals are compressed using order-0 huffman compression.
|
Literals are compressed using order-0 huffman compression.
|
||||||
During sequence phase, literals will be entangled with match copy operations.
|
During sequence phase, literals will be entangled with match copy operations.
|
||||||
All literals are regrouped in the first part of the block.
|
All literals are regrouped in the first part of the block.
|
||||||
They can be decoded first, and then copied during sequence operations,
|
They can be decoded first, and then copied during sequence operations,
|
||||||
or they can be decoded on the flow, as needed by sequences.
|
or they can be decoded on the flow, as needed by sequence commands.
|
||||||
|
|
||||||
| Header | (Tree Description) | Stream1 | (Stream2) | (Stream3) | (Stream4) |
|
| Header | (Tree Description) | Stream1 | (Stream2) | (Stream3) | (Stream4) |
|
||||||
| ------ | ------------------ | ------- | --------- | --------- | --------- |
|
| ------ | ------------------ | ------- | --------- | --------- | --------- |
|
||||||
@ -417,9 +419,9 @@ Literals can be compressed, or uncompressed.
|
|||||||
When compressed, an optional tree description can be present,
|
When compressed, an optional tree description can be present,
|
||||||
followed by 1 or 4 streams.
|
followed by 1 or 4 streams.
|
||||||
|
|
||||||
#### Block Literal Header
|
#### Literals section header
|
||||||
|
|
||||||
Header is in charge of describing precisely how literals are packed.
|
Header is in charge of describing how literals are packed.
|
||||||
It's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes,
|
It's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes,
|
||||||
using big-endian convention.
|
using big-endian convention.
|
||||||
|
|
||||||
@ -491,12 +493,31 @@ Compressed and regenerated size fields follow big endian convention.
|
|||||||
#### Huffman Tree description
|
#### Huffman Tree description
|
||||||
|
|
||||||
This section is only present when block type is _compressed_ (`0`).
|
This section is only present when block type is _compressed_ (`0`).
|
||||||
It describes the different leaf nodes of the huffman tree,
|
|
||||||
and their relative weights.
|
Prefix coding represents symbols from an a priori known
|
||||||
|
alphabet by bit sequences (codes), one code for each symbol, in
|
||||||
|
a manner such that different symbols may be represented by bit
|
||||||
|
sequences of different lengths, but a parser can always parse
|
||||||
|
an encoded string unambiguously symbol-by-symbol.
|
||||||
|
|
||||||
|
Given an alphabet with known symbol frequencies, the Huffman
|
||||||
|
algorithm allows the construction of an optimal prefix code
|
||||||
|
(one which represents strings with those symbol frequencies
|
||||||
|
using the fewest bits of any possible prefix codes for that
|
||||||
|
alphabet). Such a code is called a Huffman code.
|
||||||
|
|
||||||
|
Huffman code must not exceed a maximum code length.
|
||||||
|
More bits improve accuracy but cost more header size,
|
||||||
|
and requires more memory for decoding operations.
|
||||||
|
|
||||||
|
The current format limits the maximum depth to 15 bits.
|
||||||
|
The reference decoder goes further, by limiting it to 11 bits.
|
||||||
|
It is recommended to remain compatible with reference decoder.
|
||||||
|
|
||||||
|
|
||||||
##### Representation
|
##### Representation
|
||||||
|
|
||||||
All byte values from zero (included) to last present one (excluded)
|
All literal values from zero (included) to last present one (excluded)
|
||||||
are represented by `weight` values, from 0 to `maxBits`.
|
are represented by `weight` values, from 0 to `maxBits`.
|
||||||
Transformation from `weight` to `nbBits` follows this formulae :
|
Transformation from `weight` to `nbBits` follows this formulae :
|
||||||
`nbBits = weight ? maxBits + 1 - weight : 0;` .
|
`nbBits = weight ? maxBits + 1 - weight : 0;` .
|
||||||
@ -552,7 +573,7 @@ This is a single byte value (0-255), which tells how to decode the tree.
|
|||||||
|
|
||||||
- if headerByte >= 128 : this is a direct representation,
|
- if headerByte >= 128 : this is a direct representation,
|
||||||
where each weight is written directly as a 4 bits field (0-15).
|
where each weight is written directly as a 4 bits field (0-15).
|
||||||
The full representation occupies (nbSymbols+1/2) bytes,
|
The full representation occupies ((nbSymbols+1)/2) bytes,
|
||||||
meaning it uses a last full byte even if nbSymbols is odd.
|
meaning it uses a last full byte even if nbSymbols is odd.
|
||||||
`nbSymbols = headerByte - 127;`
|
`nbSymbols = headerByte - 127;`
|
||||||
|
|
||||||
@ -573,7 +594,7 @@ In this case, it's `255`, since literal values range from `0` to `255`,
|
|||||||
and the last symbol value is not represented.
|
and the last symbol value is not represented.
|
||||||
|
|
||||||
An FSE bitstream starts by a header, describing probabilities distribution.
|
An FSE bitstream starts by a header, describing probabilities distribution.
|
||||||
Result will create a Decoding Table.
|
It will create a Decoding Table.
|
||||||
It is necessary to know the maximum accuracy of distribution
|
It is necessary to know the maximum accuracy of distribution
|
||||||
to properly allocate space for the Table.
|
to properly allocate space for the Table.
|
||||||
For a list of huffman weights, this maximum is 8 bits.
|
For a list of huffman weights, this maximum is 8 bits.
|
||||||
@ -583,7 +604,7 @@ FSE header and bitstreams are described in a separated chapter.
|
|||||||
##### Conversion from weights to huffman prefix codes
|
##### Conversion from weights to huffman prefix codes
|
||||||
|
|
||||||
All present symbols shall now have a `weight` value.
|
All present symbols shall now have a `weight` value.
|
||||||
A `weight` directly represent a `range` of prefix codes,
|
A `weight` directly represents a `range` of prefix codes,
|
||||||
following the formulae : `range = weight ? 1 << (weight-1) : 0 ;`
|
following the formulae : `range = weight ? 1 << (weight-1) : 0 ;`
|
||||||
Symbols are sorted by weight.
|
Symbols are sorted by weight.
|
||||||
Within same weight, symbols keep natural order.
|
Within same weight, symbols keep natural order.
|
||||||
@ -591,7 +612,7 @@ Starting from lowest weight,
|
|||||||
symbols are being allocated to a range of prefix codes.
|
symbols are being allocated to a range of prefix codes.
|
||||||
Symbols with a weight of zero are not present.
|
Symbols with a weight of zero are not present.
|
||||||
|
|
||||||
It can then proceed to transform weights into nbBits :
|
It is then possible to transform weights into nbBits :
|
||||||
`nbBits = nbBits ? maxBits + 1 - weight : 0;` .
|
`nbBits = nbBits ? maxBits + 1 - weight : 0;` .
|
||||||
|
|
||||||
|
|
||||||
@ -639,13 +660,13 @@ each stream has a size of `(totalSize+3)/4`,
|
|||||||
except the last one, which is up to 3 bytes smaller, to reach totalSize.
|
except the last one, which is up to 3 bytes smaller, to reach totalSize.
|
||||||
|
|
||||||
Compressed size must be provided explicitly : in the 4-streams variant,
|
Compressed size must be provided explicitly : in the 4-streams variant,
|
||||||
bitstream is preceded by 3 unsigned short values using Little Endian convention.
|
bitstreams are preceded by 3 unsigned Little Endian 16-bits values.
|
||||||
Each value represent the compressed size of one stream, in order.
|
Each value represents the compressed size of one stream, in order.
|
||||||
The last stream size is deducted from total compressed size
|
The last stream size is deducted from total compressed size
|
||||||
and from already known stream sizes :
|
and from already known stream sizes :
|
||||||
`stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize;`
|
`stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize;`
|
||||||
|
|
||||||
##### Bitstreams reading
|
##### Bitstreams read and decode
|
||||||
|
|
||||||
Each bitstream must be read _backward_,
|
Each bitstream must be read _backward_,
|
||||||
that is starting from the end down to the beginning.
|
that is starting from the end down to the beginning.
|
||||||
@ -662,7 +683,7 @@ Starting from the end,
|
|||||||
it's possible to read the bitstream in a little-endian fashion,
|
it's possible to read the bitstream in a little-endian fashion,
|
||||||
keeping track of already used bits.
|
keeping track of already used bits.
|
||||||
|
|
||||||
Extracting `maxBits`,
|
Reading the last `maxBits` bits,
|
||||||
it's then possible to compare extracted value to the prefix codes table,
|
it's then possible to compare extracted value to the prefix codes table,
|
||||||
determining the symbol to decode and number of bits to discard.
|
determining the symbol to decode and number of bits to discard.
|
||||||
|
|
||||||
@ -672,6 +693,32 @@ hence reaching exactly its beginning position with all bits consumed,
|
|||||||
the decoding process is considered faulty.
|
the decoding process is considered faulty.
|
||||||
|
|
||||||
|
|
||||||
|
### Sequences section
|
||||||
|
|
||||||
|
A compressed block is a succession of _sequences_ .
|
||||||
|
A sequence is a literal copy command, followed by a match copy command.
|
||||||
|
A literal copy command specifies a length.
|
||||||
|
It is the number of bytes to be copied (or extracted) from the literal section.
|
||||||
|
A match copy command specifies an offset and a length.
|
||||||
|
The offset gives the position to copy from,
|
||||||
|
which can stand within a previous block.
|
||||||
|
|
||||||
|
These are 3 symbol types, `literalLength`, `matchLength` and `offset`,
|
||||||
|
which are encoded together, interleaved in a single _bitstream_.
|
||||||
|
|
||||||
|
Each symbol decoding consists of a _code_,
|
||||||
|
which specifies a baseline and a number of additional bits.
|
||||||
|
_Codes_ are FSE compressed,
|
||||||
|
and interleaved with raw additional bits in the same bitstream.
|
||||||
|
|
||||||
|
The Sequence section starts by a header,
|
||||||
|
followed by an optional Probability table for each symbol type,
|
||||||
|
followed by the bitstream.
|
||||||
|
|
||||||
|
#### Sequences section header
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Version changes
|
Version changes
|
||||||
|
Loading…
x
Reference in New Issue
Block a user