2016-07-03 10:03:13 -07:00
|
|
|
|
Zstandard Compression Format
|
|
|
|
|
============================
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
|
|
|
|
### Notices
|
|
|
|
|
|
|
|
|
|
Copyright (c) 2016 Yann Collet
|
|
|
|
|
|
|
|
|
|
Permission is granted to copy and distribute this document
|
2016-07-25 02:04:56 -07:00
|
|
|
|
for any purpose and without charge,
|
|
|
|
|
including translations into other languages
|
2016-06-30 06:40:28 -07:00
|
|
|
|
and incorporation into compilations,
|
|
|
|
|
provided that the copyright notice and this notice are preserved,
|
|
|
|
|
and that any substantive changes or deletions from the original
|
|
|
|
|
are clearly marked.
|
|
|
|
|
Distribution of this document is unlimited.
|
|
|
|
|
|
|
|
|
|
### Version
|
|
|
|
|
|
2016-07-20 05:58:49 -07:00
|
|
|
|
0.2.0 (22/07/16)
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Introduction
|
|
|
|
|
------------
|
|
|
|
|
|
|
|
|
|
The purpose of this document is to define a lossless compressed data format,
|
|
|
|
|
that is independent of CPU type, operating system,
|
|
|
|
|
file system and character set, suitable for
|
2016-07-05 01:53:38 -07:00
|
|
|
|
file compression, pipe and streaming compression,
|
2016-06-30 06:40:28 -07:00
|
|
|
|
using the [Zstandard algorithm](http://www.zstandard.org).
|
|
|
|
|
|
|
|
|
|
The data can be produced or consumed,
|
|
|
|
|
even for an arbitrarily long sequentially presented input data stream,
|
|
|
|
|
using only an a priori bounded amount of intermediate storage,
|
|
|
|
|
and hence can be used in data communications.
|
|
|
|
|
The format uses the Zstandard compression method,
|
|
|
|
|
and optional [xxHash-64 checksum method](http://www.xxhash.org),
|
|
|
|
|
for detection of data corruption.
|
|
|
|
|
|
|
|
|
|
The data format defined by this specification
|
|
|
|
|
does not attempt to allow random access to compressed data.
|
|
|
|
|
|
|
|
|
|
This specification is intended for use by implementers of software
|
|
|
|
|
to compress data into Zstandard format and/or decompress data from Zstandard format.
|
|
|
|
|
The text of the specification assumes a basic background in programming
|
|
|
|
|
at the level of bits and other primitive data representations.
|
|
|
|
|
|
|
|
|
|
Unless otherwise indicated below,
|
|
|
|
|
a compliant compressor must produce data sets
|
|
|
|
|
that conform to the specifications presented here.
|
|
|
|
|
It doesn’t need to support all options though.
|
|
|
|
|
|
|
|
|
|
A compliant decompressor must be able to decompress
|
|
|
|
|
at least one working set of parameters
|
|
|
|
|
that conforms to the specifications presented here.
|
|
|
|
|
It may also ignore informative fields, such as checksum.
|
2016-07-01 11:55:28 -07:00
|
|
|
|
Whenever it does not support a parameter defined in the compressed stream,
|
|
|
|
|
it must produce a non-ambiguous error code and associated error message
|
|
|
|
|
explaining which parameter is unsupported.
|
|
|
|
|
|
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
Overall conventions
|
|
|
|
|
-----------
|
|
|
|
|
In this document square brackets i.e. `[` and `]` are used to indicate optional fields or parameters.
|
|
|
|
|
|
|
|
|
|
|
2016-07-01 11:55:28 -07:00
|
|
|
|
Definitions
|
|
|
|
|
-----------
|
|
|
|
|
A content compressed by Zstandard is transformed into a Zstandard __frame__.
|
|
|
|
|
Multiple frames can be appended into a single file or stream.
|
|
|
|
|
A frame is totally independent, has a defined beginning and end,
|
|
|
|
|
and a set of parameters which tells the decoder how to decompress it.
|
|
|
|
|
|
|
|
|
|
A frame encapsulates one or multiple __blocks__.
|
|
|
|
|
Each block can be compressed or not,
|
|
|
|
|
and has a guaranteed maximum content size, which depends on frame parameters.
|
|
|
|
|
Unlike frames, each block depends on previous blocks for proper decoding.
|
|
|
|
|
However, each block can be decompressed without waiting for its successor,
|
|
|
|
|
allowing streaming operations.
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
Frame Concatenation
|
|
|
|
|
-------------------
|
|
|
|
|
|
|
|
|
|
In some circumstances, it may be required to append multiple frames,
|
|
|
|
|
for example in order to add new data to an existing compressed file
|
|
|
|
|
without re-framing it.
|
|
|
|
|
|
|
|
|
|
In such case, each frame brings its own set of descriptor flags.
|
|
|
|
|
Each frame is considered independent.
|
|
|
|
|
The only relation between frames is their sequential order.
|
|
|
|
|
|
|
|
|
|
The ability to decode multiple concatenated frames
|
|
|
|
|
within a single stream or file is left outside of this specification.
|
|
|
|
|
As an example, the reference `zstd` command line utility is able
|
|
|
|
|
to decode all concatenated frames in their sequential order,
|
|
|
|
|
delivering the final decompressed result as if it was a single content.
|
|
|
|
|
|
|
|
|
|
|
2016-06-30 06:40:28 -07:00
|
|
|
|
General Structure of Zstandard Frame format
|
|
|
|
|
-------------------------------------------
|
2016-07-25 02:04:56 -07:00
|
|
|
|
The structure of a single Zstandard frame is following:
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-27 15:55:43 -07:00
|
|
|
|
| `Magic_Number` | `Frame_Header` |`Data_Block`| [More data blocks] | [`Content_Checksum`] |
|
|
|
|
|
|:--------------:|:--------------:|:----------:| ------------------ |:--------------------:|
|
|
|
|
|
| 4 bytes | 2-14 bytes | n bytes | | 0-4 bytes |
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
__`Magic_Number`__
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-25 03:47:02 -07:00
|
|
|
|
4 Bytes, Little-endian format.
|
2016-06-30 06:40:28 -07:00
|
|
|
|
Value : 0xFD2FB527
|
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
__`Frame_Header`__
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
2 to 14 Bytes, detailed in [next part](#the-structure-of-frame_header).
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
__`Data_Block`__
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
Detailed in [next chapter](#the-structure-of-data_block).
|
2016-06-30 06:40:28 -07:00
|
|
|
|
That’s where compressed data is stored.
|
|
|
|
|
|
2016-07-27 15:55:43 -07:00
|
|
|
|
__`Content_Checksum`__
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-27 15:55:43 -07:00
|
|
|
|
An optional 32-bit checksum, only present if `Content_Checksum_flag` is set.
|
2016-06-30 06:40:28 -07:00
|
|
|
|
The content checksum is the result
|
|
|
|
|
of [xxh64() hash function](https://www.xxHash.com)
|
|
|
|
|
digesting the original (decoded) data as input, and a seed of zero.
|
2016-07-27 15:55:43 -07:00
|
|
|
|
The low 4 bytes of the checksum are stored in little endian format.
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
The structure of `Frame_Header`
|
|
|
|
|
-------------------------------
|
|
|
|
|
The `Frame_Header` has a variable size, which uses a minimum of 2 bytes,
|
2016-06-30 06:40:28 -07:00
|
|
|
|
and up to 14 bytes depending on optional parameters.
|
2016-07-25 02:04:56 -07:00
|
|
|
|
The structure of `Frame_Header` is following:
|
|
|
|
|
|
|
|
|
|
| `Frame_Header_Descriptor` | [`Window_Descriptor`] | [`Dictionary_ID`] | [`Frame_Content_Size`] |
|
|
|
|
|
| ------------------------- | --------------------- | ----------------- | ---------------------- |
|
|
|
|
|
| 1 byte | 0-1 byte | 0-4 bytes | 0-8 bytes |
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
### `Frame_Header_Descriptor`
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
The first header's byte is called the `Frame_Header_Descriptor`.
|
2016-07-01 11:55:28 -07:00
|
|
|
|
It tells which other fields are present.
|
2016-07-25 02:04:56 -07:00
|
|
|
|
Decoding this byte is enough to tell the size of `Frame_Header`.
|
2016-07-01 11:55:28 -07:00
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
| Bit number | Field name |
|
|
|
|
|
| ---------- | ---------- |
|
|
|
|
|
| 7-6 | `Frame_Content_Size_flag` |
|
|
|
|
|
| 5 | `Single_Segment_flag` |
|
|
|
|
|
| 4 | `Unused_bit` |
|
|
|
|
|
| 3 | `Reserved_bit` |
|
|
|
|
|
| 2 | `Content_Checksum_flag` |
|
|
|
|
|
| 1-0 | `Dictionary_ID_flag` |
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-01 11:55:28 -07:00
|
|
|
|
In this table, bit 7 is highest bit, while bit 0 is lowest.
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-25 03:26:39 -07:00
|
|
|
|
__`Frame_Content_Size_flag`__
|
|
|
|
|
|
|
|
|
|
This is a 2-bits flag (`= Frame_Header_Descriptor >> 6`),
|
|
|
|
|
specifying if decompressed data size is provided within the header.
|
2016-07-27 15:55:43 -07:00
|
|
|
|
The `Flag_Value` can be converted into `Field_Size`,
|
|
|
|
|
which is the number of bytes used by `Frame_Content_Size`
|
|
|
|
|
according to the following table:
|
2016-07-25 03:26:39 -07:00
|
|
|
|
|
2016-07-27 15:55:43 -07:00
|
|
|
|
|`Flag_Value`| 0 | 1 | 2 | 3 |
|
2016-07-25 03:26:39 -07:00
|
|
|
|
| ---------- | --- | --- | --- | --- |
|
|
|
|
|
|`Field_Size`| 0-1 | 2 | 4 | 8 |
|
|
|
|
|
|
2016-07-27 15:55:43 -07:00
|
|
|
|
When `Flag_Value` is `0`, `Field_Size` depends on `Single_Segment_flag` :
|
|
|
|
|
if `Single_Segment_flag` is set, `Field_Size` is 1.
|
|
|
|
|
Otherwise, `Field_Size` is 0 (content size not provided).
|
2016-07-25 03:26:39 -07:00
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
__`Single_Segment_flag`__
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-25 03:26:39 -07:00
|
|
|
|
If this flag is set,
|
2016-07-27 15:55:43 -07:00
|
|
|
|
data must be regenerated within a single continuous memory segment.
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-27 15:55:43 -07:00
|
|
|
|
In this case, `Frame_Content_Size` is necessarily present,
|
|
|
|
|
but `Window_Descriptor` byte is skipped.
|
2016-07-25 03:26:39 -07:00
|
|
|
|
As a consequence, the decoder must allocate a memory segment
|
|
|
|
|
of size equal or bigger than `Frame_Content_Size`.
|
2016-07-01 11:55:28 -07:00
|
|
|
|
|
|
|
|
|
In order to preserve the decoder from unreasonable memory requirement,
|
2016-07-04 07:13:11 -07:00
|
|
|
|
a decoder can reject a compressed frame
|
2016-06-30 06:40:28 -07:00
|
|
|
|
which requests a memory size beyond decoder's authorized range.
|
|
|
|
|
|
2016-07-01 11:55:28 -07:00
|
|
|
|
For broader compatibility, decoders are recommended to support
|
2016-07-04 07:13:11 -07:00
|
|
|
|
memory sizes of at least 8 MB.
|
|
|
|
|
This is just a recommendation,
|
2016-07-05 01:53:38 -07:00
|
|
|
|
each decoder is free to support higher or lower limits,
|
2016-07-01 11:55:28 -07:00
|
|
|
|
depending on local limitations.
|
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
__`Unused_bit`__
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-13 08:30:21 -07:00
|
|
|
|
The value of this bit should be set to zero.
|
2016-07-27 15:55:43 -07:00
|
|
|
|
A decoder compliant with this specification version shall not interpret it.
|
2016-07-13 08:30:21 -07:00
|
|
|
|
It might be used in a future version,
|
|
|
|
|
to signal a property which is not mandatory to properly decode the frame.
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
__`Reserved_bit`__
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
|
|
|
|
This bit is reserved for some future feature.
|
2016-07-01 11:55:28 -07:00
|
|
|
|
Its value _must be zero_.
|
2016-06-30 06:40:28 -07:00
|
|
|
|
A decoder compliant with this specification version must ensure it is not set.
|
|
|
|
|
This bit may be used in a future revision,
|
2016-07-27 15:55:43 -07:00
|
|
|
|
to signal a feature that must be interpreted to decode the frame correctly.
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
__`Content_Checksum_flag`__
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-27 15:55:43 -07:00
|
|
|
|
If this flag is set, a 32-bits `Content_Checksum` will be present at frame's end.
|
|
|
|
|
See `Content_Checksum` paragraph.
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
__`Dictionary_ID_flag`__
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
|
|
|
|
This is a 2-bits flag (`= FHD & 3`),
|
2016-07-05 01:53:38 -07:00
|
|
|
|
telling if a dictionary ID is provided within the header.
|
|
|
|
|
It also specifies the size of this field.
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
| Value | 0 | 1 | 2 | 3 |
|
|
|
|
|
| -------- | --- | --- | --- | --- |
|
|
|
|
|
|Field size| 0 | 1 | 2 | 4 |
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
### `Window_Descriptor`
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
|
|
|
|
Provides guarantees on maximum back-reference distance
|
2016-07-27 15:55:43 -07:00
|
|
|
|
that will be used within compressed data.
|
|
|
|
|
This information is important for decoders to allocate enough memory.
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-27 15:55:43 -07:00
|
|
|
|
The `Window_Descriptor` byte is optional. It is absent when `Single_Segment_flag` is set.
|
2016-07-25 02:04:56 -07:00
|
|
|
|
In this case, the maximum back-reference distance is the content size itself,
|
2016-07-05 02:50:37 -07:00
|
|
|
|
which can be any value from 1 to 2^64-1 bytes (16 EB).
|
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
| Bit numbers | 7-3 | 0-2 |
|
|
|
|
|
| ----------- | -------- | -------- |
|
|
|
|
|
| Field name | Exponent | Mantissa |
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
|
|
|
|
Maximum distance is given by the following formulae :
|
|
|
|
|
```
|
|
|
|
|
windowLog = 10 + Exponent;
|
|
|
|
|
windowBase = 1 << windowLog;
|
|
|
|
|
windowAdd = (windowBase / 8) * Mantissa;
|
|
|
|
|
windowSize = windowBase + windowAdd;
|
|
|
|
|
```
|
2016-06-30 08:05:42 -07:00
|
|
|
|
The minimum window size is 1 KB.
|
2016-07-05 02:50:37 -07:00
|
|
|
|
The maximum size is `15*(1<<38)` bytes, which is 1.875 TB.
|
2016-07-01 11:55:28 -07:00
|
|
|
|
|
|
|
|
|
To properly decode compressed data,
|
|
|
|
|
a decoder will need to allocate a buffer of at least `windowSize` bytes.
|
2016-06-30 08:05:42 -07:00
|
|
|
|
|
|
|
|
|
In order to preserve decoder from unreasonable memory requirements,
|
2016-06-30 06:40:28 -07:00
|
|
|
|
a decoder can refuse a compressed frame
|
|
|
|
|
which requests a memory size beyond decoder's authorized range.
|
|
|
|
|
|
2016-07-05 02:50:37 -07:00
|
|
|
|
For improved interoperability,
|
2016-07-27 15:55:43 -07:00
|
|
|
|
decoders are recommended to be compatible with window sizes of 8 MB,
|
|
|
|
|
and encoders are recommended to not request more than 8 MB.
|
2016-07-01 11:55:28 -07:00
|
|
|
|
It's merely a recommendation though,
|
|
|
|
|
decoders are free to support larger or lower limits,
|
|
|
|
|
depending on local limitations.
|
2016-06-30 08:05:42 -07:00
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
### `Dictionary_ID`
|
2016-07-04 07:13:11 -07:00
|
|
|
|
|
2016-07-15 08:03:38 -07:00
|
|
|
|
This is a variable size field, which contains
|
|
|
|
|
the ID of the dictionary required to properly decode the frame.
|
|
|
|
|
Note that this field is optional. When it's not present,
|
2016-07-04 07:13:11 -07:00
|
|
|
|
it's up to the caller to make sure it uses the correct dictionary.
|
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
Field size depends on `Dictionary_ID_flag`.
|
2016-07-04 07:13:11 -07:00
|
|
|
|
1 byte can represent an ID 0-255.
|
|
|
|
|
2 bytes can represent an ID 0-65535.
|
2016-07-05 02:50:37 -07:00
|
|
|
|
4 bytes can represent an ID 0-4294967295.
|
2016-07-04 07:13:11 -07:00
|
|
|
|
|
|
|
|
|
It's allowed to represent a small ID (for example `13`)
|
2016-07-05 02:50:37 -07:00
|
|
|
|
with a large 4-bytes dictionary ID, losing some compacity in the process.
|
2016-07-04 07:13:11 -07:00
|
|
|
|
|
2016-07-15 08:03:38 -07:00
|
|
|
|
_Reserved ranges :_
|
|
|
|
|
If the frame is going to be distributed in a private environment,
|
|
|
|
|
any dictionary ID can be used.
|
|
|
|
|
However, for public distribution of compressed frames using a dictionary,
|
2016-07-25 02:04:56 -07:00
|
|
|
|
the following ranges are reserved for future use and should not be used :
|
|
|
|
|
- low range : 1 - 32767
|
|
|
|
|
- high range : >= (2^31)
|
2016-07-15 08:03:38 -07:00
|
|
|
|
|
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
### `Frame_Content_Size`
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-25 03:47:02 -07:00
|
|
|
|
This is the original (uncompressed) size. This information is optional.
|
|
|
|
|
The `Field_Size` is provided according to value of `Frame_Content_Size_flag`.
|
|
|
|
|
The `Field_Size` can be equal to 0 (not present), 1, 2, 4 or 8 bytes.
|
|
|
|
|
Format is Little-endian.
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-25 03:47:02 -07:00
|
|
|
|
| `Field_Size` | Range |
|
|
|
|
|
| ------------ | ---------- |
|
|
|
|
|
| 1 | 0 - 255 |
|
|
|
|
|
| 2 | 256 - 65791|
|
|
|
|
|
| 4 | 0 - 2^32-1 |
|
|
|
|
|
| 8 | 0 - 2^64-1 |
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-25 03:47:02 -07:00
|
|
|
|
When `Field_Size` is 1, 4 or 8 bytes, the value is read directly.
|
|
|
|
|
When `Field_Size` is 2, _the offset of 256 is added_.
|
2016-07-25 02:04:56 -07:00
|
|
|
|
It's allowed to represent a small size (for example `18`) using any compatible variant.
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
The structure of `Data_Block`
|
|
|
|
|
-----------------------------
|
|
|
|
|
The structure of `Data_Block` is following:
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-27 15:55:43 -07:00
|
|
|
|
| `Last_Block` | `Block_Type` | `Block_Size` | `Block_Content` |
|
|
|
|
|
|:------------:|:------------:|:------------:|:---------------:|
|
|
|
|
|
| 1 bit | 2 bits | 21 bits | n bytes |
|
|
|
|
|
|
|
|
|
|
The block header uses 3-bytes.
|
|
|
|
|
|
|
|
|
|
__`Last_Block`__
|
|
|
|
|
|
|
|
|
|
The lowest bit signals if this block is the last one.
|
|
|
|
|
Frame ends right after this block.
|
|
|
|
|
It may be followed by an optional `Content_Checksum` .
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
__`Block_Type` and `Block_Size`__
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-27 15:55:43 -07:00
|
|
|
|
The next 2 bits represent the `Block_Type`,
|
|
|
|
|
while the remaining 21 bits represent the `Block_Size`.
|
|
|
|
|
Format is __little-endian__.
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
|
|
|
|
There are 4 block types :
|
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
| Value | 0 | 1 | 2 | 3 |
|
|
|
|
|
| ------------ | ----------- | ----------- | ------------------ | --------- |
|
2016-07-27 15:55:43 -07:00
|
|
|
|
| `Block_Type` | `Raw_Block` | `RLE_Block` | `Compressed_Block` | `Reserved`|
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
- `Raw_Block` - this is an uncompressed block.
|
|
|
|
|
`Block_Size` is the number of bytes to read and copy.
|
|
|
|
|
- `RLE_Block` - this is a single byte, repeated N times.
|
|
|
|
|
In which case, `Block_Size` is the size to regenerate,
|
|
|
|
|
while the "compressed" block is just 1 byte (the byte to repeat).
|
|
|
|
|
- `Compressed_Block` - this is a [Zstandard compressed block](#the-format-of-compressed_block),
|
2016-07-05 01:53:38 -07:00
|
|
|
|
detailed in another section of this specification.
|
2016-07-25 02:04:56 -07:00
|
|
|
|
`Block_Size` is the compressed size.
|
2016-06-30 06:40:28 -07:00
|
|
|
|
Decompressed size is unknown,
|
2016-07-01 11:55:28 -07:00
|
|
|
|
but its maximum possible value is guaranteed (see below)
|
2016-07-27 15:55:43 -07:00
|
|
|
|
- `Reserved` - this is not a block.
|
|
|
|
|
This value cannot be used with current version of this specification.
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-01 11:55:28 -07:00
|
|
|
|
Block sizes must respect a few rules :
|
2016-07-05 01:53:38 -07:00
|
|
|
|
- In compressed mode, compressed size if always strictly `< decompressed size`.
|
|
|
|
|
- Block decompressed size is always <= maximum back-reference distance .
|
|
|
|
|
- Block decompressed size is always <= 128 KB
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
__`Block_Content`__
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
The `Block_Content` is where the actual data to decode stands.
|
2016-06-30 06:40:28 -07:00
|
|
|
|
It might be compressed or not, depending on previous field indications.
|
|
|
|
|
A data block is not necessarily "full" :
|
2016-07-01 11:55:28 -07:00
|
|
|
|
since an arbitrary “flush” may happen anytime,
|
2016-07-05 02:50:37 -07:00
|
|
|
|
block decompressed content can be any size,
|
2016-07-25 02:04:56 -07:00
|
|
|
|
up to `Block_Maximum_Decompressed_Size`, which is the smallest of :
|
2016-07-05 02:50:37 -07:00
|
|
|
|
- Maximum back-reference distance
|
2016-06-30 06:40:28 -07:00
|
|
|
|
- 128 KB
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Skippable Frames
|
|
|
|
|
----------------
|
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
| `Magic_Number` | `Frame_Size` | `User_Data` |
|
|
|
|
|
|:--------------:|:------------:|:-----------:|
|
|
|
|
|
| 4 bytes | 4 bytes | n bytes |
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
|
|
|
|
Skippable frames allow the insertion of user-defined data
|
|
|
|
|
into a flow of concatenated frames.
|
|
|
|
|
Its design is pretty straightforward,
|
|
|
|
|
with the sole objective to allow the decoder to quickly skip
|
|
|
|
|
over user-defined data and continue decoding.
|
|
|
|
|
|
2016-07-05 02:50:37 -07:00
|
|
|
|
Skippable frames defined in this specification are compatible with [LZ4] ones.
|
|
|
|
|
|
|
|
|
|
[LZ4]:http://www.lz4.org
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
__`Magic_Number`__
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-25 03:47:02 -07:00
|
|
|
|
4 Bytes, Little-endian format.
|
2016-06-30 06:40:28 -07:00
|
|
|
|
Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F.
|
|
|
|
|
All 16 values are valid to identify a skippable frame.
|
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
__`Frame_Size`__
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
This is the size, in bytes, of the following `User_Data`
|
2016-06-30 06:40:28 -07:00
|
|
|
|
(without including the magic number nor the size field itself).
|
2016-07-25 03:47:02 -07:00
|
|
|
|
This field is represented using 4 Bytes, Little-endian format, unsigned 32-bits.
|
2016-07-25 02:04:56 -07:00
|
|
|
|
This means `User_Data` can’t be bigger than (2^32-1) bytes.
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
__`User_Data`__
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
The `User_Data` can be anything. Data will just be skipped by the decoder.
|
2016-06-30 06:40:28 -07:00
|
|
|
|
|
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
The format of `Compressed_Block`
|
|
|
|
|
--------------------------------
|
|
|
|
|
The size of `Compressed_Block` must be provided using `Block_Size` field from `Data_Block`.
|
|
|
|
|
The `Compressed_Block` has a guaranteed maximum regenerated size,
|
2016-07-01 11:55:28 -07:00
|
|
|
|
in order to properly allocate destination buffer.
|
2016-07-25 02:04:56 -07:00
|
|
|
|
See [`Data_Block`](#the-structure-of-data_block) for more details.
|
2016-07-01 11:55:28 -07:00
|
|
|
|
|
|
|
|
|
A compressed block consists of 2 sections :
|
2016-08-04 05:43:21 -07:00
|
|
|
|
- [`Literals_Section`](#literals_section)
|
|
|
|
|
- [`Sequences_Section`](#sequences_section)
|
2016-07-01 11:55:28 -07:00
|
|
|
|
|
2016-07-04 07:13:11 -07:00
|
|
|
|
### Prerequisites
|
|
|
|
|
To decode a compressed block, the following elements are necessary :
|
2016-07-03 09:49:35 -07:00
|
|
|
|
- Previous decoded blocks, up to a distance of `windowSize`,
|
2016-07-25 02:04:56 -07:00
|
|
|
|
or all previous blocks when `Single_Segment_flag` is set.
|
2016-07-03 09:49:35 -07:00
|
|
|
|
- List of "recent offsets" from previous compressed block.
|
2016-07-03 16:29:47 -07:00
|
|
|
|
- Decoding tables of previous compressed block for each symbol type
|
2016-07-05 01:53:38 -07:00
|
|
|
|
(literals, litLength, matchLength, offset).
|
2016-07-03 09:49:35 -07:00
|
|
|
|
|
|
|
|
|
|
2016-08-03 07:37:42 -07:00
|
|
|
|
### `Literals_Section`
|
2016-07-01 11:55:28 -07:00
|
|
|
|
|
|
|
|
|
During sequence phase, literals will be entangled with match copy operations.
|
|
|
|
|
All literals are regrouped in the first part of the block.
|
|
|
|
|
They can be decoded first, and then copied during sequence operations,
|
2016-07-03 16:29:47 -07:00
|
|
|
|
or they can be decoded on the flow, as needed by sequence commands.
|
2016-07-01 11:55:28 -07:00
|
|
|
|
|
2016-08-04 02:25:52 -07:00
|
|
|
|
| `Literals_Section_Header` | [`Huffman_Tree_Description`] | Stream1 | [Stream2] | [Stream3] | [Stream4] |
|
|
|
|
|
| ------------------------- | ---------------------------- | ------- | --------- | --------- | --------- |
|
2016-07-01 11:55:28 -07:00
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
Literals can be stored uncompressed or compressed using Huffman prefix codes.
|
2016-07-01 11:55:28 -07:00
|
|
|
|
When compressed, an optional tree description can be present,
|
|
|
|
|
followed by 1 or 4 streams.
|
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
|
2016-08-03 07:37:42 -07:00
|
|
|
|
#### `Literals_Section_Header`
|
2016-07-01 11:55:28 -07:00
|
|
|
|
|
2016-07-03 16:29:47 -07:00
|
|
|
|
Header is in charge of describing how literals are packed.
|
2016-07-01 11:55:28 -07:00
|
|
|
|
It's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes,
|
2016-07-20 11:12:24 -07:00
|
|
|
|
using little-endian convention.
|
2016-07-01 11:55:28 -07:00
|
|
|
|
|
2016-08-03 07:37:42 -07:00
|
|
|
|
| `Literals_Block_Type` | `Size_Format` | `Regenerated_Size` | [`Compressed_Size`] |
|
|
|
|
|
| --------------------- | ------------- | ------------------ | ----------------- |
|
|
|
|
|
| 2 bits | 1 - 2 bits | 5 - 20 bits | 0 - 18 bits |
|
2016-07-20 11:12:24 -07:00
|
|
|
|
|
|
|
|
|
In this representation, bits on the left are smallest bits.
|
2016-07-01 11:55:28 -07:00
|
|
|
|
|
2016-08-04 01:41:49 -07:00
|
|
|
|
__`Literals_Block_Type`__
|
2016-07-01 11:55:28 -07:00
|
|
|
|
|
2016-07-20 11:12:24 -07:00
|
|
|
|
This field uses 2 lowest bits of first byte, describing 4 different block types :
|
2016-07-01 11:55:28 -07:00
|
|
|
|
|
2016-08-04 01:41:49 -07:00
|
|
|
|
| Value | 0 | 1 | 2 | 3 |
|
|
|
|
|
| --------------------- | -------------------- | -------------------- | --------------------------- | ----------------------------- |
|
|
|
|
|
| `Literals_Block_Type` | `Raw_Literals_Block` | `RLE_Literals_Block` | `Compressed_Literals_Block` | `Repeat_Stats_Literals_Block` |
|
2016-07-01 11:55:28 -07:00
|
|
|
|
|
2016-08-03 07:37:42 -07:00
|
|
|
|
- `Raw_Literals_Block` - Literals are stored uncompressed.
|
|
|
|
|
- `RLE_Literals_Block` - Literals consist of a single byte value repeated N times.
|
|
|
|
|
- `Compressed_Literals_Block` - This is a standard Huffman-compressed block,
|
2016-08-03 07:16:38 -07:00
|
|
|
|
starting with a Huffman tree description.
|
2016-07-05 02:50:37 -07:00
|
|
|
|
See details below.
|
2016-08-03 07:37:42 -07:00
|
|
|
|
- `Repeat_Stats_Literals_Block` - This is a Huffman-compressed block,
|
2016-08-03 07:16:38 -07:00
|
|
|
|
using Huffman tree _from previous Huffman-compressed literals block_.
|
2016-07-05 02:50:37 -07:00
|
|
|
|
Huffman tree description will be skipped.
|
2016-07-01 11:55:28 -07:00
|
|
|
|
|
2016-08-04 01:41:49 -07:00
|
|
|
|
__`Size_Format`__
|
2016-07-01 11:55:28 -07:00
|
|
|
|
|
2016-08-03 07:37:42 -07:00
|
|
|
|
`Size_Format` is divided into 2 families :
|
2016-07-01 11:55:28 -07:00
|
|
|
|
|
2016-08-03 07:37:42 -07:00
|
|
|
|
- For `Compressed_Block`, it requires to decode both `Compressed_Size`
|
|
|
|
|
and `Regenerated_Size` (the decompressed size). It will also decode the number of streams.
|
|
|
|
|
- For `Raw_Block` and `RLE_Block` it's enough to decode `Regenerated_Size`.
|
2016-07-01 11:55:28 -07:00
|
|
|
|
|
2016-07-20 11:12:24 -07:00
|
|
|
|
For values spanning several bytes, convention is Little-endian.
|
2016-07-01 11:55:28 -07:00
|
|
|
|
|
2016-08-04 01:41:49 -07:00
|
|
|
|
__`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ :
|
2016-07-01 11:55:28 -07:00
|
|
|
|
|
2016-08-03 07:37:42 -07:00
|
|
|
|
- Value : x0 : `Regenerated_Size` uses 5 bits (0-31).
|
2016-07-01 11:55:28 -07:00
|
|
|
|
Total literal header size is 1 byte.
|
2016-07-20 11:12:24 -07:00
|
|
|
|
`size = h[0]>>3;`
|
2016-08-03 07:37:42 -07:00
|
|
|
|
- Value : 01 : `Regenerated_Size` uses 12 bits (0-4095).
|
2016-07-01 11:55:28 -07:00
|
|
|
|
Total literal header size is 2 bytes.
|
2016-07-20 11:12:24 -07:00
|
|
|
|
`size = (h[0]>>4) + (h[1]<<4);`
|
2016-08-03 07:37:42 -07:00
|
|
|
|
- Value : 11 : `Regenerated_Size` uses 20 bits (0-1048575).
|
2016-07-03 15:42:58 -07:00
|
|
|
|
Total literal header size is 3 bytes.
|
2016-07-20 11:12:24 -07:00
|
|
|
|
`size = (h[0]>>4) + (h[1]<<4) + (h[2]<<12);`
|
2016-07-01 11:55:28 -07:00
|
|
|
|
|
|
|
|
|
Note : it's allowed to represent a short value (ex : `13`)
|
|
|
|
|
using a long format, accepting the reduced compacity.
|
|
|
|
|
|
2016-08-04 01:41:49 -07:00
|
|
|
|
__`Size_Format` for `Compressed_Literals_Block` and `Repeat_Stats_Literals_Block`__ :
|
2016-07-01 11:55:28 -07:00
|
|
|
|
|
2016-07-22 08:30:52 -07:00
|
|
|
|
- Value : 00 : _Single stream_.
|
2016-08-03 07:37:42 -07:00
|
|
|
|
`Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023).
|
2016-07-05 02:50:37 -07:00
|
|
|
|
Total literal header size is 3 bytes.
|
2016-07-22 08:30:52 -07:00
|
|
|
|
- Value : 01 : 4 streams.
|
2016-08-03 07:37:42 -07:00
|
|
|
|
`Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023).
|
2016-07-05 02:50:37 -07:00
|
|
|
|
Total literal header size is 3 bytes.
|
|
|
|
|
- Value : 10 : 4 streams.
|
2016-08-03 07:37:42 -07:00
|
|
|
|
`Compressed_Size` and `Regenerated_Size` use 14 bits (0-16383).
|
2016-07-05 02:50:37 -07:00
|
|
|
|
Total literal header size is 4 bytes.
|
2016-07-22 10:15:27 -07:00
|
|
|
|
- Value : 11 : 4 streams.
|
2016-08-03 07:37:42 -07:00
|
|
|
|
`Compressed_Size` and `Regenerated_Size` use 18 bits (0-262143).
|
2016-07-05 02:50:37 -07:00
|
|
|
|
Total literal header size is 5 bytes.
|
2016-07-01 11:55:28 -07:00
|
|
|
|
|
2016-08-04 02:25:52 -07:00
|
|
|
|
`Compressed_Size` and `Regenerated_Size` fields follow little-endian convention.
|
2016-08-03 07:37:42 -07:00
|
|
|
|
|
2016-07-03 09:49:35 -07:00
|
|
|
|
|
2016-08-04 02:25:52 -07:00
|
|
|
|
#### `Huffman_Tree_Description`
|
2016-07-03 09:49:35 -07:00
|
|
|
|
|
2016-08-04 05:43:21 -07:00
|
|
|
|
This section is only present when `Literals_Block_Type` type is `Compressed_Block` (`2`).
|
2016-07-03 16:29:47 -07:00
|
|
|
|
|
2016-07-05 01:53:38 -07:00
|
|
|
|
Prefix coding represents symbols from an a priori known alphabet
|
2016-07-15 08:31:13 -07:00
|
|
|
|
by bit sequences (codewords), one codeword for each symbol,
|
2016-07-05 01:53:38 -07:00
|
|
|
|
in a manner such that different symbols may be represented
|
|
|
|
|
by bit sequences of different lengths,
|
|
|
|
|
but a parser can always parse an encoded string
|
|
|
|
|
unambiguously symbol-by-symbol.
|
2016-07-03 16:29:47 -07:00
|
|
|
|
|
2016-07-05 01:53:38 -07:00
|
|
|
|
Given an alphabet with known symbol frequencies,
|
|
|
|
|
the Huffman algorithm allows the construction of an optimal prefix code
|
|
|
|
|
using the fewest bits of any possible prefix codes for that alphabet.
|
2016-07-03 16:29:47 -07:00
|
|
|
|
|
2016-07-05 01:53:38 -07:00
|
|
|
|
Prefix code must not exceed a maximum code length.
|
2016-07-03 16:29:47 -07:00
|
|
|
|
More bits improve accuracy but cost more header size,
|
2016-07-17 07:21:37 -07:00
|
|
|
|
and require more memory or more complex decoding operations.
|
|
|
|
|
This specification limits maximum code length to 11 bits.
|
2016-07-03 16:29:47 -07:00
|
|
|
|
|
2016-07-03 09:49:35 -07:00
|
|
|
|
|
|
|
|
|
##### Representation
|
|
|
|
|
|
2016-07-03 16:29:47 -07:00
|
|
|
|
All literal values from zero (included) to last present one (excluded)
|
2016-07-03 09:49:35 -07:00
|
|
|
|
are represented by `weight` values, from 0 to `maxBits`.
|
|
|
|
|
Transformation from `weight` to `nbBits` follows this formulae :
|
|
|
|
|
`nbBits = weight ? maxBits + 1 - weight : 0;` .
|
|
|
|
|
The last symbol's weight is deduced from previously decoded ones,
|
|
|
|
|
by completing to the nearest power of 2.
|
|
|
|
|
This power of 2 gives `maxBits`, the depth of the current tree.
|
|
|
|
|
|
|
|
|
|
__Example__ :
|
2016-08-03 07:16:38 -07:00
|
|
|
|
Let's presume the following Huffman tree must be described :
|
2016-07-03 09:49:35 -07:00
|
|
|
|
|
2016-07-03 15:42:58 -07:00
|
|
|
|
| literal | 0 | 1 | 2 | 3 | 4 | 5 |
|
|
|
|
|
| ------- | --- | --- | --- | --- | --- | --- |
|
|
|
|
|
| nbBits | 1 | 2 | 3 | 0 | 4 | 4 |
|
2016-07-03 09:49:35 -07:00
|
|
|
|
|
|
|
|
|
The tree depth is 4, since its smallest element uses 4 bits.
|
|
|
|
|
Value `5` will not be listed, nor will values above `5`.
|
|
|
|
|
Values from `0` to `4` will be listed using `weight` instead of `nbBits`.
|
|
|
|
|
Weight formula is : `weight = nbBits ? maxBits + 1 - nbBits : 0;`
|
|
|
|
|
It gives the following serie of weights :
|
|
|
|
|
|
2016-07-03 15:42:58 -07:00
|
|
|
|
| weights | 4 | 3 | 2 | 0 | 1 |
|
|
|
|
|
| ------- | --- | --- | --- | --- | --- |
|
|
|
|
|
| literal | 0 | 1 | 2 | 3 | 4 |
|
2016-07-03 09:49:35 -07:00
|
|
|
|
|
|
|
|
|
The decoder will do the inverse operation :
|
2016-07-03 15:42:58 -07:00
|
|
|
|
having collected weights of literals from `0` to `4`,
|
|
|
|
|
it knows the last literal, `5`, is present with a non-zero weight.
|
2016-07-05 02:50:37 -07:00
|
|
|
|
The weight of `5` can be deducted by joining to the nearest power of 2.
|
2016-07-03 09:49:35 -07:00
|
|
|
|
Sum of 2^(weight-1) (excluding 0) is :
|
2016-07-03 15:42:58 -07:00
|
|
|
|
`8 + 4 + 2 + 0 + 1 = 15`
|
2016-07-03 09:49:35 -07:00
|
|
|
|
Nearest power of 2 is 16.
|
|
|
|
|
Therefore, `maxBits = 4` and `weight[5] = 1`.
|
|
|
|
|
|
|
|
|
|
##### Huffman Tree header
|
|
|
|
|
|
2016-07-05 01:53:38 -07:00
|
|
|
|
This is a single byte value (0-255),
|
|
|
|
|
which tells how to decode the list of weights.
|
2016-07-03 09:49:35 -07:00
|
|
|
|
|
|
|
|
|
- if headerByte >= 128 : this is a direct representation,
|
|
|
|
|
where each weight is written directly as a 4 bits field (0-15).
|
2016-07-05 02:50:37 -07:00
|
|
|
|
The full representation occupies `((nbSymbols+1)/2)` bytes,
|
2016-07-03 09:49:35 -07:00
|
|
|
|
meaning it uses a last full byte even if nbSymbols is odd.
|
2016-07-08 01:42:59 -07:00
|
|
|
|
`nbSymbols = headerByte - 127;`.
|
2016-07-24 06:35:59 -07:00
|
|
|
|
Note that maximum nbSymbols is 255-127 = 128.
|
2016-07-08 01:42:59 -07:00
|
|
|
|
A larger serie must necessarily use FSE compression.
|
2016-07-03 09:49:35 -07:00
|
|
|
|
|
|
|
|
|
- if headerByte < 128 :
|
|
|
|
|
the serie of weights is compressed by FSE.
|
2016-07-08 01:42:59 -07:00
|
|
|
|
The length of the FSE-compressed serie is `headerByte` (0-127).
|
2016-07-03 09:49:35 -07:00
|
|
|
|
|
2016-08-03 07:16:38 -07:00
|
|
|
|
##### FSE (Finite State Entropy) compression of Huffman weights
|
2016-07-03 09:49:35 -07:00
|
|
|
|
|
2016-07-08 01:42:59 -07:00
|
|
|
|
The serie of weights is compressed using FSE compression.
|
2016-07-03 09:49:35 -07:00
|
|
|
|
It's a single bitstream with 2 interleaved states,
|
2016-07-08 01:42:59 -07:00
|
|
|
|
sharing a single distribution table.
|
2016-07-03 09:49:35 -07:00
|
|
|
|
|
|
|
|
|
To decode an FSE bitstream, it is necessary to know its compressed size.
|
|
|
|
|
Compressed size is provided by `headerByte`.
|
2016-07-24 06:35:59 -07:00
|
|
|
|
It's also necessary to know its _maximum possible_ decompressed size,
|
2016-07-08 01:42:59 -07:00
|
|
|
|
which is `255`, since literal values span from `0` to `255`,
|
2016-07-05 02:50:37 -07:00
|
|
|
|
and last symbol value is not represented.
|
2016-07-03 09:49:35 -07:00
|
|
|
|
|
|
|
|
|
An FSE bitstream starts by a header, describing probabilities distribution.
|
2016-07-03 16:29:47 -07:00
|
|
|
|
It will create a Decoding Table.
|
2016-07-08 01:42:59 -07:00
|
|
|
|
Table must be pre-allocated, which requires to support a maximum accuracy.
|
2016-08-03 07:16:38 -07:00
|
|
|
|
For a list of Huffman weights, maximum accuracy is 7 bits.
|
2016-07-08 01:42:59 -07:00
|
|
|
|
|
|
|
|
|
FSE header is [described in relevant chapter](#fse-distribution-table--condensed-format),
|
|
|
|
|
and so is [FSE bitstream](#bitstream).
|
|
|
|
|
The main difference is that Huffman header compression uses 2 states,
|
|
|
|
|
which share the same FSE distribution table.
|
2016-07-24 06:35:59 -07:00
|
|
|
|
Bitstream contains only FSE symbols (no interleaved "raw bitfields").
|
2016-07-08 01:42:59 -07:00
|
|
|
|
The number of symbols to decode is discovered
|
|
|
|
|
by tracking bitStream overflow condition.
|
|
|
|
|
When both states have overflowed the bitstream, end is reached.
|
2016-07-03 09:49:35 -07:00
|
|
|
|
|
|
|
|
|
|
2016-08-03 07:16:38 -07:00
|
|
|
|
##### Conversion from weights to Huffman prefix codes
|
2016-07-03 09:49:35 -07:00
|
|
|
|
|
2016-07-03 15:42:58 -07:00
|
|
|
|
All present symbols shall now have a `weight` value.
|
2016-07-24 06:35:59 -07:00
|
|
|
|
It is possible to transform weights into nbBits, using this formula :
|
2016-07-03 15:42:58 -07:00
|
|
|
|
`nbBits = nbBits ? maxBits + 1 - weight : 0;` .
|
|
|
|
|
|
2016-07-24 06:35:59 -07:00
|
|
|
|
Symbols are sorted by weight. Within same weight, symbols keep natural order.
|
|
|
|
|
Symbols with a weight of zero are removed.
|
|
|
|
|
Then, starting from lowest weight, prefix codes are distributed in order.
|
2016-07-03 15:42:58 -07:00
|
|
|
|
|
|
|
|
|
__Example__ :
|
2016-07-15 08:31:13 -07:00
|
|
|
|
Let's presume the following list of weights has been decoded :
|
2016-07-03 15:42:58 -07:00
|
|
|
|
|
|
|
|
|
| Literal | 0 | 1 | 2 | 3 | 4 | 5 |
|
|
|
|
|
| ------- | --- | --- | --- | --- | --- | --- |
|
|
|
|
|
| weight | 4 | 3 | 2 | 0 | 1 | 1 |
|
|
|
|
|
|
|
|
|
|
Sorted by weight and then natural order,
|
|
|
|
|
it gives the following distribution :
|
|
|
|
|
|
|
|
|
|
| Literal | 3 | 4 | 5 | 2 | 1 | 0 |
|
|
|
|
|
| ------------ | --- | --- | --- | --- | --- | ---- |
|
|
|
|
|
| weight | 0 | 1 | 1 | 2 | 3 | 4 |
|
|
|
|
|
| nb bits | 0 | 4 | 4 | 3 | 2 | 1 |
|
2016-07-15 08:31:13 -07:00
|
|
|
|
| prefix codes | N/A | 0000| 0001| 001 | 01 | 1 |
|
2016-07-03 15:42:58 -07:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
#### Literals bitstreams
|
|
|
|
|
|
|
|
|
|
##### Bitstreams sizes
|
|
|
|
|
|
|
|
|
|
As seen in a previous paragraph,
|
2016-08-03 07:16:38 -07:00
|
|
|
|
there are 2 flavors of Huffman-compressed literals :
|
2016-07-03 15:42:58 -07:00
|
|
|
|
single stream, and 4-streams.
|
|
|
|
|
|
|
|
|
|
4-streams is useful for CPU with multiple execution units and OoO operations.
|
|
|
|
|
Since each stream can be decoded independently,
|
|
|
|
|
it's possible to decode them up to 4x faster than a single stream,
|
|
|
|
|
presuming the CPU has enough parallelism available.
|
|
|
|
|
|
|
|
|
|
For single stream, header provides both the compressed and regenerated size.
|
|
|
|
|
For 4-streams though,
|
|
|
|
|
header only provides compressed and regenerated size of all 4 streams combined.
|
|
|
|
|
In order to properly decode the 4 streams,
|
|
|
|
|
it's necessary to know the compressed and regenerated size of each stream.
|
|
|
|
|
|
2016-07-24 06:35:59 -07:00
|
|
|
|
Regenerated size of each stream can be calculated by `(totalSize+3)/4`,
|
|
|
|
|
except for last one, which can be up to 3 bytes smaller, to reach `totalSize`.
|
2016-07-03 15:42:58 -07:00
|
|
|
|
|
2016-07-24 06:35:59 -07:00
|
|
|
|
Compressed size is provided explicitly : in the 4-streams variant,
|
2016-07-25 03:47:02 -07:00
|
|
|
|
bitstreams are preceded by 3 unsigned Little-Endian 16-bits values.
|
2016-07-03 16:29:47 -07:00
|
|
|
|
Each value represents the compressed size of one stream, in order.
|
2016-07-03 15:42:58 -07:00
|
|
|
|
The last stream size is deducted from total compressed size
|
2016-07-24 06:35:59 -07:00
|
|
|
|
and from previously decoded stream sizes :
|
2016-07-03 15:42:58 -07:00
|
|
|
|
`stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize;`
|
|
|
|
|
|
2016-07-03 16:29:47 -07:00
|
|
|
|
##### Bitstreams read and decode
|
2016-07-03 15:42:58 -07:00
|
|
|
|
|
|
|
|
|
Each bitstream must be read _backward_,
|
|
|
|
|
that is starting from the end down to the beginning.
|
|
|
|
|
Therefore it's necessary to know the size of each bitstream.
|
|
|
|
|
|
|
|
|
|
It's also necessary to know exactly which _bit_ is the latest.
|
|
|
|
|
This is detected by a final bit flag :
|
|
|
|
|
the highest bit of latest byte is a final-bit-flag.
|
|
|
|
|
Consequently, a last byte of `0` is not possible.
|
|
|
|
|
And the final-bit-flag itself is not part of the useful bitstream.
|
2016-07-24 06:35:59 -07:00
|
|
|
|
Hence, the last byte contains between 0 and 7 useful bits.
|
2016-07-03 15:42:58 -07:00
|
|
|
|
|
|
|
|
|
Starting from the end,
|
|
|
|
|
it's possible to read the bitstream in a little-endian fashion,
|
|
|
|
|
keeping track of already used bits.
|
|
|
|
|
|
2016-07-03 16:29:47 -07:00
|
|
|
|
Reading the last `maxBits` bits,
|
2016-07-15 08:31:13 -07:00
|
|
|
|
it's then possible to compare extracted value to decoding table,
|
2016-07-03 15:42:58 -07:00
|
|
|
|
determining the symbol to decode and number of bits to discard.
|
|
|
|
|
|
|
|
|
|
The process continues up to reading the required number of symbols per stream.
|
|
|
|
|
If a bitstream is not entirely and exactly consumed,
|
2016-07-15 08:31:13 -07:00
|
|
|
|
hence reaching exactly its beginning position with _all_ bits consumed,
|
2016-07-03 15:42:58 -07:00
|
|
|
|
the decoding process is considered faulty.
|
|
|
|
|
|
2016-07-03 09:49:35 -07:00
|
|
|
|
|
2016-08-04 02:25:52 -07:00
|
|
|
|
### `Sequences_Section`
|
2016-07-03 16:29:47 -07:00
|
|
|
|
|
|
|
|
|
A compressed block is a succession of _sequences_ .
|
|
|
|
|
A sequence is a literal copy command, followed by a match copy command.
|
|
|
|
|
A literal copy command specifies a length.
|
|
|
|
|
It is the number of bytes to be copied (or extracted) from the literal section.
|
|
|
|
|
A match copy command specifies an offset and a length.
|
|
|
|
|
The offset gives the position to copy from,
|
2016-07-15 08:31:13 -07:00
|
|
|
|
which can be within a previous block.
|
2016-07-03 16:29:47 -07:00
|
|
|
|
|
2016-07-05 02:50:37 -07:00
|
|
|
|
There are 3 symbol types, `literalLength`, `matchLength` and `offset`,
|
2016-07-03 16:29:47 -07:00
|
|
|
|
which are encoded together, interleaved in a single _bitstream_.
|
|
|
|
|
|
2016-07-05 02:50:37 -07:00
|
|
|
|
Each symbol is a _code_ in its own context,
|
|
|
|
|
which specifies a baseline and a number of bits to add.
|
2016-07-03 16:29:47 -07:00
|
|
|
|
_Codes_ are FSE compressed,
|
|
|
|
|
and interleaved with raw additional bits in the same bitstream.
|
|
|
|
|
|
2016-07-04 07:13:11 -07:00
|
|
|
|
The Sequences section starts by a header,
|
|
|
|
|
followed by optional Probability tables for each symbol type,
|
2016-07-03 16:29:47 -07:00
|
|
|
|
followed by the bitstream.
|
|
|
|
|
|
2016-07-20 05:58:49 -07:00
|
|
|
|
| Header | [LitLengthTable] | [OffsetTable] | [MatchLengthTable] | bitStream |
|
2016-07-08 06:39:02 -07:00
|
|
|
|
| ------ | ---------------- | ------------- | ------------------ | --------- |
|
|
|
|
|
|
2016-07-04 07:13:11 -07:00
|
|
|
|
To decode the Sequence section, it's required to know its size.
|
2016-07-05 02:50:37 -07:00
|
|
|
|
This size is deducted from `blockSize - literalSectionSize`.
|
2016-07-04 07:13:11 -07:00
|
|
|
|
|
|
|
|
|
|
2016-07-03 16:29:47 -07:00
|
|
|
|
#### Sequences section header
|
|
|
|
|
|
2016-07-04 07:13:11 -07:00
|
|
|
|
Consists in 2 items :
|
|
|
|
|
- Nb of Sequences
|
|
|
|
|
- Flags providing Symbol compression types
|
|
|
|
|
|
|
|
|
|
__Nb of Sequences__
|
|
|
|
|
|
|
|
|
|
This is a variable size field, `nbSeqs`, using between 1 and 3 bytes.
|
|
|
|
|
Let's call its first byte `byte0`.
|
|
|
|
|
- `if (byte0 == 0)` : there are no sequences.
|
|
|
|
|
The sequence section stops there.
|
|
|
|
|
Regenerated content is defined entirely by literals section.
|
2016-07-05 02:50:37 -07:00
|
|
|
|
- `if (byte0 < 128)` : `nbSeqs = byte0;` . Uses 1 byte.
|
|
|
|
|
- `if (byte0 < 255)` : `nbSeqs = ((byte0-128) << 8) + byte1;` . Uses 2 bytes.
|
|
|
|
|
- `if (byte0 == 255)`: `nbSeqs = byte1 + (byte2<<8) + 0x7F00;` . Uses 3 bytes.
|
2016-07-04 07:13:11 -07:00
|
|
|
|
|
2016-07-23 16:21:53 -07:00
|
|
|
|
__Symbol encoding modes__
|
2016-07-04 07:13:11 -07:00
|
|
|
|
|
|
|
|
|
This is a single byte, defining the compression mode of each symbol type.
|
|
|
|
|
|
|
|
|
|
| BitNb | 7-6 | 5-4 | 3-2 | 1-0 |
|
|
|
|
|
| ------- | ------ | ------ | ------ | -------- |
|
2016-07-23 16:21:53 -07:00
|
|
|
|
|FieldName| LLType | OFType | MLType | Reserved |
|
2016-07-04 07:13:11 -07:00
|
|
|
|
|
|
|
|
|
The last field, `Reserved`, must be all-zeroes.
|
|
|
|
|
|
2016-07-23 16:21:53 -07:00
|
|
|
|
`LLType`, `OFType` and `MLType` define the compression mode of
|
2016-07-04 07:13:11 -07:00
|
|
|
|
Literal Lengths, Offsets and Match Lengths respectively.
|
|
|
|
|
|
|
|
|
|
They follow the same enumeration :
|
|
|
|
|
|
2016-07-23 07:31:49 -07:00
|
|
|
|
| Value | 0 | 1 | 2 | 3 |
|
|
|
|
|
| ---------------- | ------ | --- | ---------- | ------ |
|
|
|
|
|
| Compression Mode | predef | RLE | Compressed | Repeat |
|
2016-07-04 07:13:11 -07:00
|
|
|
|
|
|
|
|
|
- "predef" : uses a pre-defined distribution table.
|
|
|
|
|
- "RLE" : it's a single code, repeated `nbSeqs` times.
|
|
|
|
|
- "Repeat" : re-use distribution table from previous compressed block.
|
2016-07-23 07:31:49 -07:00
|
|
|
|
- "Compressed" : standard FSE compression.
|
2016-07-05 02:50:37 -07:00
|
|
|
|
A distribution table will be present.
|
|
|
|
|
It will be described in [next part](#distribution-tables).
|
2016-07-04 07:13:11 -07:00
|
|
|
|
|
|
|
|
|
#### Symbols decoding
|
|
|
|
|
|
|
|
|
|
##### Literal Lengths codes
|
|
|
|
|
|
|
|
|
|
Literal lengths codes are values ranging from `0` to `35` included.
|
|
|
|
|
They define lengths from 0 to 131071 bytes.
|
|
|
|
|
|
|
|
|
|
| Code | 0-15 |
|
|
|
|
|
| ------ | ---- |
|
2016-07-08 06:39:02 -07:00
|
|
|
|
| length | Code |
|
2016-07-05 02:50:37 -07:00
|
|
|
|
| nbBits | 0 |
|
|
|
|
|
|
2016-07-04 07:13:11 -07:00
|
|
|
|
|
|
|
|
|
| Code | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
|
|
|
|
|
| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
|
|
|
|
| Baseline | 16 | 18 | 20 | 22 | 24 | 28 | 32 | 40 |
|
|
|
|
|
| nb Bits | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
|
|
|
|
|
|
|
|
|
|
| Code | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
|
|
|
|
|
| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
|
|
|
|
| Baseline | 48 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 |
|
|
|
|
|
| nb Bits | 4 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|
|
|
|
|
|
|
|
|
|
| Code | 32 | 33 | 34 | 35 |
|
|
|
|
|
| -------- | ---- | ---- | ---- | ---- |
|
|
|
|
|
| Baseline | 8192 |16384 |32768 |65536 |
|
|
|
|
|
| nb Bits | 13 | 14 | 15 | 16 |
|
|
|
|
|
|
|
|
|
|
__Default distribution__
|
|
|
|
|
|
2016-07-05 02:50:37 -07:00
|
|
|
|
When "compression mode" is "predef"",
|
2016-07-04 07:13:11 -07:00
|
|
|
|
a pre-defined distribution is used for FSE compression.
|
|
|
|
|
|
2016-07-08 06:39:02 -07:00
|
|
|
|
Below is its definition. It uses an accuracy of 6 bits (64 states).
|
2016-07-04 07:13:11 -07:00
|
|
|
|
```
|
|
|
|
|
short literalLengths_defaultDistribution[36] =
|
|
|
|
|
{ 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
|
|
|
|
|
2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1,
|
|
|
|
|
-1,-1,-1,-1 };
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
##### Match Lengths codes
|
|
|
|
|
|
|
|
|
|
Match lengths codes are values ranging from `0` to `52` included.
|
|
|
|
|
They define lengths from 3 to 131074 bytes.
|
|
|
|
|
|
|
|
|
|
| Code | 0-31 |
|
|
|
|
|
| ------ | -------- |
|
|
|
|
|
| value | Code + 3 |
|
2016-07-05 02:50:37 -07:00
|
|
|
|
| nbBits | 0 |
|
2016-07-04 07:13:11 -07:00
|
|
|
|
|
|
|
|
|
| Code | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 |
|
|
|
|
|
| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
|
|
|
|
| Baseline | 35 | 37 | 39 | 41 | 43 | 47 | 51 | 59 |
|
|
|
|
|
| nb Bits | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
|
|
|
|
|
|
|
|
|
|
| Code | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 |
|
|
|
|
|
| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
|
|
|
|
| Baseline | 67 | 83 | 99 | 131 | 258 | 514 | 1026 | 2050 |
|
|
|
|
|
| nb Bits | 4 | 4 | 5 | 7 | 8 | 9 | 10 | 11 |
|
|
|
|
|
|
|
|
|
|
| Code | 48 | 49 | 50 | 51 | 52 |
|
|
|
|
|
| -------- | ---- | ---- | ---- | ---- | ---- |
|
|
|
|
|
| Baseline | 4098 | 8194 |16486 |32770 |65538 |
|
|
|
|
|
| nb Bits | 12 | 13 | 14 | 15 | 16 |
|
|
|
|
|
|
|
|
|
|
__Default distribution__
|
|
|
|
|
|
2016-07-08 06:39:02 -07:00
|
|
|
|
When "compression mode" is defined as "predef",
|
2016-07-04 07:13:11 -07:00
|
|
|
|
a pre-defined distribution is used for FSE compression.
|
|
|
|
|
|
|
|
|
|
Here is its definition. It uses an accuracy of 6 bits (64 states).
|
|
|
|
|
```
|
|
|
|
|
short matchLengths_defaultDistribution[53] =
|
|
|
|
|
{ 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
|
|
|
|
|
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
|
|
|
|
|
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,
|
|
|
|
|
-1,-1,-1,-1,-1 };
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
##### Offset codes
|
|
|
|
|
|
|
|
|
|
Offset codes are values ranging from `0` to `N`,
|
|
|
|
|
with `N` being limited by maximum backreference distance.
|
|
|
|
|
|
2016-07-05 02:50:37 -07:00
|
|
|
|
A decoder is free to limit its maximum `N` supported.
|
|
|
|
|
Recommendation is to support at least up to `22`.
|
2016-07-04 07:13:11 -07:00
|
|
|
|
For information, at the time of this writing.
|
|
|
|
|
the reference decoder supports a maximum `N` value of `28` in 64-bits mode.
|
|
|
|
|
|
|
|
|
|
An offset code is also the nb of additional bits to read,
|
|
|
|
|
and can be translated into an `OFValue` using the following formulae :
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
OFValue = (1 << offsetCode) + readNBits(offsetCode);
|
|
|
|
|
if (OFValue > 3) offset = OFValue - 3;
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
OFValue from 1 to 3 are special : they define "repeat codes",
|
|
|
|
|
which means one of the previous offsets will be repeated.
|
|
|
|
|
They are sorted in recency order, with 1 meaning the most recent one.
|
2016-07-05 02:50:37 -07:00
|
|
|
|
See [Repeat offsets](#repeat-offsets) paragraph.
|
2016-07-04 07:13:11 -07:00
|
|
|
|
|
|
|
|
|
__Default distribution__
|
|
|
|
|
|
2016-07-05 02:50:37 -07:00
|
|
|
|
When "compression mode" is defined as "predef",
|
2016-07-04 07:13:11 -07:00
|
|
|
|
a pre-defined distribution is used for FSE compression.
|
|
|
|
|
|
|
|
|
|
Here is its definition. It uses an accuracy of 5 bits (32 states),
|
2016-07-05 02:50:37 -07:00
|
|
|
|
and supports a maximum `N` of 28, allowing offset values up to 536,870,908 .
|
2016-07-04 07:13:11 -07:00
|
|
|
|
|
|
|
|
|
If any sequence in the compressed block requires an offset larger than this,
|
|
|
|
|
it's not possible to use the default distribution to represent it.
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
short offsetCodes_defaultDistribution[53] =
|
|
|
|
|
{ 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
|
|
|
|
|
1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 };
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
#### Distribution tables
|
|
|
|
|
|
|
|
|
|
Following the header, up to 3 distribution tables can be described.
|
2016-07-23 16:21:53 -07:00
|
|
|
|
When present, they are in this order :
|
2016-07-04 07:13:11 -07:00
|
|
|
|
- Literal lengthes
|
|
|
|
|
- Offsets
|
|
|
|
|
- Match Lengthes
|
|
|
|
|
|
2016-07-23 16:21:53 -07:00
|
|
|
|
The content to decode depends on their respective encoding mode :
|
2016-07-04 07:13:11 -07:00
|
|
|
|
- Predef : no content. Use pre-defined distribution table.
|
|
|
|
|
- RLE : 1 byte. This is the only code to use across the whole compressed block.
|
|
|
|
|
- FSE : A distribution table is present.
|
2016-07-23 16:21:53 -07:00
|
|
|
|
- Repeat mode : no content. Re-use distribution from previous compressed block.
|
2016-07-04 07:13:11 -07:00
|
|
|
|
|
|
|
|
|
##### FSE distribution table : condensed format
|
|
|
|
|
|
|
|
|
|
An FSE distribution table describes the probabilities of all symbols
|
|
|
|
|
from `0` to the last present one (included)
|
2016-07-05 01:53:38 -07:00
|
|
|
|
on a normalized scale of `1 << AccuracyLog` .
|
2016-07-04 07:13:11 -07:00
|
|
|
|
|
|
|
|
|
It's a bitstream which is read forward, in little-endian fashion.
|
|
|
|
|
It's not necessary to know its exact size,
|
|
|
|
|
since it will be discovered and reported by the decoding process.
|
|
|
|
|
|
|
|
|
|
The bitstream starts by reporting on which scale it operates.
|
|
|
|
|
`AccuracyLog = low4bits + 5;`
|
2016-07-22 10:32:07 -07:00
|
|
|
|
Note that maximum `AccuracyLog` for literal and match lengthes is `9`,
|
|
|
|
|
and for offsets it is `8`. Higher values are considered errors.
|
2016-07-04 07:13:11 -07:00
|
|
|
|
|
|
|
|
|
Then follow each symbol value, from `0` to last present one.
|
|
|
|
|
The nb of bits used by each field is variable.
|
|
|
|
|
It depends on :
|
|
|
|
|
|
|
|
|
|
- Remaining probabilities + 1 :
|
|
|
|
|
__example__ :
|
|
|
|
|
Presuming an AccuracyLog of 8,
|
|
|
|
|
and presuming 100 probabilities points have already been distributed,
|
2016-07-05 02:50:37 -07:00
|
|
|
|
the decoder may read any value from `0` to `255 - 100 + 1 == 156` (included).
|
2016-07-04 07:13:11 -07:00
|
|
|
|
Therefore, it must read `log2sup(156) == 8` bits.
|
|
|
|
|
|
|
|
|
|
- Value decoded : small values use 1 less bit :
|
|
|
|
|
__example__ :
|
|
|
|
|
Presuming values from 0 to 156 (included) are possible,
|
|
|
|
|
255-156 = 99 values are remaining in an 8-bits field.
|
|
|
|
|
They are used this way :
|
|
|
|
|
first 99 values (hence from 0 to 98) use only 7 bits,
|
|
|
|
|
values from 99 to 156 use 8 bits.
|
|
|
|
|
This is achieved through this scheme :
|
|
|
|
|
|
|
|
|
|
| Value read | Value decoded | nb Bits used |
|
|
|
|
|
| ---------- | ------------- | ------------ |
|
|
|
|
|
| 0 - 98 | 0 - 98 | 7 |
|
|
|
|
|
| 99 - 127 | 99 - 127 | 8 |
|
|
|
|
|
| 128 - 226 | 0 - 98 | 7 |
|
|
|
|
|
| 227 - 255 | 128 - 156 | 8 |
|
|
|
|
|
|
|
|
|
|
Symbols probabilities are read one by one, in order.
|
|
|
|
|
|
|
|
|
|
Probability is obtained from Value decoded by following formulae :
|
|
|
|
|
`Proba = value - 1;`
|
|
|
|
|
|
|
|
|
|
It means value `0` becomes negative probability `-1`.
|
|
|
|
|
`-1` is a special probability, which means `less than 1`.
|
2016-07-08 06:39:02 -07:00
|
|
|
|
Its effect on distribution table is described in [next paragraph].
|
2016-07-04 07:13:11 -07:00
|
|
|
|
For the purpose of calculating cumulated distribution, it counts as one.
|
|
|
|
|
|
2016-07-08 06:39:02 -07:00
|
|
|
|
[next paragraph]:#fse-decoding--from-normalized-distribution-to-decoding-tables
|
|
|
|
|
|
2016-07-04 07:13:11 -07:00
|
|
|
|
When a symbol has a probability of `zero`,
|
|
|
|
|
it is followed by a 2-bits repeat flag.
|
|
|
|
|
This repeat flag tells how many probabilities of zeroes follow the current one.
|
|
|
|
|
It provides a number ranging from 0 to 3.
|
|
|
|
|
If it is a 3, another 2-bits repeat flag follows, and so on.
|
|
|
|
|
|
2016-07-05 01:53:38 -07:00
|
|
|
|
When last symbol reaches cumulated total of `1 << AccuracyLog`,
|
2016-07-04 07:13:11 -07:00
|
|
|
|
decoding is complete.
|
2016-07-23 16:21:53 -07:00
|
|
|
|
If the last symbol makes cumulated total go above `1 << AccuracyLog`,
|
|
|
|
|
distribution is considered corrupted.
|
|
|
|
|
|
2016-07-04 07:13:11 -07:00
|
|
|
|
Then the decoder can tell how many bytes were used in this process,
|
|
|
|
|
and how many symbols are present.
|
|
|
|
|
The bitstream consumes a round number of bytes.
|
|
|
|
|
Any remaining bit within the last byte is just unused.
|
|
|
|
|
|
|
|
|
|
##### FSE decoding : from normalized distribution to decoding tables
|
|
|
|
|
|
2016-07-05 01:53:38 -07:00
|
|
|
|
The distribution of normalized probabilities is enough
|
|
|
|
|
to create a unique decoding table.
|
|
|
|
|
|
|
|
|
|
It follows the following build rule :
|
|
|
|
|
|
|
|
|
|
The table has a size of `tableSize = 1 << AccuracyLog;`.
|
|
|
|
|
Each cell describes the symbol decoded,
|
|
|
|
|
and instructions to get the next state.
|
|
|
|
|
|
|
|
|
|
Symbols are scanned in their natural order for `less than 1` probabilities.
|
|
|
|
|
Symbols with this probability are being attributed a single cell,
|
|
|
|
|
starting from the end of the table.
|
|
|
|
|
These symbols define a full state reset, reading `AccuracyLog` bits.
|
|
|
|
|
|
|
|
|
|
All remaining symbols are sorted in their natural order.
|
|
|
|
|
Starting from symbol `0` and table position `0`,
|
|
|
|
|
each symbol gets attributed as many cells as its probability.
|
|
|
|
|
Cell allocation is spreaded, not linear :
|
|
|
|
|
each successor position follow this rule :
|
|
|
|
|
|
2016-07-05 02:50:37 -07:00
|
|
|
|
```
|
|
|
|
|
position += (tableSize>>1) + (tableSize>>3) + 3;
|
|
|
|
|
position &= tableSize-1;
|
|
|
|
|
```
|
2016-07-05 01:53:38 -07:00
|
|
|
|
|
|
|
|
|
A position is skipped if already occupied,
|
|
|
|
|
typically by a "less than 1" probability symbol.
|
|
|
|
|
|
|
|
|
|
The result is a list of state values.
|
|
|
|
|
Each state will decode the current symbol.
|
|
|
|
|
|
|
|
|
|
To get the Number of bits and baseline required for next state,
|
|
|
|
|
it's first necessary to sort all states in their natural order.
|
2016-07-05 02:50:37 -07:00
|
|
|
|
The lower states will need 1 more bit than higher ones.
|
2016-07-05 01:53:38 -07:00
|
|
|
|
|
|
|
|
|
__Example__ :
|
|
|
|
|
Presuming a symbol has a probability of 5.
|
2016-07-05 02:50:37 -07:00
|
|
|
|
It receives 5 state values. States are sorted in natural order.
|
2016-07-05 01:53:38 -07:00
|
|
|
|
|
|
|
|
|
Next power of 2 is 8.
|
|
|
|
|
Space of probabilities is divided into 8 equal parts.
|
|
|
|
|
Presuming the AccuracyLog is 7, it defines 128 states.
|
|
|
|
|
Divided by 8, each share is 16 large.
|
|
|
|
|
|
|
|
|
|
In order to reach 8, 8-5=3 lowest states will count "double",
|
|
|
|
|
taking shares twice larger,
|
|
|
|
|
requiring one more bit in the process.
|
|
|
|
|
|
|
|
|
|
Numbering starts from higher states using less bits.
|
|
|
|
|
|
|
|
|
|
| state order | 0 | 1 | 2 | 3 | 4 |
|
|
|
|
|
| ----------- | ----- | ----- | ------ | ---- | ----- |
|
|
|
|
|
| width | 32 | 32 | 32 | 16 | 16 |
|
|
|
|
|
| nb Bits | 5 | 5 | 5 | 4 | 4 |
|
|
|
|
|
| range nb | 2 | 4 | 6 | 0 | 1 |
|
|
|
|
|
| baseline | 32 | 64 | 96 | 0 | 16 |
|
|
|
|
|
| range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 |
|
|
|
|
|
|
|
|
|
|
Next state is determined from current state
|
|
|
|
|
by reading the required number of bits, and adding the specified baseline.
|
2016-07-04 07:13:11 -07:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
#### Bitstream
|
2016-07-03 16:29:47 -07:00
|
|
|
|
|
2016-07-05 01:53:38 -07:00
|
|
|
|
All sequences are stored in a single bitstream, read _backward_.
|
|
|
|
|
It is therefore necessary to know the bitstream size,
|
|
|
|
|
which is deducted from compressed block size.
|
|
|
|
|
|
2016-07-08 06:39:02 -07:00
|
|
|
|
The last useful bit of the stream is followed by an end-bit-flag.
|
2016-07-05 02:50:37 -07:00
|
|
|
|
Highest bit of last byte is this flag.
|
2016-07-05 01:53:38 -07:00
|
|
|
|
It does not belong to the useful part of the bitstream.
|
|
|
|
|
Therefore, last byte has 0-7 useful bits.
|
|
|
|
|
Note that it also means that last byte cannot be `0`.
|
|
|
|
|
|
|
|
|
|
##### Starting states
|
|
|
|
|
|
|
|
|
|
The bitstream starts with initial state values,
|
|
|
|
|
each using the required number of bits in their respective _accuracy_,
|
|
|
|
|
decoded previously from their normalized distribution.
|
|
|
|
|
|
|
|
|
|
It starts by `Literal Length State`,
|
|
|
|
|
followed by `Offset State`,
|
|
|
|
|
and finally `Match Length State`.
|
|
|
|
|
|
|
|
|
|
Reminder : always keep in mind that all values are read _backward_.
|
|
|
|
|
|
|
|
|
|
##### Decoding a sequence
|
|
|
|
|
|
|
|
|
|
A state gives a code.
|
|
|
|
|
A code provides a baseline and number of bits to add.
|
|
|
|
|
See [Symbol Decoding] section for details on each symbol.
|
|
|
|
|
|
|
|
|
|
Decoding starts by reading the nb of bits required to decode offset.
|
|
|
|
|
It then does the same for match length,
|
|
|
|
|
and then for literal length.
|
|
|
|
|
|
2016-07-08 06:39:02 -07:00
|
|
|
|
Offset / matchLength / litLength define a sequence.
|
|
|
|
|
It starts by inserting the number of literals defined by `litLength`,
|
|
|
|
|
then continue by copying `matchLength` bytes from `currentPos - offset`.
|
2016-07-05 01:53:38 -07:00
|
|
|
|
|
|
|
|
|
The next operation is to update states.
|
|
|
|
|
Using rules pre-calculated in the decoding tables,
|
|
|
|
|
`Literal Length State` is updated,
|
|
|
|
|
followed by `Match Length State`,
|
|
|
|
|
and then `Offset State`.
|
|
|
|
|
|
|
|
|
|
This operation will be repeated `NbSeqs` times.
|
|
|
|
|
At the end, the bitstream shall be entirely consumed,
|
|
|
|
|
otherwise bitstream is considered corrupted.
|
|
|
|
|
|
|
|
|
|
[Symbol Decoding]:#symbols-decoding
|
|
|
|
|
|
|
|
|
|
##### Repeat offsets
|
|
|
|
|
|
|
|
|
|
As seen in [Offset Codes], the first 3 values define a repeated offset.
|
|
|
|
|
They are sorted in recency order, with 1 meaning "most recent one".
|
|
|
|
|
|
|
|
|
|
There is an exception though, when current sequence's literal length is `0`.
|
2016-07-30 19:01:57 -07:00
|
|
|
|
In which case, repcodes are "pushed by one",
|
|
|
|
|
so 1 becomes 2, 2 becomes 3,
|
|
|
|
|
and 3 becomes "offset_1 - 1_byte".
|
2016-07-05 01:53:38 -07:00
|
|
|
|
|
2016-07-30 19:01:57 -07:00
|
|
|
|
On first block, offset history is populated by the following values : 1, 4 and 8 (in order).
|
2016-07-05 01:53:38 -07:00
|
|
|
|
|
|
|
|
|
Then each block receives its start value from previous compressed block.
|
|
|
|
|
Note that non-compressed blocks are skipped,
|
|
|
|
|
they do not contribute to offset history.
|
|
|
|
|
|
|
|
|
|
[Offset Codes]: #offset-codes
|
|
|
|
|
|
|
|
|
|
###### Offset updates rules
|
|
|
|
|
|
2016-07-30 19:01:57 -07:00
|
|
|
|
New offset take the lead in offset history,
|
|
|
|
|
up to its previous place if it was already present.
|
2016-07-05 01:53:38 -07:00
|
|
|
|
|
2016-07-30 19:01:57 -07:00
|
|
|
|
It means that when repeat offset 1 (most recent) is used, history is unmodified.
|
|
|
|
|
When repeat offset 2 is used, it's swapped with offset 1.
|
2016-07-03 16:29:47 -07:00
|
|
|
|
|
2016-07-01 11:55:28 -07:00
|
|
|
|
|
2016-07-08 10:16:57 -07:00
|
|
|
|
Dictionary format
|
|
|
|
|
-----------------
|
|
|
|
|
|
|
|
|
|
`zstd` is compatible with "pure content" dictionaries, free of any format restriction.
|
|
|
|
|
But dictionaries created by `zstd --train` follow a format, described here.
|
|
|
|
|
|
|
|
|
|
__Pre-requisites__ : a dictionary has a known length,
|
|
|
|
|
defined either by a buffer limit, or a file size.
|
|
|
|
|
|
|
|
|
|
| Header | DictID | Stats | Content |
|
|
|
|
|
| ------ | ------ | ----- | ------- |
|
|
|
|
|
|
2016-07-25 03:47:02 -07:00
|
|
|
|
__Header__ : 4 bytes ID, value 0xEC30A437, Little-Endian format
|
2016-07-08 10:16:57 -07:00
|
|
|
|
|
2016-07-25 03:47:02 -07:00
|
|
|
|
__Dict_ID__ : 4 bytes, stored in Little-Endian format.
|
2016-07-08 10:16:57 -07:00
|
|
|
|
DictID can be any value, except 0 (which means no DictID).
|
2016-07-08 10:22:16 -07:00
|
|
|
|
It's used by decoders to check if they use the correct dictionary.
|
2016-07-15 08:03:38 -07:00
|
|
|
|
_Reserved ranges :_
|
|
|
|
|
If the frame is going to be distributed in a private environment,
|
|
|
|
|
any dictionary ID can be used.
|
|
|
|
|
However, for public distribution of compressed frames,
|
|
|
|
|
some ranges are reserved for future use :
|
2016-07-15 08:58:13 -07:00
|
|
|
|
|
|
|
|
|
- low range : 1 - 32767 : reserved
|
|
|
|
|
- high range : >= (2^31) : reserved
|
2016-07-08 10:16:57 -07:00
|
|
|
|
|
|
|
|
|
__Stats__ : Entropy tables, following the same format as a [compressed blocks].
|
|
|
|
|
They are stored in following order :
|
|
|
|
|
Huffman tables for literals, FSE table for offset,
|
2016-07-08 10:22:16 -07:00
|
|
|
|
FSE table for matchLenth, and FSE table for litLength.
|
|
|
|
|
It's finally followed by 3 offset values, populating recent offsets,
|
2016-07-25 03:47:02 -07:00
|
|
|
|
stored in order, 4-bytes little-endian each, for a total of 12 bytes.
|
2016-07-08 10:16:57 -07:00
|
|
|
|
|
|
|
|
|
__Content__ : Where the actual dictionary content is.
|
2016-07-08 10:22:16 -07:00
|
|
|
|
Content size depends on Dictionary size.
|
2016-07-08 10:16:57 -07:00
|
|
|
|
|
2016-07-25 02:04:56 -07:00
|
|
|
|
[compressed blocks]: #the-format-of-compressed_block
|
2016-07-08 10:16:57 -07:00
|
|
|
|
|
2016-07-01 11:55:28 -07:00
|
|
|
|
|
2016-06-30 06:40:28 -07:00
|
|
|
|
Version changes
|
|
|
|
|
---------------
|
2016-07-20 05:58:49 -07:00
|
|
|
|
- 0.2.0 : numerous format adjustments for zstd v0.8
|
2016-08-03 07:16:38 -07:00
|
|
|
|
- 0.1.2 : limit Huffman tree depth to 11 bits
|
2016-07-17 07:21:37 -07:00
|
|
|
|
- 0.1.1 : reserved dictID ranges
|
|
|
|
|
- 0.1.0 : initial release
|