New : Zstandard Frame format specification

2016-06-30 15:40:28 +02:00 · 2016-06-30 15:40:28 +02:00 · c35cd5b1db
parent c093208ab8
commit c35cd5b1db
2 changed files with 351 additions and 1 deletions
--- a/5
+++ b/5
@ -168,6 +168,9 @@ bmix32test: clean

 bmi32test: clean
 	CFLAGS="-O3 -mbmi -m32 -Werror" $(MAKE) -C $(PRGDIR) test
+
+staticAnalyze: clean
+	CPPFLAGS=-g scan-build --status-bugs -v $(MAKE) all	
 endif


@ -187,7 +190,7 @@ gcc5install:

 gcc6install:
 	sudo add-apt-repository -y ppa:ubuntu-toolchain-r/test
-	sudo apt-get update -y -qq 
+	sudo apt-get update -y -qq
 	sudo apt-get install -y -qq gcc-6-multilib

 arminstall: clean
--- a/zstd_frame_format.md
+++ b/zstd_frame_format.md
@ -0,0 +1,347 @@
+Zstandard Frame Format Description
+==================================
+
+### Notices
+
+Copyright (c) 2016 Yann Collet
+
+Permission is granted to copy and distribute this document
+for any  purpose and without charge,
+including translations into other  languages
+and incorporation into compilations,
+provided that the copyright notice and this notice are preserved,
+and that any substantive changes or deletions from the original
+are clearly marked.
+Distribution of this document is unlimited.
+
+### Version
+
+0.1.0 (30/06/2016)
+
+
+Introduction
+------------
+
+The purpose of this document is to define a lossless compressed data format,
+that is independent of CPU type, operating system,
+file system and character set, suitable for
+File compression, Pipe and streaming compression
+using the [Zstandard algorithm](http://www.zstandard.org).
+
+The data can be produced or consumed,
+even for an arbitrarily long sequentially presented input data stream,
+using only an a priori bounded amount of intermediate storage,
+and hence can be used in data communications.
+The format uses the Zstandard compression method,
+and optional [xxHash-64 checksum method](http://www.xxhash.org),
+for detection of data corruption.
+
+The data format defined by this specification
+does not attempt to allow random access to compressed data.
+
+This specification is intended for use by implementers of software
+to compress data into Zstandard format and/or decompress data from Zstandard format.
+The text of the specification assumes a basic background in programming
+at the level of bits and other primitive data representations.
+
+Unless otherwise indicated below,
+a compliant compressor must produce data sets
+that conform to the specifications presented here.
+It doesn’t need to support all options though.
+
+A compliant decompressor must be able to decompress
+at least one working set of parameters
+that conforms to the specifications presented here.
+It may also ignore informative fields, such as checksum.
+Whenever it does not support a specific parameter within the compressed stream,
+it must produce a non-ambiguous error code
+and associated error message explaining which parameter is unsupported.
+
+
+General Structure of Zstandard Frame format
+-------------------------------------------
+
+| MagicNb |  F. Header | Block | (...) | EndMark |
+|:-------:|:----------:| ----- | ----- | ------- |
+| 4 bytes | 2-14 bytes |       |       | 3 bytes |
+
+__Magic Number__
+
+4 Bytes, Little endian format.
+Value : 0xFD2FB527
+
+__Frame Header__
+
+2 to 14 Bytes, to be detailed in the next part.
+Most important part of the spec.
+
+__Data Blocks__
+
+To be detailed later on.
+That’s where compressed data is stored.
+
+__EndMark__
+
+The flow of blocks ends when the last block header brings an _end signal_ .
+This last block header may optionally host a __Content Checksum__ .
+
+__Content Checksum__
+
+Content Checksum verify that the full content has been decoded correctly.
+The content checksum is the result
+of [xxh64() hash function](https://www.xxHash.com)
+digesting the original (decoded) data as input, and a seed of zero.
+Bits from 11 to 32 (included) are extracted to form the 22 bits checksum
+stored into the last block header.
+```
+contentChecksum = (XXH64(content, size, 0) >> 11) & (1<<22)-1);
+```
+Content checksum is only present when its associated flag
+is set in the frame descriptor.
+Its usage is optional.
+
+__Frame Concatenation__
+
+In some circumstances, it may be required to append multiple frames,
+for example in order to add new data to an existing compressed file
+without re-framing it.
+
+In such case, each frame brings its own set of descriptor flags.
+Each frame is considered independent.
+The only relation between frames is their sequential order.
+
+The ability to decode multiple concatenated frames
+within a single stream or file is left outside of this specification.
+As an example, the reference `zstd` command line utility is able
+to decode all concatenated frames in their sequential order,
+presenting the final decompressed result as if it was a single frame.
+
+
+Frame Header
+----------------
+
+| FHD     | (WD)      | (Content Size) | (dictID)  |
+| ------- | --------- |:--------------:| --------- |
+| 1 byte  | 0-1 byte  |  0 - 8 bytes   | 0-4 bytes |
+
+Frame header uses a minimum of 2 bytes,
+and up to 14 bytes depending on optional parameters.
+
+__FHD byte__ (Frame Header Descriptor)
+
+|  BitNb  |   7-6  |    5    |   4    |    3     |    2     |    1-0   |
+| ------- | ------ | ------- | ------ | -------- | -------- | -------- |
+|FieldName| FCSize | Segment | Unused | Reserved | Checksum |  dictID  |
+
+In the table, bit 7 is highest bit, while bit 0 is lowest.
+
+__Frame Content Size flag__
+
+This is a 2-bits flag (`= FHD >> 6`),
+telling if original data size is provided within the header
+
+|  Value  |  0  |  1  |  2  |  3  |
+| ------- | --- | --- | --- | --- |
+|FieldSize| 0-1 |  2  |  4  |  8  |
+
+Value 0 is special : it means `0` (data size not provided)
+_if_ the `WD` byte is present.
+Otherwise, it means `1` byte (data size <= 255 bytes).
+
+__Single Segment__
+
+If this flag is set,
+data shall be regenerated within a single continuous memory segment.
+In which case, `WD` byte is not present,
+but `Frame Content Size` field necessarily is.
+The size of the memory segment must be at least `>= Frame Content Size`.
+
+In order to preserve decoder from unreasonable memory requirement,
+a decoder can refuse a compressed frame
+which requests a memory size beyond decoder's authorized range.
+
+__Unused bit__
+
+The value of this bit is unimportant
+and not interpreted by a decoder compliant with this specification version.
+It may be used in a future revision,
+to signal a property which is not required to properly decode the frame.
+
+__Reserved bit__
+
+This bit is reserved for some future feature.
+Its value must be zero.
+A decoder compliant with this specification version must ensure it is not set.
+This bit may be used in a future revision,
+to signal a feature that must be interpreted in order to decode the frame.
+
+__Content checksum flag__
+
+If this flag is set, a content checksum will be present into the EndMark.
+The checksum is a 22 bits value extracted from the XXH64() of data.
+See __Content Checksum__ .
+
+__Dictionary ID flag__
+
+This is a 2-bits flag (`= FHD & 3`),
+telling if a dictionary ID is provided within the header
+
+|  Value  |  0  |  1  |  2  |  3  |
+| ------- | --- | --- | --- | --- |
+|FieldSize|  0  |  1  |  2  |  4  |
+
+__WD byte__ (Window Descriptor)
+
+Provides guarantees on maximum back-reference distance
+that will be used within compressed data.
+This information can then be used by decoder to allocate enough memory.
+
+|  BitNb  |    7-3   |    0-2   |
+| ------- | -------- | -------- |
+|FieldName| Exponent | Mantissa |
+
+Maximum distance is given by the following formulae :
+```
+windowLog = 10 + Exponent;
+windowBase = 1 << windowLog;
+windowAdd = (windowBase / 8) * Mantissa;
+windowSize = windowBase + windowAdd;
+```
+`WD` byte is optional. It's not present in `single segment` mode.
+
+In order to preserve decoder from unreasonable memory requirement,
+a decoder can refuse a compressed frame
+which requests a memory size beyond decoder's authorized range.
+
+__Frame Content Size__
+
+This is the original (uncompressed) size.
+This information is optional, and only present if associated flag is set.
+Content size is provided using 1, 2, 4 or 8 Bytes.
+Format is Little endian.
+
+| Field Size |    Range   |
+| ---------- | ---------- |
+|     0      |      0     |
+|     1      |   0 - 255  |
+|     2      | 256 - 65791|
+|     4      | 0 - 2^32-1 |
+|     8      | 0 - 2^64-1 |
+
+When field size is 1, 4 or 8 bytes, the value is read directly.
+When field size is 2, an offset of 256 is added.
+It's possible to represent a small size of `18` using the 8-bytes variant.
+A size of `0` means `data size is unknown`.
+In which case, the `WD` byte will be the only hint
+to determine memory allocation.
+
+In order to preserve decoder from unreasonable memory requirement,
+a decoder can refuse a compressed frame
+which requests a memory size beyond decoder's authorized range.
+
+__Dictionary ID__
+
+This is a variable size field, which contains a single ID.
+It checks if the correct dictionary is used for decoding.
+Note that this field is optional. If it's not present,
+it's up to the caller to make sure it uses the correct dictionary.
+
+Field size depends on __Dictionary ID flag__.
+1 byte can represent an ID 0-255.
+2 bytes can represent an ID 0-65535.
+4 bytes can represent an ID 0-(2^32-1).
+
+It's possible to represent a small ID (for example `13`) with a large 4-bytes dictionary ID, losing some efficiency in the process.
+
+
+Data Blocks
+-----------
+
+| B. Header |  data  |
+|:---------:| ------ |
+|  3 bytes  |        |
+
+
+__Block Header__
+
+This field uses 3-bytes, format is big-endian.
+
+The 2 highest bits represent the `block type`,
+while the remaining 22 bits represent the block size.
+
+There are 4 block types :
+
+|    Value   |      0     |  1  |  2  |    3    |
+| ---------- | ---------- | --- | --- | ------- |
+| Block Type | Compressed | Raw | RLE | EndMark |
+
+- Compressed : this is a compressed block,
+  following Zstandard's block format specification.
+  The "block size" is the compressed size.
+  Decompressed size is unknown,
+  but its maximum possible value is guaranteed (see later)
+- Raw : this is an uncompressed block.
+  "block size" is the number of bytes to read and copy.
+- RLE : this is a single byte, repeated N times.
+  In which case, the size of the "compressed" block is always 1,
+  and the "block size" is the size to regenerate.
+- EndMark : this is not a block. Signal the end of the frame.
+  The rest of the field may be optionally filled by a checksum
+  (see frame checksum).
+
+Block Size shall never be larger than Block Maximum Size.
+Block Maximum Size is the smallest of :
+- Max back-reference distance
+- 128 KB
+
+
+__Data__
+
+Where the actual data to decode stands.
+It might be compressed or not, depending on previous field indications.
+A data block is not necessarily "full" :
+an arbitrary “flush” may happen anytime. Any block can be “partially filled”.
+Therefore, data can have any size, up to Block Maximum Size.
+Block Maximum Size is the smallest of :
+- Max back-reference distance
+- 128 KB
+
+
+Skippable Frames
+----------------
+
+| Magic Number | Frame Size | User Data |
+|:------------:|:----------:| --------- |
+|   4 bytes    |  4 bytes   |           |
+
+Skippable frames allow the insertion of user-defined data
+into a flow of concatenated frames.
+Its design is pretty straightforward,
+with the sole objective to allow the decoder to quickly skip
+over user-defined data and continue decoding.
+
+Skippable frames defined in this specification are compatible with LZ4 ones.
+
+
+__Magic Number__
+
+4 Bytes, Little endian format.
+Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F.
+All 16 values are valid to identify a skippable frame.
+
+__Frame Size__
+
+This is the size, in bytes, of the following User Data
+(without including the magic number nor the size field itself).
+4 Bytes, Little endian format, unsigned 32-bits.
+This means User Data can’t be bigger than (2^32-1) Bytes.
+
+__User Data__
+
+User Data can be anything. Data will just be skipped by the decoder.
+
+
+Version changes
+---------------
+
+0.1 : initial release