[zdict] Add a FAQ to the top of zdict.h

The FAQ covers the questions asked in Issue #2566. It first covers why you would want to use a dictionary, then what a dictionary is, and finally it tells you how to train a dictionary, and clarifies some of the parameters. There is definitely more that could be said about some of the advanced trainers, but this should be a good start.
2021-05-05 19:44:24 -07:00 · 2021-05-05 19:44:24 -07:00 · 1874f0844d
commit 1874f0844d
parent fed8589430
1 changed files with 147 additions and 1 deletions
--- a/lib/zdict.h
+++ b/lib/zdict.h
@ -36,6 +36,145 @@ extern "C" {
 #  define ZDICTLIB_API ZDICTLIB_VISIBILITY
 #endif
 /*******************************************************************************
 * Zstd dictionary builder
 *
 * FAQ
 * ===
 * Why should I use a dictionary?
 * ------------------------------
 *
 * Zstd can use dictionaries to improve compression ratio of small data.
 * Traditionally small files don't compress well because there is very little
 * repetion in a single sample, since it is small. But, if you are compressing
 * many similar files, like a bunch of JSON records that share the same
 * structure, you can train a dictionary on ahead of time on some samples of
 * these files. Then, zstd can use the dictionary to find repetitions that are
 * present across samples. This can vastly improve compression ratio.
 *
 * When is a dictionary useful?
 * ----------------------------
 *
 * Dictionaries are useful when compressing many small files that are similar.
 * The larger a file is, the less benefit a dictionary will have. Generally,
 * we don't expect dictionary compression to be effective past 100KB. And the
 * smaller a file is, the more we would expect the dictionary to help.
 *
 * How do I use a dictionary?
 * --------------------------
 *
 * Simply pass the dictionary to the zstd compressor with
 * `ZSTD_CCtx_loadDictionary()`. The same dictionary must then be passed to
 * the decompressor, using `ZSTD_DCtx_loadDictionary()`. There are other
 * more advanced functions that allow selecting some options, see zstd.h for
 * complete documentation.
 *
 * What is a zstd dictionary?
 * --------------------------
 *
 * A zstd dictionary has two pieces: Its header, and its content. The header
 * contains a magic number, the dictionary ID, and entropy tables. These
 * entropy tables allow zstd to save on header costs in the compressed file,
 * which really matters for small data. The content is just bytes, which are
 * repeated content that is common across many samples.
 *
 * What is a raw content dictionary?
 * ---------------------------------
 *
 * A raw content dictionary is just bytes. It doesn't have a zstd dictionary
 * header, a dictionary ID, or entropy tables. Any buffer is a valid raw
 * content dictionary.
 *
 * How do I train a dictionary?
 * ----------------------------
 *
 * Gather samples from your use case. These samples should be similar to each
 * other. If you have several use cases, you could try to train one dictionary
 * per use case.
 *
 * Pass those samples to `ZDICT_trainFromBuffer()` and that will train your
 * dictionary. There are a few advanced versions of this function, but this
 * is a great starting point. If you want to further tune your dictionary
 * you could try `ZDICT_optimizeTrainFromBuffer_cover()`. If that is too slow
 * you can try `ZDICT_optimizeTrainFromBuffer_fastCover()`.
 *
 * If the dictionary training function fails, that is likely because you
 * either passed too few samples, or a dictionary would not be effective
 * for your data. Look at the messages that the dictionary trainer printed,
 * if it doesn't say too few samples, then a dictionary would not be effective.
 *
 * How large should my dictionary be?
 * ----------------------------------
 *
 * A reasonable dictionary size, the `dictBufferCapacity`, is about 100KB.
 * The zstd CLI defaults to a 110KB dictionary. You likely don't need a
 * dictionary larger than that. But, most use cases can get away with a
 * smaller dictionary. The advanced dictionary builders can automatically
 * shrink the dictionary for you, and select a the smallest size that
 * doesn't hurt compression ratio too much. See the `shrinkDict` parameter.
 * A smaller dictionary can save memory, and potentially speed up
 * compression.
 *
 * How many samples should I provide to the dictionary builder?
 * ------------------------------------------------------------
 *
 * We generally recommend passing ~100x the size of the dictionary
 * in samples. A few thousand should suffice. Having too few samples
 * can hurt the dictionaries effectiveness. Having more samples will
 * only improve the dictionaries effectiveness. But having too many
 * samples can slow down the dictionary builder.
 *
 * How do I determine if a dictionary will be effective?
 * -----------------------------------------------------
 *
 * Simply train a dictionary and try it out. You can use zstd's built in
 * benchmarking tool to test the dictionary effectiveness.
 *
 *   # Benchmark levels 1-3 without a dictionary
 *   zstd -b1e3 -r /path/to/my/files
 *   # Benchmark levels 1-3 with a dictioanry
 *   zstd -b1e3 -r /path/to/my/files -D /path/to/my/dictionary
 *
 * When should I retrain a dictionary?
 * -----------------------------------
 *
 * You should retrain a dictionary when its effectiveness drops. Dictionary
 * effectiveness drops as the data you are compressing changes. Generally, we do
 * expect dictionaries to "decay" over time, as your data changes, but the rate
 * at which they decay depends on your use case. Internally, we regularly
 * retrain dictionaries, and if the new dictionary performs significantly
 * better than the old dictionary, we will ship the new dictionary.
 *
 * I have a raw content dictionary, how do I turn it into a zstd dictionary?
 * -------------------------------------------------------------------------
 *
 * If you have a raw content dictionary, e.g. by manually constructing it, or
 * using a third-party dictionary builder, you can turn it into a zstd
 * dictionary by using `ZDICT_finalizeDictionary()`. You'll also have to
 * provide some samples of the data. It will add the zstd header to the
 * raw content, which contains a dictionary ID and entropy tables, which
 * will improve compression ratio, and allow zstd to write the dictionary ID
 * into the frame, if you so choose.
 *
 * Do I have to use zstd's dictionary builder?
 * -------------------------------------------
 *
 * No! You can construct dictionary content however you please, it is just
 * bytes. It will always be valid as a raw content dictionary. If you want
 * a zstd dictionary, which can improve compression ratio, use
 * `ZDICT_finalizeDictionary()`.
 *
 * What is the attack surface of a zstd dictionary?
 * ------------------------------------------------
 *
 * Zstd is heavily fuzz tested, including loading fuzzed dictionaries, so
 * zstd should never crash, or access out-of-bounds memory no matter what
 * the dictionary is. However, if an attacker can control the dictionary
 * during decompression, they can cause zstd to generate arbitrary bytes,
 * just like if they controlled the compressed data.
 *
 ******************************************************************************/
 /*! ZDICT_trainFromBuffer():
 *  Train a dictionary from an array of samples.
@ -64,7 +203,14 @@ ZDICTLIB_API size_t ZDICT_trainFromBuffer(void* dictBuffer, size_t dictBufferCap
 typedef struct {
    int      compressionLevel;   /*< optimize for a specific zstd compression level; 0 means default */
    unsigned notificationLevel;  /*< Write log to stderr; 0 = none (default); 1 = errors; 2 = progression; 3 = details; 4 = debug; */
-    unsigned dictID;             /*< force dictID value; 0 means auto mode (32-bits random value) */
+    unsigned dictID;             /*< force dictID value; 0 means auto mode (32-bits random value)
                                  *   NOTE: The zstd format reserves some dictionary IDs for future use.
                                  *         You may use them in private settings, but be warned that they
                                  *         may be used by zstd in a public dictionary registry in the future.
                                  *         These dictionary IDs are:
                                  *           - low range  : <= 32767
                                  *           - high range : >= (2^31)
                                  */
 } ZDICT_params_t;
 /*! ZDICT_finalizeDictionary():