facebook/zstd - zstd - Final Minetest

Author	SHA1	Message	Date
Norbert Lange	0d45540695	decompress: conditionally remove bmi2 from context Use an helper function, which will just return 0 in case the feature is disabled. Allows constant propagation and removal of dead code.	2021-09-26 14:41:37 +02:00
Nick Terrell	189e87bcbe	[lib] Make lib compatible with `-Wfall-through` excepting legacy Switch to a macro `ZSTD_FALLTHROUGH;` instead of a comment. On supported compilers this uses an attribute, otherwise it becomes a comment. This is necessary to be compatible with clang's `-Wfall-through`, and gcc's `-Wfall-through=2` which don't support comments. Without this the linux build emits a bunch of warnings. Also add a test to CI to ensure that we don't regress.	2021-09-23 10:51:18 -07:00
Danila Kutenin	2c2c9e7dfd	Add possible improvements for gcc-11	2021-06-29 09:06:47 +01:00
Danila Kutenin	08a3ddbd28	Add comment for gcc-11	2021-06-08 20:54:21 +01:00
Danila Kutenin	6534c0000f	Be C89 compliant and fix alignment for gcc11	2021-06-08 20:45:57 +01:00
Danila Kutenin	a80d268700	Optimize ZSTD_decodeSequence by another x%	2021-05-29 18:21:10 +01:00
Yann Collet	439e58d060	improved gcc-9 and gcc-10 decoding speed the new alignment setting is better for gcc-9 and gcc-10 by about ~+5%. Unfortunately, it's worse for essentially all other compilers. Make the new alignment setting conditional to gcc-9+.	2021-05-08 00:01:01 -07:00
Yann Collet	6755baf940	update decoder hot loop alignment This seems to bring an additional ~+1.2% decompression speed on average across 10 compilers x 6 scenarios.	2021-05-07 15:18:16 -07:00
Yann Collet	1db5947591	improve decompression speed of long variant by ~+5% changed strategy, now unconditionally prefetch the first 2 cache lines, instead of cache lines corresponding to the first and last bytes of the match. This better corresponds to cpu expectation, which should auto-prefetch following cachelines on detecting the sequential nature of the read. This is globally positive, by +5%, though exact gains depend on compiler (from -2% to +15%). The only negative counter-example is gcc-9.	2021-05-07 11:26:14 -07:00
Yann Collet	ee425faaa7	Merge branch 'dev' into d_prefetch_refactor	2021-05-06 19:49:26 -07:00
Nick Terrell	b052b583e5	[lib] Fix UBSAN warning in ZSTD_decompressSequences()	2021-05-06 15:31:30 -07:00
Yann Collet	7ef6d7b36c	deeper prefetching pipeline for decompressSequencesLong pipeline increased from 4 to 8 slots. This change substantially improves decompression speed when there are long distance offsets. example with enwik9 compressed at level 22 : gcc-9 : 947 -> 1039 MB/s clang-10: 884 -> 946 MB/s I also checked the "cold dictionary" scenario, and found a smaller benefit, around ~2% (measurements are more noisy for this scenario).	2021-05-05 10:04:03 -07:00
Yann Collet	8cde167a27	Merge branch 'dev' into d_prefetch_refactor	2021-05-05 09:13:38 -07:00
Nick Terrell	a494308ae9	[copyright][license] Switch to yearless copyright and some cleanup in the linux-kernel files * Switch to yearless copyright per FB policy * Fix up SPDX-License-Identifier lines in `contrib/linux-kernel` sources * Add zstd copyright/license header to the `contrib/linux-kernel` sources * Update the `tests/test-license.py` to check for yearless copyright * Improvements to `tests/test-license.py` * Check `contrib/linux-kernel` in `tests/test-license.py`	2021-03-30 10:30:43 -07:00
Yann Collet	f5434663ea	Refactor prefetching for the decoding loop Following #2545, I noticed that one field in `seq_t` is optional, and only used in combination with prefetching. (This may have contributed to static analyzer failure to detect correct initialization). I then wondered if it would be possible to rewrite the code so that this optional part is handled directly by the prefetching code rather than delegated as an option into `ZSTD_decodeSequence()`. This resulted into this refactoring exercise where the prefetching responsibility is better isolated into its own function and `ZSTD_decodeSequence()` is streamlined to contain strictly Sequence decoding operations. Incidently, due to better code locality, it reduces the need to send information around, leading to simplified interface, and smaller state structures.	2021-03-19 15:48:17 -07:00
Nick Terrell	f9b1e711ba	[zstd] Fix NULL pointer addition in ZSTD_checkContinuity() Don't start a new section when `dstSize == 0` to avoid NULL pointer addition.	2021-02-05 12:18:06 -08:00
Yann Collet	b9748757b0	fixed minor cast warning	2021-02-05 09:55:54 -08:00
Nick Terrell	66e811d782	[license] Update year to 2021	2021-01-04 17:53:52 -05:00
Yann Collet	0b39531d75	moving all references to `release` branch was previously `master`	2020-12-16 23:00:35 -08:00
Nick Terrell	c465f24457	ZSTD_ prefix mem{cpy,move,set},malloc,calloc,free	2020-08-26 12:26:03 -07:00
Nick Terrell	80f577baa2	Move standard includes to zstd_deps.h	2020-08-26 12:25:08 -07:00
Nick Terrell	614e446000	Merge pull request #2271 from terrelln/small-blocks Small block optimizations	2020-08-24 18:54:33 -07:00
Nick Terrell	52f33a1da5	Fix compiler warnings	2020-08-24 16:09:45 -07:00
Nick Terrell	6f301a7903	Merge pull request #2272 from terrelln/dstSize_tooSmall [fix] Always return dstSize_tooSmall when it is the case	2020-08-24 15:01:17 -07:00
Nick Terrell	6d2f750b37	Document the BMI2 default() functions	2020-08-24 14:44:33 -07:00
Nick Terrell	1302f8d676	[fix] Always return dstSize_tooSmall when it is the case	2020-08-24 13:38:13 -07:00
Nick Terrell	575731b6db	Use ncount=1 when < 4096 symbols	2020-08-18 16:47:53 -07:00
Nick Terrell	612e947c5e	wire up bmi2 support	2020-08-17 16:35:28 -07:00
Nick Terrell	ba1fd17a9f	speed up literal header decoding	2020-08-17 12:17:53 -07:00
Nick Terrell	6004c1117f	speed up small blocks	2020-08-16 23:03:38 -07:00
Carl Woffenden	4c81fae146	Fix clang -Wcomma warning	2020-08-13 16:11:22 +02:00
Nick Terrell	cce0edfdbe	Fix unused variable warnings in fuzzing build mode without asserts Fix unused vairable warnings when `FUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION` is defined but asserts are disabled. Fixes #2210.	2020-06-22 12:56:57 -07:00
Nick Terrell	f800e72a3c	[lib] Fix assertion when dictionary is prefix	2020-05-12 14:33:59 -07:00
Nick Terrell	4b88bd3ee0	[lib][fuzz] Assert sequences are valid in round trip tests	2020-05-11 20:38:49 -07:00
Nick Terrell	5717bd39ee	[lib] Fix NULL pointer dereference When the output buffer is `NULL` with size 0, but the frame content size is non-zero, we will write to the NULL pointer because our bounds check underflowed. This was exposed by a recent PR that allowed an empty frame into the single-pass shortcut in streaming mode. * Fix the bug. * Fix another NULL dereference in zstd-v1. * Overflow checks in 32-bit mode. * Add a dedicated test. * Expose the bug in the dedicated simple_decompress fuzzer. * Switch all mallocs in fuzzers to return NULL for size=0. * Fix a new timeout in a fuzzer. Neither clang nor gcc show a decompression speed regression on x86-64. On x86-32 clang is slightly positive and gcc loses 2.5% of speed. Credit to OSS-Fuzz.	2020-05-06 12:09:02 -07:00
W. Felix Handte	6028827fee	Rewrite Include Paths to be Relative Addresses #1998.	2020-05-04 15:20:26 -04:00
W. Felix Handte	5e5f262612	Add (Possibly Empty) Info Strings to All Variadic Error Handling Macro Invocations	2020-05-04 10:58:55 -04:00
Nick Terrell	ac58c8d720	Fix copyright and license lines * All copyright lines now have -2020 instead of -present * All copyright lines include "Facebook, Inc" * All licenses are now standardized The copyright in `threading.{h,c}` is not changed because it comes from zstdmt. The copyright and license of `divsufsort.{h,c}` is not changed.	2020-03-26 17:02:06 -07:00
Nick Terrell	8d0ee37ac0	Align decompress sequences loop to 32+16 bytes The alignment is added before the loop, so this shouldn't hurt performance in any case. The only way it hurts is if there is already performance instability, and we force it to be stable but in the bad case. This consistently gets us into the good case with gcc-{7,8,9} on an Intel i9-9900K and clang-9. gcc-5 is 5% worse than its best case but has stable performance. We get consistently good behavior on my Macbook Pro compiled with both clang and gcc-8. It ends up in the 50% from DSB and 50% from MITE case, but the performance is the same as the 85% DSB case, so thats fine.	2020-03-23 19:40:31 -07:00
Nick Terrell	7627759b4e	Merge pull request #1972 from terrelln/check-cont Move ZSTD_checkContinuity() to zstd_decompress_block.c	2020-01-23 22:02:50 -08:00
Nick Terrell	cb2abc3dbe	Fix performance regression on aarch64 with clang	2020-01-23 17:31:14 -08:00
Nick Terrell	6e3cd5b024	Move ZSTD_checkContinuity() to zstd_decompress_block.c	2020-01-23 12:27:39 -08:00
Nick Terrell	718f00ff6f	Optimize decompression speed for gcc and clang (#1892 ) * Optimize `ZSTD_decodeSequence()` * Optimize Huffman decoding * Optimize `ZSTD_decompressSequences()` * Delete `ZSTD_decodeSequenceLong()`	2019-11-25 18:26:19 -08:00
Nick Terrell	9c1860861e	Fix assert in ZSTD_safecopy In the case that `op >= oend_w` it is possible that `diff < 8` because the two buffers could be adjacent. Credit to OSS-Fuzz, which found the bug. It isn't reproducible because it depends on the memory layout.	2019-10-28 17:51:17 -07:00
Nick Terrell	ddab2a94e8	Pass iend into ZSTD_storeSeq() to allow ZSTD_wildcopy()	2019-09-20 00:56:20 -07:00
Nick Terrell	cdad7fa512	Widen ZSTD_wildcopy to 32 bytes	2019-09-20 00:52:15 -07:00
Nick Terrell	efd37a64ea	Optimize decompression and fix wildcopy overread * Bump `WILDCOPY_OVERLENGTH` to 16 to fix the wildcopy overread. * Optimize `ZSTD_wildcopy()` by removing unnecessary branches and unrolling the loop. * Extract `ZSTD_overlapCopy8()` into its own function. * Add `ZSTD_safecopy()` for `ZSTD_execSequenceEnd()`. It is optimized for single long sequences, since that is the important case that can end up in `ZSTD_execSequenceEnd()`. Without this optimization, decompressing a block with 1 long match goes from 5.7 GB/s to 800 MB/s. * Refactor `ZSTD_execSequenceEnd()`. * Increase the literal copy shortcut to 16. * Add a shortcut for offset >= 16. * Simplify `ZSTD_execSequence()` by pushing more cases into `ZSTD_execSequenceEnd()`. * Delete `ZSTD_execSequenceLong()` since it is exactly the same as `ZSTD_execSequence()`. clang-8 seeds +17.5% on silesia and +21.8% on enwik8. gcc-9 sees +12% on silesia and +15.5% on enwik8. TODO: More detailed measurements, and on more datasets. Crdit to OSS-Fuzz for finding the wildcopy overread.	2019-09-19 21:07:14 -07:00
mgrice	b830599582	Improvements in zstd decode performance Summary: The idea behind wildcopy is that it can be cheaper to copy more bytes (say 8) than it is to copy less (say, 3). This change takes that further by exploiting some properties: 1. it's almost always OK to copy 16 bytes instead of 8, which means fewer copy instructions, and fewer branches 2. A 16 byte chunk size means that ~90% of wildcopy invocations will have a trip count of 1, so branch prediction will be improved. Speedup on Xeon E5-2680v4 is in the range of 3-5%. Measured wildcopy length distributions on silesia.tar: level <=8 <=16 <=24 >24 1 78.05% 11.49% 3.52% 6.94% 3 82.14% 8.99% 2.44% 6.43% 6 85.81% 6.51% 2.92% 4.76% 8 83.02% 7.31% 3.64% 6.03% 10 84.13% 6.67% 3.29% 5.91% 15 77.58% 7.55% 5.21% 9.66% 16 80.07% 7.20% 3.98% 8.75% Test Plan: benchmark silesia, make check	2019-08-29 12:25:56 -07:00
Yann Collet	0b0b83e8f3	fix test 122 it's an unsupported scenario.	2019-08-03 16:51:26 +02:00
mgrice	812e8f2a16	perf improvements for zstd decode (#1668 ) * perf improvements for zstd decode tldr: 7.5% average decode speedup on silesia corpus at compression levels 1-3 (sandy bridge) Background: while investigating zstd perf differences between clang and gcc I noticed that even though gcc is vectorizing the loop in in wildcopy, it was not being done as well as could be done by hand. The sites where wildcopy is invoked have an interesting distribution of lengths to be copied. The loop trip count is rarely above 1, yet long copies are common enough to make their performance important.The code in zstd_decompress.c to invoke wildcopy handles the latter well but the gcc autovectorizer introduces a needlessly expensive startup check for vectorization. See how GCC autovectorizes the loop here: https://godbolt.org/z/apr0x0 Here is the code after this diff has been applied: (left hand side is the good one, right is with vectorizer on) After: https://godbolt.org/z/OwO4F8 Note that autovectorization still does not do a good job on the optimized version, so it's turned off\ via attribute and flag. I found that neither attribute nor command-line flag were entirely successful in turning off vectorization, which is why there were both. silesia benchmark data - second triad of each file is with the original code: file orig compressedratio encode decode change 1#dickens 10192446-> 4268865(2.388), 198.9MB/s 709.6MB/s 2#dickens 10192446-> 3876126(2.630), 128.7MB/s 552.5MB/s 3#dickens 10192446-> 3682956(2.767), 104.6MB/s 537MB/s 1#dickens 10192446-> 4268865(2.388), 195.4MB/s 659.5MB/s 7.60% 2#dickens 10192446-> 3876126(2.630), 127MB/s 516.3MB/s 7.01% 3#dickens 10192446-> 3682956(2.767), 105MB/s 479.5MB/s 11.99% 1#mozilla 51220480-> 20117517(2.546), 285.4MB/s 734.9MB/s 2#mozilla 51220480-> 19067018(2.686), 220.8MB/s 686.3MB/s 3#mozilla 51220480-> 18508283(2.767), 152.2MB/s 669.4MB/s 1#mozilla 51220480-> 20117517(2.546), 283.4MB/s 697.9MB/s 5.30% 2#mozilla 51220480-> 19067018(2.686), 225.9MB/s 665MB/s 3.20% 3#mozilla 51220480-> 18508283(2.767), 154.5MB/s 640.6MB/s 4.50% 1#mr 9970564-> 3840242(2.596), 262.4MB/s 899.8MB/s 2#mr 9970564-> 3600976(2.769), 181.2MB/s 717.9MB/s 3#mr 9970564-> 3563987(2.798), 116.3MB/s 620MB/s 1#mr 9970564-> 3840242(2.596), 253.2MB/s 827.3MB/s 8.76% 2#mr 9970564-> 3600976(2.769), 177.4MB/s 655.4MB/s 9.54% 3#mr 9970564-> 3563987(2.798), 111.2MB/s 564.2MB/s 9.89% 1#nci 33553445-> 2849306(11.78), 575.2MB/s , 1335.8MB/s 2#nci 33553445-> 2890166(11.61), 509.3MB/s , 1238.1MB/s 3#nci 33553445-> 2857408(11.74), 431MB/s , 1210.7MB/s 1#nci 33553445-> 2849306(11.78), 565.4MB/s , 1220.2MB/s 9.47% 2#nci 33553445-> 2890166(11.61), 508.2MB/s , 1128.4MB/s 9.72% 3#nci 33553445-> 2857408(11.74), 429.1MB/s , 1097.7MB/s 10.29% 1#ooffice 6152192-> 3590954(1.713), 231.4MB/s , 662.6MB/s 2#ooffice 6152192-> 3323931(1.851), 162.8MB/s , 592.6MB/s 3#ooffice 6152192-> 3145625(1.956), 99.9MB/s , 549.6MB/s 1#ooffice 6152192-> 3590954(1.713), 224.7MB/s , 624.2MB/s 6.15% 2#ooffice 6152192-> 3323931 (1.851), 155MB/s , 564.5MB/s 4.98% 3#ooffice 6152192-> 3145625(1.956), 101.1MB/s , 521.2MB/s 5.45% 1#osdb 10085684-> 3739042(2.697), 271.9MB/s 876.4MB/s 2#osdb 10085684-> 3493875(2.887), 208.2MB/s 857MB/s 3#osdb 10085684-> 3515831(2.869), 135.3MB/s 805.4MB/s 1#osdb 10085684-> 3739042(2.697), 257.4MB/s 793.8MB/s 10.41% 2#osdb 10085684-> 3493875(2.887), 209.7MB/s 776.1MB/s 10.42% 3#osdb 10085684-> 3515831(2.869), 130.6MB/s 727.7MB/s 10.68% 1#reymont 6627202-> 2152771(3.078), 198.9MB/s 696.2MB/s 2#reymont 6627202-> 2071140(3.200), 170MB/s 595.2MB/s 3#reymont 6627202-> 1953597(3.392), 128.5MB/s 609.7MB/s 1#reymont 6627202-> 2152771(3.078), 199.6MB/s 655.2MB/s 6.26% 2#reymont 6627202-> 2071140(3.200), 168.2MB/s 554.4MB/s 7.36% 3#reymont 6627202-> 1953597(3.392), 128.7MB/s 557.4MB/s 9.38% 1#samba 21606400-> 5510994(3.921), 338.1MB/s 1066MB/s 2#samba 21606400-> 5240208(4.123), 258.7MB/s 992.3MB/s 3#samba 21606400-> 5003358(4.318), 200.2MB/s 991.1MB/s 1#samba 21606400-> 5510994(3.921), 330.8MB/s 974MB/s 9.45% 2#samba 21606400-> 5240208(4.123), 257.9MB/s 919.4MB/s 7.93% 3#samba 21606400-> 5003358(4.318), 198.5MB/s 908.9MB/s 9.04% 1#sao 7251944-> 6256401(1.159), 194.6MB/s 602.2MB/s 2#sao 7251944-> 5808761(1.248), 128.2MB/s 532.1MB/s 3#sao 7251944-> 5556318(1.305), 73MB/s 509.4MB/s 1#sao 7251944-> 6256401(1.159), 198.7MB/s 580.7MB/s 3.70% 2#sao 7251944-> 5808761(1.248), 129.1MB/s 502.7MB/s 5.85% 3#sao 7251944-> 5556318(1.305), 74.6MB/s 493.1MB/s 3.31% 1#webster 41458703-> 13692222(3.028), 222.3MB/s 752MB/s 2#webster 41458703-> 12842646(3.228), 157.6MB/s 532.2MB/s 3#webster 41458703-> 12191964(3.400), 124MB/s 468.5MB/s 1#webster 41458703-> 13692222(3.028), 219.7MB/s 697MB/s 7.89% 2#webster 41458703-> 12842646(3.228), 153.9MB/s 495.4MB/s 7.43% 3#webster 41458703-> 12191964(3.400), 124.8MB/s 444.8MB/s 5.33% 1#xml 5345280-> 696652(7.673), 485MB/s , 1333.9MB/s 2#xml 5345280-> 681492(7.843), 405.2MB/s , 1237.5MB/s 3#xml 5345280-> 639057(8.364), 328.5MB/s , 1281.3MB/s 1#xml 5345280-> 696652(7.673), 473.1MB/s , 1232.4MB/s 8.24% 2#xml 5345280-> 681492(7.843), 398.6MB/s , 1145.9MB/s 7.99% 3#xml 5345280-> 639057(8.364), 327.1MB/s , 1175MB/s 9.05% 1#x-ray 8474240-> 6772557(1.251), 521.3MB/s 762.6MB/s 2#x-ray 8474240-> 6684531(1.268), 230.5MB/s 688.5MB/s 3#x-ray 8474240-> 6166679(1.374), 68.7MB/s 478.8MB/s 1#x-ray 8474240-> 6772557(1.251), 502.8MB/s 736.7MB/s 3.52% 2#x-ray 8474240-> 6684531(1.268), 224.4MB/s 662MB/s 4.00% 3#x-ray 8474240-> 6166679(1.374), 67.3MB/s 437.8MB/s 9.37% 7.51% * makefile changed to only pass -fno-tree-vectorize to gcc * <Replace this line with a title. Use 1 line only, 67 chars or less> Don't add "no-tree-vectorize" attribute on clang (which defines __GNUC__) * fix for warning/error with subtraction of void* pointers * fix c90 conformance issue - ISO C90 forbids mixed declarations and code * Fix assert for negative diff, only when there is no overlap * fix overflow revealed in fuzzing tests * tweak for small speed increase	2019-07-11 18:31:07 -04:00

1 2

72 Commits