59 Commits

Author SHA1 Message Date
Ilya Tokar
7c3d1cb3ab Enable STATIC_BMI2 for gcc/clang
Some usage (e.g. BIT_getLowerBit) uses it without checking for MSVC,
so enabling for clang gives a small performance boost.
2022-03-03 15:03:54 -05:00
Nick Terrell
c284569457 [asm] Share portability macros and restrict ASM further
Move portability macros to `lib/common/portability_macros.h`. This file
only contains platform/feature detection (e.g. 0/1 macros). This file is
shared between C and ASM code, so it cannot include any C code.

Rename `HUF_` ASM macros to be `ZSTD_` prefixed, and move to the new
header.

Restrict `ZSTD_ASM_SUPPORTED` to `__GNUC__`, because we need the GAS
assembler.

Finally, only include the ASM code if we are actually going to use it.
This disables it on all Windows platforms, which should resolve the
problem brought up in Issue #2789.
2021-12-02 16:58:04 -08:00
Yann Collet
1bf3d8a475
Merge pull request #2896 from facebook/m68k
Zstandard compiles and run on m68k cpus
2021-12-02 14:25:45 -08:00
Yann Collet
30b9db8ae4 changed macro name to ZSTD_ALIGNOF
for better consistency
2021-12-02 12:57:42 -08:00
Yann Collet
1d025d871b bound alignment backup to sizeof(void*) 2021-12-02 11:30:03 -08:00
Yann Collet
80a13fd645 move the alignment macro to compiler.h
because mem.h is dropped in the Linux kernel.

Changed macro definition order (gcc/clang/msvc before c11)
due to a limitation in the kernel source builder.

Changed the backup to sizeof(),
reverting to previous behavior when no support of alignof() is detected.
2021-12-02 11:20:01 -08:00
Nick Terrell
91f5891dd0 [CircleCI] Fix short-tests-0
short-tests-0 were silently failing. I think because of the && make clean construction. Switch to ; instead.

Also fix all the test failures that were exposed.

`make all` is failing on CircleCI because it is missing Docker. Move that test
to GitHub actions, and switch the pedantic CircleCI test to `make allmost`.
2021-12-01 17:43:46 -08:00
Nick Terrell
5414dd7978 [bmi2] Add lzcnt and bmi target attributes
* When dynamic dispatching to bmi2 add lzcnt and bmi to the
  TARGET_ATTRIBUTE.
* Centralize the bmi2 TARGET_ATTRIBUTE definition to
  BMI2_TARGET_ATTRIBUTE so we can change it in the future.
* Only enable bmi2 when both bmi1 & bmi2 are supported. There shouldn't
  be any cases where bmi2 is supported but bmi1 isn't. But, since we are
  using the instruction we should check bmi1 as well.
2021-11-30 17:54:56 -08:00
Dimitris Apostolou
ebbd675998
Fix typos 2021-11-13 10:04:04 +02:00
Nick Terrell
14772d97be
Merge pull request #2796 from terrelln/linux-fixes
[lib] Make lib compatible with `-Wfall-through` excepting legacy
2021-09-23 16:11:53 -07:00
Nick Terrell
d7ef97a013 [build] Fix oss-fuzz build with the dataflow sanitizer
The dataflow sanitizer requires all code to be instrumented. We can't
instrument the ASM function, so we have to disable it.
2021-09-23 11:48:39 -07:00
Nick Terrell
189e87bcbe [lib] Make lib compatible with -Wfall-through excepting legacy
Switch to a macro `ZSTD_FALLTHROUGH;` instead of a comment. On supported
compilers this uses an attribute, otherwise it becomes a comment.

This is necessary to be compatible with clang's `-Wfall-through`, and
gcc's `-Wfall-through=2` which don't support comments. Without this the
linux build emits a bunch of warnings.

Also add a test to CI to ensure that we don't regress.
2021-09-23 10:51:18 -07:00
Nick Terrell
a5f2c45528 Huffman ASM 2021-09-20 14:46:43 -07:00
makise-homura
3cd085cec3 Clarify no-tree-vectorize usage for ICC and LCC 2021-07-14 20:00:44 +03:00
makise-homura
a5f518ae27 Change zstdcli's main() declaration due to -Wmain on some compilers 2021-07-14 19:55:47 +03:00
makise-homura
d4ad02c721 Add support for MCST LCC compiler 2021-07-10 03:57:06 +03:00
aqrit
dd4f6aa9e6
Flatten ZSTD_row_getMatchMask (#2681)
* Flatten ZSTD_row_getMatchMask

* Remove the SIMD abstraction layer.
* Add big endian support.
* Align `hashTags` within `tagRow` to a 16-byte boundary. 
* Switch SSE2 to use aligned reads.
* Optimize scalar path using SWAR.
* Optimize neon path for `n == 32`
* Work around minor clang issue for NEON (https://bugs.llvm.org/show_bug.cgi?id=49577)

* replace memcpy with MEM_readST

* silence alignment warnings

* fix neon casts

* Update zstd_lazy.c

* unify simd preprocessor detection (#3)

* remove duplicate asserts

* tweak rotates

* improve endian detection

* add cast

there is a fun little catch-22 with gcc: result from pmovmskb has to be cast to uint32_t to avoid a zero-extension
but must be uint16_t to get gcc to generate a rotate instruction..

* more casts

* fix casts

better work-around for the (bogus) warning: unary minus on unsigned
2021-06-09 08:50:25 +03:00
Nick Terrell
a494308ae9 [copyright][license] Switch to yearless copyright and some cleanup in the linux-kernel files
* Switch to yearless copyright per FB policy
* Fix up SPDX-License-Identifier lines in `contrib/linux-kernel` sources
* Add zstd copyright/license header to the `contrib/linux-kernel` sources
* Update the `tests/test-license.py` to check for yearless copyright
* Improvements to `tests/test-license.py`
* Check `contrib/linux-kernel` in `tests/test-license.py`
2021-03-30 10:30:43 -07:00
Nick Terrell
54a4998a80 Add basic tracing functionality 2021-02-05 16:28:52 -08:00
Nick Terrell
66e811d782 [license] Update year to 2021 2021-01-04 17:53:52 -05:00
Nick Terrell
caecd8c211 Allow user to override ASAN/MSAN detection
Rename ADDRESS_SANITIZER -> ZSTD_ADDRESS_SANITIZER and same for
MEMORY_SANITIZER. Also set it to 0/1 instead of checking for defined.
This allows the user to override ASAN/MSAN detection for platforms that
don't support it.
2020-09-24 19:42:04 -07:00
Nick Terrell
9ae0483858 Reorganize zstd_deps.h and mem.h + replace mem.h for the kernel 2020-09-24 19:41:59 -07:00
Nick Terrell
260fc75028 Move __has_builtin() fallback define to compiler.h 2020-09-24 15:51:08 -07:00
Nick Terrell
4d63ee57f5 Move ASAN/MSAN support declarations to compiler.h 2020-09-24 15:51:08 -07:00
Nick Terrell
e3bda594ae Prefer __builtin_prefetch over inline asm
Reorder the ifdefs for the PREFETCH macros so that the compiler builtin is
favored over the inline assembly for aarch64.
2020-08-10 22:17:18 -07:00
Niadb
493fd40dca
Add files via upload 2020-07-28 02:52:15 -06:00
Niadb
74f65f624c
Update compiler.h
clean wording
2020-06-19 09:51:00 -06:00
Niadb
8c115cbe23
Update compiler.h
Added a comment explaining the purpose of the WIN_CDECL macro
2020-06-19 09:48:35 -06:00
Niadb
2962fda93f
Add files via upload 2020-06-19 03:34:05 -06:00
Niadb
a4c8aa5e02
Add files via upload 2020-06-19 03:31:47 -06:00
W. Felix Handte
952427aebf Avoid inline Keyword in C90
Previously we would use it for all gcc-like compilations, even when a
restrictive mode that disallowed it had been selected.
2020-05-04 10:59:15 -04:00
Nick Terrell
5fcbc484c8
Merge pull request #2040 from caoyzh/dev-2
Optimize by prefetching on aarch64
2020-04-08 13:14:47 -07:00
Nick Terrell
ac58c8d720 Fix copyright and license lines
* All copyright lines now have -2020 instead of -present
* All copyright lines include "Facebook, Inc"
* All licenses are now standardized

The copyright in `threading.{h,c}` is not changed because it comes from
zstdmt.

The copyright and license of `divsufsort.{h,c}` is not changed.
2020-03-26 17:02:06 -07:00
caoyzh
7201980650 Optimize by prefetching on aarch64 2020-03-14 15:25:59 +08:00
Bimba Shrestha
85d0efd619 Removing no-tree-vectorize for intel 2020-03-05 10:02:48 -08:00
Nick Terrell
718f00ff6f
Optimize decompression speed for gcc and clang (#1892)
* Optimize `ZSTD_decodeSequence()`
* Optimize Huffman decoding
* Optimize `ZSTD_decompressSequences()`
* Delete `ZSTD_decodeSequenceLong()`
2019-11-25 18:26:19 -08:00
Nick Terrell
5cb7615f1f Add UNUSED_ATTR to ZSTD_storeSeq() 2019-09-20 21:37:13 -07:00
Carl Woffenden
901ea61f83 Tweaks to create a single-file decoder
The CHECK_F macros differ slightly (but eventually do the same thing). Older GCC needs to fallback on the old-style pragma optimisation flags.
2019-08-21 17:49:17 +02:00
Joseph Chen
3855bc4295 Add support for IAR C/C++ Compiler for Arm 2019-07-29 15:25:58 +08:00
mgrice
812e8f2a16 perf improvements for zstd decode (#1668)
* perf improvements for zstd decode

tldr: 7.5% average decode speedup on silesia corpus at compression levels 1-3 (sandy bridge)

Background: while investigating zstd perf differences between clang and gcc I noticed that even though gcc is vectorizing the loop in in wildcopy, it was not being done as well as could be done by hand.  The sites where wildcopy is invoked have an interesting distribution of lengths to be copied.  The loop trip count is rarely above 1, yet long copies are common enough to make their performance important.The code in zstd_decompress.c to invoke wildcopy handles the latter well but the gcc autovectorizer introduces a needlessly expensive startup check for vectorization.

See how GCC autovectorizes the loop here:
https://godbolt.org/z/apr0x0

Here is the code after this diff has been applied: (left hand side is the good one, right is with vectorizer on)
After: https://godbolt.org/z/OwO4F8

Note that autovectorization still does not do a good job on the optimized version, so it's turned off\
 via attribute and flag.  I found that neither attribute nor command-line flag were entirely successful in turning off vectorization, which is why there were both.

    silesia benchmark data - second triad of each file is with the original code:

    file      orig        compressedratio     encode              decode           change
    1#dickens   10192446->   4268865(2.388),       198.9MB/s           709.6MB/s
    2#dickens   10192446->   3876126(2.630),       128.7MB/s           552.5MB/s
    3#dickens   10192446->   3682956(2.767),       104.6MB/s             537MB/s
    1#dickens   10192446->   4268865(2.388),       195.4MB/s           659.5MB/s     7.60%
    2#dickens   10192446->   3876126(2.630),         127MB/s           516.3MB/s     7.01%
    3#dickens   10192446->   3682956(2.767),         105MB/s           479.5MB/s    11.99%
    1#mozilla   51220480->  20117517(2.546),       285.4MB/s           734.9MB/s
    2#mozilla   51220480->  19067018(2.686),       220.8MB/s           686.3MB/s
    3#mozilla   51220480->  18508283(2.767),       152.2MB/s           669.4MB/s
    1#mozilla   51220480->  20117517(2.546),       283.4MB/s           697.9MB/s     5.30%
    2#mozilla   51220480->  19067018(2.686),       225.9MB/s             665MB/s     3.20%
    3#mozilla   51220480->  18508283(2.767),       154.5MB/s           640.6MB/s     4.50%
    1#mr         9970564->   3840242(2.596),       262.4MB/s           899.8MB/s
    2#mr         9970564->   3600976(2.769),       181.2MB/s           717.9MB/s
    3#mr         9970564->   3563987(2.798),       116.3MB/s             620MB/s
    1#mr         9970564->   3840242(2.596),       253.2MB/s           827.3MB/s     8.76%
    2#mr         9970564->   3600976(2.769),       177.4MB/s           655.4MB/s     9.54%
    3#mr         9970564->   3563987(2.798),       111.2MB/s           564.2MB/s     9.89%
    1#nci       33553445->   2849306(11.78),       575.2MB/s ,        1335.8MB/s
    2#nci       33553445->   2890166(11.61),       509.3MB/s ,        1238.1MB/s
    3#nci       33553445->   2857408(11.74),         431MB/s ,        1210.7MB/s
    1#nci       33553445->   2849306(11.78),       565.4MB/s ,        1220.2MB/s     9.47%
    2#nci       33553445->   2890166(11.61),       508.2MB/s ,        1128.4MB/s     9.72%
    3#nci       33553445->   2857408(11.74),       429.1MB/s ,        1097.7MB/s    10.29%
    1#ooffice    6152192->   3590954(1.713),       231.4MB/s ,         662.6MB/s
    2#ooffice    6152192->   3323931(1.851),       162.8MB/s ,         592.6MB/s
    3#ooffice    6152192->   3145625(1.956),        99.9MB/s ,         549.6MB/s
    1#ooffice    6152192->   3590954(1.713),       224.7MB/s ,         624.2MB/s     6.15%
    2#ooffice    6152192->   3323931 (1.851),        155MB/s ,         564.5MB/s     4.98%
    3#ooffice    6152192->   3145625(1.956),       101.1MB/s ,         521.2MB/s     5.45%
    1#osdb      10085684->   3739042(2.697),       271.9MB/s           876.4MB/s
    2#osdb      10085684->   3493875(2.887),       208.2MB/s             857MB/s
    3#osdb      10085684->   3515831(2.869),       135.3MB/s           805.4MB/s
    1#osdb      10085684->   3739042(2.697),       257.4MB/s           793.8MB/s    10.41%
    2#osdb      10085684->   3493875(2.887),       209.7MB/s           776.1MB/s    10.42%
    3#osdb      10085684->   3515831(2.869),       130.6MB/s           727.7MB/s    10.68%
    1#reymont    6627202->   2152771(3.078),       198.9MB/s           696.2MB/s
    2#reymont    6627202->   2071140(3.200),         170MB/s           595.2MB/s
    3#reymont    6627202->   1953597(3.392),       128.5MB/s           609.7MB/s
    1#reymont    6627202->   2152771(3.078),       199.6MB/s           655.2MB/s     6.26%
    2#reymont    6627202->   2071140(3.200),       168.2MB/s           554.4MB/s     7.36%
    3#reymont    6627202->   1953597(3.392),       128.7MB/s           557.4MB/s     9.38%
    1#samba     21606400->   5510994(3.921),       338.1MB/s            1066MB/s
    2#samba     21606400->   5240208(4.123),       258.7MB/s           992.3MB/s
    3#samba     21606400->   5003358(4.318),       200.2MB/s           991.1MB/s
    1#samba     21606400->   5510994(3.921),       330.8MB/s             974MB/s     9.45%
    2#samba     21606400->   5240208(4.123),       257.9MB/s           919.4MB/s     7.93%
    3#samba     21606400->   5003358(4.318),       198.5MB/s           908.9MB/s     9.04%
    1#sao        7251944->   6256401(1.159),       194.6MB/s           602.2MB/s
    2#sao        7251944->   5808761(1.248),       128.2MB/s           532.1MB/s
    3#sao        7251944->   5556318(1.305),          73MB/s           509.4MB/s
    1#sao        7251944->   6256401(1.159),       198.7MB/s           580.7MB/s     3.70%
    2#sao        7251944->   5808761(1.248),       129.1MB/s           502.7MB/s     5.85%
    3#sao        7251944->   5556318(1.305),        74.6MB/s           493.1MB/s     3.31%
    1#webster   41458703->  13692222(3.028),       222.3MB/s             752MB/s
    2#webster   41458703->  12842646(3.228),       157.6MB/s           532.2MB/s
    3#webster   41458703->  12191964(3.400),         124MB/s           468.5MB/s
    1#webster   41458703->  13692222(3.028),       219.7MB/s             697MB/s     7.89%
    2#webster   41458703->  12842646(3.228),       153.9MB/s           495.4MB/s     7.43%
    3#webster   41458703->  12191964(3.400),       124.8MB/s           444.8MB/s     5.33%
    1#xml        5345280->    696652(7.673),         485MB/s ,        1333.9MB/s
    2#xml        5345280->    681492(7.843),       405.2MB/s ,        1237.5MB/s
    3#xml        5345280->    639057(8.364),       328.5MB/s ,        1281.3MB/s
    1#xml        5345280->    696652(7.673),       473.1MB/s ,        1232.4MB/s     8.24%
    2#xml        5345280->    681492(7.843),       398.6MB/s ,        1145.9MB/s     7.99%
    3#xml        5345280->    639057(8.364),       327.1MB/s ,          1175MB/s     9.05%
    1#x-ray      8474240->   6772557(1.251),       521.3MB/s           762.6MB/s
    2#x-ray      8474240->   6684531(1.268),       230.5MB/s           688.5MB/s
    3#x-ray      8474240->   6166679(1.374),        68.7MB/s           478.8MB/s
    1#x-ray      8474240->   6772557(1.251),       502.8MB/s           736.7MB/s     3.52%
    2#x-ray      8474240->   6684531(1.268),       224.4MB/s             662MB/s     4.00%
    3#x-ray      8474240->   6166679(1.374),        67.3MB/s           437.8MB/s     9.37%

                                                                                     7.51%

* makefile changed to only pass -fno-tree-vectorize to gcc

* <Replace this line with a title. Use 1 line only, 67 chars or less>

Don't add "no-tree-vectorize" attribute on clang (which defines __GNUC__)

* fix for warning/error with subtraction of void* pointers

* fix c90 conformance issue - ISO C90 forbids mixed declarations and code

* Fix assert for negative diff, only when there is no overlap

* fix overflow revealed in fuzzing tests

* tweak for small speed increase
2019-07-11 18:31:07 -04:00
Josh Soref
a880ca239b Spelling (#1582)
* spelling: accidentally

* spelling: across

* spelling: additionally

* spelling: addresses

* spelling: appropriate

* spelling: assumed

* spelling: available

* spelling: builder

* spelling: capacity

* spelling: compiler

* spelling: compressibility

* spelling: compressor

* spelling: compression

* spelling: contract

* spelling: convenience

* spelling: decompress

* spelling: description

* spelling: deflate

* spelling: deterministically

* spelling: dictionary

* spelling: display

* spelling: eliminate

* spelling: preemptively

* spelling: exclude

* spelling: failure

* spelling: independence

* spelling: independent

* spelling: intentionally

* spelling: matching

* spelling: maximum

* spelling: meaning

* spelling: mishandled

* spelling: memory

* spelling: occasionally

* spelling: occurrence

* spelling: official

* spelling: offsets

* spelling: original

* spelling: output

* spelling: overflow

* spelling: overridden

* spelling: parameter

* spelling: performance

* spelling: probability

* spelling: receives

* spelling: redundant

* spelling: recompression

* spelling: resources

* spelling: sanity

* spelling: segment

* spelling: series

* spelling: specified

* spelling: specify

* spelling: subtracted

* spelling: successful

* spelling: return

* spelling: translation

* spelling: update

* spelling: unrelated

* spelling: useless

* spelling: variables

* spelling: variety

* spelling: verbatim

* spelling: verification

* spelling: visited

* spelling: warming

* spelling: workers

* spelling: with
2019-04-12 11:18:11 -07:00
W. Felix Handte
9d5f3963ff Add Option to Not Request Inlining with ZSTD_NO_INLINE 2018-12-18 13:36:39 -08:00
Yann Collet
626040ab53 changed PREFETCH() macro into PREFETCH_L2()
which is more accurate
2018-11-12 17:05:32 -08:00
Yann Collet
9126da5b5c improve long-range decoder speed
on enwik9 at level 22 (which is almost a worst case scenario),
speed improves by +7% on my laptop (415 -> 445 MB/s)
2018-11-08 12:47:46 -08:00
Yann Collet
5512400677 updated code comments, based on @terrelln review 2018-09-13 16:44:04 -07:00
Yann Collet
2618253da2 fixed PREFETCH() macro
for corner cases and platforms without this instruction
2018-09-12 16:15:37 -07:00
Yann Collet
4de344d505 added conditional prefetch
depending on amount of work to do.
2018-09-12 10:29:47 -07:00
Yann Collet
63a519dbf6 implemented first prefetch
based on dictID.
dictContent is prefetched up to 32 KB
(no contentSize adaptation)
2018-09-11 17:23:44 -07:00
Yann Collet
bbd78df59b add build macro NO_PREFETCH
prevent usage of prefetch intrinsic commands
which are not supported by c2rust
(see https://github.com/immunant/c2rust/issues/13)
2018-07-06 17:06:04 -07:00
fbrosson
291824f49d __builtin_prefetch did probably not exist before gcc 3.1. 2018-05-18 18:40:11 +00:00