The optimal parser is unlikely to be used in the linux kernel in
practice. There is no reason these functions should be force inlined,
since we aren't gaining anything, and are losing build size.
| Compiler | Before (Bytes) | After (Bytes) | Delta (Bytes) |
|----------|----------------|---------------|---------------|
| gcc-11 | 1142090 | 952754 | -189336 |
| clang-12 | 1228402 | 976290 | -252112 |
This is a temporary solution pending the resolution of PR #2862 in the
`dev` branch.
Take the same approach as in PR #2828 [0] to remove functions that force
inline many function bodies and `switch`. Instead, create one function per
"template" combination, and then switch between these functions. This
allows the compiler to break the large function into many small
functions, which generally helps codegen.
Also, in the `extDict` modes when there is no ext-dict, call the top
level function instead of the force inlined one, to save on code size.
I'm specifically doing this because gcc on the parisc architecture doesn't
handle the large function body well, and ends up using a lot of excess
stack space. Outlining these functions fixes it.
Putting stack marking into every assembly files is required to indicate
that the stack does not need to be executable.
Executable flag on stack conflicts with some security measures, Systemd
MemoryDenyWriteExecute=yes for example.
Previously, if an index was equal to `reducerValue + 1`, it would get remapped
during index reduction to 1 i.e. `ZSTD_DUBT_UNSORTED_MARK`. This can affect the
parsing of the input slightly, by causing tree nodes to be nullified when they
otherwise wouldn't be. This hardly matters from a correctness or efficiency
perspective, but it does impact determinism.
So this commit changes index reduction to avoid mapping indices to collide with
`ZSTD_DUBT_UNSORTED_MARK`.
that's clearer than finding the tables somewhere in the middle of `compress.c`.
Also, down the line, it may potentially allows zstd to feature adjusted tables depending on target cpu.
Speed up compilation times by moving each specialized search function
into its own function. This is faster because compilers can handle many
smaller functions much faster than one gigantic function. The previous
approach generated one giant function with `switch` statements and
inlining to select the implementation.
| Compiler | Flags | Dev Time (s) | PR Time (s) | Delta |
|----------|-------------------------------------|--------------|-------------|-------|
| gcc | -O3 | 16.5 | 5.6 | -66% |
| gcc | -O3 -g -fsanitize=address,undefined | 158.9 | 38.2 | -75% |
| clang | -O3 | 36.5 | 5.5 | -85% |
| clang | -O3 -g -fsanitize=address,undefined | 27.8 | 17.5 | -37% |
This also reduces the binary size because the search functions are no
longer inlined into the main body.
| Compiler | Dev libzstd.a Size (B) | PR libzstd.a Size (B) | Delta |
|----------|------------------------|-----------------------|-------|
| gcc | 1563868 | 1308844 | -16% |
| clang | 1924372 | 1376020 | -28% |
Finally, the performance is not impacted significantly by this change,
in fact we generally see a small speed boost.
| Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta |
|----------|-------|------------------|-----------------|-------|
| gcc | 5 | 110.6 | 110.0 | -0.5% |
| gcc | 7 | 70.4 | 72.2 | +2.5% |
| gcc | 9 | 53.2 | 53.5 | +0.5% |
| gcc | 13 | 12.7 | 12.9 | +1.5% |
| clang | 5 | 113.9 | 110.4 | -3.0% |
| clang | 7 | 67.7 | 70.6 | +4.2% |
| clang | 9 | 51.9 | 52.2 | +0.5% |
| clang | 13 | 12.4 | 13.3 | +7.2% |
The compression strategy is unmodified in this PR, so the compressed size
should be exactly the same. I may have a follow up PR to slightly improve
the compression ratio, if it doesn't cost too much speed.