This assert slows the loop down by 10x. We can get similar
coverage by asserting at the beginning & end of the loop.
We need this fix because Debian compiles zstd with asserts
enabled. Separately, we should ask them why, and if they would
consider disabling asserts in their builds. Since we don't
optimize for assert enabled builds.
Fixes Issue #3150.
ZSTD_seqSymbol is a structure with total of 64 bits
wide. So it can be loaded in one operation and
extract its fields by simply shifting or extracting
on aarch64.
GCC doesn't recognize this and generates more
unnecessary ldr/ldrb/ldrh operations that cause
performance drop.
With this change it is observed 2~4% uplift of
silesia and 2.5~6% of cantrbry @L8 on Arm N1.
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: I7748909204cf78a17eb9d4f2333692d53239daa8
On aarch64 ZSTD_wildcopy uses a simple loop to do
16B based memory copy. There is existing optimized
two stage copy that can achieve better performance.
By applying this to aarch64 it is also observed ~1%
uplift in silesia corpus.
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: Ic1253308e7a8a7df2d08963ba544e086c81ce8be
We found that movemask is not used properly or consumes too much CPU.
This effort helps to optimize the movemask emulation on ARM.
For level 8-9 we saw 3-5% improvements. For level 10 we say 1.5%
improvement.
The key idea is not to use pure movemasks but to have groups of bits.
For rowEntries == 16, 32 we are going to have groups of size 4 and 2
respectively. It means that each bit will be duplicated within the group
Then we do AND to have only one bit set in the group so that iteration
with lowering bit `a &= (a - 1)` works as well.
Also, aarch64 does not have rotate instructions for 16 bit, only for 32
and 64, that's why we see more improvements for level 8-9.
vshrn_n_u16 instruction is used to achieve that: vshrn_n_u16 shifts by
4 every u16 and narrows to 8 lower bits. See the picture below. It's
also used in
[Folly](c570259008/folly/container/detail/F14Table.h (L446)).
It also uses 2 cycles according to Neoverse-N{1,2} guidelines.
64 bit movemask is already well optimized. We have ongoing experiments
but were not able to validate other implementations work reliably faster.
This commit avoids checking whether a hashtable write is safe in two of the
three match-found paths in `ZSTD_compressBlock_fast_noDict_generic`. This pro-
duces a ~0.5% speed-up in compression.
A comment in the code describes why we can skip this check in the other two
paths (the repcode check and the first match check in the unrolled loop).
A downside is that in the new position where we make this check, we have not
yet computed `mLength`. We therefore have to avoid writing *possibly* dangerous
positions, rather than the old check which only avoids writing *actually*
dangerous positions. This leads to a miniscule loss in ratio (remember that
this scenario can only been triggered in very negative levels or under incomp-
ressibility acceleration).
Partial, Meson-only implementation of #2976 for non-MSVC builds.
Due to the prevalence of private symbol reuse, linking to a shared
library is simply utterly unreliable, but we still want to defer to the
shared library for installable applications. By linking to both, we can
share symbols where possible, and statically link where needed.
This means we no longer need to manually track every file that needs to
be extracted and reused.
The flip side is that MSVC completely does not support this, so for MSVC
builds we just link to a full static copy even where
-Ddefault_library=shared.
As a side benefit, by using library inclusion rather than including
extra explicit object files, the zstd program shrinks in size slightly
(~4kb).
These need to be explicitly included as we use their private symbols,
but we don't need to recompile them when we can reuse the existing
objects.
Minus 7 compile steps.
The poolTests program already linked to libzstd, and later to
libtestcommon with included libzstd objects. So this was redundant.
Minus 4 compile steps.