zstd/lib/compress
Danila Kutenin e11783b04d [lazy] Optimize ZSTD_row_getMatchMask for level 8-10
We found that movemask is not used properly or consumes too much CPU.
This effort helps to optimize the movemask emulation on ARM.

For level 8-9 we saw 3-5% improvements. For level 10 we say 1.5%
improvement.

The key idea is not to use pure movemasks but to have groups of bits.
For rowEntries == 16, 32 we are going to have groups of size 4 and 2
respectively. It means that each bit will be duplicated within the group

Then we do AND to have only one bit set in the group so that iteration
with lowering bit `a &= (a - 1)` works as well.

Also, aarch64 does not have rotate instructions for 16 bit, only for 32
and 64, that's why we see more improvements for level 8-9.

vshrn_n_u16 instruction is used to achieve that: vshrn_n_u16 shifts by
4 every u16 and narrows to 8 lower bits. See the picture below. It's
also used in
[Folly](c570259008/folly/container/detail/F14Table.h (L446)).
It also uses 2 cycles according to Neoverse-N{1,2} guidelines.

64 bit movemask is already well optimized. We have ongoing experiments
but were not able to validate other implementations work reliably faster.
2022-05-22 10:44:24 +00:00
..
2022-02-01 10:49:38 -08:00
2022-03-12 08:58:04 +01:00
2022-05-12 12:53:15 -04:00
2022-03-12 08:58:04 +01:00
2022-03-12 08:58:04 +01:00