new version easier to vectorize
leads to smaller code and faster execution
notably at the last recombination stage
(basically, fixed cost per block).
Assembly inspected with godbolt
On my laptop, with `clang` and `-mavx2` :
2K block : 1280 MB/s -> 1550 MB/s
8K block : 1750 MB/s -> 1860 MB/s