Gives a ~40% speedup on x86_64.
However, the generic code remains faster on aarch64.
This is still processing only one block at a time for now.
I'm pretty confident that processing more blocks per round
will eventually give a substantial performance improvement on
all platforms with vector units.