This is essentially a 12-point sinc resampler, unless it's resampling to a rate
higher than the output, at which point it will vary between 12 and 24 points
and do anti-aliasing to avoid/reduce frequencies going over nyquist.
Code provided by Christopher Fitzgerald.
SSE uses reverse ordering, such that component 0 is the last in memory.
_mm_load_* and _mm_loadu_*, and the corresponding stores, do not change the
memory ordering.
Currently the only way SSE 4.1 is detected is by using __get_cpuid, i.e. with
GCC. Windows' IsProcessorFeaturePresent does not report SSE4.1 capabilities.