Turns out the C version of the cubic resampler is just slightly faster than
even the SSE3 version of the FIR4 resampler. This is likely due to not using a
64KB random-access lookup table along with unaligned loads, both offseting the
gains from SSE.
This is essentially a 12-point sinc resampler, unless it's resampling to a rate
higher than the output, at which point it will vary between 12 and 24 points
and do anti-aliasing to avoid/reduce frequencies going over nyquist.
Code provided by Christopher Fitzgerald.
SSE uses reverse ordering, such that component 0 is the last in memory.
_mm_load_* and _mm_loadu_*, and the corresponding stores, do not change the
memory ordering.