Avoid masking the index with each iteration, and instead do up to when the mask
would apply. This allows for better optimizations, in particular fewer
instructions and better chances for vectorization.
Rather than continuously wrapping when used, each update uses it from the front
and copies the tail to the front at the end. This allows for more effficient
accesses in loops.
This provides better characteristics for an amplitude limiter. In particular,
it utilizes the peak amplitude instead of the RMS, and the used parameters
basically guarantee no output samples exceed the given threshold... almost, due
to floating-point errors as the threshold is converted from dB to log-e for the
envelope, then is negated and converted to linear amplitude to apply to the
signal. It's quite possible for some rounding errors to creep in and not
perfectly saturate the result.
The offsets and coefficients are controlled by a relatively small set of input
parameters, just with different base constants or different calculations. This
lead to numerous redundant checks since if one value didn't change, others that
use the same inputs wouldn't have either.
For playback, increment the ring buffer's write pointer before queueing audio,
to handle cases where the callback is invoked, advancing the read pointer,
before the write pointer is advanced.
For capture, limit the number of re-queued chunks to the number of fully read
chunks.
In particular, the source sample position was reduced by the size of the
next buffer list item when one is completed, rather than the size of the
one it just completed.
It's not an issue for the final mix, but if one loop has an unaligned count,
the next loop will have unaligned input and output buffer targets which can
crash the SSE mixers.