And yes, with the power of the ARM chip, per-sample processing would make possible a number type zero-latency effects (1 sample to be axact, so not zero, but only a fraction of a millisec) wich is very enticing. But from here we get out of the scope of the current audio library, i know. Also, a lot of DSP methods are unusable per-sample , so the usage of audio blocks would be still required. and processing audio blocks need time, so this gets complicated here.
.....
I just thought that what i'm talking about is achievable within the bounds of the current lib.
I do understand what you're getting at.
Per-sample processing, or even variable or mixed size block processing is outside the scope of this library.
Developing this library is already quite challenging. I'm not going to add complexity to an already difficult project by deviating from the fixed block size approach. I believe the roughly 40 year history of modern software has shown restricting the scope of a design to highly uniform structure is good strategy. But it is a path that requires being able to say "no" to certain features.
The current structure was not picked arbitrarily. I've been working on this library for many months, and in the early days a LOT of time and effort went into investigating the feasibility of doing many types of projects people want with the many design trade-offs. A decision was made, and I'm sticking with it.
Also, can anybody point out how does the mixer "mix" two samples? What formula is used?
It's 16 bit saturating addition. Here's the relevant code from the applyGainThenAdd() function:
Code:
if (mult == 65536) {
do {
uint32_t tmp32 = *dst;
*dst++ = signed_add_16_and_16(tmp32, *src++);
tmp32 = *dst;
*dst++ = signed_add_16_and_16(tmp32, *src++);
} while (dst < end);
Inside the loop "tmp32 = *dst" reads two audio samples at once (from the "dst" buffer) into a 32 bit temporary variable. Then the next line reads two more audio samples from the "src" buffer. The signed_add_16_and_16() function is merely a QADD16 instruction, which you can find documented here:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0473j/dom1361289886623.html
It simply adds both 16 bit halves. Each addition is done with "saturation", which is basically the same as clipping in analog circuitry, if the two numbers sum to greater than the audio range. The 32 bits are stored back into the "dst" buffer, and the point is incremented to the next 32 bits. This is done twice in the loop, so the overhead of checking for the end of the buffer is cut in half.
The mixer was one of the first objects I wrote. I've since learned its even faster than read 8 audio samples into variables. Even though the code is the same, the Cortex-M4 processor uses a special faster burst mode to access the RAM if you read consecutive 32 bit words back-to-back. So eventually I'll rewrite this to use 4 temporary variables, but the code will be pretty much identical, just 4 copies inside the loop instead of 2, and the operations rearranged.
You might notice the check if "mult" is 65536, which represents unity gain for the mixer channel. If it's some other number, a slight more complex version is used, where the 2 samples are multiplied by the gain, then right shifted so a multiply by 65536 becomes a multiply by 1. The 2 multiplied values are packed back into a single 32 bit variable, and then the same QADD16 instruction is used to add them. This code might also benefit from loop unrolling a bit.... eventually I'll go through most of these objects and do more optimization. For now, the goal is to get things simply working well with enough speed to be usable.