Coding audio effects, questions about emulation, synth_dc

Status
Not open for further replies.
I'm getting into writing audio effects for Teensy, and have been reading through the source code for many of the effects already in the library. I'm ok when it comes to much of the maths as I've done DSP stuff before on regular computers, but I've never had to optimise them much and could rely on floats for the most part.

Synth_dc.cpp (https://github.com/PaulStoffregen/Audio/blob/master/synth_dc.cpp) had some things in it that I didn't quite get.

Line 42:
Code:
val = pack_16t_16t(magnitude, magnitude);
Why is this done? I've read through https://github.com/PaulStoffregen/Audio/blob/master/utility/dspinst.h and get programatically what it does, but I don't know why you would do it.

Line 43:
Code:
do {
  *p++ = val;
  *p++ = val;
  *p++ = val;
  *p++ = val;
  *p++ = val;
  *p++ = val;
  *p++ = val;
  *p++ = val;
} while (p < end);

Why is this done in blocks of 8? Why not 1, or 16 for example?

Line 60:
Code:
magnitude += increment;
t1 = magnitude;
magnitude += increment;
t1 = pack_16t_16t(magnitude, t1);

I guess I would have expected a simple magnitude += increment. Why do it twice and pack the top halves of each uint_32 together like that?

Also, is there a way to emulate the Teensy on a desktop or laptop based PC when developing audio effects? I feel like this might be easier to troubleshoot and log data. Or if there's a neat way of setting up value logging through Serial and somehow digesting hundreds of values then that might be helpful. It's useful to graph things sometimes.

Also also, if you have any reading material recommendations on fixed point optimisation / math techniques that might be helpful. Reading through dspinst.h made my head hurt a little!
 
Line 42:
Code:
val = pack_16t_16t(magnitude, magnitude);
Why is this done?

The chip is a 32 bit processor, with 32 bit bus to the memory.

Packing two 16 bit samples into a single 32 bit register allows both to be written to the memory in a single bus cycle.

Why is this done in blocks of 8? Why not 1, or 16 for example?

Cortex-M4 has a special hardware optimization for back-to-back memory operations.

Normally, each load or store takes 2 cycles. But when you do them back-to-back, only the first takes 2 cycles. The rest are performed in 1 cycle. So in this case, the 8 writes store 16 packed samples in only 9 clock cycles. On average, audio write speed is 0.56 cycles/sample.

Expanding this to 16 would store 32 samples in 17 cycles, for a speedup to only 0.53 cycles/sample. There's 2 reasons this wasn't done. First, going from 0.56 to 0.53 is a pretty diminishing return for larger code size. Second, there's a sort-of unwritten rule in the library not to process more than 16 samples at once. Today the audio block size is 128 samples. Limiting to 16 means it can (in theory) be adjusted by increments of 16. Of course, plenty of other places in the library have dependency or certain assumptions on the 128 sample size. Still, it's nice to keep the processing to 1, 2, 4, 8 or 16 samples and keep the code flexible for future changes to the block size.

Line 60:
Code:
magnitude += increment;
t1 = magnitude;
magnitude += increment;
t1 = pack_16t_16t(magnitude, t1);

I guess I would have expected a simple magnitude += increment. Why do it twice and pack the top halves of each uint_32 together like that?

Same as above, and throughout the library, 16 bit samples are packed into 32 bit registers, mainly to cut the number of slow memory accesses in half.

Many of the special multiply instructions can take either half of a packed register as an input, which is another huge incentive to pack pairs of samples into 32 bit registers.

The processor has 16 register, 3 of which are reserved, and only 8 of which work with many instructions. That imposes a practical limit to what you can do efficiently. So packing 2 samples per register also doubles the number of input or output samples you can work with in the limited register space, which is the third incentive to pack data this way.


Also, is there a way to emulate the Teensy on a desktop or laptop based PC when developing audio effects? I feel like this might be easier to troubleshoot and log data. Or if there's a neat way of setting up value logging through Serial and somehow digesting hundreds of values then that might be helpful. It's useful to graph things sometimes.

As far as I know, there's no simply way to get a good emulation.

For much of the library code, I've written test programs in C on Linux or scripts in languages like Perl, just to test the algorithm. Usually I just print number to stdout, or save data to binary files that can be imported and played by Audacity. While that's pretty primitive compared to a full emulation system than streams data, it's served me pretty well. Especially when fiddling with the packed data and other optimizations, it's really helpful to have reference data for comparison.

Also also, if you have any reading material recommendations on fixed point optimisation / math techniques that might be helpful. Reading through dspinst.h made my head hurt a little!

This book has probably the most approachable description of the processor features. But it's still far from a light & easy read.

http://www.amazon.com/dp/B00G9856GU/

About general DSP, there's tons of references. Many tend to be heavy with very abstract math and light on implementation details. Sadly, there seems to be a terrible lack of books or approachable info about fixed point techniques.
 
Last edited:
Amazing reply, thank you so much. Pushing 2 16bit samples through at a time makes sense, knowing that gives context to many of the functions in dspinst.h.
 
Status
Not open for further replies.
Back
Top