Audio Processing Benchmarking on Teensy 3.2 and K66 (Teensy 3.6)

chipaudette · Sep 8, 2016

Hi All,

I'm working on some audio processing projects and I'm trying to choose a good embedded platform for doing some DSP. But, I also want to be able to program it in a relatively simple way. I'm coming from Arduino and I've done a few projects with Teensy because of the Teensyduino bridge. For me and my fellow audio hackers, I think that some version of Teensy would work well. But is it fast enough?

I've started to do a bunch of audio processing benchmarks on a range of boards: a few Arduino boards, a Maple, a Teensy 3.2, and K66 board to approximate the results that we should see with the up-coming Teensy 3.6.

Right now, I've completed the testing for FIR filtering. My FIR filter is just a naive C implementation right now...follow-on tests using the CMSIS DSP library will be later. With my niave C implementation, I've tested each of the different boards using a range of filter lengths. I've also tested using Int32 and using Float32.

If you're interested in the results, check out the link below. Note that the K66 board (ie, Teensy 3.6) is 25x faster than Teensy 3.2 using floats. Wow!

https://openaudio.blogspot.com/2016/09/benchmarking-fir-filtering.html

My next step is to write up my results for my FFT/IFFT benchmarking. Not surprisingly, they follow a similar pattern as seen with these FIR results.

Has anyone done similar benchmarking comparing from Teensy to a "real" DSP platform (like the TI C5xxx or TI C6xxx series)?

Chip

Frank B · Sep 8, 2016

The Audiolibrary (did you try it?) uses the DSP-Extensions, which are even faster. It has filtering and fft, too.

PaulStoffregen · Sep 8, 2016

Yes, use the audio lib. Leveraging DMA to bring data in and out saves a lot of CPU time, allowing everything else to run better.

chipaudette · Sep 9, 2016

Thanks so much for the comments.

FrankB, I did not (yet) use the DSP extensions because not all of my platforms have DSP extensions. I was looking to do a straight apples-to-apples comparison across units with the software being as similar as possible. Along those lines, the AudioLibrary was not available for all of my platforms (e.g. the FRDM-K66F), so I stayed away from it on this first test. For my upcoming post on FFT speed, I manually invoked the CMSIS fft functions (ie, not through the AudioLibrary) along with a generic C FFT, so we'll get to see the impact of the DSP acceleration.

Paul, thanks for the reminder about the AudioLibrary using DMA. My tests to date do not include the ADC/DAC pathway to get the audio into and out of the hardware. My tests were purely about the computations. If I end up using the Teensy for my projects (as seems likely given the K66F results), I'll be sure to use the DMA (maybe through the library, maybe not...depends on what hacking I feel like doing).

As a different question, can the AudioLibrary's processing functions (not necessarily the DMA and stuff) be used in other platforms besides the Teensy?

Chip

PaulStoffregen · Sep 9, 2016

You should definitely check out the DSP extensions, if your samples are 16 bits. For FFT it makes a pretty incredible improvement.

chipaudette · Sep 14, 2016

I've now done my benchmarking tests for FFT instead of for FIR. I compare Arduino to Maple to Teensy 3.2 to an NXP K66 board. I also try generic C versus the CMSIS DSP routines. If you're interested, you can check out my discussion here:

http://openaudio.blogspot.com/2016/09/benchmarking-fft-speed.html

The K66 board absolutely screams, especially on Float32 data. Wow. The Teensy 3.6 (which uses the K66) is gonna be great for audio processing.

Chip

chipaudette · Oct 8, 2016

I just got my Teensy 3.5 and 3.6 (thanks Paul!) so I ran my FFT benchmarks on the new boards. They're so fast!

For discussion and results, see: https://openaudio.blogspot.com/2016/10/benchmarking-teensy-36-is-fast.html

I love the speed of the new boards on floating-point data. I even found that the Teensy 3.6 is a little faster than the NXP FRDM-K66F board, which uses the came CPU. That results was a little surprising...perhaps it's a difference in the compiler settings under the hood? Or perhaps it's a difference in the version of the CMSIS FFT library? I don't know.

Regardless, the Teensy 3.6 screams. I'm so looking forward to using it in my audio projects.

Chip

chipaudette · Feb 8, 2017

In continuing to work with audio on the Teensy 3.6. I found that some operations are really fast (arithmetic, FIR, FFT) while others are suprisingly slow (sqrt, exp, log, pow, log10). It turns out that, if you call the floating-point specific form of the functions (sqrtf, expf, logf, powf, log10f), you get *way* faster performance...

As you can see, sqrtf() is 30x faster. expf(), logf(), powf(), and log10f() are all accelerated by 10x. This is just a tremendous difference. It pays to be explicit.

More details (and code!) are here: http://openaudio.blogspot.com/2017/02/for-speedy-float-math-specify-float.html

Why doesn't the compiler automatically substitute the "f" version when I give it floating-point arguments?

Chip

Theremingenieur · Feb 8, 2017

The non "f" functions deal with 64bit floats (double precision) for which there isn't any FPU support in the M4F core, thus, everything is done in software which is s-l-o-w... The "f" functions deal with 32bit floats for which there is hardware FPU support, thus, logically, they are much more efficient (but less precise).

That's why the compiler lets the choice to the developer to trade either speed for precision or vice-versa and does no automatic substitution.

PaulStoffregen · Feb 9, 2017

It's standard C library API that the "f" versions are 32 bit float and the ones without are 64 bit double.

PaulStoffregen · Feb 9, 2017

If you ever redo this test, I hope you'll include exp2f(). It's particularly important for some algorithms.

In the audio library's state variable filter, I implemented an integer only polynomial approximation of exp2. The code is here:

https://github.com/PaulStoffregen/Audio/blob/master/filter_variable.cpp#L110

Actually, as you can see in the code, I implemented 2 different exp2 algorithms, both using the Cortex-M4 DSP extensions.

Later this year I'm planning to make many more things in the library have "control voltage" inputs. Currently only this filter, the multiplier (variable gain) and FM sine wave have signal control. In many cases you want control signals to act as a log scale, so this or some other similar fast algorithm is needed.

Synthetech · Feb 9, 2017

PaulStoffregen said:
If you ever redo this test, I hope you'll include exp2f(). It's particularly important for some algorithms.

In the audio library's state variable filter, I implemented an integer only polynomial approximation of exp2. The code is here:

https://github.com/PaulStoffregen/Audio/blob/master/filter_variable.cpp#L110

Actually, as you can see in the code, I implemented 2 different exp2 algorithms, both using the Cortex-M4 DSP extensions.

Later this year I'm planning to make many more things in the library have "control voltage" inputs. Currently only this filter, the multiplier (variable gain) and FM sine wave have signal control. In many cases you want control signals to act as a log scale, so this or some other similar fast algorithm is needed.

Interesting...
So are you saying the three objects mentioned have logarithmic reaction to linear input? (Hope I have this termed properly).

If I understand correctly, then they are going to react like a Audio tapered pot?

PaulStoffregen · Feb 10, 2017

The filter definitely does. Linear change in the control signal is mapped onto a user-defined number of octaves. Like every object in the audio library, it's documented in the design tool. Here's a link:

https://www.pjrc.com/teensy/gui/?info=AudioFilterStateVariable

Look at the octaveControl() function and the notes section.

The multiplier (variable gain) is definitely linear!

The FM sine wave varies a factor of 2. This is on my TODO list to make configurable.....

chipaudette · Feb 15, 2017

Since I need to do lots of conversions into dB space and back out into linear space, I need to do calls to log10f(x) and powf(10,x) at audio rates. I wasn't satisfied with the speed of the standard function calls. So, I reformulated them and increased their speed by 3x. Not bad!

My substitution for powf(10.0,x) is exact, which is nice. My substitution for log10f, however, uses an approximation for log2. It's a good approximation, but it's not exact. Regardless, being 3x faster is pretty good.

Code:

//powf(10.x) is exactly exp(log(10.0)*x)
#define pow10f(x) expf(2.302585092994046*x)  

//log10f is exactly log2(x)/log2(10.0f)
#define log10f_fast(x)  (log2f_approx(x)*0.3010299956639812f)

More info (and the log2 approximation code) is at: http://openaudio.blogspot.com/2017/02/faster-log10-and-pow.html

Also, in my next round of work on this, I'll definitely look at exp2f() as suggested above.

Thanks for your interest!

Chip

Theremingenieur · Feb 16, 2017

I'm very interested in that, too! Need to do wave shaping by non-linear filtering in real time (ca. 192kS/s). Thus awaiting your exp2f() results and optimizations with impatience.

PaulStoffregen · Feb 16, 2017

I'm really curious to know how the float library compares with these integer-only algorithms, from the state variable filter.

Code:

                // exp2 polynomial suggested by Stefan Stenzel on "music-dsp"
                // mail list, Wed, 3 Sep 2014 10:08:55 +0200
                int32_t x = n << 3;
                n = multiply_accumulate_32x32_rshift32_rounded(536870912, x, 1494202713);
                int32_t sq = multiply_32x32_rshift32_rounded(x, x);
                n = multiply_accumulate_32x32_rshift32_rounded(n, sq, 1934101615);
                n = n + (multiply_32x32_rshift32_rounded(sq,
                        multiply_32x32_rshift32_rounded(x, 1358044250)) << 1);
                n = n << 1;

Code:

                // exp2 algorithm by Laurent de Soras
                // http://www.musicdsp.org/showone.php?id=106
                n = (n + 134217728) << 3;
                n = multiply_32x32_rshift32_rounded(n, n);
                n = multiply_32x32_rshift32_rounded(n, 715827883) << 3;
                n = n + 715827882;

Both of these are the fractional part. These few extra lines combine with the integer portion:

Code:

                n = n >> (6 - (control >> 27)); // 4 integer control bits
                fmult = multiply_32x32_rshift32_rounded(fcenter, n);
                if (fmult > 5378279) fmult = 5378279;

Audio Processing Benchmarking on Teensy 3.2 and K66 (Teensy 3.6)

chipaudette

Well-known member

Frank B

Senior Member

PaulStoffregen

Well-known member

chipaudette

Well-known member

PaulStoffregen

Well-known member

chipaudette

Well-known member

chipaudette

Well-known member

chipaudette

Well-known member

Theremingenieur

Senior Member+

PaulStoffregen

Well-known member

PaulStoffregen

Well-known member

Synthetech

Active member

PaulStoffregen

Well-known member

chipaudette

Well-known member

Theremingenieur

Senior Member+

PaulStoffregen

Well-known member