Fast Convolution Filtering with Teensy 4.0 and audio board

Not open for further replies.
Just to confirm: minimum phase lowpass (real valued) FIR filters built with MATLAB also work well with the code and seem to exhibit low latency (as far as I can measure) and good filter effects. (latest code is on github)

I built 512 tap, 1024, 2048, and 4096 tap filters and they all have the same 128 sample delay. So this problem is solved, the partitioned convolution does what it should.

MATLABs´ Filterbuilder has some severe restrictions: minimum phase is only possible with real valued lowpass filters and equiripple design, not with complex valued bandpass filters and choosable window [all of which would be needed for SDR IQ filtering]. Additionally calculating a 4096 tap minimum phase filter needs 31 minutes on a standard 2.4GHz laptop!!!

Now, the last and most complex problem to solve is the following:

* we need an algorithm that runs on the Teensy 4.0 to calculate COMPLEX bandpass FIR filter coefficients with minimum phase OR
* an algorithm that is able to transform linear phase coefficients to minimum phase (that runs on the T4) OR
* MATLAB code to calculate complex bandpass minimum phase filter coefficients for sizes 512, 1024, 2048, 4096 . . .

Any volunteer or ideas on that would be very welcome!
Last edited:
Could you explain why 'we need an algorithm that runs on to calculate COMPLEX bandpass FIR filter coefficients...'
@tschrama: I see (at least) three different applications for low latency partitioned convolution:

1.) guitar cabinet impulse resonse (IR) filtering: that is already possible with the existing code with recorded impulse responses of up to 0.49 sec @44.1ksps sample rate (= 21632 taps) length, so all "standard" IR lengths (512 samples, 1024 samples, 170msec, 400msec, 500msec) can be used [the 500ms version has to be cut to 21632 samples, but should work]. However, I have to change the code, because at the moment the IR has to have zero insertion after every sample. I will do that change in the next days, so one can enter the impulse response as it is without having to insert zeros.

2.) low latency low pass filtering with high number of FIR filter taps: that is already possible, but you have to calculate the coefficients for the FIR filter by yourself. The coefficients have to be minimum phase coeffs in order for the filter to work with low latency. If you use linear phase coefficients, the filter works, but the latency is half the filter length. There has to be zero insertion in the filter coeffs as well, but I will work on that (see above). At the moment, I use the MATLAB FilterBuilder option for calculating minimum phase coeffs, but the filter design is restricted to real valued lowpass equiripple filters and is extremely slow needing 31 minutes for calculating a 4096 tap filter on a 2.4GHz laptop.

3.) Software defined radio main filtering for the IQ incoming signals: this is not possible with the code at the moment, because it needs complex bandpass FIR coefficients with minimum phase response.

Sorry for not being clear enough in my last post, but I was talking only about the third option needing complex coeffs :).
The partitioned convolution code has been changed now (is on github), so that any recorded impulse responses can be directly put into the code (without zero insertion).

If you have impulse response files as WAV files, a quick and easy way to get them into the code is to use Audacity --> Analyze --> Sample Data Export --> specify number of samples to export and Linear measurement scale. Then put the coeffs into an .h-file and use them with the code :). [use replace function in WORD or any other program to insert commas: BTW: ^p is the sign for a line change]
I dont see the problem. You can take, measure, the impulse response of any filter and feed it to a convolutor.
and why do those coeeficients need to be calculated on a T4? Why not store your IR as a file on a Teensy?
ah, I see, software defined radio... needs those complex coeeficients and minimim phase.

But radio isnt bothered by latency, is it? so why are you trying to use a method which is designed for low latency?
Thanks! I know w a thing or two about Matlab, IRs and guitar.. but please excuse my total lack of radio knowledgeble.
Last edited:
@Frank. Congratulations- looks like you solved all of the issues, except the one you need for SDR.
I wondered if you were planning on embedding this routine into a Teensy Audio library object, like I did with your older routine?
Yes, it seems all options work, except the one I am most interested in and which I was originally targeting ;-). If there is any volunteer to put the code into an audio lib LOW LATENCY IMPULSE RESPONSE PLAYER object, that would be perfect!
I never dealt with audio lib objects, so I am a total beginner in that respect. Also one would have to decide whether this object only works on a T4, and how and where one would store the impulse responses. Additionally: should the algorithm be in Stereo, like it is now? Or should it be MONO? In the latter case, the algorithm would have to be altered internally, which would require some work, but would enable about double the size of IRs [ie. a bit less than one second at 44.1ksps on a Teensy 4.0, with external RAM this could probably be extended a bit, but then processor load would come into play even when overclocked, so more than 2 seconds will probably not be possible even with overclocked T4 with external RAM]
Hi Frank: Yes, I thought you were really interested in the SDR filters, and was a bit surprised when you drifted into the guitar cabinet simulation aspect of it. The fact that the guitar IR files were min. latency did help you make progress with the routine, which was a nice coincidence. For the library object, I think it would be possible to use a T3.6 if one kept the tap length shorter, to fit 3.6 smaller memory. I was thinking exactly the same as you that for guitar cab. simulation, the signal is mono, so it would make sense to code it mono, and increase the max taps by 2. I don't think external RAM would be an option. The iMxrt MCU could handle external sram using a parallel address/data bus, but all of those pins are not present on the T4. While SPI, QSPI SRAM is possible, it would be too slow for this application, I think. But the 1 sec impulse length is perfectly fine for cabinet simulation & convolution reverb would work for times < 1sec OK using your routine (or something like it). Really "lush" reverb is generally several seconds long and that is out of reach for T4.
There are some fellows on the forum who could probably embed your code into a Teensy audio library object, as well as tailor it so that it would compile differently to match either a T3.6 or T4. I don't have that expertise currently, but I did code your original routine into a Teensy lib. object, so I guess I could do a T4 version for this one as well. Personally, I think that to be really useful for guitar cabinet simulation, one really has to be able to pick the desired IR file at run-time, probably from an SD card, and load the impulse into sram- instead of having 1 or more hard-coded files in program flash. That is how I did it with the lib object I wrote for your original routine.
If one of the real programming "Experts" steps up to do the library coding, I'll defer to them. If not, I can certainly give it a try.
Hmm, I have gone through the variables again and I fear we cannot save memory when going to a MONO version. These are the variables used in the Stereo version:

FFT length is 256
partitionsize is 128
nfor is number of blocks --> 169 for an IR of 21632 taps

const float32_t PROGMEM impulse_response[21632];
float32_t DMAMEM maskgen[512]; // SAME FOR MONO
float32_t DMAMEM fmask[169][512]; // SAME FOR MONO, however maybe one could cut this into half by intelligent file management: BUT I think that would not make a difference, because this variable has to be either in DMAMEM or in RAM1 . . . so we cannot partition equally anymore
float32_t DMAMEM fftin[512]; // SAME FOR MONO
float32_t DMAMEM accum[512]; // SAME FOR MONO
float  fftout[169][512]; //  SAME FOR MONO, because output of a real-to-complex FFT would have the same size as this complex-to-complex FFT

float32_t DMAMEM float_buffer_L [128];  //  SAME FOR MONO
float32_t DMAMEM float_buffer_R [128]; //  SAME FOR MONO
float32_t DMAMEM last_sample_buffer_L [128];  //  SAME FOR MONO
float32_t DMAMEM last_sample_buffer_R [128]; //  SAME FOR MONO

We have two very large variables, one is in RAM1 [fftout], the other is in RAM2 [fmask]. So if we cut one of them in half, we still have the other array which has the large size and fills up one part of the RAM. So I do not think its worth coding a MONO version, what do you think?

I am not sure about where to put the IRs. Maybe for the T4 (no SD card!) hardcoding a nice selection of useable IRs in FLASH is sufficient for a on-stage-realtime-version of the guitar cabinet simulator? However, for the T3.6, you are right, using SD card would be useful. Maybe I will test whether the code will run on the T3.6 and how many max taps it can use . . .
Hi Frank: Since I wasn't the one who wrote the code, I neglected to think about 1) the filter mask has to be the same size, mono or stereo and 2) that the memory was in 2 distinct blocks (although I only finished my Soundfont synthesizer project a month ago, and had to do the same juggling to get all of its large arrays to fit in using DMA memory). So, you are right, of course, if one went to Mono & tried to increase the number of taps further, the complex filter mask would be the biggest array and would still be in the DMA block. But, if you moved a few of the other DMAMEM arrays out into the "normal" sram, it would free up space in DMAMEM for a larger filter mask. Since FFTout should only be 1/2 the size, due to only 1 channel, I think maybe one could increase the tap size significantly, but not double as I assumed. The other consideration here is that if your routine was made into a Teensy library, the chances are good that other users would want to add other audio library function blocks to the program, for other features. At that point, audio block memory would need to increase. So, if for no other reason, if you went to Mono, the shrinking of the FFTOUT array by 2 would free up some space in "normal" sram for use by the library Audio blocks.
The IR storage is the other issue. Right now your .h files have the IR array as constants, so they should reside in program flash. You are only loading one of them now, so the space taken up by the others is not used now. I had loaded your 08/11/2019 version from Github- when I compile it, it uses only 105,120 bytes of Flash. So, you could easily add many more IR .h files without filling up the 2 Meg program flash. I know that the T4 transfers the program itself from program flash to sram, at bootup. I would assume that any constants in program flash would also get transferred to sram too, BUT I AM NOT CERTAIN OF THIS. Your 08/11/2019 version, which includes only 1 fixed filter mask of 16384 (Marshall 197 impulse response from takes 386,640 bytes of sram. I don't know whether the 386640 figure includes the IR constants or not. If it does, you wouldn't be able to store many IR files in Flash at once, without overflowing the sram once they were transfered (assuming that they are transferred from flash to sram).
That is why I like the SD card route. People using the routine as a library object might not want to go into the library code to specify which .h file to use, or for that matter, want to translate an off-the-shelf IR WAV file into a form that would be suitable for a .h file. The demo program that I wrote for my convolution library will accept an IR file in WAV format, and import it directly into the coefficient array ( although my demo only read in 513 taps (strictly speaking you need FFT/2 + 1 samples for an FIR filter).
Even though the T4 doesn't have the SD card built in, it has the IO pins needed to add an SD card. Also, the Audio Shield itself has an SD card socket on it, and chances are most people, including myself, use that for the CODEC.
That said, if one could fit 1.5+ MB of IR files into T4 program flash, and if constants are not transferred to sram on bootup, then your method has a lot of advantages.
I have been playing around with the code a couple of days on two hardware setups:

* T4 and DAC PCM5102a & ADC PCM1808
* T3.6 and Teensy audio shield rev B

I have also tried T4 and the Teensy audio shield, but I failed again in getting this setup to run, I get drizzle noise again in all situations.

Also, now even in the T3.6 and the Teensy audio shield I cannot find a configuration that is satisfying, I always get drizzling noise/artefacts. I already soldered a 100 Ohm resistor into the MCLK line on the audio shield itself, but still the problem is not solved, although I use proper 1cm headers to connect T3.6 and my audio shield. (Yes, I use a stereo microscope for soldering and checking all solder connections ;-))

I begin to think I have a faulty Teensy audio shield, maybe . . .

The good thing is, that I found out that the code also can run on the Teensy 3.6 for IRs up to a length of about 7552 taps.

With the T4, it can now process IRs with a length of up to about 24000 taps, that is a little more than 500msec and has a processor load of 50% with that length.

So, with the frustrating hardware issues continuing, I will stop development of the low latency partitioned convolution code now, because I cannot proceed further at the moment.

Mainly because I have no access to the SD card (which would be necessary for loading IRs):

* with Teensy 3.6 (SD card) I cannot eliminate the noise
* and with the T4.0 setup the audio is nice, but I cannot use SD card (because my T4.0 will not work with my Teensy audio shield and I have no chance to solder an SD card holder to my T4, did anybody do that already?)

So I feel I cannot develop this further now, until I have access to a working audio shield that will play with the T4 (rev D audio shields are impossible to obtain in Central Europe at the moment [even Digikey does not have it in stock] and it is unclear to me whether they will exhibit the same noise problems as my rev B audio shield with the T3.6.)

So, everybody feel free to use the code to build a convolution object or whatever you want. The latest and optimized code is on github.

@Brian, thanks for your thoughts and comments and measurements and all the help!
Last edited:
Hi Frank: You're very welcome!
It would seem like you must have a bad Teensy Audio shield. It definitely should work with the audio shield mounted directly onto the T3.6 via short headers. I assume you are testing it with something simple like the audio pass-through demo, thus eliminating any code errors. (although you can get things working with your alternate ADC/DAC, so the code must be OK)
My old audio shield works great on the T4- you can see the length of my wiring in post #14. I only use a 100 ohm resistor on MCLK.
For your conv. filter testing, I just used the SD card socket on the audio shield. But,for my T4 Soundfont Synth, I used only a PT8211 DAC. For the SD card socket, I hand-wired one up from the footprint on the T4 PCB bottom. I had to use wire-wrap wire for this, but the wiring is about 11 cm long, and that works fine.
Maybe you'll get a new Rev D board from neurofun's vendor suggestion.
I'll have to take a look at your latest github code. I guess you must be conditionally compiling for the T3.6/T4 as I thought the T3.6 had no separate DMA RAM section, indeed much less ram period.
thanks for the hints on where to get the T4 audio board, will try it there.

I added the code for T3.6 convolution code to the github.

Have fun with the Teensy!
Hi Frank: Over the weekend I adapted your U.P. convolution code into a Teensy Audio library. It seems to work identically. However, in class libraries, you can't declare an array in DMAMEM. Also, in your program, depending upon what (fixed) impulse you choose, the compiler is able to declare the fftout and fmask arrays as an appropriate size. You can't do that in a class library, because that would be a dynamic declaration, and that's not allowed.
So, I don't use DMAMEM and as a result, my max impulse size is only about 1/2 of what you can get.
But, I think it may be possible to declare one or both of those big arrays in the main program using DMAMEM, and just pass pointers to them to the library routines. They are both 2 dimensional arrays though, so manipulating them via pointers is more work, especially in the section of code following your comment " // doing 8 of these complex multiplies inside one loop saves a HUGE LOT of processor cycles"
I'll play with this first before posting it to github
First of all I'd like to thank @bmiller and @DD4WH for their awesome work here. I'm playing with the code for its FFT capability as I need high resolution FFT/iFFT to perform inter-channel correlation of stereo audio for extraction and synthesis, not for analysis or FIR convolution. I'm working with a Teensy 4.0 on slightly different hardware using a CS42448 with TDM I/O to provide 2 input and 4 output channels. For now I'm only looking at getting 2 channels of audio in and through the float conversion/FFT/iFFT/conversion and output. In so doing I'm having some audio artifacts that appear with each iteration of the loop() and I'm wondering if it's the TDM hardware/AudioBuffer interaction or the way I'm configuring the software. I've played with each of the posted variations briefly and they all appear to exhibit the same problem, but to be sure - with nc=128 and FFT_L=4096 on the 11/01/19 code from post #35, are either of you experiencing audio 'overlap' clicks at the loop frequency? Given the information you've each posted I suspect this is a nuance of the hardware differences but it seems prudent to ask. I'll be pulling the scope out to see what the output looks like - maybe that will help narrow down the cause.
Thanks for any feedback you might be able to provide!
@Brian: that sounds really really good, that would be a major achievement to have a partitioned convolution audio object!

@highly: hard to say whats going wrong, but it seems you misunderstood the reason for a <partitioned> convolution: the FFT size should stay low (256), so that the latency is small. If you have a 128 coefficients filter and an FFT size of 4096, that does not make sense to me. And of course you would have to overlap your input buffers in order to have nice audio. For your purpose probably a normal convolution object, such as Brians already existing audio object would be perfect. Or you try the NON-partitioned convolution code from the very first post in this thread.
@ highly Neither Frank nor I are using TDM I/O. We're both using the standard (stereo) I2S interface- in his case a PCM1808 ADC and separate DAC, and I'm just using the Audio shield. When I run Franks code, I get no artifacts with my Audio shield. I've converted his code into a Teensy audio library object, and get no artifacts with that either. So, I am guessing it might be the TDM interface- but I've never used it myself.
Both of our routines wait until blocks have arrived from both channels. Maybe its a bit different with TDM, as it can handle more channels. Just a complete guess though.
@ Frank. Thanks. I have made more progress. I now declare the fftout array in the main program, as DMAMEM, and just pass a pointer into the audio library. So, I can now make the impulse longer, like yours. But, in the audio update routine, where all the calculations take place, I now have only a pointer to the fftout array. So, that complex multiply you do, near the end, had to be re-written using pointers to the 2 arrays for all of the 8 complex multiply operations. I was feeling really proud of myself, because I was eliminating all of the multiplications/additions that are needed for the various array index calculations, by clever manipulation of the pointers. I figured my way would be much quicker. However it is about twice as slow!!! (2ms instead of 1.2ms for yours) I am just now realizing that the compiler is probably so smart that it is converting that whole complex multiplication routine of yours into one call to the CMSIS complex multiply routine. Those CMSIS routines are super fast as they use instructions that we can't directly access in C. If I am right about this, I hope to be able to replace my (pointer access) complex multiply routine into one that just calls the CMSIS complex multiply routine. I have to admit, I'm getting right at(or above) the limit of my programming experience here.
I wonder if you were aware of the CMSIS complex multiply optimization when you put this comment in your code:
doing 8 of these complex multiplies inside one loop saves a HUGE LOT of processor cycles

Brian, that sounds good!

I dont think, the compiler is so smart to use the CMSIS routines. No, it probably doesnt. But AFAIK the CMSIS routine does the same as I did: not using one loop with an index, but putting 4 or 8 single instructions into one loop round. Thats the whole trick ;-).

Someone in this forum (dont remember who, its a long time ago) pointed that out: that the optimization in the CMSIS cplxmultiply routine is just doing several instructions in a row, without loop indexing. It seems the indexing takes a lot of time . . . !?

I was really astonished how much the execution time is influenced by access of variables and indexing, there is probably a lot more room to optimize in the code :).

All the best,

Not open for further replies.