FFT/Convolution object for Teensy Audio library (guitar cabinet simulation )

Status
Not open for further replies.

bmillier

Well-known member
I've written a floating point FFT-based convolution filter object for the Teensy Audio library. It is basically a 513-tap convolution filter performed by converting the signal & filter mask into the frequency domain where the convolution function can be performed much more quickly. In other words, the convolution/filtering can be done in real time at the 44,100 Hz sample rate used in the Audio library. I based my code on the excellent work done by Frank, DD4WH on his convolution SDR (posted on this forum). Note that a conventional FIR filter object already exists in the Teensy Audio library, but it is limited to 200 taps, and is done using fixed-point math.
Besides standard 513-tap FIR filtering applications, this routine can be used to do guitar amplifier cabinet simulation, such as is found in commercial DAW plug-ins, for example. In this case the "impulse" file (a WAV file) is used in place of the 513 point FIR filter coefficient array.
The convolution routine is done in floating point, which eliminates the limitations imposed by the fixed-point (Q15) routines generally used in the Teensy audio library. However, this means that some modifications must be made to the Teensyduino distribution- to add the floating point CMSIS DSP routines needed for this object. These modifications are identical to those needed for the DD4WH SDR project also found in this forum. A link to these instructions are contained in the repository I've created on Github for this FFT/convolution routine, at

https://github.com/bmillier/Teensy-FFT-Convolution-Filter

In the repository I've included the .cpp and .h files needed for this library object, as well as 2 examples:
1) a 513-tap FIR filter example
2) a guitar amplifier cabinet simulation example (uses the Teensy 3.6's on-board SD socket to host an SD card containing the cabinet simulation WAV file(s).

While the FFT-based convolution routine is very efficient, it does require 64% of the processor's execution capacity for 1 out of every 4 audio update cycles (the FFT/convolution processing must be done in 512 sample blocks, while the audio library works in blocks of 128 samples). Therefore, there definitely are limits on how many additional audio blocks you could use for your application.
I developed my audio library code by first modifiying DD4WH's "inline" SDR sketch code to perform only the FFT/convolution that I needed. I haven't included my sketch-based (in-line code) version of the routine in the repository, but could add it if people were interested in developing guitar amplifier cabinet simulation for the Teensy, without actually adding this new object to the Teensy Audio library.
I used the following source extensively in understanding what the underlying theory of FFT/convolution was all about. If, like me, you are not a math wizard, this is the best treatment of the subject I have come across.

The Scientist and Engineer's Guide to Digital Signal Processing
By Steven W. Smith, Ph.D.

found at www.dspguide.com
You can buy the book, or read the various chapters in PDF, for free, at that website. Thanks Dr. Smith for this excellent book!
 
awesome. congratulations! I will test when I have time. I am guessing that the .wav file has to be exactly the right length for this to work?
 
@Adrian- Thanks. The convolution algorithm requires 513 samples in the impulse WAV file. The impulse files I have seen are all long (500 KB or greater) but contain just 513 samples followed by zero padding for the rest of the file. My sample program just reads 513 samples and ignores the rest. If the impulse file was shorter than 513 samples, it wouldn't work.
The 513 sample convolution isn't arbitrary. The larger the convolution size the more accurate it is at modelling. The 513 size works with 1024 pt FFT routines in the CMSIS library. Also, the larger the convolution size, the more latency (delay) there is in doing the processing. For live performance, guitar players find latencies >25 ms to be noticeable. My 513 sample convolution has a latency of 19.5 ms which is close to the minimum latency you can get at the 44,100 Hz sample rate.
 
Hmm... that's an 11 milisecond impulse 513/44100???? I use redwire impulses (and kefir 2 vst) for all of my recording .... I'll have a look how long the impulses are... I had in my mind that they are about a second long .... they do have a very small active footprint, though, as you say, a lot of padding.

Live, certainly, 20ms latency is do-able. I'd say you are right on the border there, as you say. I will have a play when I have time.

Again, congratulations on getting this done!! awesome. A really great achievement.
 
@adrian.. Yes, the process involves bringing in 512 audio samples (4 blocks) @ 2.9 ms per block or 11.6 ms. Then I do the convolution math, which is pretty fast. Then I have to send out the 512 audio samples, in 4 blocks, again @2.9 ms per block. There is a 1 block overlap in this, so 7 X 2.9 ms =20.3 ms. I measured 19.5 ms- must have been a bit of inaccuracy in my measurement there.
I have to play by the rules of the Teensy Audio library, but the PC/MAC VSTs could be faster as they have 10X more processing power and don't have to work within the Teensy Audio framework. Actually, as I was writing this, I thought I could probably decrease my latency by using only 128 "new" audio samples (plus 384 earlier ones) , doing the 513 pt convolution, and then only sending out only the last 128 samples of the convoluted result. This would take 4X the processing power however, straining the Teensy. Even now, I have to add 512 "old" samples in with the 512 "new" ones, do the math using 1024 pt FFTs, in order to allow the discrete FFTs to work with continuous time signals.
 
Nice work! If needed, there are several ways to reduce the latency further:

1) Instead of one big FFT, you can use uniform-partitioned FFT convolution to operate in blocks of 128. Slightly less efficient, but avoids the buffering latency and also distributes the computation.
2) You can reduce the group delay of the filter by designing (or converting to) a minimum-phase FIR filter. This has the same magnitude response, but instead a symmetric FIR with a group delay of 256 samples, it moves the peak to the start of the impulse response.
 
Nice work! If needed, there are several ways to reduce the latency further:

1) Instead of one big FFT, you can use uniform-partitioned FFT convolution to operate in blocks of 128. Slightly less efficient, but avoids the buffering latency and also distributes the computation.

I'm not quite sure what uniform-partitioned means but the idea of operating on a single 128 block at a time seems like a much better idea than filling the buffer with 4 blocks. I guess shortening the impulse is out of the question ... 512 samples is, I guess, reasonably close to the wind, resolution wise?? Or maybe you coud get away with 256 sample impulses, working to the limitations of the machinery.

The idea of a ring bufer could help (as bmiller has suggested)?

2) You can reduce the group delay of the filter by designing (or converting to) a minimum-phase FIR filter. This has the same magnitude response, but instead a symmetric FIR with a group delay of 256 samples, it moves the peak to the start of the impulse response.

Officially out of my depth ... very impressed by bmiller's achievement.
 
For partitioned convolution, the filter and the output are the same, only the machinery is different. The basic idea is to think of your long filter as the sum of shorter filters (where all but the first include some delay). This paper does a good job explaining it: http://www.ericbattenberg.com/school/partconvDAFx2011.pdf

My other point was, if you don't care about the phase response, you can convert a linear-phase FIR to a minimum-phase FIR and get rid of almost all of the group delay (filterlength/2 samples). EDIT: In this case, the cabinet IR is already minimum-phase.

With both of these combined, you have close to zero delay (regardless of filter length). This is important for something like a real-time convolutional reverb.
 
Last edited:
For partitioned convolution, the filter and the output are the same, only the machinery is different. The basic idea is to think of your long filter as the sum of shorter filters (where all but the first include some delay). This paper does a good job explaining it: http://www.ericbattenberg.com/school/partconvDAFx2011.pdf

partitioned convolution is definitely the way to handle reverb and very long responses.
I was wondering if it would be useful for 'normal' band-pass filter (say 1024 point FFT with 129 FIR coefficients ).
any experience?
 
@sublinear. A quick look at the paper you cited seems to confirm that this would be a lower-latency way to go. I'm mostly a hardware guy, and it was challenging to just code up the "standard" FFT convolution routine that I found in Steven Smiths DSP book which I cited on my github repo. And yes, the cabinet impulse files are minimum phase- you can see that by plotting them.
@ WMXY I suspect you are correct in thinking that this method is used for the convolution reverb VST plug-ins that I have on my DAW. The filter masks there are several seconds long, of course. In my case, I use these plug-ins only for post-processing, where any latency can be eliminated as all of the samples are available "in advance" so to speak.
I don't actually play guitar (live anyway)- I wrote this library object on a challenge from a guitar-playing friend to whom I had introduced the Teensy.
 
@WMXZ For a filter with L=129, I think the standard FFT convolution (N=128 and NFFT=256) is the way to go. With a minimum-phase filter, you'll have near-zero added latency and great efficiency. You can also use a real-FFT to almost double the efficiency of the FFT and complex-multiply.

For long filters (say L > 512) I think uniform-partitioned convolution may be the best solution, assuming you need minimum latency. If latency doesn't matter, one long FFT (as bmiller is doing) is more efficient.

For a medium length filter (say L=513) you can actually use a single FFT to process at the native blocksize (N=128). This requires an annoying 640-point FFT (you'd have to add a radix-5 stage) and is suboptimal in efficiency, but it does work. A better example might be N=128, L=385 and NFFT=512. The efficiency would fall somewhere between direct FIR filtering and a more balanced FFT convolution.
 
@WMXZ For a filter with L=129, I think the standard FFT convolution (N=128 and NFFT=256) is the way to go. With a minimum-phase filter, you'll have near-zero added latency and great efficiency. You can also use a real-FFT to almost double the efficiency of the FFT and complex-multiply.

Ignoring latency for the moment,
let me assume that I have 3000*896 samples to filter with a 129 coefficient LP

considering only FFT and IFFT with a overlap-discard approach

to filter all data, I would need
3000 x 1024-point FFT-IFFT combo (stepsize 896)
7000 x 512-point FFT-IFFT combo (stepsize 384)
or
21000 x 256-point FFT-IFFT combo (stepsize 128)

using as computational metric: 2*FFT_LENGTH*log2(FFT_LENGTH) for each FFT-IFFT combo
this would result to
61440000 ( 3000*20480)
64512000 ( 7000 *9216)
86016000 (21000 *4096)
FFT-operations

clearly, the 1024 point FFT is more efficient than the 256 point FFT
this is obvious, as step size is 896 for 1024 point FFT and only 128 for 254 point FFT

This is all clear and well known.

Now, what about portioning the filter (instead 129) to 4*33 (?)
and carrying out
12000 x 256-point FFT-IFFT combo (stepsize 224)
one would get (ignoring additional complex operations)
49152000 (12000 * 4096)
FFT-operations

apart from the fact that it is not clear how to partition a 129 point Filter
a low latency filter seems to beat or draw with a standard 1024 point FFT-IFFT combo.

Now, question are
- if this is correct,
- how would on partition a 129 point filter or would one generate a 4 x 33 = 132 filter for partitioning or work with (32+32+32+33) sub-filters?
 
I think you're right, the theoretical operation-count for partitioned convolution seems to always be lower than unpartitioned. In practice, when FFTs get too small (32-point RFFT is only a 16-point FFT) the overhead and housekeeping really start to hurt.

I'd consider these tunable parameters, and choose whatever benchmarks as the fastest.
 
Hello,
very interesting discussions here and I am very much interested in those developments.
I just wanted to point out that the critical point in a real time system is not necessarily the algorithm efficiency but more to balance the CPU load over time as the main limitation is to never reach 100% load to avoid dropping audio blocks.
As I am developing a distortion modeler, I am naturally interested in those developements. However, in my application, the modeling tasks already take up to 85% CPU load on a Teensy 3.5. If I move to a Teensy 3.6, the would mean 55% load approx. leaving about 30% CPU for cabinent impulse response model.

Correct me if I am wrong but I have the impression that, if the FFT code would be partitionned in 4 , the load would be mostly spread evenly and could reach something close to 20%-30% (estimated from 64% / 4 + some overhead). Do you think that this estimate makes sense?
if yes, this would be a great addition for my modeler and I could try to contribute to such development.
 
@jcugnoni
I know my code method pretty well, but have not thought a lot about the segmented method outlined by the others. Certainly breaking the FFT down into 4 segments will spread the load out by a factor of 4. But, I am pretty certain that 4 256 point FFT/iFFFTs will take significantly longer to execute than the one 1024 pt FFT/iFFT that I do. The single 1024 FFT/iFFT requires padding the FTT arrays with trailing zeros, and saving/ adding part of the results from the previous iteratIon. There must be a fair bit of this type of extra calculation needed for the segmented method, over and above just adding the results of the 4 separate FFT/iFFTs
It would be nice if other math experts would enhance my library object, or write their own, to achieve the same results with lower latency/ load on the MCU. At the present, it is beyond my math skill level. I may dig in more later.
 
Hi Bill, this Convolution object looks like a very nice addition to the audio library!

A while ago, I also have been thinking about how to decrease the latency of the convolution process in the context of my Teensy Convolution SDR and also found some nice papers which I would like to share with you here:

https://dsp.stackexchange.com/quest...-convolution-plugins-process-audio-so-quickly

http://www.cs.ust.hk/mjg_lib/bibs/DPSu/DPSu.Files/Ga95.PDF

HERE PSEUDOCODE for the sliding DFT:
http://www.convict.lu/Jeunes/ultimate_stuff/RFT.pdf

"Sliding phase vocoder"
http://cs.bath.ac.uk/jpff/PAPERS/spv-icmc2007.pdf

However I also found that this is way to complex for me to be able to implement that.

But I would like to highly encourage everyone to try to build that, because it could be used for many many purposes!

All the best, have fun!

Frank DD4WH
 
Hi Frank: Thanks. I spent a bit of time looking over these, as comments by sublinear re segmented convolution seemed to be a viable way to decrease latency (but I am not sure why 20-30 msec of latency would be an issue in an SDR, like it is to live guitar playing). Math is not my forte, so take the following for what it is worth.
These papers seem to be striving for low to zero latency. As such, they are operating in an environment where each audio sample is available immediately as it is converted by the codec .Similarly, each result can be sent out individually.(as far as I can understand) In our case, we have to live with the 128 sample buffer inherent in the audio library. Therefor, things they do, such as substituting a direct form convolution (FIR) in place of the FFT convolution, for the first 32, or 64 samples (where they are more efficient than the block FFT form) don't seem to make much sense when you have 128 samples available, all at once. Sticking with radix 2 FFTs, the only combination applicable to us would be a 64 sample FIR followed by a 128 pt FFT style convolution. I don't think there is much efficiency to be gained there.
That said, if one could substitute four 256 pt FFT/iFFTs in place of 1 large 1024 pt FFT/iFFT, by segmentation, then the latency could be reduced at the expense of MCU utilization. All else being equal, I guess the % MCU utilization would go from 64% (for every 4th audio block) I.e. 16% average, up to
16% X ( 4 X 16% X (log2 256)/(log2 1024).
16% X 4 X (8/10)
16% X 3.2 = 51.2%
This would be 51.2% for each block, unlike my implementation.

(Maybe less if you could do only one 1024 pt iFFT at the end instead of four 256 pt iFFTs.)
If I get ambitious, I may try this segmented FFT method.
 
That said, if one could substitute four 256 pt FFT/iFFTs in place of 1 large 1024 pt FFT/iFFT, by segmentation, then the latency could be reduced at the expense of MCU utilization.
Just dealing with FFT's are you trying compute a 1024 pt FFT with four 256 pt FFT's in sequential updates? Meaning instead of doing one 1024 pt FFT every four blocks w/(50% overlap) you would do a 256 pt FFT every block.
 
@duff The segmented FFT technique was brought up by sublinear in this thread, not me. In theory convolution is a linear function, so I guess you can break it up this way. But, I haven't coded/tried it out yet.
 
I was just researching that and saw your post, one thing with "segmented FFT's" is it would increase latency. This is because you still need to collect all the samples for the 1024 pt FFT. Getting a 1024 pt FFT from four 256 pt FFT's can't be done 256 sample at a time and then combined on the last one, you have interleave the samples when doing 256 pt FFT's so you need all the samples before doing the first 256 pt FFT. This is where the increase in latency would come in. This would though average out the overall FFT CPU usage since you are doing smaller FFT's in multiple Audio Updates.
 
Please somebody in this thread make some FFT/Convolution block for Teensy 3.2. I know teensy 3.2 has a fixed point. But i hope by using teensy 3.2 can achieve upto some extent right? It will be much helpful.
 
I wouldn't say it can't be done, but there would be a lot of trade-offs in signal quality. Using a teensy 3.6 at 180 MHz clock, my routine uses about 75% of the MCUs computing power every 4th audio block it processes,and about 25% for the other 3 audio blocks. ( because the 1024 pt FFT, multiply, iFFT routines need 512.new samples at a time, to process. The audio sample blocks are only 128 samples in size, so 4 of them have to be collected, then processed)
The T3.2 running at 72 or 96 MHz would not be fast enough to do this, assuming the CMSIS Fixed point DSP functions (T3.2) ran at the same speed as the HW floating point ones I use with the T3.6. I guess you could reduce the audio libraries sample rate to 22,050 from its default 44,100 rate, which might help.
But I think that doing the equivalent of a 513 pt Convolution filter , which is what I am doing, would suffer greatly from the lack of precision of the fixed point DSP routines. Overall, i doubt that it would provide satisfactory results-especially since you are expecting 16-bit/44.1 quality from the hardware itself.
 
I prefer brickwall filtering with Teensy 3.2. I know it cannot do something like Teensy 3.6. But suggest what would be the best filtering mechanism( brickwall filtering) can achieve using Teensy 3.2 ?
 
Hello all!
I must compare waveforms on regular intervals. So is it possible to save FFT results on file (including FIR) and later play these results from file to iFTT? What could be the best option - save including FIR for later play just through iFTT or save without FIR and later play through FIR and iFTT? Thanks in advance!
 
Status
Not open for further replies.
Back
Top