Improvements to FFT

h4yn0nnym0u5e · Feb 12, 2025

Hi folks

I've been tinkering with the FFT code in the audio library, and think I've made some improvements:

processing is now spread across update() calls, so the peak CPU load has dropped (slightly)
works with smaller values of AUDIO_BLOCK_SAMPLES, and some larger ones (up to the FFT size)
- I could probably make it work with even larger block sizes, but mostly people want smaller to reduce latency
- there was untested and broken code for 64-sample blocks, apparently for Teensy LC, dating from 2015: I removed this
AudioAnalyzeFFT1024 can now do averaging, as AudioAnalyzeFFT256 already could
there's now only one pair of files, analyze_fft.cpp and .h, as pretty much all the code is common to both classes

It should all work exactly the same as before for existing code, with the exception that AudioAnalyzeFFT1024 objects now consume an extra 2kB of RAM for the averaging. If that's a problem it'd be very easy to revert it, and add a new object called something like AudioAnalyzeFFT1024Averaging.

The relevant branch can be found here. As ever, if you try it and find issues, please post a simple sketch which demonstrates the problem, that can be compiled using the Arduino IDE, and that requires no extra hardware.

WMXZ · Feb 13, 2025

Sorry, but could explain the algorithm how you distribute, say a 1024 point FFT over eight 128 data blocks (i.e. 8 update() calls). This is not obvious from the code, at least to me.

h4yn0nnym0u5e · Feb 13, 2025

To a large extent, I don't. I'm not even sure it's mathematically feasible to compute a partial FFT with only some of the data available, and finish it later... Someone with maths skillz will no doubt be along to clear that up

The original code stored 8 128-sample blocks, and every 4th update windowed all 1024 samples and computed the FFT, which means 1 in every 4 updates takes much more CPU. What I've done is to compute (part of) the windowed data on each update, and only compute the FFT on the 4th update. This is still more computationally intensive, but at least it spreads the load a bit better.

Additionally, I noticed that it's not necessary to store 8 blocks, only 4. On each update an old block is windowed and discarded, and the new block is stored (in the old block's slot) and windowed. For AudioAnalyzeFFT256 it's a bit simpler, because the FFT is just computed every 256 samples - there's no 50% overlap as there is with AudioAnalyzeFFT1024.

All the above changes with a change of AUDIO_BLOCK_SAMPLES, but that's now taken care of automatically at compile time - if you use 32-sample blocks, then 16 are stored and an FFT is computed every 16th update(). That wasn't the case before - the code just broke horribly due to being littered with magic numbers and assumptions that the value of AUDIO_BLOCK_SAMPLES would be 128.

MarkT · Feb 13, 2025

The FFT requires all the data up-front before it can start. But you don't have to do all the FFT computation in one go once you have that data, you could do a few stages at a time on each update, but that would increase latency. A 1024 point FFT has 10 stages, so doing 3 per update would take 4 updates to finish for instance, not 1 update.

One issue with the standard FFT classes are they use 16 bit intermediate results, whereas for numerical precision you need more (guard bits). The result is the signal/noise ratio of the FFT spectra is definitely worse than it could be.

h4yn0nnym0u5e · Feb 13, 2025

Does the ARM library give API access to each of those individual stages? I had a quick look and didn’t spot anything, but could easily have missed it, there was a lot of terminology that went way over my head

To my mind an extra 12ms of latency wouldn’t be a big issue, but I know Paul’s hot on keeping backwards compatibility so it’d probably be a case of adding a couple of new FFT classes … at which point improvement to the S/N ratio would be on the table.

Thinking about it, it would possibly make sense to do only windowing into the buffer in the audio update(), and have the read-out methods do the FFT. It uses about 4% CPU every 12ms, and there must be many applications which don’t need a new set of values that often.

WMXZ · Feb 14, 2025

@h4yn0nnym0u5e , I guess your approach is not good, and is destroying the spectrum.
For an 1024 point FFT you need 1024 points spanning 8 blocks of 128 samples. Of course, you can construct your own algorithm and process the 8 blocks individually and combine the frequencies. For this to work efficiently, the algorithm consists of 1024 point FFT on 128 input samples, with pruning the zero data that is they do not contributing to the calculation. Every 1024 pt FFT would then work on different 128 samples. The results (frequencies) are then accumulate appropriately. Pruning algorithms for the first 128 samples are trivial and published years and years ago. I have not seen algorithms that have pruning in the beginning and end of the timeseries, but as FFT assume cyclic data, this may be possible.

h4yn0nnym0u5e · Feb 14, 2025

I don't think I've destroyed anything, though I'm prepared to be proved wrong

I've probably explained myself poorly, though...

The 1024 point FFT is done on 1024 points taken from 8 consecutive blocks, as before. However, what I've improved (I think) is to copy each block to the buffer as it is received, windowing it using the relevant section of the 1024-value window array. In the old code all the copying was done when the 8th block arrived, followed by another pass across the buffer to apply the window. The copy / window is not a huge burden compared to the FFT, of course, it's just a bunch of multiplications, so there's still a spike in CPU use when the FFT is done, just it's smaller by about 0.3% (4.15% down to 3.85%, but some of that is other audio processing).

Improvements to FFT

h4yn0nnym0u5e

Well-known member

WMXZ

Well-known member

h4yn0nnym0u5e

Well-known member

MarkT

Well-known member

h4yn0nnym0u5e

Well-known member

WMXZ

Well-known member

h4yn0nnym0u5e

Well-known member