Sustituting arm_rfft_q15() into the Audio Library FFT functions?

Status
Not open for further replies.

DerekR

Well-known member
Has anybody thought about (or already completed) substituting the arm_rfft_q15() real FFT function as the workhorse in the Audio Library FFT blocks? I have been using it in several projects recently, and it is SIGNIFICANTLY faster and halves the buffer storage required. It would be well worthwhile if it hasn't been done already.

If it hasn't, I'll take a pass at it...

What this function does is take a purely "real" input data sequence of length N, and returns a complex buffer (in the arm format of ... real/imag/real/imag...) also of total length N - but containing only the N/2 positive frequency lines (notice I refuse to say "bins" - they ARE NOT bins ;) - my pet peeve). The only funniness is that line 0 is not considered to be complex, it contains two "real" quantities: the dc term, and the Nyquist frequency term, but that is easily taken care of.

The routine relies on the fact that the two halves of the FFT of a real sequence are complex conjugates, and thus redundant.
 
(notice I refuse to say "bins" - they ARE NOT bins ;) - my pet peeve).
I take on the challenge of a rebuttal. IMO, the FFT provides a spectral density, and consequently it is legitimate to call these frequency bins.
increasing the FFT resolution consequently decreases the bin width an the contained spectral energy.
So I would be curious to read your arguments (even as PM). may be you are right and I'm wrong.
 
@WMXZ
Hehehe - completely off-topic :D, but I accept the challenge, we will meet at dawn tomorrow behind the barn and settle this thing once and for all! You've got this ancient academic on one of his favorite topics.

First, the units of the FFT are categorically not of spectral density! The units are simply amplitude. Sorry about that chief, but that is just the way it is - the DFT/FFT is simply a sum of numbers (amplitudes) multiplied by sines and cosines (which are dimensionless). As I point out below, however, the units of the continuous Fourier Transform are amplitude density...
The lines in the FFT have zero width - you couldn't squeeze anything into a bin there if you tried, and there is a complete nothingness between them. They are simply a set of complex numerical coefficients in a complex polynomial. Its like me taking a polynomial like
y = 3 +1x +5x^2 + 7x^3 -2x^4
and describing the set of coefficients that describe the polynomial, ie [3 1 5 7 -2], as "bins". In the DFT/FFT case the lines values are the numerical coefficients of terms of the form exp(-j.2pi.m.n/N). Same argument.

I have attached a handout from an old course I taught for many years, to provide a more mathematical basis:
View attachment SamplingDFT.pdf
and I will just outline my argument here:
1) We start in the continuous time domain, with the Fourier Series representation of periodic functions as a summation of discrete frequency (harmonically related) sinusoids, producing a line spectrum where each line has no width - it is purely a discrete spectrum, and nothing exists between the lines. Its units are also amplitude (again, as I will point out below, there is a very
strong relationship between the Fourier Series and the DFT/FFT)
2) The Fourier Transform is a spectral representation of a finite duration (aperiodic, or one-shot) continuous function f(t), and is derived by considering the behavior in the limit of the Fourier Series of a periodic-extension of f(t) as the period goes to infinity. In other words we assume that f(t) is in fact periodic, but the period is so long that we have not yet come to the end of the first period! The lines become more and more closely packed together, and in the limit the line spacing becomes zero and we end up with a continuous function of frequency.
The units of the Fourier Transform are of an "amplitude-density", such as volts/Hz.

2) Now we move to the discrete-time domain and the Discrete-Fourier-Transform DFT (of which the FFT is just an algorithmic implementation).
A sampled waveform is modeled mathematically by assuming that the continuous waveform f(t) has been multiplied by a periodic train of Dirac delta functions (which are infinitely narrow pulses of infinite amplitude, but with an area of unity, and integrated (see the attachment). This process eliminates ALL knowledge of the waveform f(t) except AT the sampling instants. Once we have the set of samples, the concept of continuous time doesn't exist any more, all we have is a bunch of numbers.
The DFTs (forward and inverse) are mathematical relationships between two sets of discrete complex numbers; continuous time and frequency do not enter into the transforms. But wait - there's more - there's an implicit assumption that each of the two data sets each are a single period of a PERIODIC pair of data sets, and when we say that {F_m} (M = 0...N)
F_m = sum(n = 0 to N-1) f_n exp(-j.2pi.n.m/N)
is the DFT of the sequence {f_n}, we are saying that it can be considered as a sum of complex exponentials - just as we imply when we use the Fourier Series representation of a periodic continuous f(t).
Indeed, another way of interpreting the DFT is as the Fourier Series of a periodic-extension of f(t). As I said at the outset the line spectrum of a Fourier Series has no interpretation of bins and/or line width. You must be very careful when you relate the DFT back and forth to the continuous world that (most of us) live in.

So where does the concept of "bins" in the DFT come from? Maybe somehow related to the concept of reconstruction of a bandlimited waveform between samples from its sample set using the sinc function (Whitakker or cardinal, see the attachment) interpolation formula?

My concept of "binning" is the grouping of nearby values into a common container, such as when constructing a histogram, or estimating a probalility-density-function (pdf). In my 40+ years of teaching this stuff I saw many, many outright errors and misunderstandings created by attempting to "bin" results from DFT analysis. I prefer to steer away from the term as far as I can!

See you out behind the barn at dawn tomorrow :rolleyes:
 
Last edited:
I'll do a direct timing comparison on the length 256 and 1024 versions and get back tomorrow...

BTW - the arm_rfft_q15() and arm_rfft_q31() in the more recent arm_math that I am using with Teensyduino have just a single version, and specify the length of the FFT in the init function.

I've made a start on making a new version of AudioAnalyzeFFT256...
 
Please also keep an eye on the flash memory usage. One of the reasons we've stayed on the old version is even the shorter lengths bringing in a huge lookup table that burns up too much flash. Hopefully that won't be an issue in 1.5.
 
Please also keep an eye on the flash memory usage. One of the reasons we've stayed on the old version is even the shorter lengths bringing in a huge lookup table that burns up too much flash. Hopefully that won't be an issue in 1.5.
I'll go back and check but I'm pretty sure they fixed the lookup tables flash sizes which result in significantly smaller tables than the current version. Plus with the new CMSIS-DSP I can speed even more my fft optimization because I can use two 512pt FFT instead of 4 256pt FFT's plus the stock routines are better optimized also.
 
I did the timing comparisons for N = 256, 512, and 1024, directly and the improvement seems to be nowhere as great as I recall from a couple of weeks ago.
T3.6 at 180MHz.
Here's the quickie sketch
Code:
// Quick sketch for FFT loop timing:
// DR      9/30/2017
//----------------------------------------------------------
#include "arm_math.h"

#define forward    0
#define bitReverse 1
#define fftLength  256  // <-- change value here.
//
arm_rfft_instance_q15 rfft_inst;
arm_cfft_radix4_instance_q15 cfft_inst;
int16_t rdata[fftLength], rfft_buffer[fftLength];
int16_t cdata[2*fftLength];

//---------------------------------------------------
void setup() {
   Serial.begin(9600);
   while(!Serial){}
   arm_rfft_init_q15(&rfft_inst, fftLength, forward, bitReverse);
   arm_cfft_radix4_init_q15(&cfft_inst, fftLength, forward, bitReverse);
}

//-----------------------------
void loop() {
  // Need to do this here because rfft destroys its input buffer...
  for (int i=0; i<fftLength; i++) {
   rdata[i]     = (int16_t)(20000 + 4096*cos(2*3.1415926*i/16) + 6788*sin(2*3.1415926*i/4));  // real
   cdata[2*i]   = (int16_t)(20000 + 4096*cos(2*3.1415926*i/16) + 6788*sin(2*3.1415926*i/4));  // complex
   cdata[2*i+1] = 0; 
  }
  uint32_t t0 = micros();
  arm_rfft_q15(&rfft_instance, rdata, rfft_buffer);
  Serial.print(micros() - t0);
  t0 = micros();
  arm_cfft_radix4_q15(&cfft_instance, cdata);
  Serial.print("\t"); Serial.println(micros() - t0);
}
Times are in usecs for a single pass through a forward FFT with BitReversal:
N = 256 rfft: 91 cfft: 102
N = 512 rfft: 165 cfft: 216 <-- may not be valid since 512 is not radix 4!
N = 1024 rfft: 396 cfft: 483

One thing I can't be sure about is whether I am now using the same version of arm_cfft_radix4_q15() that is supplied with Teensyduino, since that has long gone from my system.
Sorry if I created a false impression of the speed-up, I seemed to remember a much bigger improvement.
 
Hehehe - completely off-topic :D, but I accept the challenge, we will meet at dawn tomorrow behind the barn and settle this thing once and for all! You've got this ancient academic on one of his favorite topics.

....
I have attached a handout from an old course I taught for many years, to provide a more mathematical basis:
....

See you out behind the barn at dawn tomorrow :rolleyes:

[OT]Thanks for document, that is what I' was hoping for.
My reasoning was starting from the Fourier integrals having a delta_omega and that spectral lines do physically not occur (cannot wait ad infinitum) and that we have hardly strictly periodic functions but mostly band limited stochastic processes.
Anyhow, I will try to consolidate my views with your notes, and will try hard to find something I can disagree with; and then we will meet inside the barn for a drink (outside has too unpleasant ending) [/OT]
 
I did the timing comparisons for N = 256, 512, and 1024, directly and the improvement seems to be nowhere as great as I recall from a couple of weeks ago.
T3.6 at 180MHz.
Here's the quickie sketch
Times are in usecs for a single pass through a forward FFT with BitReversal:
N = 256 rfft: 91 cfft: 102
N = 512 rfft: 165 cfft: 216 <-- may not be valid since 512 is not radix 4!
N = 1024 rfft: 396 cfft: 483

One thing I can't be sure about is whether I am now using the same version of arm_cfft_radix4_q15() that is supplied with Teensyduino, since that has long gone from my system.
Sorry if I created a false impression of the speed-up, I seemed to remember a much bigger improvement.

IMO, it is very hard to directly compare rfft and cfft
rfft does first some radix 4 FFTs followed by a spectral unshuffling (effectively a Radix2)
That may also been the reason for limited numbers of rffts is older CMSIS DSP library.
So either FFT size is optimal for rfft of for cfft but never for both.

IMO, where the advantage comes from is that rfft does not waste processing, say take the audio library, which does 1 cfft every 256 samples (2 blocks), using rfft I would do a transform every 512 samples (4 blocks) doubling the resolution at about 2/3 of the time (cfft: 102 us for N=256 vs rfft: 165 us for N=512) so you gain spectral resolution and some time savings (51 us/block for cfft and 41 us/block for rfft)
 
I did the timing comparisons for N = 256, 512, and 1024, directly and the improvement seems to be nowhere as great as I recall from a couple of weeks ago.

One thing I can't be sure about is whether I am now using the same version of arm_cfft_radix4_q15() that is supplied with Teensyduino, since that has long gone from my system.
Sorry if I created a false impression of the speed-up, I seemed to remember a much bigger improvement.
Teensyduino's current version doesn't have 256pt or a 1024pt rfft, you must have updated your CMSIS-DSP, look at your arm_math.h file to see what version.
 
Now I recall where I got the idea as to how fast the latest rffts are. I have been using arm_rfft_fast_f32() in my new Audio block STFT based channel-vocoder (512 channels max) for electronic music synthesis - to be posted later this week. (A full phase-vocoder is in the works.)

I included the floating point version into the benchmark sketch I posted two nights ago and here are the results, all times in usecs (Teensy 3.6 @ 180MHz):

N = 256:
arm_rfft_q15: 91
arm_rfft_fast_f32: 105
arm_cfft_radix4_q15: 102
arm_cfft_radix2_q15: 148

N = 512:
arm_rfft_q15: 165
arm_rfft_fast_f32: 225
arm_cfft_radix2_q15: 311 <- Note: corrected from previous post by moving to radix-2

N = 1024:
arm_rfft_q15: 396
arm_rfft_fast_f32: 404
arm_cfft_radix4_q15: 483
arm_cfft_radix2_q15: 650

Conclusion: The f32 version is lightning fast for floating point - only marginally slower than the q15 rfft version, and about 25% faster than the old radix-4 (q15 arm_cfft_radix4_q15()). I'll be looking for reasons NOT to use floating-point in most of my FFT processing :D. (Haven't taken into account above the time needed for q15 <--> f32 conversion.)
 
Status
Not open for further replies.
Back
Top