FFT benchmarks on teensy3

Status
Not open for further replies.

jimlux

Active member
I ran some benchmarks on a teensy3 using a fairly simple fixed point FFT. The original routine used program memory for the sin table and a FIX_MPY routine to do scaled multiplies, but since the teensy has 32 and 16 bit ints and a lot more data memory, I modified it to just use the native multiply operation. This was done for *timing tests* only, so I've not checked the numerical accuracy.

N us
1024 8167
512 3750
256 1711
128 771
64 349
32 154
16 73

The last file below is a matlab/octave script to generate the FFT sine table with specified length and precision.
Test program to call FFT and time it
View attachment ffttest.ino
the sine table (1024 long, 32767 max value)
View attachment ffttab1024.h
the fft routine
View attachment fixfft.cpp
the .h to go with above
View attachment fixfft.h

matlab/octave to generate sine table
View attachment ffttab.m.txt
 
Yes.. times in microseconds..for 1024 complex points. the time requried is a bit lower than N*log(N), probably because there's some fixed overhead that becomes a smaller part of the overall time.

There are other FFT libraries out there that might be faster (some mentioned on this forum), but I was looking for some generalized numbers for planning purposes, so it was easy to just code this up and run it.
 
There's some definite bugs in my test code. Everything was declared as char, as opposed to int or long.. I'm fixing it all now, and initial results are that it runs about 20% slower
1024 10109
512 4665
256 2131
128 965
64 433
32 194
16 84

Is there any info out there on relative speed of multiplying two 16 bit ints to get a 32 bit long? vs multiplying long*long? I suppose this is why people use optimized libraries written in ASM, because trusting in the compiler to get it right is tough.

I'm going for "everything in a long" as a sort of bounding case. but still using a 16 bit sine table. I figure nobody is going to be doing 32kpoint transforms without some smarter analysis, so the truncation error from the shorter sine table helps. I can probably get rid of some of the on-the-fly scaling checks, too.
 
I know how long the processor takes, my question is more about "what instructions does the compiler emit"...

for instance, what's the relative speed of:

long ilong,qlong, rlong, slong;
int ishort,qshort, rshort, sshort;

ilong = qlong * slong;

vs

ilong = qlong * sshort;

vs
ilong = qshort *sshort;

and what's the truncation behavior.

I'm just coding up all the alternatives and trying them. That's sort of the acid test, anyway.

For instance, the original routine used "if (x<0) x = -x; if (x>MAXVAL) { bump shift flag};" and it turns out that " if (abs(x)>MAXVAL) {bump shift flag};" is not only simpler to read, but runs faster.
 
You can see what instructions the compiler used by disassembling the .elf file. First, use Arduino's File > Preferences to turn on verbose info while compiling. Then you can see the compiler commands, which will show you the full pathname to the temporary directory where it's compiling the code. You need that to get the .elf file.

Then you can use "arm-none-eabi-objdump -d ffttest.cpp.elf > ffttest.txt" to turn the elf file into a disassembly that shows the instructions.

The arm-none-eabi-objdump program is located in hardware/tools/arm-none-eabi/bin, so you'll need to add that to your path or type the pathname to run it from the command line.


For details on truncation behavior, and pretty much every other detail of the instructions, you can look them up in this ARM reference manual. Chapter 4 has a nice summary and chapter 7 gives the exact details for all instructions.

http://www.pjrc.com/teensy/beta/DDI0403D_arm_architecture_v7m_reference_manual.pdf

The "DSP" instructions on pages 133-134 are meant for optimizing this sort of code, but they work on 16 bit signed integers packed two per 32 bit CPU register. You can't simply apply them to normal C code. The surrounding code needs to manipulate pairs of 16 bit numbers in 32 bit variables, which also doubles the speed of moving the data in and out of the CPU.
 
Thanks, Paul..

But I think I'm going to attack this by good old empiricism. There's only a few ways to code it (simply) in C, and all I want to do is know if there's something simple that makes it faster or slower (e.g. mixing ints and longs). Then I'll get my timing numbers.. If that's "good enough" then I'm done. If that's not, I'll probably look into the CMSIS-DSP libraries, instead..
 
The other excellent but expensive reference is this book:

http://www.amazon.com/Definitive-Guide-Cortex®-M3-Cortex®-M4-Processors-ebook/dp/B00G9856GU

It is loaded with a LOT of nice insights about making best use of the instructions, which doesn't seem to be available anywhere else.

But even this book lacks a table or definitive info that shows how many clock cycles each instruction really takes (with no wait states from the buses). Almost all instructions are single cycle, unless they access memory or cause a change in the program counter (which requires a pipeline refill). But some like the divide instruction apparently take 2 or more cycles.

Of course, when you throw in the flash latency, flash bandwidth (32 bit wide on Teensy 3.0, 64 bit wide on Teensy 3.1), flash cache, bus arbitration when the DMA controller or USB DMA accesses RAM, dual buses to the RAM, and maybe other complex factors, well, it gets quite tricky to really predict exactly how many cycles something will take. It's just so much more complex than a slow 8 bit CPU without multiple bus masters.
 
But I think I'm going to attack this by good old empiricism.

Yes, that's probably the best way. But you asked for the detailed info, so I'm doing the best I can to give it to you.

If that's not, I'll probably look into the CMSIS-DSP libraries, instead..

I'm sad to say the CMSIS-DSP fixed point FFT appears to be quite buggy. It has some numerical precision or rounding bug. It works for large signals, but smaller signals result in only zeros or very coarse resolution in the output bins.

In fact, this weekend I'm looking into why it doesn't work properly. I actually have an older copy of the code you posted, before the conversion to 8 bits, which I'm experimenting with as a reference.
 
Hmm, looks like the CMSIS version does indeed work. There's some sort of bug in how I'm using it in the audio library... which will be tomorrow's project!

Here's the speed comparison I'm seeing on all 3, for a 256 point FFT.

Code:
122       0       0       0  --        0      -1  --     0.00     0.00      0.00
123       0       0       0  --        0       0  --     0.01     0.00      0.01
124      49      -1      49  --       49       0  --    49.54    -0.02     49.54
125       0      -1       1  --       -1       0  --    -0.05    -0.00      0.05
126       0      -1       1  --        0       0  --     0.00     0.00      0.00
127       0      -1       1  --        0       0  --    -0.05     0.00      0.05
cmsis: 234 us
fixed: 1142 us
float: 14752 us

The code is attached.
 

Attachments

  • test_fix_fft2.zip
    9.5 KB · Views: 163
With more investigation this morning, it seems arm_cmplx_mag_q15() and probably arm_sqrt_q15() is the part of the math library with the bug. The FFT works fine, with a tiny bit more noise due to less careful rounding, but that's pretty reasonable for a nearly 5X speedup!
 
I'm pretty sure arm_cfft_radix4_q15 is ok. If you try running the code from reply #12, it's pretty easy to see arm_cfft_radix4_q15, fix_fft and and fft.Forward (double precision floating point) are all returning pretty much the same values. I did scale the float one by dividing by 256. The 2 fixed point ones seem to do that automatically.

It's looks like I incorrectly applied arm_cmplx_mag_q15() in analyze_fft256.cpp. In hindsight (after a couple solid days of fiddling), it's becoming clear that it's implementing fractional square root, but what's needed is scaled integer square root.

I believe I'll (finally) get analyze_fft256.cpp working properly later today.......
 
I fixed analyze_fft256.cpp yesterday.

Both arm_cfft_radix4_q15 and arm_cmplx_mag_q15 were working properly, but applying arm_cmplx_mag_q15 to the output of the FFT was not valid, since the FFT output is not in fractional format. Replacing it with a sum of squares and regular square root worked.
 
Status
Not open for further replies.
Back
Top