1024 point FFT, 30% more efficient (experimental)

Status
Not open for further replies.

kpc

Well-known member
I optimized the 1024 point fft a little.
In the processorUsageMax(), the percentage drops from 68% to a mere 12% (72 MHz teensy 3.1)
Most of this comes from distributing the processing power equally over all time-slots.

But also the FFT algorithm has changed. I could get it 30% more efficient.

This table presents the approximate number of cycles used in the current and the new version.

Code:
                   old     new    delta
copying data:      8288    4452   -46%
applying window:   13073   9270   -23%
fft:               98030   67812  -31%
calc. magnitude:   12652   11856   -6%

I think by unrolling loops and improving memory access, some 5000 cycles can still be gained.
I hope I caught all overflow problems due to bit-growth. If anybody wants to try, i would like to hear any of these problems.

The file can just replace the existing analyze_fft1024.cpp/h in the Audio library
In the header file only two changes were made, 2 more audio_block_t's and the reduction of the cmsis fft from 1024 to 256.

In the original file Paul(?) mentions a TODO
// TODO: support averaging multiple copies
I have some ideas to implement this, it might only take about 4000 cycles and has a variable length averaging. But I first need to do more testing on the core of the FFT.

EDIT: for the latest code see post: https://forum.pjrc.com/threads/27905-1024-point-FFT-30-more-efficient-%28experimental%29?p=66678&viewfull=1#post66678
 

Attachments

  • fasterfft1024.zip
    5 KB · Views: 319
Last edited:
Nice, I've been looking at using this and will look at it as soon as Beta-7 is running here. Will this change apply equally to all the FFT variants?

In looking at the 256 .vs. 1024 and the 256 uses 12% instead of 52% as I saw it - but somehow it takes twice as long? I assumed the 256 would take less time to return samples? I was seeing 12ms between samples on 1024 and 22ms on 256 - is the algorithm that different somehow or taking more samples?
 
Last edited:
@KPC - I can't get a compile to complete using these files on Win7x64 with new 1.21-beta7. Under 50% it just stops as quoted below. I have reverted the two files and in about 10 seconds it completes with no warning

If it helps here is my FFT sketch working with factory libs. It is the
"C:\Users\tim\Documents\Arduino\hardware\teensy\avr\libraries\Audio\examples\Analysis\FFT\FFT.ino"
example with perf data and modified looping changing the sinewave data to watch the buckets move: View attachment T_FFT_SINEWAVE_edit.ino

C:\Users\mobil_000\Documents\Arduino/hardware/tools/arm/bin/arm-none-eabi-gcc -c -g -Os -Wall -ffunction-sections -fdata-sections -MMD -nostdlib -mthumb -mcpu=cortex-m4 -D__MK20DX256__ -DTEENSYDUINO=121 -DARDUINO=10600 -DF_CPU=96000000 -DARDUINO_ARCH_AVR -DUSB_SERIAL -DLAYOUT_US_ENGLISH -IC:\Users\mobil_000\Documents\Arduino\hardware\teensy\avr\cores\teensy3 -IC:\Users\mobil_000\Documents\Arduino\hardware\teensy\avr\libraries\Audio -IC:\Users\mobil_000\Documents\Arduino\hardware\teensy\avr\libraries\Wire -IC:\Users\mobil_000\Documents\Arduino\hardware\teensy\avr\libraries\SPI -IC:\Users\mobil_000\Documents\Arduino\hardware\teensy\avr\libraries\SD -IC:\Users\mobil_000\Documents\Arduino\hardware\teensy\avr\libraries\Audio\utility C:\Users\mobil_000\Documents\Arduino\hardware\teensy\avr\libraries\Audio\data_windows.c -o C:\Users\MOBIL_~1\AppData\Local\Temp\build1765179341746263744.tmp\Audio\data_windows.c.o
C:\Users\mobil_000\Documents\Arduino/hardware/tools/arm/bin/arm-none-eabi-g++ -c -g -Os -Wall -ffunction-sections -fdata-sections -MMD -nostdlib -fno-exceptions -felide-constructors -std=gnu++0x -fno-rtti -mthumb -mcpu=cortex-m4 -D__MK20DX256__ -DTEENSYDUINO=121 -DARDUINO=10600 -DF_CPU=96000000 -DARDUINO_ARCH_AVR -DUSB_SERIAL -DLAYOUT_US_ENGLISH -IC:\Users\mobil_000\Documents\Arduino\hardware\teensy\avr\cores\teensy3 -IC:\Users\mobil_000\Documents\Arduino\hardware\teensy\avr\libraries\Audio -IC:\Users\mobil_000\Documents\Arduino\hardware\teensy\avr\libraries\Wire -IC:\Users\mobil_000\Documents\Arduino\hardware\teensy\avr\libraries\SPI -IC:\Users\mobil_000\Documents\Arduino\hardware\teensy\avr\libraries\SD -IC:\Users\mobil_000\Documents\Arduino\hardware\teensy\avr\libraries\Audio\utility C:\Users\mobil_000\Documents\Arduino\hardware\teensy\avr\libraries\Audio\analyze_fft1024.cpp -o C:\Users\MOBIL_~1\AppData\Local\Temp\build1765179341746263744.tmp\Audio\analyze_fft1024.cpp.o
In file included from C:\Users\mobil_000\Documents\Arduino\hardware\teensy\avr\cores\teensy3/WProgram.h:15:0,
from C:\Users\mobil_000\Documents\Arduino\hardware\teensy\avr\cores\teensy3/Arduino.h:1,
from C:\Users\mobil_000\Documents\Arduino\hardware\teensy\avr\cores\teensy3/AudioStream.h:34,
from C:\Users\mobil_000\Documents\Arduino\hardware\teensy\avr\libraries\Audio\analyze_fft1024.h:30,
from C:\Users\mobil_000\Documents\Arduino\hardware\teensy\avr\libraries\Audio\analyze_fft1024.cpp:4:


BEFORE: fft1024=51,52 all=53.02,53.64 Memory: 4,8 Ready:1440 Not:5742280 # Samples:120 Sample Time:1393 Time per sample:11
Sketch uses 92,808 bytes (35%) of program storage space. Maximum is 262,144 bytes.
Global variables use 13,428 bytes (20%) of dynamic memory, leaving 52,108 bytes for local variables. Maximum is 65,536 bytes

AFTER: ... see above for now
 
Last edited:
Sorry, have not tried the new arduino yet. Could you list the real error. Your respons stops just before the real error appears. On 1.06 it your .ino compiles without a problem.
As to your first question: The fft256 returns an fft for every 256 samples. The fft1024 returns an fft every 512 samples. The latter uses half overlapping samples to compute the spectrum (which is sound, considering you throw away roughly half the information due to windowing).
Furthermore, my implementation uses just a little bit of pipelining. This causes the latency between the input and the output to be approx 1280 samples, instead of approx. 1088 samples.
My strategy of using four smaller fft's to calculate the result of the complete fft, might also transfer to 256, there you split it in four fft64 blocks. I might have a look at this, but not before I am certain the fft1024 works without problems.

I am just thinking: the preprocessor/compiler might choke on the
Code:
uint32_t twiddle_1024[] __attribute__((aligned(4))) =
                { TW1024_LOOP512(0) };
Could you replace it by
Code:
uint32_t twiddle_1024[512] __attribute__((aligned(4))) = { 0 };
And remove the preprocessor stuff, just to check it compiles?

Edit. The latency mentioned above is for streaming conditions. If you stop restart the fft every time, the latency is much worse, then the original still has approx 1088 samples latency. My fft then has approximately 1680 samples initial latency. This is due to the fact of equilizing the processing power per time slot. By reorginizing the block and calculate everything on the 8-th slot, like in the original one, you still get the 30% improvement of the fft, but not the improvement on max slot usage.
 
Last edited:
I think by unrolling loops and improving memory access, some 5000 cycles can still be gained.
I hope I caught all overflow problems due to bit-growth. If anybody wants to try, i would like to hear any of these problems.

you may remove one ROR in "untangle_256" by using SHSAX instead of SHASX

Test code is
Code:
#include "arm_math.h"

// separate two spectra
// Fn = (Hn + conj(HN-n)/2
// Gn = -i(Hn - conj(HN-n))/2
//
// Z1 = (X + conj(Y)/2
// z1r = (xr+yr)/2
// z1i = (xi-yi)/2
//
// Z2 = -i(Z - conj(Y))/2
// z2r = (xi+yi)/2
// z2i = -(xr-yr)/2 = (yr-xr)/2

//T3.1 is little endian [0],[1] = [r],[i] = [bottom,top]
//
void setup()
{
  while(!Serial);
  short xx[2]; xx[0]=4; xx[1]=8;
  short yy[2]; yy[0]=2; yy[1]=4;
  short z1[2];
  short z2[2];

  // direct method
  z1[0]=(xx[0]+yy[0])/2;
  z1[1]=(xx[1]-yy[1])/2;
  
  z2[0]=(xx[1]+yy[1])/2;
  z2[1]=(yy[0]-xx[0])/2;

  Serial.printf("%d %d, %d %d: %d %d, %d %d\n",
  xx[0],xx[1],yy[0],yy[1],z1[0],z1[1],z2[0],z2[1]);
  
// using dsp

int rn= *(int*)xx;
int rm= *(int*)yy;

int *r1 = (int*)z1;
int *r2 = (int*)z2;

int rm1=__ROR(rm,16);
*r1=__SHSAX(rn,rm1);
*r2=__SHSAX(rm1,rn);

Serial.printf("%d %d, %d %d: %d %d, %d %d\n",
  xx[0],xx[1],yy[0],yy[1],z1[0],z1[1],z2[0],z2[1]);
  
}
void loop()
{}
 
By the way it seems there were changes to those files in the latest 1.21-Beta7 release. I had one open on this machine in an editor and it told me it changed. No idea what - or how much but you are downstream of something.

I'll try again - I let the IDE sit a bit. That is where the text stopped before I killed it - It didn't error out, just printed those 'preceding' lines that I didn't see on the 'factory' version compile.

I'm on diff machine now win7 not win8.1 - I'll swap the files and make the indicated change. BTW I'm on a Teensy 3.1.
 
@wmxz: good catch, was struggling with this bit for a while. All those ror's seemed redundant. Never thought someone would actually bother reading the code.
 
@wmxz: good catch, was struggling with this bit for a while. All those ror's seemed redundant. Never thought someone would actually bother reading the code.

I only compared it with my own code. However, I only do a 256 point fft on 2 data streams.
 
Ever tried or thought about doing a radix 4 FFT step instead of two radix 2 steps from 256 to 1024?
After all, it ends up with a mixed radix FFT.
 
COMPILES and RUNS : I swapped the Twiddle code in cpp, I didn't do any other edits. [did a fresh download and code swap on diff machine]

BEFORE: fft1024=51,52 all=53.02,53.64 Memory: 4,8 Ready:1440 Not:5742280 # Samples:120 Sample Time:1393 Time per sample:11
Sketch uses 92,808 bytes (35%) of program storage space. Maximum is 262,144 bytes.
Global variables use 13,428 bytes (20%) of dynamic memory, leaving 52,108 bytes for local variables. Maximum is 65,536 bytes

AFTER: fft1024=9,10 all=10.73,12.17 Memory: 6,10 Ready:1440 Not:6134857 # Samples:120 Sample Time:1393 Time per sample:11
Sketch uses 93,464 bytes (35%) of program storage space. Maximum is 262,144 bytes.
Global variables use 15,492 bytes (23%) of dynamic memory, leaving 50,044 bytes for local variables. Maximum is 65,536 bytes.

I can't attest to the FFT Bucket accuracy - though I can attach something of a bucket graph shortly.

But to make notes on my previously existing 'stats' lines as I read it below - summary is I made 6.8% more "idle" passes through the loop() over the same time collecting the sample SINEWAVE data. The CPU usage may have dropped to 20% - but the spare time I got looping raised to 106.8% by my measure, so I assume somehow extra MCU idle waiting got pushed outside the FFT counts that didn't show up in loops?:

fft1024=9,10 :: FFT1024 CPU usage dropped from peak 52 down to 10
all=10.73,12.17 :: Overall CPU usage way down with slight change in spread - some extra work incurred outside FFT under 1%
Memory: 6,10 :: Increased memory usage, not sure how big these blocks are?
Ready:1440 Not:6134857 :: Count of loop() passes where new sample was Ready or Not, less overhead gave 7% more passes through loop()
# Samples:120 :: Each of my 'cycles' takes 120 FFT samples and displays it, so the 1440 count above means this sample was the 10th pass
Sample Time:1393 :: Total millis() time looping waiting for these 120 samples to complete
Time per sample:11 :: 11 milliseconds per loop seems to be a constant based on the wait for FFT data collection I presume.
 
@wmxz: good catch, was struggling with this bit for a while. All those ror's seemed redundant. Never thought someone would actually bother reading the code.

@kpc: I have not checked this in your code, but I recall from other real to complex FFT's that the spectrum array has to be 1 complex sample longer than the input array (to store the Nyquist point). Did you consider this, or in the end simply drop the highest frequency?
 
Here is output from my sample runs - using same code provided above. The same SINEWAVE values are being processed, but the results are unique. This is a portion of the steps from 100Hz sinewave in 100Hz increments up to 4,000Hz. The left is KPC result and the right is 'factory' PJRC as it was in the 1.21-b7 release. The numbers displayed are the 100ths values of the numbers returned and they should on average add to 98 {sinewave.amplitude(.98);}. The looping makes three hits at each of 40 steps {sinewave.frequency(Freq);}, thus 120 Samples before repeating. {Image is a snip of ~6 changes}

Results from SINEWAVE input and myFFT.windowFunction(AudioWindowHanning1024);
KPCvsPJRC.png
 
Last edited:
@wnmxz: My main focus was for a streaming FFT. I specifically used radix-2 to give me more freedom to spread the processing over each state. This also easily allows skipping the negative frequencies in the final step. The streaming application also adds some more overhead, then when I would when implementing it as an off-line FFT.
I just dropped the Nyquist frequency, because any frequency contents here should have been removed by an anti-aliasing filter anyway. If no filter is present, the highest frequency bins will also be ruined by the aliasing anyway.
For the cases where the Nyquist frequency is important, like convolution, you need to keep the negative frequencies as well.
Dropping the Nyquist point also allows me to use the same fft-buffer in the pipelining. Although the specific point can easily be calculated without FFT. Just like DC is proportional to the sum of all samples, the Nyquist point is proportional to the sum of the difference of each pair of samples (s[0]-s[1]+s[2]-s[3]...)
As said, I do not see any good reason for maintaining the Nyquist point if you are throwing away the negative frequencies anyway, or am I missing an obvious application?
 
@defragster. The magnitudes are really different. When I check, the scaling of both is exactly the same. The error between both is only a few LSB due to bit-truncation.
Maybe check the twiddle factors. the first should be 0x00007fff (IIRC)

Both the original and my fft code give the same response to your code. Peak values around 48
 
Last edited:
@kpc: Indeed I was going to say your results are diminished for sure. Each transition to start the test shows one or two zero lines from the 4000Hz transition to 100Hz, where the factory code even has '1' data on the unshown right edge in bins 72-80. And new FFT doesn't get near 98 (.98 with no decimals) but 23 on average so some loss/filtering is adding up. Odd that CPU load dropped 400% and only get 7% more free passes in the otherwise empty loop()
[ if (myFFT.available()) { ... } else NoFFT += 1; ]

"twiddle factors"? :: Everything is the default in the code above and the library, I'd need more specifics to adjust anything. I'll be back online in 12hrs if I can make any changes, but FFT example code as adjusted code looks like a good testbed if the sinewave is the right signal and this the right FFT. My next step was going to be repeating this code and putting in EPROM which of the 11 FFT's to setup() for on boot, since it can't be changed after that.

It was about 30 years ago last I looked at FFT's, and that was running pre-compiled code on a VAX on image data as I learned C. I just need to 'see' a couple of sounds from a mic as the same Teensy 'watches' what somebody is doing and where he is as he runs around with this and a 9DOF on his wrist piping events by radio back to a stationary Teensy. I might even be to just go with sound intensity for my main event - but the pre-cursor will have some voice snippets and then a unique tone played for .3 seconds and I could really shut the FFT down after that so I can focus on the 9DOF. That's why I was hoping the 256 window FFT would be faster than the 1024. I still don't know why it takes twice as long to sample?
 
The penalty of my function is a higher latency. Maybe your code is getting out of sync due to the increased latency.
As to the twiddle factors, you now say
Everything is the default in the code above and the library.
Whereas before you said
I swapped the Twiddle code in cpp, I didn't do any other edits.
Thats why I was affraid of them.
Sorry I still don't get what you are trying to say with
I still don't know why it takes twice as long to sample?
 
For the 'old' behaviour without the pipelining, buth with the 30% enhancement, try this instead

Code:
void AudioAnalyzeFFT1024::update(void)
{
        audio_block_t *block;

        block = receiveReadOnly();
        if (!block) return;
        blocklist[4 + (state & 3)] = block;

        switch(state) {
        case 0: // startup states, just waiting for data
        case 1:
        case 2:
                break;
        case 3: // final statup state
                for (unsigned i = 0; i < 4; i++)
                        blocklist[i] = blocklist[i + 4];
                break;
        case 4:
        case 5:
        case 6:
                break;
        case 7:
                prepare_256(blocklist, buffer, 0);
                prepare_256(blocklist, buffer + 1024, 1);
                if (window) {
                        apply_window_256(buffer, window, 0);
                        apply_window_256(buffer + 1024, window, 1);
                }
                arm_cfft_radix4_q15(&fft_inst, buffer);
                arm_cfft_radix4_q15(&fft_inst, buffer + 1024);
                untangle_256(buffer);
                untangle_256(buffer + 1024);
                radix2_512(buffer);
                radix2_512(buffer + 1024);
                radix2_1024(buffer);
                calc_magnitude_512(buffer + 1024, output);
                outputflag = true;
                for (unsigned i = 0; i < 4; i++) {
                        release(blocklist[i]);
                        blocklist[i] = blocklist[i + 4];
                }
                state = 4 - 1; // -1 because of state++
                break;
        }
        state++;
}
 
Indeed added latency - outside the FFT cycle count region. Not sure how to account for where it is, perhaps I could throw in an interval timer task to both cases and count the work it can do. I'll do that perhaps for a different measure of overall system throughput.

Sorry for the confusion -
'Your code with twiddle change as needed to compile' - versus - the unchanged PJRC code and the basically unchanged (except for instrumentation edits) FFT sample.

Not related to your code but the source sample before I saw :
My naïve assumption was FFT#256 would run faster than FFT#1024. Indeed the fft@256 CPU hit is 12%, but the overall run time for my 120 Sample loop is Double at 2785 for #256 versus 1393 for #1024 milliseconds.

[edit] - I'll swap the AudioAnalyzeFFT1024::update(void) code above in coming hours and report.
 
Last edited:
kpc: Here twiddle code as used - when I swap the comment line the compile just hangs over a minute.
//uint32_t twiddle_1024[] __attribute__((aligned(4))) = { TW1024_LOOP512(0) };
uint32_t twiddle_1024[512] __attribute__((aligned(4))) = { 0 };

fft1024=36,37 all=37.83,38.55 Memory: 4,8 Ready:1440 Not:6128811 # Samples:120 Sample Time:1393 Time per sample:11

I swapped in the above replacement for .update() and the freq change points are now non-zero. Overall per sample magnitude (#23) is unchanged, FFT CPU use jumped from 10 up to 37, and my loop() entry measure of efficiency improvement is nearly unchanged at 1.067.
Not:6128811 / Not:5742280 versus Not:6134857 / Not:5742280

So somehow the loop rollups were causing loss on the transition edges, and doing their work hidden from the cycle counts attributed to the FFT code. But are not behind the loss of magnitude 23 .vs. 98.

Next time I'll change to sum the magnitudes on each line, so I can start each FFT output line with that.

KPC_Unrolled.PNG
 
O, I see, I am sorry, setting the twiddle to zero was just to check the cause of the compilation error. I was never intended as a solutiion.
Put this in the setup to precalculate the twiddle factors
Code:
for (int i = 0; i < 512; i++)
  twiddle_1024[i] = (uint32_t) ( \
  (((int16_t) min(32767,max(-32767,round(32768*sin(-M_PI/512*(i)))))) << 16) | \
  (((int16_t) min(32767,max(-32767,round(32768*cos(-M_PI/512*(i))))))& 0xffff));
 
In short: Yes
Longer: Wherever, as long as it is set before you rely on the results of the fft.

Remove the backslashes at the and of the line, these were copied from the define
You might also need to add this:
extern uint32_t twiddle_1024[512]
so that setup can find the array
 
Last edited:
Okay I had done that and added the extern uint32_t twiddle_1024[512];

gives error: 'twiddle' was not declared in this scope

Need to make global in cpp source - in header I suppose. I'm rushing to something else . . .
 
IT LOOKS GOOD ! [Resolved - somehow my pasted twiddle_1024 was just 'twiddle' - as it says in the error above.]

The First still rolled loops has the right magnitude and also stabilizes on the two lines between freq changes like the PJRC code so it passes a sanity check now. Though turning the UNROLL back on not only compromised the data - but it actually gave me 4,500 fewer passes through loop() after 1440 samples! So this code is just causing unmeasured 'latency' - and consuming more RAM buffers in the process!

KPC1024=37,37 all=38.74,39.35 Memory: 4,8 Ready:1440 Not:6119463 # Samples:120 Sample Time:1393 Time per sample:11

KPC_Twiddled.PNG


and turning back on the rolled up update() loops:
KPC1024=9,11 all=10.64,12.66 Memory: 6,10 Ready:1440 Not:6114810 # Samples:120 Sample Time:1392 Time per sample:11
Fixed Twiddle - Unrolled Loops :: still has missing freq transition lines - but maybe you can fix that?

KPC_TwiddledUnrolled.PNG
 
Last edited:
I checked again and there might be some problem with the timing between your code and my fft (apart from the latency). If I move this part
Code:
                calc_magnitude_512(buffer + 1024, output);
                if (state >= 12)
                        outputflag = true;
from case 13 to the end of case 12, it seems to give better results. I cannot understand why.
 
Status
Not open for further replies.
Back
Top