optimized 256 point fft: 47% more efficient

kpc · Mar 25, 2015

This is a new thread about an optimized version of the fft. In this case the 256 point fft.
I decided not to continue the old thread about the 1024 point fft, since a lot of the discussion was oriented towards general fft usage and related items, and not necessarily about the optimized version.

I did not test all possible average values extensively. If anybody wants to test, please report any errors.
Anyway here are the results for the 256 point FFT:

Code:

// all numbers for averaging of 8.
// All cycles are approximate figures based on the cycle counter.
// other interrupt/dma sources have not been explicitly disabled while testing.
//
//                      old             optimized       delta
// copying data          11432            8688          -24%
// applying window       22816           11472          -50%
// fft                  180496           90248          -50%
// integration+output    17506           11080          -36%
//
// total                232250          121488          -47%
//
// max cpu @ 72 MHz     14%             10%             -29% or -4%point

The code can be optimized further, I could not get the compiler to generate the code I wanted. By inserting assemly code, I could easily save another 2400 cycles.
I mentioned earlier about more balanced cpu usage over the cycles. I eventually abandoned that idea, since the code became very unreadible (yes even more than my code in the attachment). The cpu usage for 8 time averaging is alternatingly 4% and 10% between cycles.

As to what I did
copying data: writing two values at the same time saved little.
applying window: again applying the window for two values at once.
fft: by using a complex fft for two real fft's simultaneously, I save 50%
integration: No need to disentangle the fft spectra in most cases.
output: replaced the division by a multiplication

Defragster mentioned something in this post about not being able to verify the improvement. The following example shows in my eyes a reasonable way to verify. As said the cycles are alternatingly 4% and 10%, so 7% on average. Since the original fft was 14%, this matches nicely with the 50% reduction in cycles. The free CPU time on average therefore goes from 86% to 93%, so an 8% increase. By executing the following example, the counter is able to count up to 178070 and 191350, for respectively the old and the optimized version (@72 MHz). So based on a counter, I see a 7.5% increase, which matches nicely with the predicted value.

Code:

#include <Audio.h>
#include <Wire.h>
#include <SPI.h>
#include <SD.h>

AudioInputI2S i2sin;
AudioAnalyzeFFT256 fft;
AudioConnection pc1(i2sin, fft);

void setup() {
  AudioMemory(12);
  fft.averageTogether(8);
}

void loop() {
  uint32_t cntr = 0;
  // make sure we start at a proper point 
  while (!fft.available());
  // just keep counting until next fft data available
  while (!fft.available())
    cntr++;
  // finally output data
  Serial.println(cntr);
}

Finally the code is attached. It is a drop in replacement for the analyze_fft256.c/.h files.

Edit: For the latest code, see this post

defragster · Mar 25, 2015

Swapped the PJCR files - Compiles and runs.

Now I just need to redirect the signal to my microphone to look at the data - is this the right start - last time I'm not sure I got there?

AudioInputAnalog adc1(A3); // Pin 17
AudioAnalyzeFFT256 fft;
AudioConnection patchCord1(adc1, fft);

instead of:

AudioInputI2S i2sin;
AudioAnalyzeFFT256 fft;
AudioConnection pc1(i2sin, fft);

defragster · Mar 25, 2015

kpc: I was pointing out I didn't have a metric that showed improvement in relation to what you expected - which split my simple measure and what the audio CPU% increase suggested.

For some reason when I ran this code before swapping in your files - it seemed to NEVER complete - so I can't see a before & after value set.

kpc · Mar 26, 2015

Oh, sorry, this is due to a bug in the original code. In the original header file, the outputflag should have been declared volatile, so please change it.

Edit: also overlooked the obvious for spreading the fft over time. Will supply code later with instead of 4%/10%, it will present alternatingly a 6%/8% load over the cycles. This will also reduce latency by 1 cycle. But fundamentally exactly the same implementation, just spread somewhat differently over time.

defragster · Mar 26, 2015

Thanks, I think not but: might that 'volatile bool outputflag;' in PJRC code explain troubles I saw in bins when I was looking at FFT_256 before? Maybe instead of not showing done, it would show done too early?

Made that edit to PJRC code and at 72MHz I get : 178620 and at 96MHz I get: 246381 (not OPT: 218534)

with kpc code : at 72MHz I get :191367 and at 96MHz I get 260233 (not OPT: 231029)

at 72MHz that looks like 107.1% at 96 MHz 105.6% (not OP 105.7%) :: These numbers match the ~6% I saw (at 96 MHz) in the FFT_1024 from my loop() entry count, the oddity was the CPU saying it was taking over 4x less CPU%, I haven't added that code yet.

I did a for() to get groups of 50 before printing to minimize the scrolling, and tracked the overall sample time for the averageTogether(8) I get 46.421 millis()

I haven't busted open the bins yet to see the output. Please confirm the first bin "0" I read is the DC component, and not a SAMP_FREQ/256 bin? And that bins #1-127 represent the usable region.

This still HANGS between println's, https://forum.pjrc.com/threads/27736-variable-I2S-master-clock-(and-sampling)-frequency?p=68487&viewfull=1#post68487

Code:

  Serial.println("HELLO");
  PDB0_MOD = (F_BUS / 8000.0) - 1;
  PDB0_SC |= PDB_SC_LDOK | PDB_SC_LDMOD(1);
  Serial.println("HELLO 1");

kpc · Mar 26, 2015

Yes, the first bin is the DC component.

As for the clock modification, maybe the timing kills something in the serial? I merely checked if I could modify the frequency behavior on an oscilloscope.
As a minimal example that doen't work, the contents of loop() is just the 4 lines you gave?
I will check this later, maybe continue discussion about it in the other thread.

defragster · Mar 26, 2015

Thanks. The lines shown are within setup(), will watch for feedback here or there. With only 256 bins, 8kHz samples should give needed resolution. Tomorrow I'll dump bins to TFT lines, and then try to get my Analog mic in place.

kpc · Mar 28, 2015

OK managed to spread the load of the fft over all audio block cycles, at the cost of some efficiency. Just like the 1024 point fft, leave the majority of the fft to the CMSIS function and replaced only the last radix-4 by two radix-2 functions.
With an averaging of 8, the number of cycles spent is 43% of the original function. But also the max CPU usage has dropped by 43%. Both figures being equal proves that the workload is well balanced. For other averaging numbers this balance also stays approximately equal.
Also fixed a small indexing bug when using an odd number of averages.

Code:

// all numbers for averaging of 8.
// All cycles are approximate figures based on the cycle counter.
// other interrupt/dma sources have not been explicitly disabled while testing.
//
//                      old             optimized       delta
// copying data          11432            9580          -16%
// applying window       22816           12484          -45%
// fft                  180496           98632          -45%
// integration+output    17506           11370          -35%
//
// total                232250          132066          -43%
//
// max cpu @ 72 MHz     14%             8%              -43% or -6%point

defragster · Mar 28, 2015

Awesome - exactly what I think I need!: Got it copied - hope to get back to it in next 24hr. Getting ADC lib read timing figured out to do the PlayQueue. Lower CPU demand welcome.

Looking for 8k sample rate - w/ADC I can control oversample/averaging and still stay with LOW_SPEED that I assume ends up with better accuracy. And storing each sample myself will allow monitoring input peak/duration and exact timing before the FFT even starts as my most critical sound will be over full scale.

optimized 256 point fft: 47% more efficient

kpc

Well-known member

Attachments

defragster

Senior Member+

defragster

Senior Member+

kpc

Well-known member

defragster

Senior Member+

kpc

Well-known member

defragster

Senior Member+

kpc

Well-known member

Attachments

defragster

Senior Member+