Forum Rule: Always post complete source code & details to reproduce any issue!
Page 1 of 3 1 2 3 LastLast
Results 1 to 25 of 67

Thread: 1024 point FFT, 30% more efficient (experimental)

Hybrid View

  1. #1
    Senior Member
    Join Date
    Jan 2015
    Location
    frisia
    Posts
    285

    1024 point FFT, 30% more efficient (experimental)

    I optimized the 1024 point fft a little.
    In the processorUsageMax(), the percentage drops from 68% to a mere 12% (72 MHz teensy 3.1)
    Most of this comes from distributing the processing power equally over all time-slots.

    But also the FFT algorithm has changed. I could get it 30% more efficient.

    This table presents the approximate number of cycles used in the current and the new version.

    Code:
                       old     new    delta
    copying data:      8288    4452   -46%
    applying window:   13073   9270   -23%
    fft:               98030   67812  -31%
    calc. magnitude:   12652   11856   -6%
    I think by unrolling loops and improving memory access, some 5000 cycles can still be gained.
    I hope I caught all overflow problems due to bit-growth. If anybody wants to try, i would like to hear any of these problems.

    The file can just replace the existing analyze_fft1024.cpp/h in the Audio library
    In the header file only two changes were made, 2 more audio_block_t's and the reduction of the cmsis fft from 1024 to 256.

    In the original file Paul(?) mentions a TODO
    // TODO: support averaging multiple copies
    I have some ideas to implement this, it might only take about 4000 cycles and has a variable length averaging. But I first need to do more testing on the core of the FFT.

    EDIT: for the latest code see post: https://forum.pjrc.com/threads/27905...ll=1#post66678
    Attached Files Attached Files
    Last edited by kpc; 03-05-2015 at 06:37 AM.

  2. #2
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    12,175
    Nice, I've been looking at using this and will look at it as soon as Beta-7 is running here. Will this change apply equally to all the FFT variants?

    In looking at the 256 .vs. 1024 and the 256 uses 12% instead of 52% as I saw it - but somehow it takes twice as long? I assumed the 256 would take less time to return samples? I was seeing 12ms between samples on 1024 and 22ms on 256 - is the algorithm that different somehow or taking more samples?
    Last edited by defragster; 03-01-2015 at 06:57 AM.

  3. #3
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    12,175
    @KPC - I can't get a compile to complete using these files on Win7x64 with new 1.21-beta7. Under 50% it just stops as quoted below. I have reverted the two files and in about 10 seconds it completes with no warning

    If it helps here is my FFT sketch working with factory libs. It is the
    "C:\Users\tim\Documents\Arduino\hardware\teensy\av r\libraries\Audio\examples\Analysis\FFT\FFT.ino"
    example with perf data and modified looping changing the sinewave data to watch the buckets move: T_FFT_SINEWAVE_edit.ino

    C:\Users\mobil_000\Documents\Arduino/hardware/tools/arm/bin/arm-none-eabi-gcc -c -g -Os -Wall -ffunction-sections -fdata-sections -MMD -nostdlib -mthumb -mcpu=cortex-m4 -D__MK20DX256__ -DTEENSYDUINO=121 -DARDUINO=10600 -DF_CPU=96000000 -DARDUINO_ARCH_AVR -DUSB_SERIAL -DLAYOUT_US_ENGLISH -IC:\Users\mobil_000\Documents\Arduino\hardware\tee nsy\avr\cores\teensy3 -IC:\Users\mobil_000\Documents\Arduino\hardware\tee nsy\avr\libraries\Audio -IC:\Users\mobil_000\Documents\Arduino\hardware\tee nsy\avr\libraries\Wire -IC:\Users\mobil_000\Documents\Arduino\hardware\tee nsy\avr\libraries\SPI -IC:\Users\mobil_000\Documents\Arduino\hardware\tee nsy\avr\libraries\SD -IC:\Users\mobil_000\Documents\Arduino\hardware\tee nsy\avr\libraries\Audio\utility C:\Users\mobil_000\Documents\Arduino\hardware\teen sy\avr\libraries\Audio\data_windows.c -o C:\Users\MOBIL_~1\AppData\Local\Temp\build17651793 41746263744.tmp\Audio\data_windows.c.o
    C:\Users\mobil_000\Documents\Arduino/hardware/tools/arm/bin/arm-none-eabi-g++ -c -g -Os -Wall -ffunction-sections -fdata-sections -MMD -nostdlib -fno-exceptions -felide-constructors -std=gnu++0x -fno-rtti -mthumb -mcpu=cortex-m4 -D__MK20DX256__ -DTEENSYDUINO=121 -DARDUINO=10600 -DF_CPU=96000000 -DARDUINO_ARCH_AVR -DUSB_SERIAL -DLAYOUT_US_ENGLISH -IC:\Users\mobil_000\Documents\Arduino\hardware\tee nsy\avr\cores\teensy3 -IC:\Users\mobil_000\Documents\Arduino\hardware\tee nsy\avr\libraries\Audio -IC:\Users\mobil_000\Documents\Arduino\hardware\tee nsy\avr\libraries\Wire -IC:\Users\mobil_000\Documents\Arduino\hardware\tee nsy\avr\libraries\SPI -IC:\Users\mobil_000\Documents\Arduino\hardware\tee nsy\avr\libraries\SD -IC:\Users\mobil_000\Documents\Arduino\hardware\tee nsy\avr\libraries\Audio\utility C:\Users\mobil_000\Documents\Arduino\hardware\teen sy\avr\libraries\Audio\analyze_fft1024.cpp -o C:\Users\MOBIL_~1\AppData\Local\Temp\build17651793 41746263744.tmp\Audio\analyze_fft1024.cpp.o
    In file included from C:\Users\mobil_000\Documents\Arduino\hardware\teen sy\avr\cores\teensy3/WProgram.h:15:0,
    from C:\Users\mobil_000\Documents\Arduino\hardware\teen sy\avr\cores\teensy3/Arduino.h:1,
    from C:\Users\mobil_000\Documents\Arduino\hardware\teen sy\avr\cores\teensy3/AudioStream.h:34,
    from C:\Users\mobil_000\Documents\Arduino\hardware\teen sy\avr\libraries\Audio\analyze_fft1024.h:30,
    from C:\Users\mobil_000\Documents\Arduino\hardware\teen sy\avr\libraries\Audio\analyze_fft1024.cpp:4:

    BEFORE: fft1024=51,52 all=53.02,53.64 Memory: 4,8 Ready:1440 Not:5742280 # Samples:120 Sample Time:1393 Time per sample:11
    Sketch uses 92,808 bytes (35%) of program storage space. Maximum is 262,144 bytes.
    Global variables use 13,428 bytes (20%) of dynamic memory, leaving 52,108 bytes for local variables. Maximum is 65,536 bytes

    AFTER: ... see above for now
    Last edited by defragster; 03-01-2015 at 06:24 AM.

  4. #4
    Senior Member
    Join Date
    Jan 2015
    Location
    frisia
    Posts
    285
    Sorry, have not tried the new arduino yet. Could you list the real error. Your respons stops just before the real error appears. On 1.06 it your .ino compiles without a problem.
    As to your first question: The fft256 returns an fft for every 256 samples. The fft1024 returns an fft every 512 samples. The latter uses half overlapping samples to compute the spectrum (which is sound, considering you throw away roughly half the information due to windowing).
    Furthermore, my implementation uses just a little bit of pipelining. This causes the latency between the input and the output to be approx 1280 samples, instead of approx. 1088 samples.
    My strategy of using four smaller fft's to calculate the result of the complete fft, might also transfer to 256, there you split it in four fft64 blocks. I might have a look at this, but not before I am certain the fft1024 works without problems.

    I am just thinking: the preprocessor/compiler might choke on the
    Code:
    uint32_t twiddle_1024[] __attribute__((aligned(4))) =
                    { TW1024_LOOP512(0) };
    Could you replace it by
    Code:
    uint32_t twiddle_1024[512] __attribute__((aligned(4))) = { 0 };
    And remove the preprocessor stuff, just to check it compiles?

    Edit. The latency mentioned above is for streaming conditions. If you stop restart the fft every time, the latency is much worse, then the original still has approx 1088 samples latency. My fft then has approximately 1680 samples initial latency. This is due to the fact of equilizing the processing power per time slot. By reorginizing the block and calculate everything on the 8-th slot, like in the original one, you still get the 30% improvement of the fft, but not the improvement on max slot usage.
    Last edited by kpc; 03-01-2015 at 08:04 AM.

  5. #5
    Senior Member
    Join Date
    Jul 2014
    Posts
    2,714
    Quote Originally Posted by kpc View Post
    I think by unrolling loops and improving memory access, some 5000 cycles can still be gained.
    I hope I caught all overflow problems due to bit-growth. If anybody wants to try, i would like to hear any of these problems.
    you may remove one ROR in "untangle_256" by using SHSAX instead of SHASX

    Test code is
    Code:
    #include "arm_math.h"
    
    // separate two spectra
    // Fn = (Hn + conj(HN-n)/2
    // Gn = -i(Hn - conj(HN-n))/2
    //
    // Z1 = (X + conj(Y)/2
    // z1r = (xr+yr)/2
    // z1i = (xi-yi)/2
    //
    // Z2 = -i(Z - conj(Y))/2
    // z2r = (xi+yi)/2
    // z2i = -(xr-yr)/2 = (yr-xr)/2
    
    //T3.1 is little endian [0],[1] = [r],[i] = [bottom,top]
    //
    void setup()
    {
      while(!Serial);
      short xx[2]; xx[0]=4; xx[1]=8;
      short yy[2]; yy[0]=2; yy[1]=4;
      short z1[2];
      short z2[2];
    
      // direct method
      z1[0]=(xx[0]+yy[0])/2;
      z1[1]=(xx[1]-yy[1])/2;
      
      z2[0]=(xx[1]+yy[1])/2;
      z2[1]=(yy[0]-xx[0])/2;
    
      Serial.printf("%d %d, %d %d: %d %d, %d %d\n",
      xx[0],xx[1],yy[0],yy[1],z1[0],z1[1],z2[0],z2[1]);
      
    // using dsp
    
    int rn= *(int*)xx;
    int rm= *(int*)yy;
    
    int *r1 = (int*)z1;
    int *r2 = (int*)z2;
    
    int rm1=__ROR(rm,16);
    *r1=__SHSAX(rn,rm1);
    *r2=__SHSAX(rm1,rn);
    
    Serial.printf("%d %d, %d %d: %d %d, %d %d\n",
      xx[0],xx[1],yy[0],yy[1],z1[0],z1[1],z2[0],z2[1]);
      
    }
    void loop()
    {}

  6. #6
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    12,175
    By the way it seems there were changes to those files in the latest 1.21-Beta7 release. I had one open on this machine in an editor and it told me it changed. No idea what - or how much but you are downstream of something.

    I'll try again - I let the IDE sit a bit. That is where the text stopped before I killed it - It didn't error out, just printed those 'preceding' lines that I didn't see on the 'factory' version compile.

    I'm on diff machine now win7 not win8.1 - I'll swap the files and make the indicated change. BTW I'm on a Teensy 3.1.

  7. #7
    Senior Member
    Join Date
    Jan 2015
    Location
    frisia
    Posts
    285
    @wmxz: good catch, was struggling with this bit for a while. All those ror's seemed redundant. Never thought someone would actually bother reading the code.

  8. #8
    Senior Member
    Join Date
    Jul 2014
    Posts
    2,714
    Quote Originally Posted by kpc View Post
    @wmxz: good catch, was struggling with this bit for a while. All those ror's seemed redundant. Never thought someone would actually bother reading the code.
    I only compared it with my own code. However, I only do a 256 point fft on 2 data streams.

  9. #9
    Senior Member
    Join Date
    Jul 2014
    Posts
    2,714
    Ever tried or thought about doing a radix 4 FFT step instead of two radix 2 steps from 256 to 1024?
    After all, it ends up with a mixed radix FFT.

  10. #10
    Senior Member
    Join Date
    Jul 2014
    Posts
    2,714
    Quote Originally Posted by kpc View Post
    @wmxz: good catch, was struggling with this bit for a while. All those ror's seemed redundant. Never thought someone would actually bother reading the code.
    @kpc: I have not checked this in your code, but I recall from other real to complex FFT's that the spectrum array has to be 1 complex sample longer than the input array (to store the Nyquist point). Did you consider this, or in the end simply drop the highest frequency?

  11. #11
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    12,175
    Here is output from my sample runs - using same code provided above. The same SINEWAVE values are being processed, but the results are unique. This is a portion of the steps from 100Hz sinewave in 100Hz increments up to 4,000Hz. The left is KPC result and the right is 'factory' PJRC as it was in the 1.21-b7 release. The numbers displayed are the 100ths values of the numbers returned and they should on average add to 98 {sinewave.amplitude(.98);}. The looping makes three hits at each of 40 steps {sinewave.frequency(Freq);}, thus 120 Samples before repeating. {Image is a snip of ~6 changes}

    Results from SINEWAVE input and myFFT.windowFunction(AudioWindowHanning1024);
    Click image for larger version. 

Name:	KPCvsPJRC.png 
Views:	174 
Size:	52.7 KB 
ID:	3728
    Last edited by defragster; 03-01-2015 at 09:46 AM. Reason: FFT type noted

  12. #12
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    12,175
    COMPILES and RUNS : I swapped the Twiddle code in cpp, I didn't do any other edits. [did a fresh download and code swap on diff machine]

    BEFORE: fft1024=51,52 all=53.02,53.64 Memory: 4,8 Ready:1440 Not:5742280 # Samples:120 Sample Time:1393 Time per sample:11
    Sketch uses 92,808 bytes (35%) of program storage space. Maximum is 262,144 bytes.
    Global variables use 13,428 bytes (20%) of dynamic memory, leaving 52,108 bytes for local variables. Maximum is 65,536 bytes

    AFTER: fft1024=9,10 all=10.73,12.17 Memory: 6,10 Ready:1440 Not:6134857 # Samples:120 Sample Time:1393 Time per sample:11
    Sketch uses 93,464 bytes (35%) of program storage space. Maximum is 262,144 bytes.
    Global variables use 15,492 bytes (23%) of dynamic memory, leaving 50,044 bytes for local variables. Maximum is 65,536 bytes.

    I can't attest to the FFT Bucket accuracy - though I can attach something of a bucket graph shortly.

    But to make notes on my previously existing 'stats' lines as I read it below - summary is I made 6.8% more "idle" passes through the loop() over the same time collecting the sample SINEWAVE data. The CPU usage may have dropped to 20% - but the spare time I got looping raised to 106.8% by my measure, so I assume somehow extra MCU idle waiting got pushed outside the FFT counts that didn't show up in loops?:

    fft1024=9,10 :: FFT1024 CPU usage dropped from peak 52 down to 10
    all=10.73,12.17 :: Overall CPU usage way down with slight change in spread - some extra work incurred outside FFT under 1%
    Memory: 6,10 :: Increased memory usage, not sure how big these blocks are?
    Ready:1440 Not:6134857 :: Count of loop() passes where new sample was Ready or Not, less overhead gave 7% more passes through loop()
    # Samples:120 :: Each of my 'cycles' takes 120 FFT samples and displays it, so the 1440 count above means this sample was the 10th pass
    Sample Time:1393 :: Total millis() time looping waiting for these 120 samples to complete
    Time per sample:11 :: 11 milliseconds per loop seems to be a constant based on the wait for FFT data collection I presume.

  13. #13
    Senior Member
    Join Date
    Jan 2015
    Location
    frisia
    Posts
    285
    @wnmxz: My main focus was for a streaming FFT. I specifically used radix-2 to give me more freedom to spread the processing over each state. This also easily allows skipping the negative frequencies in the final step. The streaming application also adds some more overhead, then when I would when implementing it as an off-line FFT.
    I just dropped the Nyquist frequency, because any frequency contents here should have been removed by an anti-aliasing filter anyway. If no filter is present, the highest frequency bins will also be ruined by the aliasing anyway.
    For the cases where the Nyquist frequency is important, like convolution, you need to keep the negative frequencies as well.
    Dropping the Nyquist point also allows me to use the same fft-buffer in the pipelining. Although the specific point can easily be calculated without FFT. Just like DC is proportional to the sum of all samples, the Nyquist point is proportional to the sum of the difference of each pair of samples (s[0]-s[1]+s[2]-s[3]...)
    As said, I do not see any good reason for maintaining the Nyquist point if you are throwing away the negative frequencies anyway, or am I missing an obvious application?

  14. #14
    Senior Member
    Join Date
    Jan 2015
    Location
    frisia
    Posts
    285
    @defragster. The magnitudes are really different. When I check, the scaling of both is exactly the same. The error between both is only a few LSB due to bit-truncation.
    Maybe check the twiddle factors. the first should be 0x00007fff (IIRC)

    Both the original and my fft code give the same response to your code. Peak values around 48
    Last edited by kpc; 03-01-2015 at 10:06 AM.

  15. #15
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    12,175
    @kpc: Indeed I was going to say your results are diminished for sure. Each transition to start the test shows one or two zero lines from the 4000Hz transition to 100Hz, where the factory code even has '1' data on the unshown right edge in bins 72-80. And new FFT doesn't get near 98 (.98 with no decimals) but 23 on average so some loss/filtering is adding up. Odd that CPU load dropped 400% and only get 7% more free passes in the otherwise empty loop()
    [ if (myFFT.available()) { ... } else NoFFT += 1; ]
    "twiddle factors"? :: Everything is the default in the code above and the library, I'd need more specifics to adjust anything. I'll be back online in 12hrs if I can make any changes, but FFT example code as adjusted code looks like a good testbed if the sinewave is the right signal and this the right FFT. My next step was going to be repeating this code and putting in EPROM which of the 11 FFT's to setup() for on boot, since it can't be changed after that.

    It was about 30 years ago last I looked at FFT's, and that was running pre-compiled code on a VAX on image data as I learned C. I just need to 'see' a couple of sounds from a mic as the same Teensy 'watches' what somebody is doing and where he is as he runs around with this and a 9DOF on his wrist piping events by radio back to a stationary Teensy. I might even be to just go with sound intensity for my main event - but the pre-cursor will have some voice snippets and then a unique tone played for .3 seconds and I could really shut the FFT down after that so I can focus on the 9DOF. That's why I was hoping the 256 window FFT would be faster than the 1024. I still don't know why it takes twice as long to sample?

  16. #16
    Senior Member
    Join Date
    Jan 2015
    Location
    frisia
    Posts
    285
    The penalty of my function is a higher latency. Maybe your code is getting out of sync due to the increased latency.
    As to the twiddle factors, you now say
    Everything is the default in the code above and the library.
    Whereas before you said
    I swapped the Twiddle code in cpp, I didn't do any other edits.
    Thats why I was affraid of them.
    Sorry I still don't get what you are trying to say with
    I still don't know why it takes twice as long to sample?

  17. #17
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    12,175
    Indeed added latency - outside the FFT cycle count region. Not sure how to account for where it is, perhaps I could throw in an interval timer task to both cases and count the work it can do. I'll do that perhaps for a different measure of overall system throughput.

    Sorry for the confusion -
    'Your code with twiddle change as needed to compile' - versus - the unchanged PJRC code and the basically unchanged (except for instrumentation edits) FFT sample.
    Not related to your code but the source sample before I saw :
    My na´ve assumption was FFT#256 would run faster than FFT#1024. Indeed the fft@256 CPU hit is 12%, but the overall run time for my 120 Sample loop is Double at 2785 for #256 versus 1393 for #1024 milliseconds.
    [edit] - I'll swap the AudioAnalyzeFFT1024::update(void) code above in coming hours and report.
    Last edited by defragster; 03-01-2015 at 06:29 PM.

  18. #18
    Senior Member
    Join Date
    Jan 2015
    Location
    frisia
    Posts
    285
    For the 'old' behaviour without the pipelining, buth with the 30% enhancement, try this instead

    Code:
    void AudioAnalyzeFFT1024::update(void)
    {
            audio_block_t *block;
    
            block = receiveReadOnly();
            if (!block) return;
            blocklist[4 + (state & 3)] = block;
    
            switch(state) {
            case 0: // startup states, just waiting for data
            case 1:
            case 2:
                    break;
            case 3: // final statup state
                    for (unsigned i = 0; i < 4; i++)
                            blocklist[i] = blocklist[i + 4];
                    break;
            case 4:
            case 5:
            case 6:
                    break;
            case 7:
                    prepare_256(blocklist, buffer, 0);
                    prepare_256(blocklist, buffer + 1024, 1);
                    if (window) {
                            apply_window_256(buffer, window, 0);
                            apply_window_256(buffer + 1024, window, 1);
                    }
                    arm_cfft_radix4_q15(&fft_inst, buffer);
                    arm_cfft_radix4_q15(&fft_inst, buffer + 1024);
                    untangle_256(buffer);
                    untangle_256(buffer + 1024);
                    radix2_512(buffer);
                    radix2_512(buffer + 1024);
                    radix2_1024(buffer);
                    calc_magnitude_512(buffer + 1024, output);
                    outputflag = true;
                    for (unsigned i = 0; i < 4; i++) {
                            release(blocklist[i]);
                            blocklist[i] = blocklist[i + 4];
                    }
                    state = 4 - 1; // -1 because of state++
                    break;
            }
            state++;
    }

  19. #19
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    12,175
    kpc: Here twiddle code as used - when I swap the comment line the compile just hangs over a minute.
    //uint32_t twiddle_1024[] __attribute__((aligned(4))) = { TW1024_LOOP512(0) };
    uint32_t twiddle_1024[512] __attribute__((aligned(4))) = { 0 };

    fft1024=36,37 all=37.83,38.55 Memory: 4,8 Ready:1440 Not:6128811 # Samples:120 Sample Time:1393 Time per sample:11

    I swapped in the above replacement for .update() and the freq change points are now non-zero. Overall per sample magnitude (#23) is unchanged, FFT CPU use jumped from 10 up to 37, and my loop() entry measure of efficiency improvement is nearly unchanged at 1.067.
    Not:6128811 / Not:5742280 versus Not:6134857 / Not:5742280
    So somehow the loop rollups were causing loss on the transition edges, and doing their work hidden from the cycle counts attributed to the FFT code. But are not behind the loss of magnitude 23 .vs. 98.

    Next time I'll change to sum the magnitudes on each line, so I can start each FFT output line with that.

    Click image for larger version. 

Name:	KPC_Unrolled.PNG 
Views:	225 
Size:	30.1 KB 
ID:	3733

  20. #20
    Senior Member
    Join Date
    Jan 2015
    Location
    frisia
    Posts
    285
    O, I see, I am sorry, setting the twiddle to zero was just to check the cause of the compilation error. I was never intended as a solutiion.
    Put this in the setup to precalculate the twiddle factors
    Code:
    for (int i = 0; i < 512; i++)
      twiddle_1024[i] = (uint32_t) ( \
      (((int16_t) min(32767,max(-32767,round(32768*sin(-M_PI/512*(i)))))) << 16) | \
      (((int16_t) min(32767,max(-32767,round(32768*cos(-M_PI/512*(i))))))& 0xffff));

  21. #21
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    12,175
    setup() in the sketch?

  22. #22
    Senior Member
    Join Date
    Jan 2015
    Location
    frisia
    Posts
    285
    In short: Yes
    Longer: Wherever, as long as it is set before you rely on the results of the fft.

    Remove the backslashes at the and of the line, these were copied from the define
    You might also need to add this:
    extern uint32_t twiddle_1024[512]
    so that setup can find the array
    Last edited by kpc; 03-01-2015 at 08:05 PM.

  23. #23
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    12,175
    Okay I had done that and added the extern uint32_t twiddle_1024[512];

    gives error: 'twiddle' was not declared in this scope

    Need to make global in cpp source - in header I suppose. I'm rushing to something else . . .

  24. #24
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    12,175
    IT LOOKS GOOD ! [Resolved - somehow my pasted twiddle_1024 was just 'twiddle' - as it says in the error above.]

    The First still rolled loops has the right magnitude and also stabilizes on the two lines between freq changes like the PJRC code so it passes a sanity check now. Though turning the UNROLL back on not only compromised the data - but it actually gave me 4,500 fewer passes through loop() after 1440 samples! So this code is just causing unmeasured 'latency' - and consuming more RAM buffers in the process!

    KPC1024=37,37 all=38.74,39.35 Memory: 4,8 Ready:1440 Not:6119463 # Samples:120 Sample Time:1393 Time per sample:11

    Click image for larger version. 

Name:	KPC_Twiddled.PNG 
Views:	148 
Size:	32.9 KB 
ID:	3739


    and turning back on the rolled up update() loops:
    KPC1024=9,11 all=10.64,12.66 Memory: 6,10 Ready:1440 Not:6114810 # Samples:120 Sample Time:1392 Time per sample:11
    Fixed Twiddle - Unrolled Loops :: still has missing freq transition lines - but maybe you can fix that?

    Click image for larger version. 

Name:	KPC_TwiddledUnrolled.PNG 
Views:	134 
Size:	37.0 KB 
ID:	3740
    Last edited by defragster; 03-01-2015 at 10:27 PM.

  25. #25
    Senior Member
    Join Date
    Jan 2015
    Location
    frisia
    Posts
    285
    I checked again and there might be some problem with the timing between your code and my fft (apart from the latency). If I move this part
    Code:
                    calc_magnitude_512(buffer + 1024, output);
                    if (state >= 12)
                            outputflag = true;
    from case 13 to the end of case 12, it seems to give better results. I cannot understand why.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •