Forum Rule: Always post complete source code & details to reproduce any issue!
Results 1 to 16 of 16

Thread: Speed test and optimizations--some surprises!

  1. #1

    Speed test and optimizations--some surprises!

    I have a very math intensive bit of code running a giant pile of things that also include a bunch of LUTs. So I sat down to see which of the various optimizations gave me the fastest code.
    Space? who cares! I want speed!
    So, at the start of the code I throw a pin high and at the end, throw it low and measured it on the o-scope.
    going from Fast to Fastest with LTO on every other one I got:
    15.56 us
    16.36 us
    14.44 us
    14.28 us
    14.52 us
    14.12 us
    with the last being "Fastest with LTO"
    It was funny that "faster" was faster than "fastest" w/o the LTO optimization, but hey, that's how things go.

    so, for giggles, I then ran "smallest" and "smallest with LTO" and got:
    13.72 us
    13.64 us

    yep, "smallest with LTO" IN MY CASE is faster than "FASTEST with LTO"

    the take away message I'd throw out there is:
    if speed is king, test your code; don't rely on the compiler's settings.

    --mjlg

  2. #2
    Member
    Join Date
    Oct 2015
    Location
    Mt Beauty, Aus
    Posts
    23
    Quote Originally Posted by Marcus LaGrone View Post
    I have a very math intensive bit of code running a giant pile of things that also include a bunch of LUTs. So I sat down to see which of the various optimizations gave me the fastest code.
    Space? who cares! I want speed!
    So, at the start of the code I throw a pin high and at the end, throw it low and measured it on the o-scope.
    going from Fast to Fastest with LTO on every other one I got:
    15.56 us
    16.36 us
    14.44 us
    14.28 us
    14.52 us
    14.12 us
    with the last being "Fastest with LTO"
    It was funny that "faster" was faster than "fastest" w/o the LTO optimization, but hey, that's how things go.

    so, for giggles, I then ran "smallest" and "smallest with LTO" and got:
    13.72 us
    13.64 us

    yep, "smallest with LTO" IN MY CASE is faster than "FASTEST with LTO"

    the take away message I'd throw out there is:
    if speed is king, test your code; don't rely on the compiler's settings.

    --mjlg
    Yes... optimizations have always been a tricky beast to tame. Sometimes the fastest solution defies all logic. Sometimes it seems a simple look up table (LUT) would beat a semi-complex bit of arithmetic hands down... but not always the case.

    I had an example where I got a 10% speed increase in a video frame renderer just by ... moving the damn temp buffer.

    Originally the buffer I used to process each scanline was a "spare" buffer in global memory I knew would always be free when rendering video... so I used it. It was a 2K buffer so ample.

    Since I only needed a buffer 160*2 I decided to put the buffer on the stack. Winner... 10% increase!

    Basically optimizing anything in the "inner-inner-inner" loop that is executed for every pixel adds up to a big gain (or loss) quickly.

    From my experience, accessing the static buffer would require the correct instruction + 32 bit address whereas the smaller buffer on the stack just uses a displacement and would be more efficient. Also, since all the variables I use are all local... nice cache performance!

    Also, got a very pleasant surprise (we don't get too many of them do we!!!). My routine of test video with sound originally took 1500 seconds and after some small changes came down to 918 seconds. (big difference). I decided to turn the sound off and see what difference that would make (5000-6000Hz) and amazingly it was almost identical at 919 seconds. As I don't have the actual AD5330 breakout yet I just coded it to missing hardware so frantically rechecked it. I put a LED on/off in the sound interrupt and it flashes so can assume it is actually sending.

    My video renderer uses DMA without waiting so I process the next scanline while the previous is being sent to the display. From what I could figure out mem-mem is a lot faster than mem-spi so there is a lot of spare cycles which usually would be wasted being used.

    As a result of > week of optimizations I'm now getting 20+ frames/sec on a 84Mhz processor with 128*128 DMA SPI on SDD1351 with sound via AD5330.

    I should also add... I'm a big fan of "inline".

  3. #3
    Quote Originally Posted by hoek67 View Post
    I should also add... I'm a big fan of "inline".
    I continue to be amused when inline does and doesn't offer an improvement in speed... 1d6 please :P

    --mjlg

  4. #4
    Senior Member+ MichaelMeissner's Avatar
    Join Date
    Nov 2012
    Location
    Ayer Massachussetts
    Posts
    3,145
    The problem with inline is it often times inlines too much. If you have a small leaf function (which doesn't call anything else) that only uses a few registers, it can often be a win.

    But at other times, if you inline multiple functions, it can force register spilling (where the compiler doesn't have enough registers to hold everything important in the function, and it has to store one register value in order to free up a register so it can use it for something else). These extra stores and loads can slow things down. This particularly shows up if you have an error function that is rarely called, but it uses a lot of resources. If you don't actually call the function, you could have used the resources that were dedicated to the function for something else.

    And hopefully GCC gets better over time (though I do have some benchmarks that have regressed). Right now, Teensy is using GCC 5.4. The original 5.1 compiler was released two years ago (the bug fix release 5.4 was released last June, but the main functionality was frozen in the 5.1 release). The 6.1 compiler was released last year, with the last bug fix release in December. The 7.1 GCC compiler was released a few days ago. Perhaps GCC 6.3 is better than GCC 5.4 (but perhaps not). The only real thing is to try it on your own code to see what is better.

  5. #5
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    9,300
    Quote Originally Posted by Marcus LaGrone View Post
    ...
    so, for giggles, I then ran "smallest" and "smallest with LTO" and got:
    13.72 us
    13.64 us

    yep, "smallest with LTO" IN MY CASE is faster than "FASTEST with LTO"

    the take away message I'd throw out there is:
    if speed is king, test your code; don't rely on the compiler's settings.

    --mjlg
    Test indeed - the MCU's have internal optimization RAM/FLASH tradeoff's as well. Quite possible the smallest with reduced code size fit the cache on hand so was using higher bandwidth memory without dumping to reload/run from FLASH for the hottest execution path. Small doesn't mean it isn't compiling smart - but rather not going overboard growing the code in the process - and some library code is neutered to fit as well.

    FASTRUN decoration on some small critical function might fit in RAM and may get better treatment without compromising overall operation as well. FASTRUN might be almost as good as inline (?) in some cases where a single copy can be called quickly in RAM rather than multiple copies growing the code and needing to access FLASH to run.
    Last edited by defragster; 05-09-2017 at 03:23 AM. Reason: opps ... FASTRUN

  6. #6
    Quote Originally Posted by defragster View Post
    RAMFUNC decoration on some small critical function might fit in RAM and may get better treatment without compromising overall operation as well. RAMFUNC might be almost as good as inline (?) in some cases where a single copy can be called quickly in RAM rather than multiple copies growing the code and needing to access FLASH to run.
    So what is the practical difference between RAMFUNC and FASTRUN?

    FASTRUN has been helpful today! Wish I had seen it earlier...

    --mjlg

  7. #7
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    9,300
    Yeah - FASTRUN - is what I meant . . .

  8. #8
    Senior Member+ MichaelMeissner's Avatar
    Join Date
    Nov 2012
    Location
    Ayer Massachussetts
    Posts
    3,145
    In terms of LTO, I'm currently doing spec 2006 runs on the released GCC 7.1. For the options I'm using, out of the 29 benchmarks in Spec 2006 INT/FP, 14 bencharks were faster using LTO over not using LTO. One benchmark was nearly 36% faster. On the other hand, 3 benchmarks were slower, the worst had a 7.5% slowdown. So yeah, I imagine in the next year, I or somebody in the group will try to reduce those slowdowns. But in the compiler field, we often play whack-a-mole, where you optimize in one case, and it causes something to slow down.
    Last edited by MichaelMeissner; 05-09-2017 at 04:27 AM.

  9. #9
    A few rounds of changes later and using FASTRUN the code is now faster with "Faster w LTO" as opposed to "smallest"

    quick question, is there any way to tell the compiler (Teensyduino) to stick EVERYTHING in RAM if possible?
    I can stick "FASTRUN" all over hither and yon, but having it also assert FASTRUN for things in the #INCLUDE would be nice.

    --mjlg

  10. #10
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    9,300
    What about Smallest w/LTO?

    The reserved RAM cache of code from FLASH can hold 8K maybe once or twice in some fashion.

    There are books - and parts in posts from Paul that likely cover such architectural issues - but as with all 'optimizations' there are trade offs.

    Things like this : Optimizing Performance on Kinetis K-series MCUs

    I don't see posted code - so no idea of actual utility or overall size of the 'test code'. The MCU has inbuilt efficiencies and abilities, doing everything in compiling and linking and compromising those will limit performance at some point.

  11. #11
    Senior Member
    Join Date
    Jan 2013
    Posts
    843
    Quote Originally Posted by defragster View Post
    The reserved RAM cache of code from FLASH can hold 8K maybe once or twice in some fashion.
    Only Teensy 3.6 has a large cache (Flash cache and extra 8kb cache).

    T3.2 / T3.5 only have a small Flash cache. For Teensy 3.5, the Flash cache is 128 bytes (16 64-bit entries) + a 64-bit prefetch buffer.

  12. #12
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    9,300
    Quote Originally Posted by tni View Post
    Only Teensy 3.6 has a large cache (Flash cache and extra 8kb cache).

    T3.2 / T3.5 only have a small Flash cache. For Teensy 3.5, the Flash cache is 128 bytes (16 64-bit entries) + a 64-bit prefetch buffer.
    Indeed - I don't see an MCU noted above - the capabilities are varied across them. Like with the optimization/LTO options there isn't one answer across the board

  13. #13
    Senior Member
    Join Date
    Jan 2013
    Posts
    843
    Teensy 3.6 benefits a lot less from FASTRUN (with double, the code size is much bigger and overflows the cache):
    Code:
    #define C(x) case x: { volatile T v = 42; v = v+v; break; }
    
    #define test_loop(type)  \
        using T = type;\
        for(volatile int ol = 0; ol < 100000; ol++) {\
            for(int i = 0; i < 20; i++) {\
                switch(i) {\
                    C(0); C(1); C(2); C(3); C(4); \
                    C(5); C(6); C(7); C(8); C(9); \
                    C(10); C(11); C(12); C(13); C(14);\
                    C(15); C(16); C(17); C(18); C(19); \
                }\
            }\
        }
    
    __attribute__ ((__noinline__)) void test_loop_no_fastrun_int() {
        test_loop(int)
    }
    
    __attribute__ ((__noinline__)) FASTRUN void test_loop_fastrun_int() {
        test_loop(int)
    }
    
    __attribute__ ((__noinline__)) void test_loop_no_fastrun_double() {
        test_loop(double)
    }
    
    __attribute__ ((__noinline__)) FASTRUN void test_loop_fastrun_double() {
        test_loop(double)
    }
    
    using fn_t = void (*)();
    
    __attribute__ ((__noinline__)) void benchFn(fn_t fn, const char* desc1, const char* desc2) {
        uint32_t start_time = millis();
        fn();
        uint32_t end_time = millis();
        uint32_t duration = end_time - start_time;
        Serial.printf("Duration [%s, %s]: %u\n", desc1, desc2, duration);
    }
    
    
    #define BENCH_ALL(type) \
        benchFn(test_loop_no_fastrun_##type, #type,      "no fastrun          ");\
        benchFn(test_loop_fastrun_##type, #type,         "fastrun             ");\
        if(is_k66) {\
            LMEM_PCCCR = 0;\
            benchFn(test_loop_no_fastrun_##type, #type,      "no fastrun, no cache");\
            benchFn(test_loop_fastrun_##type, #type,         "fastrun, no cache   ");\
            LMEM_PCCCR = 0x85000003;\
        }
    
    
    void setup() {
        Serial.begin(9600);
        delay(2000);
        bool is_k66 = false;
    #ifdef __MK66FX1M0__
        is_k66 = true;
    #endif
        BENCH_ALL(int)
        Serial.println();
        BENCH_ALL(double)
    }
    
    void loop() {}
    Teensy 3.6 @ 180MHz
    Duration [int, no fastrun ]: 234
    Duration [int, fastrun ]: 235
    Duration [int, no fastrun, no cache]: 276
    Duration [int, fastrun, no cache ]: 235

    Duration [double, no fastrun ]: 835
    Duration [double, fastrun ]: 890
    Duration [double, no fastrun, no cache]: 1780
    Duration [double, fastrun, no cache ]: 891


    Teensy 3.5 @ 120MHz
    Duration [int, no fastrun ]: 396
    Duration [int, fastrun ]: 352

    Duration [double, no fastrun ]: 1900
    Duration [double, fastrun ]: 1336

    Teensy 3.2 @ 96MHz
    Duration [int, no fastrun ]: 497
    Duration [int, fastrun ]: 440

    Duration [double, no fastrun ]: 2005
    Duration [double, fastrun ]: 1672
    Last edited by tni; 05-12-2017 at 06:57 PM.

  14. #14
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    9,300
    @tni :: The code in your posts is so instructive

    Quote Originally Posted by tni View Post
    Teensy 3.6 benefits a lot less from FASTRUN (with double, the code size is much bigger and overflows the cache):

    Teensy 3.6 @ 180MHz

    Duration [double, no fastrun ]: 835
    Duration [double, fastrun ]: 890
    Duration [double, no fastrun, no cache]: 1581
    I should code this (and other questions) - but assume T_3.6 fastrun with 'no cache' performance wouldn't suffer much
    Duration [double, fastrun, no cache]: 890

  15. #15
    Senior Member
    Join Date
    Jan 2013
    Posts
    843
    Quote Originally Posted by defragster View Post
    I should code this (and other questions) - but assume T_3.6 fastrun with 'no cache' performance wouldn't suffer much
    No difference. Post above updated.

  16. #16
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    9,300
    Quote Originally Posted by tni View Post
    No difference. Post above updated.
    Thanks, shows the T_3.6 large cache has consistent value independent of fastrun.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •