Speed test and optimizations--some surprises!

Status
Not open for further replies.
I have a very math intensive bit of code running a giant pile of things that also include a bunch of LUTs. So I sat down to see which of the various optimizations gave me the fastest code.
Space? who cares! I want speed!
So, at the start of the code I throw a pin high and at the end, throw it low and measured it on the o-scope.
going from Fast to Fastest with LTO on every other one I got:
15.56 us
16.36 us
14.44 us
14.28 us
14.52 us
14.12 us
with the last being "Fastest with LTO"
It was funny that "faster" was faster than "fastest" w/o the LTO optimization, but hey, that's how things go.

so, for giggles, I then ran "smallest" and "smallest with LTO" and got:
13.72 us
13.64 us

yep, "smallest with LTO" IN MY CASE is faster than "FASTEST with LTO"

the take away message I'd throw out there is:
if speed is king, test your code; don't rely on the compiler's settings.

--mjlg
 
I have a very math intensive bit of code running a giant pile of things that also include a bunch of LUTs. So I sat down to see which of the various optimizations gave me the fastest code.
Space? who cares! I want speed!
So, at the start of the code I throw a pin high and at the end, throw it low and measured it on the o-scope.
going from Fast to Fastest with LTO on every other one I got:
15.56 us
16.36 us
14.44 us
14.28 us
14.52 us
14.12 us
with the last being "Fastest with LTO"
It was funny that "faster" was faster than "fastest" w/o the LTO optimization, but hey, that's how things go.

so, for giggles, I then ran "smallest" and "smallest with LTO" and got:
13.72 us
13.64 us

yep, "smallest with LTO" IN MY CASE is faster than "FASTEST with LTO"

the take away message I'd throw out there is:
if speed is king, test your code; don't rely on the compiler's settings.

--mjlg

Yes... optimizations have always been a tricky beast to tame. Sometimes the fastest solution defies all logic. Sometimes it seems a simple look up table (LUT) would beat a semi-complex bit of arithmetic hands down... but not always the case.

I had an example where I got a 10% speed increase in a video frame renderer just by ... moving the damn temp buffer.

Originally the buffer I used to process each scanline was a "spare" buffer in global memory I knew would always be free when rendering video... so I used it. It was a 2K buffer so ample.

Since I only needed a buffer 160*2 I decided to put the buffer on the stack. Winner... 10% increase!

Basically optimizing anything in the "inner-inner-inner" loop that is executed for every pixel adds up to a big gain (or loss) quickly.

From my experience, accessing the static buffer would require the correct instruction + 32 bit address whereas the smaller buffer on the stack just uses a displacement and would be more efficient. Also, since all the variables I use are all local... nice cache performance!

Also, got a very pleasant surprise (we don't get too many of them do we!!!). My routine of test video with sound originally took 1500 seconds and after some small changes came down to 918 seconds. (big difference). I decided to turn the sound off and see what difference that would make (5000-6000Hz) and amazingly it was almost identical at 919 seconds. As I don't have the actual AD5330 breakout yet I just coded it to missing hardware so frantically rechecked it. I put a LED on/off in the sound interrupt and it flashes so can assume it is actually sending.

My video renderer uses DMA without waiting so I process the next scanline while the previous is being sent to the display. From what I could figure out mem-mem is a lot faster than mem-spi so there is a lot of spare cycles which usually would be wasted being used.

As a result of > week of optimizations I'm now getting 20+ frames/sec on a 84Mhz processor with 128*128 DMA SPI on SDD1351 with sound via AD5330.

I should also add... I'm a big fan of "inline".
 
The problem with inline is it often times inlines too much. If you have a small leaf function (which doesn't call anything else) that only uses a few registers, it can often be a win.

But at other times, if you inline multiple functions, it can force register spilling (where the compiler doesn't have enough registers to hold everything important in the function, and it has to store one register value in order to free up a register so it can use it for something else). These extra stores and loads can slow things down. This particularly shows up if you have an error function that is rarely called, but it uses a lot of resources. If you don't actually call the function, you could have used the resources that were dedicated to the function for something else.

And hopefully GCC gets better over time (though I do have some benchmarks that have regressed). Right now, Teensy is using GCC 5.4. The original 5.1 compiler was released two years ago (the bug fix release 5.4 was released last June, but the main functionality was frozen in the 5.1 release). The 6.1 compiler was released last year, with the last bug fix release in December. The 7.1 GCC compiler was released a few days ago. Perhaps GCC 6.3 is better than GCC 5.4 (but perhaps not). The only real thing is to try it on your own code to see what is better.
 
...
so, for giggles, I then ran "smallest" and "smallest with LTO" and got:
13.72 us
13.64 us

yep, "smallest with LTO" IN MY CASE is faster than "FASTEST with LTO"

the take away message I'd throw out there is:
if speed is king, test your code; don't rely on the compiler's settings.

--mjlg

Test indeed - the MCU's have internal optimization RAM/FLASH tradeoff's as well. Quite possible the smallest with reduced code size fit the cache on hand so was using higher bandwidth memory without dumping to reload/run from FLASH for the hottest execution path. Small doesn't mean it isn't compiling smart - but rather not going overboard growing the code in the process - and some library code is neutered to fit as well.

FASTRUN decoration on some small critical function might fit in RAM and may get better treatment without compromising overall operation as well. FASTRUN might be almost as good as inline (?) in some cases where a single copy can be called quickly in RAM rather than multiple copies growing the code and needing to access FLASH to run.
 
Last edited:
RAMFUNC decoration on some small critical function might fit in RAM and may get better treatment without compromising overall operation as well. RAMFUNC might be almost as good as inline (?) in some cases where a single copy can be called quickly in RAM rather than multiple copies growing the code and needing to access FLASH to run.

So what is the practical difference between RAMFUNC and FASTRUN?

FASTRUN has been helpful today! Wish I had seen it earlier...

--mjlg
 
In terms of LTO, I'm currently doing spec 2006 runs on the released GCC 7.1. For the options I'm using, out of the 29 benchmarks in Spec 2006 INT/FP, 14 bencharks were faster using LTO over not using LTO. One benchmark was nearly 36% faster. On the other hand, 3 benchmarks were slower, the worst had a 7.5% slowdown. So yeah, I imagine in the next year, I or somebody in the group will try to reduce those slowdowns. But in the compiler field, we often play whack-a-mole, where you optimize in one case, and it causes something to slow down.
 
Last edited:
A few rounds of changes later and using FASTRUN the code is now faster with "Faster w LTO" as opposed to "smallest"

quick question, is there any way to tell the compiler (Teensyduino) to stick EVERYTHING in RAM if possible?
I can stick "FASTRUN" all over hither and yon, but having it also assert FASTRUN for things in the #INCLUDE would be nice.

--mjlg
 
What about Smallest w/LTO?

The reserved RAM cache of code from FLASH can hold 8K maybe once or twice in some fashion.

There are books - and parts in posts from Paul that likely cover such architectural issues - but as with all 'optimizations' there are trade offs.

Things like this : Optimizing Performance on Kinetis K-series MCUs

I don't see posted code - so no idea of actual utility or overall size of the 'test code'. The MCU has inbuilt efficiencies and abilities, doing everything in compiling and linking and compromising those will limit performance at some point.
 
The reserved RAM cache of code from FLASH can hold 8K maybe once or twice in some fashion.

Only Teensy 3.6 has a large cache (Flash cache and extra 8kb cache).

T3.2 / T3.5 only have a small Flash cache. For Teensy 3.5, the Flash cache is 128 bytes (16 64-bit entries) + a 64-bit prefetch buffer.
 
Only Teensy 3.6 has a large cache (Flash cache and extra 8kb cache).

T3.2 / T3.5 only have a small Flash cache. For Teensy 3.5, the Flash cache is 128 bytes (16 64-bit entries) + a 64-bit prefetch buffer.

Indeed - I don't see an MCU noted above - the capabilities are varied across them. Like with the optimization/LTO options there isn't one answer across the board
 
Teensy 3.6 benefits a lot less from FASTRUN (with double, the code size is much bigger and overflows the cache):
Code:
#define C(x) case x: { volatile T v = 42; v = v+v; break; }

#define test_loop(type)  \
    using T = type;\
    for(volatile int ol = 0; ol < 100000; ol++) {\
        for(int i = 0; i < 20; i++) {\
            switch(i) {\
                C(0); C(1); C(2); C(3); C(4); \
                C(5); C(6); C(7); C(8); C(9); \
                C(10); C(11); C(12); C(13); C(14);\
                C(15); C(16); C(17); C(18); C(19); \
            }\
        }\
    }

__attribute__ ((__noinline__)) void test_loop_no_fastrun_int() {
    test_loop(int)
}

__attribute__ ((__noinline__)) FASTRUN void test_loop_fastrun_int() {
    test_loop(int)
}

__attribute__ ((__noinline__)) void test_loop_no_fastrun_double() {
    test_loop(double)
}

__attribute__ ((__noinline__)) FASTRUN void test_loop_fastrun_double() {
    test_loop(double)
}

using fn_t = void (*)();

__attribute__ ((__noinline__)) void benchFn(fn_t fn, const char* desc1, const char* desc2) {
    uint32_t start_time = millis();
    fn();
    uint32_t end_time = millis();
    uint32_t duration = end_time - start_time;
    Serial.printf("Duration [%s, %s]: %u\n", desc1, desc2, duration);
}


#define BENCH_ALL(type) \
    benchFn(test_loop_no_fastrun_##type, #type,      "no fastrun          ");\
    benchFn(test_loop_fastrun_##type, #type,         "fastrun             ");\
    if(is_k66) {\
        LMEM_PCCCR = 0;\
        benchFn(test_loop_no_fastrun_##type, #type,      "no fastrun, no cache");\
        benchFn(test_loop_fastrun_##type, #type,         "fastrun, no cache   ");\
        LMEM_PCCCR = 0x85000003;\
    }


void setup() {
    Serial.begin(9600);
    delay(2000);
    bool is_k66 = false;
#ifdef __MK66FX1M0__
    is_k66 = true;
#endif
    BENCH_ALL(int)
    Serial.println();
    BENCH_ALL(double)
}

void loop() {}

Teensy 3.6 @ 180MHz
Duration [int, no fastrun ]: 234
Duration [int, fastrun ]: 235
Duration [int, no fastrun, no cache]: 276
Duration [int, fastrun, no cache ]: 235

Duration [double, no fastrun ]: 835
Duration [double, fastrun ]: 890
Duration [double, no fastrun, no cache]: 1780
Duration [double, fastrun, no cache ]: 891


Teensy 3.5 @ 120MHz
Duration [int, no fastrun ]: 396
Duration [int, fastrun ]: 352

Duration [double, no fastrun ]: 1900
Duration [double, fastrun ]: 1336

Teensy 3.2 @ 96MHz
Duration [int, no fastrun ]: 497
Duration [int, fastrun ]: 440

Duration [double, no fastrun ]: 2005
Duration [double, fastrun ]: 1672
 
Last edited:
@tni :: The code in your posts is so instructive :)

Teensy 3.6 benefits a lot less from FASTRUN (with double, the code size is much bigger and overflows the cache):

Teensy 3.6 @ 180MHz

Duration [double, no fastrun ]: 835
Duration [double, fastrun ]: 890
Duration [double, no fastrun, no cache]: 1581

I should code this (and other questions) - but assume T_3.6 fastrun with 'no cache' performance wouldn't suffer much
Duration [double, fastrun, no cache]: 890
 
Status
Not open for further replies.
Back
Top