Help me optimise this code

Sure:

Code:
float CrossCorrelation::ComputeHead(int sensorID) {

    float max = 0.0f;

    //unrolled loop, the "0" is incremented every time as in a for loop
    m_targetSignal[0] = m_dataStorage.headSimilaritySignal.at(sensorID).peek(0);
    max = (m_targetSignal[0] > max) ? m_targetSignal[0] : max;
    //code above repeats N times

    std::max(max, 10.0f); //signals below 10 are not measured.

    max = 1.0f/max;
    float result = 0.0f;
    
    //unrolled loop
    float temp = (m_targetSignal[0] *= max) - m_templateSignal[0];
    result += (std::fabs(temp));
    //code above repeats N times

    return result;
}
 
However a big part of the optimisation is that I only process every other sensor reading as the requirements are not that great for this calculation. So every other time I process one sensor pair, and the next time I process the next. I read two sensors at a time and the processing happens on the previous sensor pair while waiting for the next sensor pair to finish reading the sensors.
 
Ok, there's a bug in the code, where you should assign std::max() result to max.

Depending on the architecture the loop unrolling may actually hurt performance in terms of instruction cache misses and because some architectures got branch prediction to make loop branching practically free. Not sure in your case with your target MCU, just speaking from CPU developer background :) If you have compile-time constant loop count (as you do since you unroll it manually by simply copying the code and not use like Duff's device), it can be beneficial to leave the loop unrolling to the compiler, which can do it to "optimal degree" for given target. Anyway, probably it's best to check the generated assembly code to see if there are some further optimizations, but looks good to me :)
 
Ok, there's a bug in the code, where you should assign std::max() result to max.

Depending on the architecture the loop unrolling may actually hurt performance in terms of instruction cache misses and because some architectures got branch prediction to make loop branching practically free. Not sure in your case with your target MCU, just speaking from CPU developer background :) If you have compile-time constant loop count (as you do since you unroll it manually by simply copying the code and not use like Duff's device), it can be beneficial to leave the loop unrolling to the compiler, which can do it to "optimal degree" for given target. Anyway, probably it's best to check the generated assembly code to see if there are some further optimizations, but looks good to me :)

You mean: ”std::max(max, 10.0f)” ?
That’s intentional for my use case. I don’t care about max values below 80 and 10 is just noise.

I used a regular for loop but it was way slower so I unrolled it manually. I haven’t gotten to the GCC #pragma way of doing it yet because I’m too lazy to change it 😝
 
You mean: ”std::max(max, 10.0f)” ?
That’s intentional for my use case. I don’t care about max values below 80 and 10 is just noise.

Yeah, but that statement has no side-effect (i.e. does nothing). Should have "max=std::max(max, 10.0f);" to clamp the <10.0f values to 10.0f

For unrolling, did you have fixed loop count for your test or did you use the "i<m_signalLength" condition as in your original code? Because if you would use "i<N" instead (where N is some integer literal) to match your unrolling, compiler can do the unrolling because N is known at compilation time.
 
Last edited:
Ah, I missed that. Thanks, will fix. It doesn’t really do anything because the signals are usually in the thousands. It’s just in case to not divide by zero but it’s pretty much impossible anyway.
 
Depending on the architecture the loop unrolling may actually hurt performance in terms of instruction cache misses and because some architectures got branch prediction to make loop branching practically free. Not sure in your case with your target MCU, just speaking from CPU developer background :)
Well, this is a forum for teensy, and I would guess that it would run on a T4. (btw, the architecture is also known ;-), and is has no "no-overhead loops" like esp32)
Frankzappa, don't get confused! Normally the program runs in lower RAM, and that is exactly as fast as a cache would be. In fact, there is no cache for this case.
At most on a T3 it would matter, but there the instruction cache is relatively large, at least large enough for your inner loops with 2x or 4x unrolling.
The data, yes, should not be in the upper RAM (heap) if possible. That is slower, and is actually cached.
Beyond that, just ignore instruction caches. That makes it only complicated and plays, as said, for the T4 no role. Also hardly for the T3. (Unless you do such "bad" things on T4 and execute the code directly from flash).
 
Well, this is a forum for teensy, and I would guess that it would run on a T4. (btw, the architecture is also known ;-), and is has no "no-overhead loops" like esp32)
Frankzappa, don't get confused! Normally the program runs in lower RAM, and that is exactly as fast as a cache would be. In fact, there is no cache for this case.
At most on a T3 it would matter, but there the instruction cache is relatively large, at least large enough for your inner loops with 2x or 4x unrolling.
The data, yes, should not be in the upper RAM (heap) if possible. That is slower, and is actually cached.
Beyond that, just ignore instruction caches. That makes it only complicated and plays, as said, for the T4 no role. Also hardly for the T3. (Unless you do such "bad" things on T4 and execute the code directly from flash).

I just change something and measure the elapsed micros to see which is faster. The unrolled was twice as fast so I kept it. It doesn’t matter what anyone says, if it’s faster it’s faster.

However the code before was written in a flexible way and could take an array of any size and now when it’s unrolled it takes a single size.

Maybe I should try again with the for loop but with the size known at compile time.

I got the teensy 4 so I don’t have to care so much about performance and just focus on good coding practices.
 
Well, this is a forum for teensy, and I would guess that it would run on a T4. (btw, the architecture is also known ;-), and is has no "no-overhead loops" like esp32)
Teensy 4 seems to have branch prediction and also much bigger cache than Teensy LC for example. If you expand the code size by unrolling, the MCU has to be fetching the code from flash more frequently slowing things down due to the memory latency. It obviously depends a lot on the case what the performance impact is, e.g. what other code than the loop you are running if it ends up evicting the loop code from the cache, and what's the size of the loop code. Instead of trying to analyze the code, it's better just to try it out and profile, but you need to be aware of these things to understand factors that influence the code performance to know what to try. I don't know the details of T4 branch prediction, but correctly predicted branching can have zero-cost because pipelining, and it would be easy to achieve with fixed size for-loop. The performance benefit of keeping the loop with zero-cost is that there's less code to fetch from the flash.

I just change something and measure the elapsed micros to see which is faster. The unrolled was twice as fast so I kept it. It doesn’t matter what anyone says, if it’s faster it’s faster.
Yeah, but it sounds like you were comparing apples to oranges. If you have fixed size loop compiler can actually decide to (partially) unroll the loop knowing the loop size and the architecture it's targeting. For example if you use fixed loop size of 64 in your for-loop condition, the compiler could end up doing 16 loops and unroll the loop 4 times knowing the cache size, branch prediction capabilities and the compiled loop code size. But if you give the loop size as m_signalLength compiler doesn't know the size because the size is defined run-time and you rob the compiler from potentially doing the optimization. I'm not saying that the compiler does the unrolling automatically for fixed size loop, but at least it could do it, the best is just to try it out :)
 
T_4.x by default runs all code from RAM1 ITCM at processor speed - unless code is marked 'FLASHMEM' to keep in flash it never needs/gets cached - the cache is 32KB for FLASH based code.
The T_4.x's 1062 MCU can also dual execute instructions that don't 'conflict', so it also has a better set of prediction and prep.

There is also a running ARM_DWT_CYCCNT that counts the clock ticks between calls for short/precision use - so at 600 MHz it has better resolution than microseconds, and is also faster by 10X to read ( since micros() uses ARM_DWT_CYCCNT also and does math from last millis systick to get the us value in ~36 clocks )
This measure time between calls of an _isr:
Code:
uint32_t now = ARM_DWT_CYCCNT;
	cyclesPulse = now - last;
	last = now;
 
Teensy 4 seems to have branch prediction and also much bigger cache than Teensy LC for example.
Even with branch prediction, a register must be incremented or decremented and a comparison/jump must be performed. This is not without cost. (the jump takes 1 cycle if the prediction is correct. +1 cyle minimum for inc/dec - only if the register pressure is low enough and the loop counter can be held in a register)
Normally, the code is executed in RAM, which is as fast as a cache would be. But in this case there is not even a cache - what for, it is superfluous.
If you expand the code size by unrolling, the MCU has to be fetching the code from flash more frequently
Again, it still runs from RAM :) It does not fetch it from flash. This happens at startup.
Yeah, but it sounds like you were comparing apples to oranges. If you have fixed size loop compiler can actually decide to (partially) unroll the loop knowing the loop size and the architecture it's targeting.
The guy who writes the code knows it, too. In addition, i.e. you can always use force unrolling by two if you know the number is even. The #pragma I mentioned takes the number.

Yes, to try it out is always the best.
 
Last edited:
Even with branch prediction, a register must be incremented or decremented and a comparison/jump must be performed. This is not without cost. (the jump takes 1 cycle if the prediction is correct. +1 cyle minimum for inc/dec - only if the register pressure is low enough and the loop counter can be held in a register)
The looping adds extra work for the processor but it doesn't necessarily add extra cost because ILP - there's a difference, and the cost is what we care about not work. Here's a good explanation what that means: https://stackoverflow.com/questions...iction-overhead-of-perfectly-predicted-branch

Normally, the code is executed in RAM, which is as fast as a cache would be.
Ok, that's interesting, didn't know RAM access on T4 is as fast as cache access which is quite contrary to modern CPUs. Or that the code is loaded to RAM at startup.

The guy who writes the code knows it, too.
The point was that the information for the optimization wasn't provided to the compiler making it apples to oranges comparison. It would be interesting to know if the compiler would actually automatically unroll the loop for T4 if the loop size was fixed.
 
...
Ok, that's interesting, didn't know RAM access on T4 is as fast as cache access which is quite contrary to modern CPUs. Or that the code is loaded to RAM at startup.
...

pjrc.com/store/teensy41.html#memory

The note here of 'tightly coupled' was to say RAM1 runs at F_CPU - That is RAm1 that holds ITCM and DTCM in that 512KB. The other 512KB only clocks at F_CPU/4.
Code:
RAM
1024K of memory is available for variables and data. Half of this memory (RAM1) is accessed as tightly coupled memory for maximum performance. The other half (RAM2) is optimized for access by DMA. Normally large arrays & data buffers are placed in RAM2, to save the ultra-fast RAM1 for normal variables.

Code can be moved/left in FLASH and runs through a 32KB instruction cache.
RAM2 and other DATA areas have their own 32KB data cache.

On startup all 'FASTRUN'/default code is moved to RAM1 as ITCM, and DTCM gets what is left where the ITCM consumes the RAM in 32KB blocks - thus the padding shown on building. No cache ever sees any of those 512KB. That padding space can be located (I posted code) and used.
 
@ JarkkoL
This, for a micro controller quite impressive piece of silicon, is however miles away from a modern desktop CPU. It is also helpful to simply compare the orders of magnitude lower number of transistors or power consumption to see the bigger picture. Bottom line, you'll see that unrolled loops are always faster (or have equal speed at least). Before you generalize and apply things to a (and specifically this) microcontroller, you should take into account these different performance capabilities. There are many interesting articles about the CM7 - just read them. Most of the details can be found somewhere (e.g. also details about the branch prediction, which is very simple) - and I have read many articles about it.
I am tired of participating in these superfluous and pointless discussions. Besides, it is not necessary to speculate about the GCC. It's all documented, and if you can't find a piece of information you can just try it out. If you don't know something, wild speculation using completely different architectures (and then claiming it's fact) is the worst thing you can do.

Just try it out. Take it as a chance to prove me wrong or to learn.

Have fun, and bye,
I'm out here, I prefer to use my time wisely.
 
Last edited:
I don't think you give CM7 the credit it deserves regarding branch prediction, which coupled with dual-issue pipe enables zero cost loops. Not sure how you managed to miss it while reading those articles. Fixed size loop in particular is a very simple case for any branch predictor. I don't find it superfluous to gain good mental model of the target architecture and capabilities/limitations of the compiler optimizations in order to apply good practices for writing high performance code. This can be quite useful knowledge when you actually optimize the code.
 
Back
Top