I don't think you give CM7 the credit it deserves regarding branch prediction, which coupled with dual-issue pipe enables zero cost loops. Not sure how you managed to miss it while reading those articles. Fixed size...
The looping adds extra work for the processor but it doesn't necessarily add extra cost because ILP - there's a difference, and the cost is what we care about not work. Here's a good explanation what that means:...
Teensy 4 seems to have branch prediction and also much bigger cache than Teensy LC for example. If you expand the code size by unrolling, the MCU has to be fetching the code from flash more frequently slowing things...
Yeah, but that statement has no side-effect (i.e. does nothing). Should have "max=std::max(max, 10.0f);" to clamp the <10.0f values to 10.0f
For unrolling, did you have fixed loop count for your test or did you use...
Ok, there's a bug in the code, where you should assign std::max() result to max.
Depending on the architecture the loop unrolling may actually hurt performance in terms of instruction cache misses and because some...
You could remove the second loop in NormalizeSignal() and just return the max, then perform the normalization while calculating the abs delta in ComputeManhattan().