Performance of unaligned reads from program memory

Status
Not open for further replies.

JarkkoL

Well-known member
Hi,

What's the performance hit of unaligned memory reads from program memory on Teensy? I'm porting some code to Cortex-M0+ which simply crashes upon unaligned reads and I have to write this code without unaligned reads which imposes some performance overhead due to double fetch from the program memory. So I'm just wondering if I should just use this same code on Teensy as well if there's significant overhead upon unaligned reads, or keep the unaligned read implementation if it's more optimal. This is part of performance intensive part of the code thus I'm interested.

Thanks, Jarkko
 
That might depend on how the unaligned reads come about and what solution works on T_LC. If just reading a Word ( 16b or 32b? ) or part of a longer read the solution might be efficient or ugly.
 
The performance impact is minimal.

However, unaligned access is only supported when using the normal 16 ARM registers. Memory access loading or storing the FPU registers does not support unaligned memory, so you need to be careful with 32 bit float on Teensy 3.5 & 3.6. Unaligned access for float type will typically crash on those FPU-enabled chips, just like integers do on Cortex-M0+.

Unaligned access is also not supported across the 1FFFFFFF to 20000000 barrier. The M4 core has 2 separate buses and the RAM is implemented as 2 physical banks. Unaligned access only works when accessing the same memory bank using the same physical bus. If you attempt an unaligned access across this barrier, your code will crash, just like integers do on Cortex-M0+. Because of this special barrier, unaligned access is not recommended unless you take special precautions.

However, flash memory will always be on the 0-1FFFFFFF. The flash memory does use wait states, but it always has a small cache which is fed by a wide bus. Teensy 3.6 also has a larger 8K cache. On some chips the flash bus is 8 bytes, others 16 bytes. So if your unaligned access stays within 1 flash bus width, you'll suffer at most 1 flash latency. The 2nd access will come from the cache.

Even if you suffer 2 cache misses, it's still not many cycles. For most ordinary programming, it's probably not worth worrying.

But obviously aligned access is faster. If you're implementing something like a FIR filter or correlation/convolution against fixed patterns (stored in flash), running at a fast sample rate, where you would need to read hundreds of words from a table for every sample, then you'll probably care. For that sort of programming, you'll probably also want to structure your code to leverage some of the other special hardware features. The main thing you'd want to do is leverage the M4's special memory burst optimization. Normally a load or store takes 2 cycles. But on M4, if you have 2 or more "related" load or store instructions in a row, only the first takes 2 cycles. Then the rest are single cycle. You can see this used in the audio library in several places, like the RMS analysis. Specifically for the flash memory you asked about, you'd probably want to align your data not only to the 32 bit words, but to the actual flash bus width, so you can use the M4 burst optimization and suffer the flash latency only once for a burst or 2 or 4 words. There are a number of other tricks, like the DSP extensions (also used in the audio lib) which can really speed up signal processing code.

However, for normal programming, these sorts of intense optimizations take a lot of work that's rarely worth the effort. Usually effort is better spent avoiding if-else (assuming you can craft an expression which handles all cases) or finding ways to inline small functions. M4 has no branch prediction, so avoid branches is usually the low-handing optimization fruit. Unaligned access overhead is small, so if you have some reason to not keep things aligned in flash, rarely a big deal.

Just remember that the chips with FPU don't actually support unaligned access for floats, and if you apply this to RAM, be aware of that barrier between the RAM banks. You must align in those cases.
 
Thanks for the detailed answer Paul!
I was mainly worried that splitting the read to two would always issue two fetches over the bus. It's good to know Teensy actually has data cache. This is for performance intensive loop and something I like to have well optimized.

Cheers, Jarkko
 
Status
Not open for further replies.
Back
Top