Preventing Caching with DMA transfers? is it necessary? (see post body)

honey_the_codewitch

Well-known member
I'm on a Teensy 4.1 (MXRT1062)

I've been inspecting the code for *_t3 graphics libraries like SSD1351_t3, and ST7789_t3.
I noticed there is a complicated anti-caching scheme happening wherein multiple buffers are being juggled and memcpy'd, according to (i think?) Paul's comments this copying is happening to avoid caching.

I've adapted the code to do partial screen updates like LVGL usually expects, and therefore I can fit every transfer within 32KB with no trouble.

Here's the thing. I was running into issues using Paul's scheme because I couldn't get the ISR handler chaining to cut off the transfer at the appropriate number of bytes in all cases.

So I eliminated the anti-caching scheme altogether just to see if it would still work, and stressed it with a fire demo I made to blast it at the SPI rate limits. (70MHz for the ST7789 i have). Now the interrupt handler just fires the completion code, no continuation necessary, no memcpys, nothing like that.

There's plenty of code, so I'll link to the testbed I checked in so you folks can peruse it. Here's the relevant source file. You'll see it's largely the same as Paul's code except the DMA bits have been dramatically simplified.
https://github.com/codewitch-honey-...pi_driver_t4/src/source/lcd_spi_driver_t4.cpp

One major difference here is when i use the above code I am not using malloc to allocate the transfer buffers. They are static arrays. But everything seems to work. I'm not getting stale image data or anything.

Questions:

1. Do I not need the anti-caching scheme because those transfer buffers are maybe being created in the fast 128KB region instead of being allocated on the general heap? Is that what's likely happening above? If not, then why does it work?

2. Is it safe? Is this something intermittent that might not be showing up for me, but likely will depending on the circumstances? If so, is there a way I can encourage the problem to show itself?

3. Am I right that there's in essence a 32KB limit on the transfer size? If I implement chaining could that potentially introduce caching issues?

4. I've heard there's a way to disable caching on major NXP chips for memory set aside for DMA transfers. I'm not sure about the 1062 specifically, but it seems to me that would be a lot more efficient than doing memcpys between multiple buffers, and a lot less sketchy too. Is there a reason for the memcpy scheme that I'm missing?

I've included the ILI9341_t3 code under ./lib so you can see Paul's code for the chaining and anti-caching scheme I'm talking about.

I'd just like some details because scanning TRMs gives me a headache in my eye and I don't usually find what I'm looking for anyway. Some people have a gift for that sort of sifting. I'm comically bad at it.
 
DTCM isn't cached. The easy answer is to allocate your buffers in DTCM (the default for all static and global variables without any special keywords), but then the downside is you'll be consuming that precious fast memory which everything else also wants.

If you use other memory, the answer is to call arm_dcache_flush(mybuffer, sizeof(mybuffer)) after you've written the last data but before you start DMA. It's defined in imxrt.h, if you want to look at the details.


You probably also want to use __attribute__ ((aligned(32))) on your buffer so it's aligned to a 32 byte cache row, and make its total size a multiple of 32 bytes. Here's an example.


But aligning is really only critical when you'll use the cache functions that delete. Normally that's not done with display buffers, as you want subsequent writing to the buffer to enjoy caching speed as much as possible. When you only flush, aligning just gives you a small speed boost by not having to flush 1 more row.
 
Last edited:
3. Am I right that there's in essence a 32KB limit on the transfer size? If I implement chaining could that potentially introduce caching issues?

To answer this specific question, the 32K limit is on the number of iterations of the major loop. So if you configure the minor loop to move just 1 byte, then you have a 32KB limit. But usually you would have the minor loop move at least 4 bytes, because moving just 1 byte on a 32 bit bus means you're using only 8 of the 32 wires! It can be configured to move quite a lot per minor loop iteration. The DMA engine offers an amazing number of features, which can be pretty awesome, but also a bit daunting to choose. Like so many things, best to start simple and leverage initial success to explore the more complex ways.
 
In case anyone finds this thread later here's what I've found specifically in my tests.

Paul basically said as much, but I'll just hit the highlights:

You can use DTCM memory (via declaring a static array) and everything works great without calling any sort of cache flushing method.

You can use other memory but you have to call arm_dcache_flush() on it as Paul said.
 
Back
Top