Maximizing External RAM DMA Speed

jdnle

New member
Hello, I've been working on a direct 40-pin display driver for the Teensy 4.1 for a while now. I've made major progress but the main bottleneck I've faced is memory speed. I'm trying to drive a 480 * 800 pixel screen with 16-bit color, which is a bit over the size of memory that can fit in RAM2 (DMA accessible). One option is to split the buffer across both RAM chips, and have interrupts copy the necessary data into the DMA buffer, but it'd be much cleaner (and use no CPU) if I could put the frame buffer in an external PSRAM chip.

I'm trying to drive the screen RGB mode (HSYNC, VSYNC, DATA_ENABLE, CLK, and data pins) at 30 fps. To minimize CPU usage, I'm using the LCDIF module in the chip to do all the timing. As far as I can tell, not a single mainstream teensy 4.1 project has used the module so figuring out how to use it correctly was a nightmare. But it works, and I can answer questions if anyone else wants to use it for generating RGB mode timing signals.

I need 800 * 480 * 2 * 30fps = ~23Mbps of DMA throughput. It looks like the main PSRAM chip PJRC provides can clock up to 132Mhz, so I'd assume it could handle the throughput I need, however when using PSRAM as the source, I get a glitchy screen. With a tiny frame buffer located in DMARAM it works just fine, however if I want a full size buffer in external ram (or just 0x70000000), it's not fast enough. I'm not sure if this is a latency problem or if FLEXSPI's DMA interface just can't go that fast, but I'd assume it can with the right settings.

There also appear to be multiple DMA interfaces for the FLEXSPI module so I'm not sure where to start.

So far the only thing I've found is a snippet to increase the clock speed (which offers some improvement, but not enough).
C++:
CCM_CCGR7 &= ~CCM_CCGR7_FLEXSPI2(CCM_CCGR_OFF);
CCM_CBCMR = (CCM_CBCMR & ~(CCM_CBCMR_FLEXSPI2_PODF_MASK | CCM_CBCMR_FLEXSPI2_CLK_SEL_MASK))
    | CCM_CBCMR_FLEXSPI2_PODF(4) | CCM_CBCMR_FLEXSPI2_CLK_SEL(2); // 528 / 5 = ~ 132Mhz
CCM_CCGR7 |= CCM_CCGR7_FLEXSPI2(CCM_CCGR_ON);

What settings can I used to maximize DMA transfer speed from the external PSRAM chip?

@PaulStoffregen you seem to be an expert on these types of things, any ideas?

The following is just tips / techniques I've learned along the way with the project:

Since most of the data pins of the LCDIF module aren't routed, it's kind of impossible to use for it for the color data. Instead I use DMA channels to manage the color data. I have the DATA_ENABLE (pin 12) and CLK (pin 10) signals routed back into two other pins to trigger DMA transfers. The main color data channel triggers on the CLK rising edge, and copies color data to GPIO1. Two other DMA channels are used to enable / disable the color-data DMA channel on the rising / falling edge of the DATA_ENABLE signal respectively.

A major hurdle was figuring out how to store the buffers efficiently. I made sure my 16 color data pins (RGB565) were the top 16 bits of GPIO1 for convenience. As far as I can tell, DMA can only access the GPIO1 register using a 32-bit transfer, however I wanted to store my data as 16-bit and not pad every single color sample. After many hours and scouring the manual and frantically changing registers I found an answer. I'm sure I'm not the first person to have figured this technique out, but this may useful in many other DMA applications.

C++:
  frameDMA.begin();
  // This is the secret sauce. The minor loop copies two 16-bit values from the FRAMEBUFFER and copies it to GPIO1 as a 32-bit value.
  // Setting the 31st bit high enables an offset to be applied to the source address after the minor loop is complete (data is copied to GPIO1). [Note 1]
  // In this case, our offset is -2. So after we complete the minor loop and copy 4 bytes of data, we shift our source address back by 2 bytes.
  // Due to the way GPIO1 is configured, the extra data on the lower bits has no effect, since the other pins are configured to be on GPIO6.
  // A side affect of this method is you need the length of FRAMEBUFFER to be PIXEL_CNT + 1 as the first 16 bits of the buffer
  // will never be copied to the high bits of GPIO1

  // Source MLOFFSET = -2, NBYTES = 4
  frameDMA.TCD->NBYTES_MLOFFYES = (1 << 31) | ((((1 << 20) - 1) & (-2)) << 10) | 4;
  // source is our frame buffer
  frameDMA.TCD->SADDR = FRAMEBUFFER;
  // minor loop will offset by 2 bytes to create the 32-bit value [Note 2]
  frameDMA.TCD->SOFF = 2;
  // 16-bit source
  frameDMA.TCD->ATTR_SRC = 1;
  // when the major loop is complete, shift source address to reset it
  frameDMA.TCD->SLAST = -sizeof(uint16_t) * (PIXEL_CNT + 1);
  // we have PIXEL_CNT total transfers per major loop
  frameDMA.TCD->BITER = frameDMA.TCD->CITER = PIXEL_CNT;
  // destination is GPIO1
  frameDMA.TCD->DADDR = &GPIO1_DR;
  // don't offset destination address on transfer
  frameDMA.TCD->DOFF = 0;
  // 32-bit destination
  frameDMA.TCD->ATTR_DST = 2;
  // do nothing to destination when major loop is complete
  frameDMA.TCD->DLASTSGA = 0;
  // trigger on CLK signal from LCDIF
  frameDMA.triggerAtHardwareEvent(DMAMUX_SOURCE_XBAR1_0);
  frameDMA.enable();

Note 1: this only applies when some other flags in various control registers are set. However they are set by default in the DMAChannel.h library
Note 2: Ideally we'd use SOFF = 0, with an MLOFFSET of 2, but the DMA controller errors
 
How much more space do you need in RAM2 to fit the framebuffer? With some tweaks to the linker script it's possible to reallocate space from RAM1 to RAM2...
 
How would I go about doing that? I've tried tinkering with values in imxrt1062_t41.ld, but I have no experience editing linker scripts. After some tinkering I got the IOMUXC_GPR_GPR17 register to report that I've allocated memory banks to OCRAM, but I have no idea how to unify it with RAM2.

Code:
/* configure 9 banks as OCRAM to fit the framebuffer */
_flexram_bank_config = 0x55556AAA | ((1 << (_itcm_block_count * 2)) - 1);
/* move the stack pointer back by 9 banks so it starts in DTCM not OCRAM */
_estack = ORIGIN(DTCM) + ((16 - _itcm_block_count - 9) << 15);
 
Solved it through intense guesswork. I modified the MEMORY section to the following:
Code:
MEMORY
{
    ITCM (rwx):  ORIGIN = 0x00000000, LENGTH = 224K
    DTCM (rwx):  ORIGIN = 0x20000000, LENGTH = 224K
    OCRAM (rwx): ORIGIN = 0x20280000, LENGTH = 288K
    RAM (rwx):   ORIGIN = 0x20200000, LENGTH = 800K
    FLASH (rwx): ORIGIN = 0x60000000, LENGTH = 7936K
    ERAM (rwx):  ORIGIN = 0x70000000, LENGTH = 16384K
}
I just assumed a variable called OCRAM existed, so I defined its memory region to 512kb past RAM's start address, theoretically making it a contiguous region. It failed to compile with my buffer length, so I increased the length of the RAM to 800K. However I'm using the Arduino IDE, and "teensy_post_compile" gets extremely upset that I'm using more than RAM 512kb. It's not an issue as all it does is cancel the automatic upload, so I can do it manually, but is there any way to make teensy_post_compile aware of the configuration? Also does anyone know a way to use a linker file on a per-project basis in Arduino IDE? (Right now I just went into the core directory and edited it manually)

Now it all works and I was able to push the LCD to 40fps!
 
Back
Top