Hello, I've been working on a direct 40-pin display driver for the Teensy 4.1 for a while now. I've made major progress but the main bottleneck I've faced is memory speed. I'm trying to drive a 480 * 800 pixel screen with 16-bit color, which is a bit over the size of memory that can fit in RAM2 (DMA accessible). One option is to split the buffer across both RAM chips, and have interrupts copy the necessary data into the DMA buffer, but it'd be much cleaner (and use no CPU) if I could put the frame buffer in an external PSRAM chip.
I'm trying to drive the screen RGB mode (HSYNC, VSYNC, DATA_ENABLE, CLK, and data pins) at 30 fps. To minimize CPU usage, I'm using the LCDIF module in the chip to do all the timing. As far as I can tell, not a single mainstream teensy 4.1 project has used the module so figuring out how to use it correctly was a nightmare. But it works, and I can answer questions if anyone else wants to use it for generating RGB mode timing signals.
I need 800 * 480 * 2 * 30fps = ~23Mbps of DMA throughput. It looks like the main PSRAM chip PJRC provides can clock up to 132Mhz, so I'd assume it could handle the throughput I need, however when using PSRAM as the source, I get a glitchy screen. With a tiny frame buffer located in DMARAM it works just fine, however if I want a full size buffer in external ram (or just 0x70000000), it's not fast enough. I'm not sure if this is a latency problem or if FLEXSPI's DMA interface just can't go that fast, but I'd assume it can with the right settings.
There also appear to be multiple DMA interfaces for the FLEXSPI module so I'm not sure where to start.
So far the only thing I've found is a snippet to increase the clock speed (which offers some improvement, but not enough).
What settings can I used to maximize DMA transfer speed from the external PSRAM chip?
@PaulStoffregen you seem to be an expert on these types of things, any ideas?
The following is just tips / techniques I've learned along the way with the project:
Since most of the data pins of the LCDIF module aren't routed, it's kind of impossible to use for it for the color data. Instead I use DMA channels to manage the color data. I have the DATA_ENABLE (pin 12) and CLK (pin 10) signals routed back into two other pins to trigger DMA transfers. The main color data channel triggers on the CLK rising edge, and copies color data to GPIO1. Two other DMA channels are used to enable / disable the color-data DMA channel on the rising / falling edge of the DATA_ENABLE signal respectively.
A major hurdle was figuring out how to store the buffers efficiently. I made sure my 16 color data pins (RGB565) were the top 16 bits of GPIO1 for convenience. As far as I can tell, DMA can only access the GPIO1 register using a 32-bit transfer, however I wanted to store my data as 16-bit and not pad every single color sample. After many hours and scouring the manual and frantically changing registers I found an answer. I'm sure I'm not the first person to have figured this technique out, but this may useful in many other DMA applications.
Note 1: this only applies when some other flags in various control registers are set. However they are set by default in the DMAChannel.h library
Note 2: Ideally we'd use SOFF = 0, with an MLOFFSET of 2, but the DMA controller errors
I'm trying to drive the screen RGB mode (HSYNC, VSYNC, DATA_ENABLE, CLK, and data pins) at 30 fps. To minimize CPU usage, I'm using the LCDIF module in the chip to do all the timing. As far as I can tell, not a single mainstream teensy 4.1 project has used the module so figuring out how to use it correctly was a nightmare. But it works, and I can answer questions if anyone else wants to use it for generating RGB mode timing signals.
I need 800 * 480 * 2 * 30fps = ~23Mbps of DMA throughput. It looks like the main PSRAM chip PJRC provides can clock up to 132Mhz, so I'd assume it could handle the throughput I need, however when using PSRAM as the source, I get a glitchy screen. With a tiny frame buffer located in DMARAM it works just fine, however if I want a full size buffer in external ram (or just 0x70000000), it's not fast enough. I'm not sure if this is a latency problem or if FLEXSPI's DMA interface just can't go that fast, but I'd assume it can with the right settings.
There also appear to be multiple DMA interfaces for the FLEXSPI module so I'm not sure where to start.
So far the only thing I've found is a snippet to increase the clock speed (which offers some improvement, but not enough).
C++:
CCM_CCGR7 &= ~CCM_CCGR7_FLEXSPI2(CCM_CCGR_OFF);
CCM_CBCMR = (CCM_CBCMR & ~(CCM_CBCMR_FLEXSPI2_PODF_MASK | CCM_CBCMR_FLEXSPI2_CLK_SEL_MASK))
| CCM_CBCMR_FLEXSPI2_PODF(4) | CCM_CBCMR_FLEXSPI2_CLK_SEL(2); // 528 / 5 = ~ 132Mhz
CCM_CCGR7 |= CCM_CCGR7_FLEXSPI2(CCM_CCGR_ON);
What settings can I used to maximize DMA transfer speed from the external PSRAM chip?
@PaulStoffregen you seem to be an expert on these types of things, any ideas?
The following is just tips / techniques I've learned along the way with the project:
Since most of the data pins of the LCDIF module aren't routed, it's kind of impossible to use for it for the color data. Instead I use DMA channels to manage the color data. I have the DATA_ENABLE (pin 12) and CLK (pin 10) signals routed back into two other pins to trigger DMA transfers. The main color data channel triggers on the CLK rising edge, and copies color data to GPIO1. Two other DMA channels are used to enable / disable the color-data DMA channel on the rising / falling edge of the DATA_ENABLE signal respectively.
A major hurdle was figuring out how to store the buffers efficiently. I made sure my 16 color data pins (RGB565) were the top 16 bits of GPIO1 for convenience. As far as I can tell, DMA can only access the GPIO1 register using a 32-bit transfer, however I wanted to store my data as 16-bit and not pad every single color sample. After many hours and scouring the manual and frantically changing registers I found an answer. I'm sure I'm not the first person to have figured this technique out, but this may useful in many other DMA applications.
C++:
frameDMA.begin();
// This is the secret sauce. The minor loop copies two 16-bit values from the FRAMEBUFFER and copies it to GPIO1 as a 32-bit value.
// Setting the 31st bit high enables an offset to be applied to the source address after the minor loop is complete (data is copied to GPIO1). [Note 1]
// In this case, our offset is -2. So after we complete the minor loop and copy 4 bytes of data, we shift our source address back by 2 bytes.
// Due to the way GPIO1 is configured, the extra data on the lower bits has no effect, since the other pins are configured to be on GPIO6.
// A side affect of this method is you need the length of FRAMEBUFFER to be PIXEL_CNT + 1 as the first 16 bits of the buffer
// will never be copied to the high bits of GPIO1
// Source MLOFFSET = -2, NBYTES = 4
frameDMA.TCD->NBYTES_MLOFFYES = (1 << 31) | ((((1 << 20) - 1) & (-2)) << 10) | 4;
// source is our frame buffer
frameDMA.TCD->SADDR = FRAMEBUFFER;
// minor loop will offset by 2 bytes to create the 32-bit value [Note 2]
frameDMA.TCD->SOFF = 2;
// 16-bit source
frameDMA.TCD->ATTR_SRC = 1;
// when the major loop is complete, shift source address to reset it
frameDMA.TCD->SLAST = -sizeof(uint16_t) * (PIXEL_CNT + 1);
// we have PIXEL_CNT total transfers per major loop
frameDMA.TCD->BITER = frameDMA.TCD->CITER = PIXEL_CNT;
// destination is GPIO1
frameDMA.TCD->DADDR = &GPIO1_DR;
// don't offset destination address on transfer
frameDMA.TCD->DOFF = 0;
// 32-bit destination
frameDMA.TCD->ATTR_DST = 2;
// do nothing to destination when major loop is complete
frameDMA.TCD->DLASTSGA = 0;
// trigger on CLK signal from LCDIF
frameDMA.triggerAtHardwareEvent(DMAMUX_SOURCE_XBAR1_0);
frameDMA.enable();
Note 1: this only applies when some other flags in various control registers are set. However they are set by default in the DMAChannel.h library
Note 2: Ideally we'd use SOFF = 0, with an MLOFFSET of 2, but the DMA controller errors