Teensy 4.1 PSRAM Random Access Latency

nherzing · Aug 14, 2020

My question is around latency limitations of the PSRAM on the Teensy 4.1 in regards to random access.

I was hoping to achieve worst case sub-microsecond tiny random access reads (reading a single byte) from the PSRAM. The majority of the time, I experience sub-microsecond reads (well below that limit) but I do experience the occasional read that takes particularly long. My understanding from reading the datasheets and startup.c is that there is some caching/prefetching/other complicating factors that make it non-trivial to immediately tell how long worst case random access read should take.

My question (for someone who understands this far deeper than I do) is two-fold:

1. Are worst case sub-microsecond tiny random access reads from PSRAM feasible?
2. If so, any suggestions on how one might achieve this? I'm guessing it would require a rewrite of the FlexSPI configuration to optimize short, random reads?

I wrote a little script to get basic timing data on sequential vs random access reads. Hopefully it's helpful to understand the types of reads I'm talking about. You can see that, on average, the random access reads are well below a microsecond but there are outliers in there that are causing issues for my use case (interfacing with an externally clocked system). You can also see that sequential access is an order of magnitude faster than random access on the PSRAM.

I'm a software dev with little hardware experience so sorry if I'm missing something obvious. Happy to provide any additional details that might be helpful. Thanks!

Code:

// Output running locally for 10000 reads:
//Random access
//Fast RAM: 221 us
//PSRAM: 6950 us
//Sequential access
//Fast RAM: 202 us
//PSRAM: 319 us

EXTMEM uint8_t extmem_data[0x10000];
uint8_t data[0x10000];

uint32_t memory_test() {
  while (!Serial) ;
  uint32_t result[10]; // prevent optimizing the loops away

  uint32_t num_samples = 100000;
  uint32_t num_iters = 10000;
  uint32_t *idxes = malloc(num_samples);

  // fill in random indices to access
  for (int i = 0; i < num_samples; i++) {
    idxes[i] = random(0x10000);
  }

  Serial.println("Random access");
  elapsedMicros took = 0;
  for (int i = 0; i < num_iters; i++) {
    result[i % 10] = data[idxes[i]];
  }
  unsigned long res = took;
  Serial.printf("Fast RAM: %d\n", res);

  took = 0;
  for (int i = 0; i < num_iters; i++) {
    result[i % 10] = extmem_data[idxes[i]];
  }
  res = took;
  Serial.printf("PSRAM: %d\n", res);

  Serial.println("Sequential access");
  took = 0;
  for (int i = 0; i < num_iters; i++) {
    result[i % 10] = data[i];
  }
  res = took;
  Serial.printf("Fast RAM: %d\n", res);

  took = 0;
  for (int i = 0; i < num_iters; i++) {
    result[i % 10] = extmem_data[i];
  }
  res = took;
  Serial.printf("PSRAM: %d\n", res);  

  return result[5];
}

void setup() {
  memory_test();
}

void loop() {
  // put your main code here, to run repeatedly:

}

defragster · Aug 14, 2020

There are tests posted to invalidate the cache that would allow seeing the true uncached timing behavior. Seems that was in Paul's PSRAM test sketch posted on the forum and on his github.

There is a 32KB cache area used across the External pad chips ( PSRAM or FLASH ) and the upper 256KB of on chip RAM2 in the 1062 - is may also cover the 'boot' flash where code resides?

The interface to the QSPI PSRAM is controlled by the 1062 processor. It chooses what resides in the cache over time and also how much to read ahead when a byte is requested. The block size or read ahead will assist with sequential access as subsequent bytes will already be local to the 1062. When it comes to Random access when only using some part of the block read then requesting the next block - the next read may be delayed as the prior read completes in some fashion? Perhaps the amount of read ahead is specified as the chip is initialized?

Not having dealt with the low level init others may have the answer - or looking at the code called in startup.c against the 1062 Ref Manual would have that answer.

<edit> : PJRC PSRAM test code : github.com/PaulStoffregen/teensy41_psram_memtest/blob/master/teensy41_psram_memtest.ino
> uses arm_dcache_flush_delete()

PaulStoffregen · Aug 14, 2020

You should use the cycle counter ARM_DWT_CYCCNT for measuring such short times.

Those tests should be done with interrupts disabled. If you're occasionally seeing a much longer time, it could be due to an interrupt occurring during the measurement.

defragster · Aug 14, 2020

PaulStoffregen said:
You should use the cycle counter ARM_DWT_CYCCNT for measuring such short times.

Those tests should be done with interrupts disabled. If you're occasionally seeing a much longer time, it could be due to an interrupt occurring during the measurement.

Good point Paul - that's twice I've skipped mentioning the really cool ARM_DWT_CYCCNT ... since I didn't look at the posted code for the generic info I gave.

See this post for example of use if needed: pjrc.com/threads/62385-Timer-interrupts-gt-1MHz
> On T_4.1 the ARM_DWT_CYCCNT is already running.

KurtE · Aug 14, 2020

It will be interesting to see how well your numbers come out.

As mentioned the same cache I believe is used for more or less everything other than the ITCM/DTCM areas of memory.

Will be good to see if you can determine when you are getting cache hits and misses.

Sort of a secondary note: I do know that for example DMA out of this region is slower than both from PROGMEM and DMAMEM...

In another thread while I was trying to localize down a bug which turned out in SPI library, I was doing dma operation to output an image from memory to an RA8876...
I tested outputs from the flash memory, than DMAMEM and PSRAM... Could not do the image from lower memory as it would not fit.

Here is a logic analyzer output, showing the outputs. I did DMAMEM twice as to see if maybe after the first output it would fix the data for the second (it did not).

All four output groups shown are doing the same output, only difference it the pointer to the data... As you can see the last one took a lot longer to finish and it was the PSRAM.
Note the image is 243800*2 bytes long

mjs513 · Aug 15, 2020

@KurtE, et al.

Just remembered something that may affect the PSRAM performance as compared to DMAMEM or PROGMEM. Right now the PSRAM is set to default to 88Mhz in startup.c. However, we tested this at 132 Mhz and seems to work without issue. To change the PSRAM clock add this to your in your setup before accessing the PSRAM it may make a difference:

Code:

	  //Reset clock to 132 Mhz
	  CCM_CCGR7 |= CCM_CCGR7_FLEXSPI2(CCM_CCGR_OFF);
	  CCM_CBCMR = (CCM_CBCMR & ~(CCM_CBCMR_FLEXSPI2_PODF_MASK | CCM_CBCMR_FLEXSPI2_CLK_SEL_MASK))
		  | CCM_CBCMR_FLEXSPI2_PODF(4) | CCM_CBCMR_FLEXSPI2_CLK_SEL(2); // 528/5 = 132 MHz
	  CCM_CCGR7 |= CCM_CCGR7_FLEXSPI2(CCM_CCGR_ON);

nherzing · Aug 15, 2020

Thanks everyone for the helpful suggestions. I dug in a bit more to understand the performance of different access patterns and to understand how data gets fetched from PSRAM.

It seems data is fetched in 32 byte bursts from the PSRAM. This lines up with the data sheet for one of the compatible PSRAM chips. You can observe this by looking at timing of sequential reads. Every 32nd read takes about 350 cycles (per ARM_DWT_CYCCNT) but subsequent reads of the next 31 bytes take only 11 cycles.

Looking at random access reads, I expected to see worst case 350 cycle reads since it would rarely hit the cache. Instead, I frequently saw up to double that. I don't know what explains this. Any thoughts?

In my quest for the fastest, worst case random access read, I ended up handwriting code to manually use the IP Command interface (Section 27.5.9). This gave me the most consistent timing for random reads at around 330 cycles. Almost all of the time (~290) is spent spin waiting for FLEXSPI_INTR_IPRXWA. There's obviously a correlation between that time and the PSRAM clock. Any suggestions on improving this or speeding it up would be appreciated.

This is sadly still too slow for my application of using the Teensy to emulate a banked memory controller clocked at 1MHz. I may need a different approach.

My full code is included below. (timing code and the manual FlexSPI code)

Code:

#include <Arduino.h>

EXTMEM uint8_t extmem_data[0x10000];

inline static uint8_t flexspi2_read(uint32_t addr) {
  FLEXSPI2_IPCR0 = addr;
  FLEXSPI2_IPCR1 = FLEXSPI_IPCR1_ISEQID(5);
  
  FLEXSPI2_IPCMD = FLEXSPI_IPCMD_TRG;

  while (!(FLEXSPI2_INTR & FLEXSPI_INTR_IPRXWA)) ;

  uint32_t data = FLEXSPI2_RFDR0;
  FLEXSPI2_INTR = FLEXSPI_INTR_IPCMDDONE | FLEXSPI_INTR_IPRXWA;
  return data;  
}

uint32_t spi_test(int num_iters) {
  uint8_t result[10];
  volatile uint32_t cycles = 0;

  uint32_t *idxes = (uint32_t *)malloc(num_iters);

  // fill in random indices to access
  for (int i = 0; i < num_iters; i++) {
    idxes[i] = random(0x10000);
  }  

  for (int i = 0; i < num_iters; i++) {
    cli()
    cycles = ARM_DWT_CYCCNT;

    result[i % 10] = flexspi2_read(idxes[i]);

    uint32_t res = ARM_DWT_CYCCNT - cycles;
    Serial.printf("%d: (%d) %d\n", i, idxes[i], res);
    sei();
  }  
  Serial.println("Done!");

  return result[5];
}

uint32_t rand_test(uint8_t *data, int num_iters) {
  uint8_t result[10];
  volatile uint32_t cycles = 0;

  uint32_t *idxes = (uint32_t *)malloc(num_iters);

  // fill in random indices to access
  for (int i = 0; i < num_iters; i++) {
    idxes[i] = random(0x10000);
  }  

  for (int i = 0; i < num_iters; i++) {
    cli()
    cycles = ARM_DWT_CYCCNT;

    result[i % 10] = data[idxes[i]];

    uint32_t res = ARM_DWT_CYCCNT - cycles;
    Serial.printf("%d: (%d) %d\n", i, idxes[i], res);
    sei();
  }  
  Serial.println("Done!");

  return result[5];
}

uint32_t seq_test(uint8_t *data, int num_iters) {
  uint8_t result[10];
  volatile uint32_t cycles = 0;

  for (int i = 0; i < num_iters; i++) {
    cli()
    cycles = ARM_DWT_CYCCNT;
    result[i % 10] = data[i];
    uint32_t res = ARM_DWT_CYCCNT - cycles;

    Serial.printf("%d: %d\n", i, res);
    sei();

  }  
  Serial.println("Done!");

  return result[5];
}

void setup() {
  while (!Serial) ;

  const float clocks[4] = {396.0f, 720.0f, 664.62f, 528.0f};
  const float frequency = clocks[(CCM_CBCMR >> 8) & 3] / (float)(((CCM_CBCMR >> 29) & 7) + 1);
  Serial.printf("CCM_CBCMR=%08X (%.1f MHz)\n", CCM_CBCMR, frequency);

  for (int i = 0; i < 0x10000; i++) {
    extmem_data[i] = i;
  }
  
  Serial.println("Seq Test");
  arm_dcache_flush_delete((void *)extmem_data, 0x10000);  
  seq_test(extmem_data, 0x100);

  Serial.println("Rand Test");
  arm_dcache_flush_delete((void *)extmem_data, 0x10000);  
  rand_test(extmem_data, 0x100);

  Serial.println("SPI TEST");
  arm_dcache_flush_delete((void *)extmem_data, 0x10000);    
  spi_test(0x100);
}

void loop() {}

KurtE · Aug 15, 2020

mjs513 said:
@KurtE, et al.

Just remembered something that may affect the PSRAM performance as compared to DMAMEM or PROGMEM. Right now the PSRAM is set to default to 88Mhz in startup.c. However, we tested this at 132 Mhz and seems to work without issue. To change the PSRAM clock add this to your in your setup before accessing the PSRAM it may make a difference:

Code:

//Reset clock to 132 Mhz CCM_CCGR7 |= CCM_CCGR7_FLEXSPI2(CCM_CCGR_OFF); CCM_CBCMR = (CCM_CBCMR & ~(CCM_CBCMR_FLEXSPI2_PODF_MASK | CCM_CBCMR_FLEXSPI2_CLK_SEL_MASK)) | CCM_CBCMR_FLEXSPI2_PODF(4) | CCM_CBCMR_FLEXSPI2_CLK_SEL(2); // 528/5 = 132 MHz CCM_CCGR7 |= CCM_CCGR7_FLEXSPI2(CCM_CCGR_ON);

Thanks Mike,

I tried it with the test program that outputs an image using DMA to RA8876, and it appears like it works...
Before the change that page took about 279ms to output the image. With the change it cut it down to 222ms.
Note: Time to out from FLASHMEM and PROGMEM: is in the range of about 115ms

mjs513 · Aug 15, 2020

@KurtE
Thanks Kurt. At least it helped in performance. Wonder if there is anything else that could speed things up with PSRAM?

@nherzing ....

I reset the PSRAM clock to 132Mhz per post #6 and it reduced the cycles from 331 down to 271.

Code:

Seq Test: ~11 cycles, on  32byte boundaries ~250
Rand Test: ~9 to 464
SPI Test: max 271, guess on average - 269

defragster · Aug 15, 2020

nherzing said:
Thanks everyone for the helpful suggestions. I dug in a bit more to understand the performance of different access patterns and to understand how data gets fetched from PSRAM.

It seems data is fetched in 32 byte bursts from the PSRAM. This lines up with the data sheet for one of the compatible PSRAM chips. You can observe this by looking at timing of sequential reads. Every 32nd read takes about 350 cycles (per ARM_DWT_CYCCNT) but subsequent reads of the next 31 bytes take only 11 cycles.
...

32 byte read on the first has to complete before the 2nd random read does another 32 byte read to complete. It seems it should return in the middle.

Test would perhaps be better to have the for loop store the time value into an array parallel to idxes[] and move printing of those arrays to outside in another loop?

Also putting the results array in : uint32_t *idxes = malloc(num_samples); versus RAM1 fast low memory (global alloc) isn't helping as that RAM2 is slower.

Oh yeah - and there is the chip clocking speed setup 88 .vs. 132.

anatoledp · Feb 28, 2024

mjs513 said:
@KurtE, et al.

Just remembered something that may affect the PSRAM performance as compared to DMAMEM or PROGMEM. Right now the PSRAM is set to default to 88Mhz in startup.c. However, we tested this at 132 Mhz and seems to work without issue. To change the PSRAM clock add this to your in your setup before accessing the PSRAM it may make a difference:

Code:

//Reset clock to 132 Mhz CCM_CCGR7 |= CCM_CCGR7_FLEXSPI2(CCM_CCGR_OFF); CCM_CBCMR = (CCM_CBCMR & ~(CCM_CBCMR_FLEXSPI2_PODF_MASK | CCM_CBCMR_FLEXSPI2_CLK_SEL_MASK)) | CCM_CBCMR_FLEXSPI2_PODF(4) | CCM_CBCMR_FLEXSPI2_CLK_SEL(2); // 528/5 = 132 MHz CCM_CCGR7 |= CCM_CCGR7_FLEXSPI2(CCM_CCGR_ON);

holy crap this is old but this right here saved me so much damb time trying to figure out why my programs were getting same performance as an esp32s3. the esp32s3 has same psram speed as teensy default and clocking it to 132mhz made my prgram work way way faster due to less memory latency now that the processor could actually get the data quicker to actually work on it

Teensy 4.1 PSRAM Random Access Latency

nherzing

New member

defragster

Senior Member+

PaulStoffregen

Well-known member

defragster

Senior Member+

KurtE

Senior Member+

mjs513

Senior Member+

nherzing

New member

KurtE

Senior Member+

mjs513

Senior Member+

defragster

Senior Member+

anatoledp

Member