How fast is Teensy 4.1 SRAM? (reading)

oneoverf

Member
What kind of speed can I expect when reading from EXTMEM on a Teensy 4.1 with 16 MB of PSRAM? Thanks!

Edit: Sorry for typo in title, should say PSRAM, but I didn't catch it until after I posted :p
 
Last edited:
It depends... It is slower than the other memory locations, but again it depends on your usage. In particular how often you read/write to cache versus how often does it actually have to read/write the actual physical cache.

So for example DMA operations are considerably slower as it always talks directly to the physical memory.

There is a datasheet up on the PJRC product page: https://www.pjrc.com/store/psram.html

I am not sure what speed the QSPI speed is that it is running at. I see it configures the clock for 88mhz but I have not looked to see if that is the speed or not.
 
Trying to read the datasheet makes me realize how much I don't know, but so far I think I understand that:

- The PSRAM will clock at 84MHz for Linear Burst operation (which will let me read across the 1024-byte page boundaries).
- "QPI Fast Quad Read" operation takes 18 clock cycles.
- The read operation gets four bits of data at a time.

So, theoretically, two read operations = 36 clock cycles, at 84MHz would give 84M / 36 =2.3 MB/s.

Is this roughly correct? I realize I'm way out of my depth here. Thanks a lot for replying.
 
Paul made this PSRAM test sketch: PaulStoffregen/teensy41_psram_memtest/blob/master/teensy41_psram_memtest.ino

It shows clock speed and tests various data patterns with write and read verify without any cache support.

Code below adds some attempt to track the time doing the writes and reads - might be somewhat right?:
Code:
EXTMEM Memory Test, 16 Mbyte
 CCM_CBCMR=B5AE8304 (88.0 MHz)
testing with fixed pattern 5A698421		fill us:651559 MB/s:25.75		test us:589824 MB/s:28.44
...
testing with fixed pattern FFFFFFFF		fill us:650328 MB/s:25.80		test us:589825 MB/s:28.44
testing with fixed pattern 00000000		fill us:650331 MB/s:25.80		test us:589825 MB/s:28.44
 test ran for 73.31 seconds
 1824 MB's test ran at 25.48 MB/sec overall
All memory tests passed :-)
Code:
uint32_t rTime;
uint32_t rTCnt = 0;
extern "C" uint8_t external_psram_size;

bool memory_ok = false;
uint32_t *memory_begin, *memory_end;

bool check_fixed_pattern(uint32_t pattern);
bool check_lfsr_pattern(uint32_t seed);

void setup()
{
  while (!Serial) ; // wait
  pinMode(13, OUTPUT);
  uint8_t size = external_psram_size;
  Serial.printf("EXTMEM Memory Test, %d Mbyte\n", size);
  if (size == 0) return;
  const float clocks[4] = {396.0f, 720.0f, 664.62f, 528.0f};
  const float frequency = clocks[(CCM_CBCMR >> 8) & 3] / (float)(((CCM_CBCMR >> 29) & 7) + 1);
  Serial.printf(" CCM_CBCMR=%08X (%.1f MHz)\n", CCM_CBCMR, frequency);
  memory_begin = (uint32_t *)(0x70000000);
  memory_end = (uint32_t *)(0x70000000 + size * 1048576);
  elapsedMillis msec = 0;
  if (!check_fixed_pattern(0x5A698421)) return;
  if (!check_lfsr_pattern(2976674124ul)) return;
  if (!check_lfsr_pattern(1438200953ul)) return;
  if (!check_lfsr_pattern(3413783263ul)) return;
  if (!check_lfsr_pattern(1900517911ul)) return;
  if (!check_lfsr_pattern(1227909400ul)) return;
  if (!check_lfsr_pattern(276562754ul)) return;
  if (!check_lfsr_pattern(146878114ul)) return;
  if (!check_lfsr_pattern(615545407ul)) return;
  if (!check_lfsr_pattern(110497896ul)) return;
  if (!check_lfsr_pattern(74539250ul)) return;
  if (!check_lfsr_pattern(4197336575ul)) return;
  if (!check_lfsr_pattern(2280382233ul)) return;
  if (!check_lfsr_pattern(542894183ul)) return;
  if (!check_lfsr_pattern(3978544245ul)) return;
  if (!check_lfsr_pattern(2315909796ul)) return;
  if (!check_lfsr_pattern(3736286001ul)) return;
  if (!check_lfsr_pattern(2876690683ul)) return;
  if (!check_lfsr_pattern(215559886ul)) return;
  if (!check_lfsr_pattern(539179291ul)) return;
  if (!check_lfsr_pattern(537678650ul)) return;
  if (!check_lfsr_pattern(4001405270ul)) return;
  if (!check_lfsr_pattern(2169216599ul)) return;
  if (!check_lfsr_pattern(4036891097ul)) return;
  if (!check_lfsr_pattern(1535452389ul)) return;
  if (!check_lfsr_pattern(2959727213ul)) return;
  if (!check_lfsr_pattern(4219363395ul)) return;
  if (!check_lfsr_pattern(1036929753ul)) return;
  if (!check_lfsr_pattern(2125248865ul)) return;
  if (!check_lfsr_pattern(3177905864ul)) return;
  if (!check_lfsr_pattern(2399307098ul)) return;
  if (!check_lfsr_pattern(3847634607ul)) return;
  if (!check_lfsr_pattern(27467969ul)) return;
  if (!check_lfsr_pattern(520563506ul)) return;
  if (!check_lfsr_pattern(381313790ul)) return;
  if (!check_lfsr_pattern(4174769276ul)) return;
  if (!check_lfsr_pattern(3932189449ul)) return;
  if (!check_lfsr_pattern(4079717394ul)) return;
  if (!check_lfsr_pattern(868357076ul)) return;
  if (!check_lfsr_pattern(2474062993ul)) return;
  if (!check_lfsr_pattern(1502682190ul)) return;
  if (!check_lfsr_pattern(2471230478ul)) return;
  if (!check_lfsr_pattern(85016565ul)) return;
  if (!check_lfsr_pattern(1427530695ul)) return;
  if (!check_lfsr_pattern(1100533073ul)) return;
  if (!check_fixed_pattern(0x55555555)) return;
  if (!check_fixed_pattern(0x33333333)) return;
  if (!check_fixed_pattern(0x0F0F0F0F)) return;
  if (!check_fixed_pattern(0x00FF00FF)) return;
  if (!check_fixed_pattern(0x0000FFFF)) return;
  if (!check_fixed_pattern(0xAAAAAAAA)) return;
  if (!check_fixed_pattern(0xCCCCCCCC)) return;
  if (!check_fixed_pattern(0xF0F0F0F0)) return;
  if (!check_fixed_pattern(0xFF00FF00)) return;
  if (!check_fixed_pattern(0xFFFF0000)) return;
  if (!check_fixed_pattern(0xFFFFFFFF)) return;
  if (!check_fixed_pattern(0x00000000)) return;
  Serial.printf(" test ran for %.2f seconds\n", (float)msec / 1000.0f);
  Serial.printf(" %d MB's test ran at %.2f MB/sec overall\n", 2 * rTCnt * external_psram_size, 2 * rTCnt * external_psram_size * 1024.0 / (float)msec);
  Serial.println("All memory tests passed :-)");
  memory_ok = true;
}

bool fail_message(volatile uint32_t *location, uint32_t actual, uint32_t expected)
{
  Serial.printf(" Error at %08X, read %08X but expected %08X\n",
                (uint32_t)location, actual, expected);
  return false;
}

// fill the entire RAM with a fixed pattern, then check it
bool check_fixed_pattern(uint32_t pattern)
{
  volatile uint32_t *p;
  Serial.printf("testing with fixed pattern %08X\t", pattern);
  rTime = micros();
  for (p = memory_begin; p < memory_end; p++) {
    *p = pattern;
  }
  rTime = micros() - rTime;
  Serial.printf( "\tfill us:%d MB/s:%.2f\t", rTime, external_psram_size * 1024 * 1024.0 / rTime );
  arm_dcache_flush_delete((void *)memory_begin,
                          (uint32_t)memory_end - (uint32_t)memory_begin);
  rTime = micros();
  for (p = memory_begin; p < memory_end; p++) {
    uint32_t actual = *p;
    if (actual != pattern) return fail_message(p, actual, pattern);
  }
  rTime = micros() - rTime;
  Serial.printf( "\ttest us:%d MB/s:%.2f\n", rTime, external_psram_size * 1024 * 1024.0 / rTime );
  rTCnt++;
  return true;
}

// fill the entire RAM with a pseudo-random sequence, then check it
bool check_lfsr_pattern(uint32_t seed)
{
  volatile uint32_t *p;
  uint32_t reg;

  Serial.printf("testing with pseudo-random sequence, seed=%u\t", seed);
  reg = seed;
  uint32_t rTime;
  rTime = micros();
  for (p = memory_begin; p < memory_end; p++) {
    *p = reg;
    for (int i = 0; i < 3; i++) {
      if (reg & 1) {
        reg >>= 1;
        reg ^= 0x7A5BC2E3;
      } else {
        reg >>= 1;
      }
    }
  }
  rTime = micros() - rTime;
  Serial.printf( "\tfill us:%d MB/s:%.2f\t", rTime, external_psram_size * 1024 * 1024.0 / rTime );
  arm_dcache_flush_delete((void *)memory_begin,
                          (uint32_t)memory_end - (uint32_t)memory_begin);
  reg = seed;
  rTime = micros();
  for (p = memory_begin; p < memory_end; p++) {
    uint32_t actual = *p;
    if (actual != reg) return fail_message(p, actual, reg);
    //Serial.printf(" reg=%08X\n", reg);
    for (int i = 0; i < 3; i++) {
      if (reg & 1) {
        reg >>= 1;
        reg ^= 0x7A5BC2E3;
      } else {
        reg >>= 1;
      }
    }
  }
  rTime = micros() - rTime;
  Serial.printf( "\ttest us:%d MB/s:%.2f\n", rTime, external_psram_size * 1024 * 1024.0 / rTime );
  rTCnt++;
  return true;
}

void loop()
{
  digitalWrite(13, HIGH);
  delay(100);
  if (!memory_ok) digitalWrite(13, LOW); // rapid blink if any test fails
  delay(100);
}
 
Last edited:
You really should write some small test programs using the ARM cycle counter to measure the actual speed. When you test, use "volatile" on the variables or pointers your code uses, so the compiler doesn't optimize away your memory access. Unless you're planning to use DMA, you probably should not call the cache maintenance functions. But if you want to explore how much the processor's cache memory is helping, comparing with and without manipulating the cache is the way to find out.

You might also try testing without "volatile", since your real code probably won't unnecessarily use volatile variables & pointers, as you almost certainly want your code to enjoy the many speed benefits of the compiler's optimizations. But without volatile, the compiler can often optimize away simple tests, so you need to be careful when testing to make sure the input data comes from real hardware or some other source the compiler can't anticipate.

So about your original question, I believe we're configuring for 88 MHz clock. Each byte takes 2 clock cycles because the PSRAM data path is 4 bits wide, so the raw burst speed is 44 Mbyte/sec. But there is some overhead for chip select, 6 clocks for the 24 bit address, 6 more wait clocks. The FlexSPI port reads in blocks to a fairly large buffer (can't recall if it's 512 bytes or 1K...) so the overhead stuff tends to be minor. But it can add latency. The buffer also means some cache misses get fetched from the buffer rather than re-reading. Some writes also get temporarily buffered, and that's after the ARM cache which is write-back allowing your program to continue without waiting.

It all adds up to a huge number of complex factors, which is why I would recommend doing real code tests using the ARM cycle counter to measure how long the particular usage patterns you will actually use end up taking. Many "normal" programs have little or no performance impact because the cache & buffer are so effective. But other cases like very long FIR filters get almost no cache benefit. Real testing with your actual application is the most practical way to know.
 
Thanks, guys, this is good news. I have a fully loaded Teensy ordered, so I will do some real-world tests as soon as I get it.
 
Thanks, guys, this is good news. I have a fully loaded Teensy ordered, so I will do some real-world tests as soon as I get it.

Posted code ref to Paul's is worst case - as CACHE or not - it runs end to end of all PSRAM writing values, then a second loop confirming those values are present as expected. So it will blow though the cache. That example code only void the cache between write and read - but the write completes at the end of PSRAM and starts at the beginning to read.

Hacked it to drop the cache flush - and the diff in that code looks like 0.35 seconds out of 73.37 seconds runtime as run here.

Paul once noted the expected throughput to PSRAM ... not sure if it agrees with the 25 to 28 MB/sec the p#4 indicates as calculated?
And the lower end of the numbers it shows it dues to overhead of deciding the value to write/expect in the Pseudo Random case.

NOTE: Code above EDITED as there was a missing set of the start time in the random write case making calc wrong.
 
Posted code ref to Paul's is worst case...

I wouldn't call that worst case. It access memory nice and in linear order, so most (around seven out of eight) memory accesses are cache hits:

Code:
for (p = memory_begin; p < memory_end; p++) {
   *p = pattern;
}

I'd say worst case would be every memory access touches a different cache line and forces a miss. According to Cortex-M7 documentation, the data cache has a line length of 32 bytes, which equals eight uint32s. Here's a pathological access pattern that would bring in a new cache line for every access:

Code:
for (uint32_t offset = 0; offset < 8; offset++ ) {
   for (p = memory_begin + offset; p < memory_end; p += 8) {
      *p = pattern;
   }
}

I modified the teensy41_psram_memtest.ino to compare the consecutive linear access with what I'm calling pathological (https://gist.github.com/ericfont/4195abf303e1a39846176ae548c77a78). When running with 16 MB at 132 Mhz, my code reports:

linear addressing: Write 457 milliseconds, flush 6 milliseconds, read 408 milliseconds
pathological addr: Write 6172 milliseconds, flush 7 milliseconds, read 3203 milliseconds

So it seems that the pathological reading takes 7.85 times as long as regular linear reading. This is because every memory access is a cache miss instead of only 1 out-of-eight cache accesses being a miss.

(I should also note according to https://blog.feabhas.com/2020/11/in...ache-part-3-optimising-software-to-use-cache/ "the Cortex-M7 data cache does not support automatic prefetch.")

And it seems that pathological writing takes 1.93 times as long as regular writing. This is because every write requires bringing in a new cache line *AND* writing that cache line back when it inevitably gets evicted.

I also observe that the linear writing takes only 1.12 times as long as the linear reading (457/408), which is a much lower ratio than 2. I believe that is because there is a store buffer (https://developer.arm.com/documentation/ddi0489/f/memory-system/l1-caches/store-buffer?lang=en) which can hold one cache-line worth of data and so the writes don't always have to wait for the cache line to be brought in before moving on.

I should also note there is a __builtin_prefetch function (https://developer.arm.com/documenta...-practice/Prefetching-with---builtin-prefetch) which maps to arm prefetch instructions, which could potentially speed up memory accessing if future memory addresses are known ahead of time and there is available bandwidth to spare for manually prefetching lines.
 
So it seems that the pathological reading takes 7.85 times as long as regular linear reading. This is because every memory access is a cache miss instead of only 1 out-of-eight cache accesses being a miss.

You could probably do even worse, if you really wanted, by scattering access across wider ranges so the FlexSPI buffer doesn't help.
 
I wouldn't call that worst case. It access memory nice and in linear order, so most (around seven out of eight) memory accesses are cache hits:
...

The cache was cleared before the read - and IIRC a read will fill 32 bytes at once, so only 1 in 8 will wait for a cache fill

Except for local buffer fulls read at once it seems in the 32 bit cache line - it doesn't use the cache. Yes a worse worst case would be avoiding using the subsequent 28 bytes after using 4 and having 32 read to cache.

So that represents perhaps a general worst case with minimal cache help.

As Paul notes in p#5 - actual use needs to be tested. But with some attention - general use not voiding or avoiding the cache logic would expected to be better.
 
Back
Top