Forum Rule: Always post complete source code & details to reproduce any issue!
Results 1 to 7 of 7

Thread: How fast is Teensy 4.1 SRAM? (reading)

  1. #1
    Junior Member
    Join Date
    Dec 2021
    Location
    Michigan USA
    Posts
    7

    How fast is Teensy 4.1 SRAM? (reading)

    What kind of speed can I expect when reading from EXTMEM on a Teensy 4.1 with 16 MB of PSRAM? Thanks!

    Edit: Sorry for typo in title, should say PSRAM, but I didn't catch it until after I posted :P
    Last edited by oneoverf; 05-22-2022 at 02:40 PM.

  2. #2
    Senior Member+ KurtE's Avatar
    Join Date
    Jan 2014
    Posts
    10,533
    It depends... It is slower than the other memory locations, but again it depends on your usage. In particular how often you read/write to cache versus how often does it actually have to read/write the actual physical cache.

    So for example DMA operations are considerably slower as it always talks directly to the physical memory.

    There is a datasheet up on the PJRC product page: https://www.pjrc.com/store/psram.html

    I am not sure what speed the QSPI speed is that it is running at. I see it configures the clock for 88mhz but I have not looked to see if that is the speed or not.

  3. #3
    Junior Member
    Join Date
    Dec 2021
    Location
    Michigan USA
    Posts
    7
    Trying to read the datasheet makes me realize how much I don't know, but so far I think I understand that:

    - The PSRAM will clock at 84MHz for Linear Burst operation (which will let me read across the 1024-byte page boundaries).
    - "QPI Fast Quad Read" operation takes 18 clock cycles.
    - The read operation gets four bits of data at a time.

    So, theoretically, two read operations = 36 clock cycles, at 84MHz would give 84M / 36 =2.3 MB/s.

    Is this roughly correct? I realize I'm way out of my depth here. Thanks a lot for replying.

  4. #4
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    16,010
    Paul made this PSRAM test sketch: PaulStoffregen/teensy41_psram_memtest/blob/master/teensy41_psram_memtest.ino

    It shows clock speed and tests various data patterns with write and read verify without any cache support.

    Code below adds some attempt to track the time doing the writes and reads - might be somewhat right?:
    Code:
    EXTMEM Memory Test, 16 Mbyte
     CCM_CBCMR=B5AE8304 (88.0 MHz)
    testing with fixed pattern 5A698421		fill us:651559 MB/s:25.75		test us:589824 MB/s:28.44
    ...
    testing with fixed pattern FFFFFFFF		fill us:650328 MB/s:25.80		test us:589825 MB/s:28.44
    testing with fixed pattern 00000000		fill us:650331 MB/s:25.80		test us:589825 MB/s:28.44
     test ran for 73.31 seconds
     1824 MB's test ran at 25.48 MB/sec overall
    All memory tests passed :-)
    Code:
    uint32_t rTime;
    uint32_t rTCnt = 0;
    extern "C" uint8_t external_psram_size;
    
    bool memory_ok = false;
    uint32_t *memory_begin, *memory_end;
    
    bool check_fixed_pattern(uint32_t pattern);
    bool check_lfsr_pattern(uint32_t seed);
    
    void setup()
    {
      while (!Serial) ; // wait
      pinMode(13, OUTPUT);
      uint8_t size = external_psram_size;
      Serial.printf("EXTMEM Memory Test, %d Mbyte\n", size);
      if (size == 0) return;
      const float clocks[4] = {396.0f, 720.0f, 664.62f, 528.0f};
      const float frequency = clocks[(CCM_CBCMR >> 8) & 3] / (float)(((CCM_CBCMR >> 29) & 7) + 1);
      Serial.printf(" CCM_CBCMR=%08X (%.1f MHz)\n", CCM_CBCMR, frequency);
      memory_begin = (uint32_t *)(0x70000000);
      memory_end = (uint32_t *)(0x70000000 + size * 1048576);
      elapsedMillis msec = 0;
      if (!check_fixed_pattern(0x5A698421)) return;
      if (!check_lfsr_pattern(2976674124ul)) return;
      if (!check_lfsr_pattern(1438200953ul)) return;
      if (!check_lfsr_pattern(3413783263ul)) return;
      if (!check_lfsr_pattern(1900517911ul)) return;
      if (!check_lfsr_pattern(1227909400ul)) return;
      if (!check_lfsr_pattern(276562754ul)) return;
      if (!check_lfsr_pattern(146878114ul)) return;
      if (!check_lfsr_pattern(615545407ul)) return;
      if (!check_lfsr_pattern(110497896ul)) return;
      if (!check_lfsr_pattern(74539250ul)) return;
      if (!check_lfsr_pattern(4197336575ul)) return;
      if (!check_lfsr_pattern(2280382233ul)) return;
      if (!check_lfsr_pattern(542894183ul)) return;
      if (!check_lfsr_pattern(3978544245ul)) return;
      if (!check_lfsr_pattern(2315909796ul)) return;
      if (!check_lfsr_pattern(3736286001ul)) return;
      if (!check_lfsr_pattern(2876690683ul)) return;
      if (!check_lfsr_pattern(215559886ul)) return;
      if (!check_lfsr_pattern(539179291ul)) return;
      if (!check_lfsr_pattern(537678650ul)) return;
      if (!check_lfsr_pattern(4001405270ul)) return;
      if (!check_lfsr_pattern(2169216599ul)) return;
      if (!check_lfsr_pattern(4036891097ul)) return;
      if (!check_lfsr_pattern(1535452389ul)) return;
      if (!check_lfsr_pattern(2959727213ul)) return;
      if (!check_lfsr_pattern(4219363395ul)) return;
      if (!check_lfsr_pattern(1036929753ul)) return;
      if (!check_lfsr_pattern(2125248865ul)) return;
      if (!check_lfsr_pattern(3177905864ul)) return;
      if (!check_lfsr_pattern(2399307098ul)) return;
      if (!check_lfsr_pattern(3847634607ul)) return;
      if (!check_lfsr_pattern(27467969ul)) return;
      if (!check_lfsr_pattern(520563506ul)) return;
      if (!check_lfsr_pattern(381313790ul)) return;
      if (!check_lfsr_pattern(4174769276ul)) return;
      if (!check_lfsr_pattern(3932189449ul)) return;
      if (!check_lfsr_pattern(4079717394ul)) return;
      if (!check_lfsr_pattern(868357076ul)) return;
      if (!check_lfsr_pattern(2474062993ul)) return;
      if (!check_lfsr_pattern(1502682190ul)) return;
      if (!check_lfsr_pattern(2471230478ul)) return;
      if (!check_lfsr_pattern(85016565ul)) return;
      if (!check_lfsr_pattern(1427530695ul)) return;
      if (!check_lfsr_pattern(1100533073ul)) return;
      if (!check_fixed_pattern(0x55555555)) return;
      if (!check_fixed_pattern(0x33333333)) return;
      if (!check_fixed_pattern(0x0F0F0F0F)) return;
      if (!check_fixed_pattern(0x00FF00FF)) return;
      if (!check_fixed_pattern(0x0000FFFF)) return;
      if (!check_fixed_pattern(0xAAAAAAAA)) return;
      if (!check_fixed_pattern(0xCCCCCCCC)) return;
      if (!check_fixed_pattern(0xF0F0F0F0)) return;
      if (!check_fixed_pattern(0xFF00FF00)) return;
      if (!check_fixed_pattern(0xFFFF0000)) return;
      if (!check_fixed_pattern(0xFFFFFFFF)) return;
      if (!check_fixed_pattern(0x00000000)) return;
      Serial.printf(" test ran for %.2f seconds\n", (float)msec / 1000.0f);
      Serial.printf(" %d MB's test ran at %.2f MB/sec overall\n", 2 * rTCnt * external_psram_size, 2 * rTCnt * external_psram_size * 1024.0 / (float)msec);
      Serial.println("All memory tests passed :-)");
      memory_ok = true;
    }
    
    bool fail_message(volatile uint32_t *location, uint32_t actual, uint32_t expected)
    {
      Serial.printf(" Error at %08X, read %08X but expected %08X\n",
                    (uint32_t)location, actual, expected);
      return false;
    }
    
    // fill the entire RAM with a fixed pattern, then check it
    bool check_fixed_pattern(uint32_t pattern)
    {
      volatile uint32_t *p;
      Serial.printf("testing with fixed pattern %08X\t", pattern);
      rTime = micros();
      for (p = memory_begin; p < memory_end; p++) {
        *p = pattern;
      }
      rTime = micros() - rTime;
      Serial.printf( "\tfill us:%d MB/s:%.2f\t", rTime, external_psram_size * 1024 * 1024.0 / rTime );
      arm_dcache_flush_delete((void *)memory_begin,
                              (uint32_t)memory_end - (uint32_t)memory_begin);
      rTime = micros();
      for (p = memory_begin; p < memory_end; p++) {
        uint32_t actual = *p;
        if (actual != pattern) return fail_message(p, actual, pattern);
      }
      rTime = micros() - rTime;
      Serial.printf( "\ttest us:%d MB/s:%.2f\n", rTime, external_psram_size * 1024 * 1024.0 / rTime );
      rTCnt++;
      return true;
    }
    
    // fill the entire RAM with a pseudo-random sequence, then check it
    bool check_lfsr_pattern(uint32_t seed)
    {
      volatile uint32_t *p;
      uint32_t reg;
    
      Serial.printf("testing with pseudo-random sequence, seed=%u\t", seed);
      reg = seed;
      uint32_t rTime;
      rTime = micros();
      for (p = memory_begin; p < memory_end; p++) {
        *p = reg;
        for (int i = 0; i < 3; i++) {
          if (reg & 1) {
            reg >>= 1;
            reg ^= 0x7A5BC2E3;
          } else {
            reg >>= 1;
          }
        }
      }
      rTime = micros() - rTime;
      Serial.printf( "\tfill us:%d MB/s:%.2f\t", rTime, external_psram_size * 1024 * 1024.0 / rTime );
      arm_dcache_flush_delete((void *)memory_begin,
                              (uint32_t)memory_end - (uint32_t)memory_begin);
      reg = seed;
      rTime = micros();
      for (p = memory_begin; p < memory_end; p++) {
        uint32_t actual = *p;
        if (actual != reg) return fail_message(p, actual, reg);
        //Serial.printf(" reg=%08X\n", reg);
        for (int i = 0; i < 3; i++) {
          if (reg & 1) {
            reg >>= 1;
            reg ^= 0x7A5BC2E3;
          } else {
            reg >>= 1;
          }
        }
      }
      rTime = micros() - rTime;
      Serial.printf( "\ttest us:%d MB/s:%.2f\n", rTime, external_psram_size * 1024 * 1024.0 / rTime );
      rTCnt++;
      return true;
    }
    
    void loop()
    {
      digitalWrite(13, HIGH);
      delay(100);
      if (!memory_ok) digitalWrite(13, LOW); // rapid blink if any test fails
      delay(100);
    }
    Last edited by defragster; 05-23-2022 at 06:13 AM. Reason: edit for missing start time set in pseudo case

  5. #5
    Senior Member PaulStoffregen's Avatar
    Join Date
    Nov 2012
    Posts
    26,395
    You really should write some small test programs using the ARM cycle counter to measure the actual speed. When you test, use "volatile" on the variables or pointers your code uses, so the compiler doesn't optimize away your memory access. Unless you're planning to use DMA, you probably should not call the cache maintenance functions. But if you want to explore how much the processor's cache memory is helping, comparing with and without manipulating the cache is the way to find out.

    You might also try testing without "volatile", since your real code probably won't unnecessarily use volatile variables & pointers, as you almost certainly want your code to enjoy the many speed benefits of the compiler's optimizations. But without volatile, the compiler can often optimize away simple tests, so you need to be careful when testing to make sure the input data comes from real hardware or some other source the compiler can't anticipate.

    So about your original question, I believe we're configuring for 88 MHz clock. Each byte takes 2 clock cycles because the PSRAM data path is 4 bits wide, so the raw burst speed is 44 Mbyte/sec. But there is some overhead for chip select, 6 clocks for the 24 bit address, 6 more wait clocks. The FlexSPI port reads in blocks to a fairly large buffer (can't recall if it's 512 bytes or 1K...) so the overhead stuff tends to be minor. But it can add latency. The buffer also means some cache misses get fetched from the buffer rather than re-reading. Some writes also get temporarily buffered, and that's after the ARM cache which is write-back allowing your program to continue without waiting.

    It all adds up to a huge number of complex factors, which is why I would recommend doing real code tests using the ARM cycle counter to measure how long the particular usage patterns you will actually use end up taking. Many "normal" programs have little or no performance impact because the cache & buffer are so effective. But other cases like very long FIR filters get almost no cache benefit. Real testing with your actual application is the most practical way to know.

  6. #6
    Junior Member
    Join Date
    Dec 2021
    Location
    Michigan USA
    Posts
    7
    Thanks, guys, this is good news. I have a fully loaded Teensy ordered, so I will do some real-world tests as soon as I get it.

  7. #7
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    16,010
    Quote Originally Posted by oneoverf View Post
    Thanks, guys, this is good news. I have a fully loaded Teensy ordered, so I will do some real-world tests as soon as I get it.
    Posted code ref to Paul's is worst case - as CACHE or not - it runs end to end of all PSRAM writing values, then a second loop confirming those values are present as expected. So it will blow though the cache. That example code only void the cache between write and read - but the write completes at the end of PSRAM and starts at the beginning to read.

    Hacked it to drop the cache flush - and the diff in that code looks like 0.35 seconds out of 73.37 seconds runtime as run here.

    Paul once noted the expected throughput to PSRAM ... not sure if it agrees with the 25 to 28 MB/sec the p#4 indicates as calculated?
    And the lower end of the numbers it shows it dues to overhead of deciding the value to write/expect in the Pseudo Random case.

    NOTE: Code above EDITED as there was a missing set of the start time in the random write case making calc wrong.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •