Additional PSRAM ID that works plus goodies

I can easily wait another month for my project to utilize 32mb. Right now for most normal things, 16mb is an acceptable minimum. Most situations only need 8mb TBH, but the additional memory will allow me to do what I am doing at a higher resolution.
As soon as the project is out of beta, and as soon as my beta testers find all the nasty bugs, I'll post the project.

Thanks everyone for supporting the larger RAM and thanks to those who helped me untangle DMA.
 
Further to issues encountered with getting the AudioEffectDelayExternal object working, I've circled back to the PSRAM memory test, i.e. using the FlexPSI2 interface to give memory in the EXTMEM area. Long story short, all appears well with the latest 1.60beta5, but there are issues if pre-fetching is enabled, as suggested in this PR by @jmarsh. The issues are not picked up by the existing PSRAM test, but are if a fairly heavily modified version of it is used. Code below, I'll do a PR soon.
C++:
/*
   Note that test fails with ISSI 16MByte parts if pre-fetch is on
   Speed  prefetch Duration (16MB)
   105.6    on       61.16
    88.0    on       70.20
   105.6    off      72.63
    88.0    off      82.49

   Original test passes using 16MB ISSI PSRAM
   (note different algorithm, so duration NOT comparable)
   Speed  prefetch Duration (16MB)
   105.6    on       48.19
*/
extern "C" uint8_t external_psram_size;

bool memory_ok = false;
uint32_t *memory_begin, *memory_end;

bool check_fixed_pattern(uint32_t pattern);
bool check_lfsr_pattern(uint32_t seed);

void setup()
{
  while (!Serial) ; // wait
  pinMode(13, OUTPUT);
  uint8_t size = external_psram_size, size1 = FLEXSPI2_FLSHA1CR0 >> 10;
  Serial.printf("EXTMEM Memory Test, %d MByte (%d+%d)\n", size, size1, size - size1);
  if (size == 0) return;
  const float clocks[4] = {396.0f, 720.0f, 664.62f, 528.0f};
  const float frequency = clocks[(CCM_CBCMR >> 8) & 3] / (float)(((CCM_CBCMR >> 29) & 7) + 1);
  Serial.printf(" CCM_CBCMR=%08X (%.1f MHz)\n", CCM_CBCMR, frequency);
  Serial.printf(" Pre-fetch is %sabled\n", (FLEXSPI2_AHBCR & FLEXSPI_AHBCR_PREFETCHEN) ? "en" : "dis");
  memory_begin = (uint32_t *)(0x70000000);
  memory_end = (uint32_t *)(0x70000000 + size * 1048576);
  elapsedMillis msec = 0;
  if (!check_fixed_pattern(0x5A698421)) return;
  if (!check_lfsr_pattern(2976674124ul)) return;
  if (!check_lfsr_pattern(1438200953ul)) return;
  if (!check_lfsr_pattern(3413783263ul)) return;
  if (!check_lfsr_pattern(1900517911ul)) return;
  if (!check_lfsr_pattern(1227909400ul)) return;
  if (!check_lfsr_pattern(276562754ul)) return;
  if (!check_lfsr_pattern(146878114ul)) return;
  if (!check_lfsr_pattern(615545407ul)) return;
  if (!check_lfsr_pattern(110497896ul)) return;
  if (!check_lfsr_pattern(74539250ul)) return;
  if (!check_lfsr_pattern(4197336575ul)) return;
  if (!check_lfsr_pattern(2280382233ul)) return;
  if (!check_lfsr_pattern(542894183ul)) return;
  if (!check_lfsr_pattern(3978544245ul)) return;
  if (!check_lfsr_pattern(2315909796ul)) return;
  if (!check_lfsr_pattern(3736286001ul)) return;
  if (!check_lfsr_pattern(2876690683ul)) return;
  if (!check_lfsr_pattern(215559886ul)) return;
  if (!check_lfsr_pattern(539179291ul)) return;
  if (!check_lfsr_pattern(537678650ul)) return;
  if (!check_lfsr_pattern(4001405270ul)) return;
  if (!check_lfsr_pattern(2169216599ul)) return;
  if (!check_lfsr_pattern(4036891097ul)) return;
  if (!check_lfsr_pattern(1535452389ul)) return;
  if (!check_lfsr_pattern(2959727213ul)) return;
  if (!check_lfsr_pattern(4219363395ul)) return;
  if (!check_lfsr_pattern(1036929753ul)) return;
  if (!check_lfsr_pattern(2125248865ul)) return;
  if (!check_lfsr_pattern(3177905864ul)) return;
  if (!check_lfsr_pattern(2399307098ul)) return;
  if (!check_lfsr_pattern(3847634607ul)) return;
  if (!check_lfsr_pattern(27467969ul)) return;
  if (!check_lfsr_pattern(520563506ul)) return;
  if (!check_lfsr_pattern(381313790ul)) return;
  if (!check_lfsr_pattern(4174769276ul)) return;
  if (!check_lfsr_pattern(3932189449ul)) return;
  if (!check_lfsr_pattern(4079717394ul)) return;
  if (!check_lfsr_pattern(868357076ul)) return;
  if (!check_lfsr_pattern(2474062993ul)) return;
  if (!check_lfsr_pattern(1502682190ul)) return;
  if (!check_lfsr_pattern(2471230478ul)) return;
  if (!check_lfsr_pattern(85016565ul)) return;
  if (!check_lfsr_pattern(1427530695ul)) return;
  if (!check_lfsr_pattern(1100533073ul)) return;
  if (!check_fixed_pattern(0x55555555)) return;
  if (!check_fixed_pattern(0x33333333)) return;
  if (!check_fixed_pattern(0x0F0F0F0F)) return;
  if (!check_fixed_pattern(0x00FF00FF)) return;
  if (!check_fixed_pattern(0x0000FFFF)) return;
  if (!check_fixed_pattern(0xAAAAAAAA)) return;
  if (!check_fixed_pattern(0xCCCCCCCC)) return;
  if (!check_fixed_pattern(0xF0F0F0F0)) return;
  if (!check_fixed_pattern(0xFF00FF00)) return;
  if (!check_fixed_pattern(0xFFFF0000)) return;
  if (!check_fixed_pattern(0xFFFFFFFF)) return;
  if (!check_fixed_pattern(0x00000000)) return;
  Serial.printf(" test ran for %.2f seconds\n", (float)msec / 1000.0f);
  Serial.println("All memory tests passed :-)");
  memory_ok = true;
}


///////////////////////////////////////////////////////////////////
// Use memcpy() etc. to do fast reads from / writes to PSRAM,
// with a length that will often cross any page boundary, i.e.
// avoiding the typical multiples of 32 or 1024 bytes. If there's
// an issue, the page start may be corrupted by the end of a write,
// and get picked up by the subsequent read. This won't of course
// cause an issue with the fixed values...
///////////////////////////////////////////////////////////////////
uint32_t reg;

#define BLK_SIZE 255 // 255*uint32_t is 1020 bytes
uint32_t regMulti[BLK_SIZE];

bool new_fail_message(uint32_t* pm, volatile uint32_t *location, int count)
{
  //Serial.printf(" Error at %08X, read %08X but expected %08X\n",
  //  (uint32_t)location, actual, expected);
  Serial.printf("Error at %08X\n",
                (uint32_t)location);
  int n = 16;
  uint32_t* pr = regMulti;
  //uint32_t* pm = location;
  while (count > 0)
  {
    Serial.printf("%08X: ", (uint32_t) location);
    for (int i = 0; i < n; i++) Serial.printf("%08X ", pr[i]);
    Serial.print("\n          ");
    for (int i = 0; i < n; i++) Serial.printf("%08X ", pm[i]);
    Serial.print("\n          ");
    for (int i = 0; i < n; i++) Serial.printf("%s ", pm[i] == pr[i] ? "        " : "^^^^^^^^");
    Serial.println();
    count -= n;
    location += n;
    pr += n;
    pm += n;
  }
  return false;
}

///////////////////////////////////////////////////////////////////
// fill the entire RAM with a fixed pattern, then check it
///////////////////////////////////////////////////////////////////
void nextRegFixed(uint32_t pattern)
{
  for (int i = 0; i < BLK_SIZE; i++) regMulti[i] = pattern;
}


bool check_fixed_pattern(uint32_t pattern)
{
  volatile uint32_t *p;
  Serial.printf("testing with fixed pattern %08X\n", pattern);

  p = memory_begin;
  nextRegFixed(pattern); // do once, value is fixed

  while (p < memory_end)
  {
    if (memory_end - p > BLK_SIZE)
    {
      memcpy((void*) p, regMulti, sizeof regMulti);
      p += sizeof regMulti / sizeof * p;
    }
    else
    {
      int count = memory_end - p;
      memcpy((void*) p, regMulti, count * sizeof * p);
      p += count;
    }
  }

  arm_dcache_flush_delete((void *)memory_begin,
                          (uint32_t)memory_end - (uint32_t)memory_begin);

  p = memory_begin;
  while (p < memory_end)
  {
    int cmpres = 999;
    uint32_t memBuff[BLK_SIZE];
    int count = memory_end - p;

    if (count > BLK_SIZE)
    {
      memcpy(memBuff, (void*) p, sizeof memBuff);
      cmpres = memcmp(memBuff, regMulti, sizeof regMulti);
      p += sizeof regMulti / sizeof * p;
      count = BLK_SIZE;
    }
    else
    {
      memcpy(memBuff, (void*) p, count * sizeof * p);
      cmpres = memcmp(memBuff, regMulti, count * sizeof * p);
      p += count;
    }
    if (0 != cmpres) return new_fail_message(memBuff, p - count, count);
    //Serial.printf(" reg=%08X\n", reg);
  }

  return true;
}


///////////////////////////////////////////////////////////////////
// fill the entire RAM with a pseudo-random sequence, then check it
///////////////////////////////////////////////////////////////////
uint32_t nextReg(void)
{
  uint32_t retval = reg;
  for (int i = 0; i < 3; i++) {
    // https://en.wikipedia.org/wiki/Xorshift
    reg ^= reg << 13;
    reg ^= reg >> 17;
    reg ^= reg << 5;
  }
  return retval;
}


void nextRegMulti(void)
{
  for (int i = 0; i < BLK_SIZE; i++)
    regMulti[i] = nextReg();
}


bool check_lfsr_pattern(uint32_t seed)
{
  volatile uint32_t *p;

  Serial.printf("testing with pseudo-random sequence, seed=%u\n", seed);
  reg = seed;
  p = memory_begin;
  while (p < memory_end)
  {
    nextRegMulti();
    if (memory_end - p > BLK_SIZE)
    {
      memcpy((void*) p, regMulti, sizeof regMulti);
      p += sizeof regMulti / sizeof * p;
    }
    else
    {
      int count = memory_end - p;
      memcpy((void*) p, regMulti, count * sizeof * p);
      p += count;
    }
  }

  arm_dcache_flush_delete((void *)memory_begin,
                          (uint32_t)memory_end - (uint32_t)memory_begin);

  reg = seed;
  p = memory_begin;
  while (p < memory_end)
  {
    int cmpres = 999;
    uint32_t memBuff[BLK_SIZE];
    int count = memory_end - p;

    nextRegMulti();
    if (count > BLK_SIZE)
    {
      memcpy(memBuff, (void*) p, sizeof memBuff);
      cmpres = memcmp(memBuff, regMulti, sizeof regMulti);
      p += sizeof regMulti / sizeof * p;
      count = BLK_SIZE;
    }
    else
    {
      memcpy(memBuff, (void*) p, count * sizeof * p);
      cmpres = memcmp(memBuff, regMulti, count * sizeof * p);
      p += count;
    }
    if (0 != cmpres) return new_fail_message(memBuff, p - count, count);
    //Serial.printf(" reg=%08X\n", reg);
  }
  return true;
}


void loop()
{
  digitalWrite(13, HIGH);
  delay(100);
  if (!memory_ok) digitalWrite(13, LOW); // rapid blink if any test fails
  delay(100);
}
 
Just the 16MB one, I believe from my testing. The datasheets are a bit coy on the subject, but I think the 8MB ones can do a burst read through a page boundary, provided the /CE is only held asserted (low) for <8μs.
 
Done some work on this, and it appears a limited version of the pre-fetch code could be used to get a 7% speed improvement even with the ISSI chip. The original PSRAM can get to 15% faster, so maybe detection at startup would be the way to go.

Only worth thinking further about once Teensyduino 1.60 development starts getting bandwidth again ... even then, maybe it's a 1.61 feature ... so maybe 2027, based on recent release cycle timings.
 
After a number of frustrating delays, these 16MB PSRAM parts are finally available. The chip mfr had to do a chip run to complete the order which took longer than they originally planned.

ProtoSupplies.com will add these as additional order options to our various custom versions of the Teensy 4.1 with either 32MB PSRAM or 16MB PSRAM + 256MB/2Gb Flash configurations added.

For now, the maximum quantity of raw chips that can be ordered by DIY’ers will be limited to quantity 4 until long-term supply of the parts is better understood. Pricing is initially set at $5.95/ea. I believe SparkFun will also be offering these parts which will hopefully help increase overall availability.

For those that don’t really need the extra memory, the current 8MB PSRAM chips are generally the better option. They are lower cost per byte and can be pushed to higher speeds (166MHz vs 120MHz) for any speed demons out there. When mated with a Flash chip, that limits the bus speed to about 132MHz for the 8MB PSRAM and still 120MHz for the 16MB PSRAM. If you need 16MB PSRAM and no Flash, two 8MB parts are cheaper and potentially faster.

Keep in mind that the 16MB PSRAM parts will require the use of the current Teensyduino 1.60 Beta 5 or later release.

If any issues related to these specific 16MB PSRAM parts are identified, please post to this thread to give visibility. @PaulStoffregen and @h4yn0nnym0u5e have setups to help investigate and @defragster will have a setup soon as well.

Bare chip link
Teensy Fully Loaded – Breadboard compatible version of Teensy 4.1
Teensy Fully Loaded for Prototyping System – PCB Baseboard compatible version of Teensy 4.1
International Customer Info
 
Looks like I have another reason to wrap up a 1.60 release "soon".

It there any lingering config issue for these 16MB parts? I seem to recall something about prefetch, but it's not currently on my radar.
 
I delayed as long as I could!

@h4yn0nnym0u5e and @jmarsh were working on the prefetch stuff as noted above starting in post #127 and captured in https://github.com/PaulStoffregen/cores/pull/708. Prefetch is a nice speed improvement, but the 16MB parts require some extra hand-holding which h4yn0nnym0u5e seems to have a handle on. Probably not a must have for the 1.60 release unless it is a low risk drop-in.

I am not aware of any other lingering issues. I am testing all installed parts at 120MHz with no issues noted, so the new 105.6MHz QSPI bus speed seems fine.
 
Wouldn't want to see the regular 8MB PSRAM unconditionally lose half of its potential prefetch speed gain if the PR was merged, I believe there are better ways of handling it.
 
I believe there are better ways of handling it.
Yes indeed. My PR on your PR shows the register settings needed to keep the ISSI part happy, but they’re macro-controlled rather than based on detecting the part at run-time, which would be the correct solution. I was aiming to let you make those changes to your satisfaction, rather than more-or-less stealing your PR from you :)

Note that an updated version of the PSRAM memory test is needed to verify correct operation with the ISSI part and prefetch enabled. The original test passes even with the prefetch at the too-large setting.
 
@KenHahn sent a dual 16MB PSRAM T_4.1 that should arrive in the next 12-24 hours.

I noted a planned alternate test here, and to Ken, that using the PJRC PSRAM test on #1 16MB then doing a copy of #1 to #2 and then doing the verify on the #2 16MB could show if something was odd.

Note that an updated version of the PSRAM memory test is needed to verify correct operation
Reading the linked PR #4 ...

it suggests that above "test" doing the "copy #1>>#2" if done using "fast (using memcpy), and an awkward length" might trigger issues?

Question: is there a single boundary of concern across the 8MB of the 16MB PSRAM, or is there another factor involved?
And Q2: will this be observed as a FAULT or data corruption?
And Q3: will there be boundary issue chip #1 16MB to #2 16MB

Other feedback on the 'copy test' welcome. Would copy [end to start] keep cache from helping without cache_flush_delete intervening?

32MB is 33554432 bytes.
a 288 byte transfer, then 32864 transfers of 1021 bytes would move them all and would cross the 8MB boundary with an odd read.

Is the Prime 1021 a good awkward number?
Or better to start with an ODD transfer number and then only read an EVEN number of bytes to never read an EVEN byte number (until the last byte)?
 
For the 8MB parts, the datasheet says:
1765796060196.png


For the ISSI 16MB parts, we're told:
1765796206141.png


So the stress test needs to ensure that there are plenty of burst reads or writes which could cross the 1024-byte page boundaries. The original PSRAM test doesn't appear to trigger this, even with arbitrary-length prefetch enabled, because it only stores and checks one uint32_t at a time. My updated test pre-computes 255 values, so 1020 bytes, then copies to / from PSRAM using memcpy(), which with luck triggers a burst write / read, in turn causing prefetch to be used. 255/256 of the accesses will cross a page boundary, each with a different overlap amount.

Actual testing shows:
  • without prefetch, both test programs pass both PSRAM sizes
  • with prefetch according to PR#708 (15% speed gain)
    • Paul's original test program passes both PSRAM sizes
    • my revised test program passes 8MB PSRAM, and fails the 16MB one
  • with a more restricted prefetch setting (7% speed gain), my revised test program passes both PSRAM sizes
Whether an even more rigorous test could provoke errors with the restricted prefetch I don't know - for example, the test doesn't attempt DMA access to the PSRAM. It's definitely something to keep a close eye on, and have a moderately simple way for users to turn off prefetch and set the bus speed.

Given the fact that 16MB parts are only just becoming available, it may be premature to adopt the prefetch PR, which is a bit unfair considering it's been on GitHub for over two years, and so became available during the Teensyduino 1.59 beta campaign, let alone 1.60 ...
 
re p#143:: Indeed, PJRC did first 8MB PSRAM (and FLASH) with functional safe conservative settings.

Doing the same to allow 16 MB to be "usable" would be a best starting point. Users can edit as needed before a next release if it works for them.

If that is the state of current 160b5 then that just needs to be confirmed with the arrival of a new batch of production 16 MB chips.

If a new 'improved' standard for 8MB is proven maybe when 16MB is detected it isn't used?
 
Quick test - seems the p#127 code - two T_4.1's 16 and 16+16 PSRAM installed by @KenHahn:


EXTMEM Memory Test, 32 MByte (16+16)
CCM_CBCMR=95AE8304 (105.6 MHz)
Pre-fetch is disabled
...
test ran for 145.28 seconds
All memory tests passed :)


EXTMEM Memory Test, 16 MByte (16+0)
CCM_CBCMR=95AE8304 (105.6 MHz)
Pre-fetch is disabled
...
test ran for 72.55 seconds
 
For the 8MB parts, the datasheet says:
View attachment 38524

For the ISSI 16MB parts, we're told:
View attachment 38525

So the stress test needs to ensure that there are plenty of burst reads or writes which could cross the 1024-byte page boundaries. The original PSRAM test doesn't appear to trigger this, even with arbitrary-length prefetch enabled, because it only stores and checks one uint32_t at a time. My updated test pre-computes 255 values, so 1020 bytes, then copies to / from PSRAM using memcpy(), which with luck triggers a burst write / read, in turn causing prefetch to be used. 255/256 of the accesses will cross a page boundary, each with a different overlap amount.

Actual testing shows:
  • without prefetch, both test programs pass both PSRAM sizes
  • with prefetch according to PR#708 (15% speed gain)
    • Paul's original test program passes both PSRAM sizes
    • my revised test program passes 8MB PSRAM, and fails the 16MB one
  • with a more restricted prefetch setting (7% speed gain), my revised test program passes both PSRAM sizes
Whether an even more rigorous test could provoke errors with the restricted prefetch I don't know - for example, the test doesn't attempt DMA access to the PSRAM. It's definitely something to keep a close eye on, and have a moderately simple way for users to turn off prefetch and set the bus speed.

Given the fact that 16MB parts are only just becoming available, it may be premature to adopt the prefetch PR, which is a bit unfair considering it's been on GitHub for over two years, and so became available during the Teensyduino 1.59 beta campaign, let alone 1.60 ...
As long as the non-prefetch or restricted prefetch keeps up with DMA, for my project I am fine with it.
I do various tricks with DMA and SPI buffers to allow the cycles tolerate any bursts/stalls on the PSRAM anyway.
This is kind of why a few bytes of buffering exists on DMA and SPI, since there's going to be a few collisions with RAM access with CPU on the crossbars anyway if you access PSRAM from CPU, and the same holds true for the different sections of RAM.

Mostly what I have tested is a set of DMA channels doing a circular read-modify-write cycle, where the reads from PSRAM are pushed to SPI, and writes are pulled from SPI to PSRAM as a digital sample-modification system.Basically the result is that the previous read caches the area for the next write.
IIRC there are 8 buffers available on the PSRAM controller, which is more than enough for the end goal of 2 SPI streams doing circular RMW. It has been able to keep up with 1MByte/second (8MHz) quite easily, and I usually only run at half, 500KBytes/sec (4Mhz) of that speed as the default.
For what I am doing, the normal data stream is approximately 250kbits/sec, so 4Mhz is actually a 16x over-sample rate, and a 2MHz 8x over-sample rate would be ample in most cases. Sometimes the data stream can get up to 1MHz rates, and that's still fine, since it is 4x over-sample.
 
Back
Top