Additional PSRAM ID that works plus goodies

Not sure how the set to 120 MHz got in the start code - shared with Ken and both of us saw 130 to fail, but no failures at 120 on working chips. Ken did find some chips to be ungood in some OOB test he did - more than he ever saw % wise with 8MB chips IIRC.
I have been using 120MHz in all the chip testing I do lately to gain confidence that the new 16MB parts can all handle it. That does seem to be about the max speed that they are happy at when using 2 chips. With 1 chip, it tested OK up to 144MHz and had it fail at 166MHz. With a Flash chip added, I think it topped out at 132Mhz which is about the max for most Flash chips anyway.

For the record, I have found 1 bad 16MB chip out of about 75 that I have tested. It happened within the first few I tested, so raised some concerns, but I haven't seen another failure since, so hopefully it was just an outlier. With the 8MB parts I have only had 1 bad chip out of a couple thousand tested.
 
Can one of you please clarify what speeds you are getting (MB/s) for PSRAM read and write with default clock per TD 1.60? Are these different for 16MB chips versus 8MB chips?
 
They're not going to be different between chips since the FlexSPI clockspeed sets the transfer rate, i.e. there's no non-deterministic behaviour involving polling.
 
some of the PDF verbiage caused wonder? ... some boundary question.
Yes, the page boundary differences can definitely cause issues, but not with TD1.60. I'm glad you've done some independent testing of that.

I’ve been playing with prefetch and DMA memory-to-memory transfers today, but more work is needed to reach a proper conclusion about what’s safe to enable. So far I haven’t got to interleaving transfers with computing the next buffer, so DMA is a lot slower, because of the need to mess with the cache.
 
OK, here are some results based on a revised version of Paul's original test program, combined with an adapted version of @jmarsh's PR#708. As previously noted, the differing constraints on page boundaries between 8MB and 16MB parts mean that the AHB prefetch has to be reined in a bit with the latter, so you can't reliably achieve the same speed gains.

The revised test gives some information about actual memory transfer speeds which can't usefully be inferred from the overall test duration due to test overheads. The theoretical speed may be a bit suspect - I tried to calculate it based on addressing overhead and prefect burst length, but it may well be wrong.
Size [MB]​
Speed [MHz]​
Prefetch​
Block size [bytes]​
To mem [MB/s]​
From mem [MB/s]​
DMA to [MB/s]​
DMA from [MB/s]​
Tests​
Test duration [sec]​
CPU speed [MB/s]​
DMA speed [MB/s]​
Test speed [MB/s]​
Theoretical speed [MB/s]​
Versus base: per block​
Versus base: overall test​
16​
105.6​
FALSE​
2044​
34.27​
34.54​
15.49​
8.67​
57​
77.38​
34.41​
12.08​
23.57​
43.32
16​
120​
TRUE​
2044​
40.5​
47.05​
16.94​
30.64​
57​
65.31​
43.78​
23.79​
27.93​
49.23
27.2%​
18.5%​
16​
105.6​
TRUE​
2044​
34.49​
41.66​
15.47​
29.91​
57​
71.03​
38.08​
22.69​
25.68​
43.32
10.7%​
8.9%​
16​
120​
FALSE​
2044​
40.1​
38.66​
16.94​
9.41​
57​
72.35​
39.38​
13.18​
25.21​
49.23
14.5%​
7.0%​
8​
105.6​
TRUE​
2044​
34.49​
51.8​
15.47​
36.93​
57​
31.7​
43.15​
26.20​
28.77​
43.32
25.4%​
22.1%​
8​
120​
TRUE​
2044​
41​
58.84​
16.94​
37​
57​
29.22​
49.92​
26.97​
31.21​
49.23
45.1%​
32.4%​

The revised test creates blocks of test data in Teensy's RAM, then copies to and from PSRAM using either memcpy(), or DMA. Only a couple of the 57 tests are set to use DMA. The blocks are 2044 bytes long, so actually cross two page boundaries; it doesn't seem to make a lot of difference, and can easily be changed:
C++:
//#define BLK_SIZE 255 // 255*uint32_t is 1020 bytes
#define BLK_SIZE 511 // 511*uint32_t is 2044 bytes

At the moment the prefetch is similarly controlled by macro definitions:
C++:
#define USE_PREFETCH
#define LIMIT_PREFETCH_SIZE    // needed for ISSI 16MB part
If you have a 16MB part, and define USE_PREFETCH but do not define LIMIT_PREFETCH_SIZE, then the test will fail. The 8MB part seems fine with using unlimited prefetch, and thus gives the maximum possible PSRAM speed.

I confess I don't understand why the DMA copy speed is so poor - I wouldn't have expected contention for memory bandwidth would be that significant, but maybe it is. Maybe a proper DMA guru can enlighten us!
 
Wondering if the DMA - while slower in transfer - opens up any more usable CPU time for a net gain in some way?

That is how much work could an 'other code' do during testing " while (!copyIsComplete()) {}" - not using PSRAM during that 2 seconds - without interfering with the DMA transfer.
Code:
bool copyIsComplete(void)
{
  bool result = true;
  if (usingDMA)
  {
    if (copyDMA.complete())

Odd there would be contention when it is not doing anything but waiting?
 
I confess I don't understand why the DMA copy speed is so poor - I wouldn't have expected contention for memory bandwidth would be that significant, but maybe it is. Maybe a proper DMA guru can enlighten us!
What is the write size (set by ATTR_DST) ?
The column headings don't really make the direction clear, "to mem", "from mem", "DMA to", "DMA from" are all ambiguous when you're copying between two different memory regions.
 
I just checked, as intended ATTR_DST is 2, i.e. it's writing 32 bits per transfer.

Yes, sorry, it was based on a hasty spreadsheet I made to record my results, as I was beginning to lose track. "To mem" means "using memcpy() from RAM buffer to PSRAM", "from mem" is "memcpy from PSRAM to RAM buffer", "DMA to" is "DMA RAM to PSRAM", and "DMA from" is "DMA PSRAM to RAM".

Wondering if the DMA - while slower in transfer - opens up any more usable CPU time for a net gain in some way?
Oh absolutely, yes. I'd sort of intended to do that by having two RAM buffers, and computing or comparing one while the other was being used for DMA. But the DMA results are so poor there didn't seem to be any merit in trying that. But it does sort of confirm that using PSRAM as a screen buffer isn't that bad, if you're short of RAM; the speed penalty for graphics operations is no worse than ~50%, and async updates are dominated by SPI speed anyway.
 
64 bits per transfer may be better since it's the burst size?

I find using PSRAM for a framebuffer really depends on the refresh rate, it generally can't handle anything beyond 640x480 (25MHz pixel clock).
 
Here is the table by @h4yn0nnym0u5e with the modified labels (rows in order of increasing speed).

Size [MB]Speed [MHz]PrefetchBlock Size [bytes]memcpy
RAM to
PSRAM [MB/s]
memcpy
PSRAM
to RAM [MB/s]
DMA
RAM to PSRAM
[MB/s]
DMA
PSRAM
to RAM [MB/s]
TestsTest Duration [sec]CPU speed [MB/s]DMA speed [MB/s]Test speed [MB/s]Theor. speed [MB/s]Versus base: per blockVersus base: overall test
16​
105.6​
FALSE​
2044​
34.27​
34.54​
15.49​
8.67​
57​
77.38​
34.41​
12.08​
23.57​
43.32​
16​
120​
FALSE​
2044​
40.1​
38.66​
16.94​
9.41​
57​
72.35​
39.38​
13.18​
25.21​
49.23​
14.5%​
7.0%​
16​
105.6​
TRUE​
2044​
34.49​
41.66​
15.47​
29.91​
57​
71.03​
38.08​
22.69​
25.68​
43.32​
10.7%​
8.9%​
16​
120​
TRUE​
2044​
40.5​
47.05​
16.94​
30.64​
57​
65.31​
43.78​
23.79​
27.93​
49.23​
27.2%​
18.5%​
8​
105.6​
TRUE​
2044​
34.49​
51.8​
15.47​
36.93​
57​
31.7​
43.15​
26.20​
28.77​
43.32​
25.4%​
22.1%​
8​
120​
TRUE​
2044​
41​
58.84​
16.94​
37​
57​
29.22​
49.92​
26.97​
31.21​
49.23​
45.1%​
32.4%​
 
64 bits per transfer may be better since it's the burst size?
Hmmm ... getting late here, but I tried a 1020-byte block, i.e. 255*uint32_t values; that lets me set NBYTES=4, 12, 20 or 68 (since 255=1*3*5*17), and divide CITER and BITER by the same factor. (The last block isn't full size, so it has to stick with NBYTES=4.) So the transfers are 32, 96, 160 or 544 bits. There's a very small speed improvement - with NBYTES=68, the DMA copy from PSRAM to RAM goes from 8.67MB/s to 9.75MB/s. Useful, but not earth-shattering, and your transfer size does have to be a non-prime!

@joepasquariello, I think I just spotted a minor error in post#160. You've ranked by overall test speed, but I believe the better ranking is by "CPU speed" aka average of memcpy() read and write speeds. That eliminates the test overhead effect.
 
Okay. Here
I believe the better ranking is by "CPU speed" aka average of memcpy() read and write speeds. That eliminates the test overhead effect.

Size [MB]Speed [MHz]PrefetchBlock Size [bytes]memcpy
RAM to
PSRAM [MB/s]
memcpy
PSRAM
to RAM [MB/s]
DMA
RAM to PSRAM
[MB/s]
DMA
PSRAM
to RAM [MB/s]
TestsTest Duration [sec]CPU speed [MB/s]DMA speed [MB/s]Test speed [MB/s]Theor. speed [MB/s]Versus base: per blockVersus base: overall test
16​
105.6​
FALSE​
2044​
34.27​
34.54​
15.49​
8.67​
57​
77.38​
34.41​
12.08​
23.57​
43.32​
16​
105.6​
TRUE​
2044​
34.49​
41.66​
15.47​
29.91​
57​
71.03​
38.08​
22.69​
25.68​
43.32​
10.7%​
8.9%​
16​
120​
FALSE​
2044​
40.1​
38.66​
16.94​
9.41​
57​
72.35​
39.38​
13.18​
25.21​
49.23​
14.5%​
7.0%​
16​
120​
TRUE​
2044​
40.5​
47.05​
16.94​
30.64​
57​
65.31​
43.78​
23.79​
27.93​
49.23​
27.2%​
18.5%​
8​
105.6​
TRUE​
2044​
34.49​
51.8​
15.47​
36.93​
57​
31.7​
43.15​
26.20​
28.77​
43.32​
25.4%​
22.1%​
8​
120​
TRUE​
2044​
41​
58.84​
16.94​
37​
57​
29.22​
49.92​
26.97​
31.21​
49.23​
45.1%​
32.4%​
 
Hmmm ... getting late here, but I tried a 1020-byte block, i.e. 255*uint32_t values; that lets me set NBYTES=4, 12, 20 or 68 (since 255=1*3*5*17), and divide CITER and BITER by the same factor. (The last block isn't full size, so it has to stick with NBYTES=4.) So the transfers are 32, 96, 160 or 544 bits. There's a very small speed improvement - with NBYTES=68, the DMA copy from PSRAM to RAM goes from 8.67MB/s to 9.75MB/s. Useful, but not earth-shattering, and your transfer size does have to be a non-prime!
I actually meant changing ATTR_DST to 3, so that 8 bytes at a time are written to PSRAM (matching the burst length).
 
Back
Top