Call to arms | Teensy + SDRAM = true

Rezo · Feb 16, 2024

I have a library that supports 8080 for the ili948x
Uses FlexIO and DMA. So it has no cpu hold up.
You can use that with the camera

jmarsh · Feb 16, 2024

Rezo said:
I have a library that supports 8080 for the ili948x
Uses FlexIO and DMA. So it has no cpu hold up.
You can use that with the camera

But that is using up memory space and bandwidth for a framebuffer. If the display was connected to the SEMC the CPU could write directly to the display's memory. All you would have to do is put values in RAM and the corresponding pixels would appear on the screen, no SDRAM even required.

Rezo · Feb 16, 2024

Right so piping it directly to the display without the need for a ram buffer.
It can be done, but you’d need a new dev board for that.
I have seen many implementations of LCDs on memory driver on the STM32 platform.

KurtE · Feb 17, 2024

KurtE said:
Quick update on the MicroMod board update, that I am not sure yet if I will order yet or not...

Decided to hold off on that for now... Did lay out quick and dirty semi-shield for the Sparkfun ATP board
Slight downside is that two Teensy pnis are not available (22, 28), but...

The only real components to solder in are the SDIO connector and resistor for backlight...
Everything else is simple through hole connectors... Ordered a set from OSHPARK

Rezo · Feb 20, 2024

@PaulStoffregen I have been playing around with the eLCDIF and a 24bit 800*480 RGB display using my experimental eLCDIF_t4 library

My first experiment was to display a static image loaded into SDRAM - that worked fine.

My next experiment was to bind it to LVGL and see what a GUI performance looks like - this is where I hit some issues.
While for the most of it, it "worked", I could see that there was a caching issue and I was getting artifacts on the screen, even though was was flushing the cache before each buffer write.

I went back to the eLCDIF application notes and also had a look at the L1 cache application notes and both mentioned the the eLCDIF (and PXP) need a no cache region of SDRAM, and this is the reason why the EVK board has 2MB of no cache SDRAM - for the LCD and PXP.

So I added the following line to startup.c 4MB of no cache memory in SDRAM, at the end of the stack

Code:

SCB_MPU_RBAR = 0x81C00000 | REGION(i++); // SMEC: SDRAM, NO CACHE, Starts from 0x81C00000 aka at 28MB
SCB_MPU_RASR = MEM_NOCACHE | READWRITE | NOEXEC | SIZE_4M;

I then used a function to allocate and align two screen sized frame buffers to that region and voila, it works!

Is this something you would be willing to add to cores in the next release of TD?
Or perhaps make configure_cache WEAK so that it can be overridden by the user to add the no cache region?

KurtE · Feb 20, 2024

Rezo said:
Is this something you would be willing to add to cores in the next release of TD?
Or perhaps make configure_cache WEAK so that it can be overridden by the user to add the no cache region?

For what is worth, back in T4 beta cycle, I was running into issues with the caching getting in the way.
Sometimes the cache flush and/or delete could help, other times a royal PIA...

That is why I found it sort of amusing that we used the keyword DMAMEM for the upper memory, as in many ways that
is the worst memory to use for DMA

. Although now there are other equally bad (or worse) regions.

I thought at the time, it would be nice if the tools menu had a memory menu item that allowed you to maybe adjust this. But...

I sort of like the idea of a minimum making the function as weak such that it can be overwritten!
Would be nice if one could then put an override in a variant. Although I am not expecting variant support to be put into the
official code base.

From Paul's last post here.

PaulStoffregen said:
I'm currently working on improvements to Teensy Loader to properly handle more than 1 Teensy. So please understand I probably won't make time for quite a while to review pull requests for new features or anything that isn't an immediate problem.

My guess is it might be a while before he might pull in a simple change such as you propose...
But could be wrong.

KurtE · Feb 20, 2024

Rezo said:
While for the most of it, it "worked", I could see that there was a caching issue and I was getting artifacts on the screen, even though was was flushing the cache before each buffer write.

Forget to ask/mention in previous post. When you said before each buffer write: not sure what you meant here.
That is what "Before"
if it was before your code updated the buffer to the new contents. then flushing before this would not help.
if the Before was before the write to the device (i.e. before the DMA or logical DMA), then this would be the right time.

Have similar issues, with things like doing continuous DMA operations, within ILI9351_t3n and needing to flush the dma before each
frame starts to output...

Rezo · Feb 20, 2024

@KurtE "Before" means before I set the LCDIF_NEXT_BUF register to point to the next frame the eLCDIF needs to push out.
So I call arm_decache_flush_delete after the frame is rendered and just before I set the eLCDIF LCDIF_NEXT_BUF register

I used the same approach in my ILI948x FlexIO library for the DMA transfers as the buffers where in DMAMEM and it worked very well, butter smooth over the 8 bit bus @16 bit color depth

PaulStoffregen · Feb 20, 2024

Rezo said:
Is this something you would be willing to add to cores in the next release of TD?

Let's first make an effort to add it in the SDRAM_t4 library. Maybe disable interrupts, turn off the cache, add that extra region, then turn the caches back on.

That is, if arm_decache_flush_delete() and arm_decache_flush() aren't enough...

Rezo · Feb 20, 2024

PaulStoffregen said:
Let's first make an effort to add it in the SDRAM_t4 library. Maybe disable interrupts, turn off the cache, add that extra region, then turn the caches back on.

@mjs513 @defragster think we could do this?
Perhaps we can add a function to the class such as uint32_t setNoCacheRegion(uint8_t size)
Based on the size (we can use a struct to set values) we can return the calculated base address for the non cached region start
Size options would be 1M, 2M, 4M, 8M, 16M, 32M

defragster · Feb 20, 2024

Rezo said:
@mjs513 @defragster think we could do this?
Perhaps we can add a function to the class such as uint32_t setNoCacheRegion(uint8_t size)
Based on the size (we can use a struct to set values) we can return the calculated base address for the non cached region start
Size options would be 1M, 2M, 4M, 8M, 16M, 32M

If Paul suggests that, it would be worth a try to emulate the needed parts of that config function to add that new non-cache region.

As I read the startup.c code for the configure_cache() it seemed like altering the cache/MPU when already set might be 'disallowed'
> // TODO: check if caches already active - skip?

Also the way it steps through creating the {i++} regions and then enabled I wasn't sure about later adding yet one more region having the desired effect. Does that that "SCB_MPU_RBAR = 0x80000000 | REGION(i++);" last "ii" value need to be known to create the next region?

Rezo · Feb 20, 2024

If we can reset it, and run though the same sequence of region setup, then we should be good - no?
or, we know what the value of i is, as we have 11 regions set up in configure_cache and the next region can be 12.

PaulStoffregen · Feb 20, 2024

Perhaps better to first ask, does the hardware truly need cache disable? Seems hard to believe.

The effect of the cache isn't even visible outside the ARM core, other than fewer actual accesses to memory. That's why arm_decache_flush_delete() and arm_decache_flush() exist, to cause everything you've recently written to the frame buffer to actually write out to the memory the hardware outside the ARM core can see.

Rezo said:
I went back to the eLCDIF application notes and also had a look at the L1 cache application notes and both mentioned the the eLCDIF (and PXP) need a no cache region of SDRAM, and this is the reason why the EVK board has 2MB of no cache SDRAM - for the LCD and PXP.

Does the hardware truly need this? Or is it really just the driver code NXP publishes lacks cache flushing?

Rezo · Feb 20, 2024

PaulStoffregen said:
Does the hardware truly need this? Or is it really just the driver code NXP publishes lacks cache flushing?

As I mentioned, I tried to flush the cache but it had no affect at all.

I can't confirm nor dismiss that the hardware needs a region with cache disabled, but from all the code examples I have observed (NXP, LVGL) and application notes, they all seem to use a non cached region, but don't have any mention as to why.

EDIT: went though the code examples again, and they are flushing the cache. Not sure where I saw a different implementation, I've been looking at so many examples these last few days.

I'll have another go at it, but from every test I conducted with cache enabled, there was always artifacts, until I set the no cache region.

jmarsh · Feb 20, 2024

It's relatively simple to walk through all the MPU regions and add a new one at the end. Shouldn't need to disable/enable the data cache as long as it's done before the SDRAM/SEMC initialization.
I agree that it should not be necessary though as long as the code is flushing when required; I had a quick look at your git repo and from what I can tell, it uses a simple height*width calculation to decide how much data to flush which isn't taking into account how many bytes are used per pixel i.e. it assumes 8 bits / 1 byte, which means only the top third of the frame would be getting flushed for 24 bpp.

(What would be nice, is having FlushAll/CleanAll functions that simply go through every line in each set/way of the cache rather than the existing functions that only work based on address/size... since the cache is only 32KB, flushing any contiguous amount of data larger than that results in a lot of wasted cycles.)

Rezo · Feb 20, 2024

Ugh… I should have multiplied the screen resolution by sizeof(uint32_t) in the flush call..

How did I miss that one?

Will test in a couple of hours - I'm sure that’s going to fix it.

But regardless, the L1 Cache application notes does recommend to use non cached regions for DMA buffers, specifically with the LCD and PXP

Code:

Always recommended to use non-cacheable regions for DMA buffers. The software can use the MPU to configure a non-cacheable memory region to use as a shared buffer between the CPU and DMA. For example:
• The frame buffer for eLCDIF display
• The input and output buffer for PXP channel

defragster · Feb 20, 2024

Working without cache attention seems called for and beneficial. And only reads suffer with cache not covering that region that is only used for writes? Posted twice the note from NXP reads go to ~1/4 speed without cache - writes drop 1 MB/sec from 323 to 322 by their measure.

@Rezo - did you try the Speed scan on the no cap devboard? Ran again here with twin 6.8's and 240 MHz working without any of 5 ReReads showing problems. Without a CAP it seems 206 was generally testing well - not that OC is desired - but if testing it to work allows other parts to work as some prior post suggested 166 didn't keep up with something?

jmarsh · Feb 21, 2024

defragster said:
Working without cache attention seems called for and beneficial. And only reads suffer with cache not covering that region that is only used for writes? Posted twice the note from NXP reads go to ~1/4 speed without cache - writes drop 1 MB/sec from 323 to 322 by their measure.

I don't think that's always accurate, especially for non-sequential writes - tile / character based rendering for example, would greatly benefit from writes to separate lines being collected in the cache before being flushed to RAM.

defragster · Feb 21, 2024

jmarsh said:
I don't think that's always accurate, especially for non-sequential writes - tile / character based rendering for example, would greatly benefit from writes to separate lines being collected in the cache before being flushed to RAM.

Interesting - that case in use here would be easy to test.

Rezo · Feb 21, 2024

defragster said:
@Rezo - did you try the Speed scan on the no cap devboard? Ran again here with twin 6.8's and 240 MHz working without any of 5 ReReads showing problems. Without a CAP it seems 206 was generally testing well - not that OC is desired - but if testing it to work allows other parts to work as some prior post suggested 166 didn't keep up with something?

I did not actually.
Which sketch should I run?

defragster · Feb 21, 2024

Rezo said:
I did not actually.
Which sketch should I run?

Posted this just now: https://github.com/mjs513/SDRAM_t4/tree/main/examples/OneScanCap

and running with DevBoard v 4.0 and twin 6.8 pF caps the summary is:

Code:

Test results 57 tests with 5 ReReads:
     At 133 MHz in 157 seconds with 0 read errors
     At 166 MHz in 142 seconds with 0 read errors
     At 196 MHz in 132 seconds with 0 read errors
     At 206 MHz in 130 seconds with 0 read errors
     At 216 MHz in 128 seconds with 0 read errors
     At 227 MHz in 125 seconds with 0 read errors
     At 240 MHz in 123 seconds with 0 read errors
     At 254 MHz in 121 seconds with 378902 read errors (0.0158%)
     At 270 MHz in 119 seconds with 1249975738 read errors (52.2838%)

    SDRAM One Scan CAP test Complete {v1.1}

Rezo · Feb 21, 2024

@defragster
Test with no cap

Code:

Test results 57 tests with 5 ReReads:
     At 166 MHz in 142 seconds with 0 read errors
     At 196 MHz in 132 seconds with 0 read errors
     At 206 MHz in 130 seconds with 0 read errors
     At 216 MHz in 128 seconds with 2284 read errors (0.0001%)
     At 227 MHz in 125 seconds with 18768902 read errors (0.7851%)
     At 240 MHz in 123 seconds with 442376913 read errors (18.5037%)
     At 254 MHz in 121 seconds with 1964872408 read errors (82.1863%)
     At 270 MHz in 119 seconds with 1971314664 read errors (82.4558%)

    SDRAM One Scan CAP test Complete {v1.1}

defragster · Feb 21, 2024

Rezo said:
Test with no cap

Nice, as expected no cap seems to work up to 206 MHz - based on prior runs by one or more others, and here.

Odd though it should have a line for 133 MHz if the downloaded sketch LINE#1 wasn't changed from: #define FIRST_SPEED 0

And looking at the top results line - I should add F_CPU_ACTUAL for ref in case user runs it at other than 600 MHz. As some testing may be done at altered F_CPU on purpose, and sometimes the IDE just is left at the wrong speed from prior use.
Also adding ":Note tested CAP here pF=" to last line for easy ref to the CAP used in the test.
And a note that test takes aobut 15 minutes as written to complete.

Github updated to V1.2 : https://github.com/mjs513/SDRAM_t4/tree/main/examples/OneScanCap

Code:

Test summary: 57 tests with 5 ReReads at F_CPU_ACTUAL 600 Mhz:
     At 133 MHz in 160 seconds with 0 read errors
     At 166 MHz in 145 seconds with 0 read errors
     At 196 MHz in 136 seconds with 0 read errors
     At 206 MHz in 133 seconds with 0 read errors
     At 216 MHz in 132 seconds with 0 read errors
     At 227 MHz in 128 seconds with 0 read errors
     At 240 MHz in 127 seconds with 0 read errors
     At 254 MHz in 124 seconds with 1010230 read errors (0.0423%)
     At 270 MHz in 122 seconds with 1269880817 read errors (53.1163%)

    SDRAM One Scan CAP test Complete {v1.2} :Note tested CAP here pF= 13.6 w/2*6.8

Full output during run:

SUCCESS sdram.init() Default config runs in about 15 minutes.

Progress:: '#'=fixed, '.'=PsuedoRand patterns: when no Errors other wise first pass with error a-z or A-Z
If built with DUAL Serial second SerMon will show details.

Compile Time:: C:\Users\TimLabs\Documents\GitHub\EVKB_1060\examples\OneScanCap\OneScanCap.ino Feb 21 2024 14:40:57
SDRAM Memory Test, 32 Mbyte F_CPU_ACTUAL 600 Mhz begin@ 80000000 end@ 82000000

Start 57 tests with 5 reads 132.92 MHz ... wait::#############............................................
Test result: 0 read errors (0.0000%)
Extra info: ran for 160.12 seconds at 133 MHz

Start 57 tests with 5 reads 166.15 MHz ... wait::#############............................................
Test result: 0 read errors (0.0000%)
Extra info: ran for 145.09 seconds at 166 MHz

Start 57 tests with 5 reads 196.36 MHz ... wait::#############............................................
Test result: 0 read errors (0.0000%)
Extra info: ran for 136.64 seconds at 196 MHz

Start 57 tests with 5 reads 205.71 MHz ... wait::#############............................................
Test result: 0 read errors (0.0000%)
Extra info: ran for 133.74 seconds at 206 MHz

Start 57 tests with 5 reads 216.00 MHz ... wait::#############............................................
Test result: 0 read errors (0.0000%)
Extra info: ran for 132.02 seconds at 216 MHz

Start 57 tests with 5 reads 227.37 MHz ... wait::#############............................................
Test result: 0 read errors (0.0000%)
Extra info: ran for 128.45 seconds at 227 MHz

Start 57 tests with 5 reads 240.00 MHz ... wait::#############............................................
Test result: 0 read errors (0.0000%)
Extra info: ran for 127.59 seconds at 240 MHz

Start 57 tests with 5 reads 254.12 MHz ... wait::#a###a####a##AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Test result: 1010230 read errors (0.0423%)
Extra info: ran for 124.31 seconds at 254 MHz

Start 57 tests with 5 reads 270.00 MHz ... wait::aaa##a#a#aaa#AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Test result: 1269880817 read errors (53.1163%)
Extra info: ran for 122.39 seconds at 270 MHz

Test summary: 57 tests with 5 ReReads at F_CPU_ACTUAL 600 Mhz:
At 133 MHz in 160 seconds with 0 read errors
At 166 MHz in 145 seconds with 0 read errors
At 196 MHz in 136 seconds with 0 read errors
At 206 MHz in 133 seconds with 0 read errors
At 216 MHz in 132 seconds with 0 read errors
At 227 MHz in 128 seconds with 0 read errors
At 240 MHz in 127 seconds with 0 read errors
At 254 MHz in 124 seconds with 1010230 read errors (0.0423%)
At 270 MHz in 122 seconds with 1269880817 read errors (53.1163%)

SDRAM One Scan CAP test Complete {v1.2} :Note tested CAP here pF=

defragster · Feb 21, 2024

Wondered what happened with an MCU OC build - went to 720 MHz - Not good ... Should the freq math calc's be adjusted?:

Code:

Start 57 tests with 5 reads 132.92 MHz ... wait::##########CrashReport:
  A problem occurred at (system time) 18:41:47
  Code was executing from address 0x2A2
  CFSR: 400
    (IMPRECISERR) Data bus error but address not related to instruction
  Temperature inside the chip was 50.55 °C
  Startup CPU clock speed is 720MHz
  Reboot was caused by auto reboot after fault or bad interrupt detected

Updated on github as linked above to use this with CrashReport - version # not changed.
Edited this to stop repeat restarts - wondered about case of OC and high temp - rather than adding set_arm_clock() to drop speed did this with "wfi" and it will sleep MCU - but still allows USB Upload without Button - the delay(10) is to make sure CrashReport gets delivered:

Code:

  if (CrashReport) {
    Serial.print(CrashReport);
    delay(10);
    while (1) asm ("wfi");
  }

Not sure why the fault bumping MCU from 600 to 720 MHz - DevBoard not likely overheat fail - but "wfi' should be quick way to stop heating if so.

Above Crash on first 133 MHz pass - rebuilt to start at 166 MHz and it has the same early Crash.

Running at 528 MHz completes fine - and recording the time per test group some idea of perf is recorded:

Code:

Test summary: 57 tests with 5 ReReads at F_CPU_ACTUAL 528 Mhz:
     At 133 MHz in 171 seconds with 0 read errors
     At 166 MHz in 157 seconds with 0 read errors
     At 196 MHz in 147 seconds with 0 read errors
     At 206 MHz in 145 seconds with 0 read errors
     At 216 MHz in 143 seconds with 0 read errors
     At 227 MHz in 141 seconds with 0 read errors
     At 240 MHz in 138 seconds with 0 read errors
     At 254 MHz in 136 seconds with 1177852 read errors (0.0493%)
     At 270 MHz in 134 seconds with 1201202286 read errors (50.2437%)

    SDRAM One Scan CAP test Complete {v1.2} :Note tested CAP here pF=13.6

Note: 816 MHz OC also fails with diff CrashReport:

Code:

SDRAM Memory Test, 32 Mbyte   F_CPU_ACTUAL 816 Mhz begin@ 80000000  end@ 82000000

Start 57 tests with 5 reads 132.92 MHz ... wait::#####CrashReport:
  A problem occurred at (system time) 19:19:56
  Code was executing from address 0x25FC
  CFSR: 82
    (DACCVIOL) Data Access Violation
    (MMARVALID) Accessed Address: 0x2000288C (Stack problem)
      Check for stack overflows, array bounds, etc.
  Temperature inside the chip was 51.22 °C
  Startup CPU clock speed is 816MHz
  Reboot was caused by auto reboot after fault or bad interrupt detected

defragster · Feb 22, 2024

More fun trivia? Seemed the "wfi" would wake to allow incoming USB Serial data so indeed this works:

Code:

  if (CrashReport) {
    Serial.print(CrashReport);
    Serial.print("Any Key to continue ...");
    delay(50);
    while (1) {
      if ( Serial.available() ) break;
      asm ("wfi");
    }
    while ( Serial.available() ) {
      Serial.print((char)Serial.read());
    }
  }

> github updated as v1.3: https://github.com/mjs513/SDRAM_t4/tree/main/examples/OneScanCap
That stops annoying repeat Crashes, should sleep the MCU in case of high temp from OC or other, and allows continuing when useful, or No Button Upload.
At 816 MHz it repeats Crash at earliest 133 MHz test - or before those prints even show up after sdram.init() prints.

At 720 MHz with Continue - it seemed to progress more speed increase a few times - getting farther without error or crash - but not to completion as observed.
Here is the abbreviated crude spew edited to show it is running faster than above at 600 MHz with some Crash and Continue:

Code:

Extra info: ran for 146.39 seconds at 133 MHz // {.vs. 160 sec}
...
Extra info: ran for 131.55 seconds at 166 MHz // {.vs. 145 sec}
...
Extra info: ran for 121.96 seconds at 196 MHz // {.vs. 136 sec}
...
Extra info: ran for 119.91 seconds at 206 MHz // {.vs. 133 sec}

So - it can work - but something underlying is wrong in the SDRAM access at higher MCU speeds.

Ran again at 720 after github update: 133, Crash: Continue> 133-254, Crash: Continue> 133-227 Crash.
These Crash temps 54.58 and 53.91 - test print of temps before they were removed to upload were idling at ~43° C

<edit> : Execution at F_CPU 396 works:

Code:

Test summary: 57 tests with 5 ReReads at F_CPU_ACTUAL 396 Mhz:
     At 133 MHz in 200 seconds with 0 read errors
     At 166 MHz in 185 seconds with 0 read errors
     At 196 MHz in 177 seconds with 0 read errors
     At 206 MHz in 174 seconds with 0 read errors
     At 216 MHz in 172 seconds with 0 read errors
     At 227 MHz in 170 seconds with 0 read errors
     At 240 MHz in 167 seconds with 0 read errors
     At 254 MHz in 165 seconds with 458026 read errors (0.0192%)
     At 270 MHz in 163 seconds with 1080675058 read errors (45.2023%)

    SDRAM One Scan CAP test Complete {v1.3} :Note tested CAP here pF=13.6

Call to arms | Teensy + SDRAM = true

Well-known member

Well-known member

Well-known member

Senior Member+

Well-known member

Senior Member+

Senior Member+

Well-known member

Well-known member

Well-known member

Senior Member+

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Senior Member+

Well-known member

Senior Member+

Well-known member

Senior Member+

Well-known member

Senior Member+

Senior Member+

Senior Member+