RAM optimization for large arrays

The other option is look at some of the custom boards floating around here running 8mb sdram
Thanks for the advice: any suggestion? It would be great if they support the Arduino IDE.
I found the Milk V Duo board, that seems promising.

@Paul : do not hesitate to delete this reply if you think it is not appropriate in this forum.
 
@Lesept ...

The other option is look at some of the custom boards floating around here running 8mb sdram
@Dogbone06 has put forth a design with 32 MB of SDRAM with PJRC Bootloader. There is a thread or two dedicated to that, as well as some other with display and camera usage. It is based on Teensy MicroMod since it uses a 16 MB Flash chip and there is a Variant Thread underway that seeks to normalize usage with the PJRC TeensyDuino CORES as installed.

The original V_4 has evolved to V_4.5 and the latest V_5.0 PCB with updates to layout and features. It started Dec 2023 with Startup and access code put forth by PJRC to interface the 166 MHz SDRAM and it seems to run well at 200 MHz and just over. There are already one or two others it seems that have taken those @Dogbone06 design files into an alternate build.
 
Of course, headers are included in compilation.
But if the total size of the code is larger than the size of RAM1, how is the code copied in the FASTRUN zone?
Should I label my large arrays with PROGMEM so they are not loaded in RAM1?
 
This thread appears to have a life of its own! Here is my 2 cents worth... Probably worth about that much:
Of course it's only possible if they are const data.
Compilation will fail if the code size is larger than RAM1.
If your arrays are initialized as part of the build process and are larger than can fit into RAM1 and as such you need to put them into PSRAM,
or DMAMEM or EXTMEM, you will need to mark them as PROGMEM and as such const. Otherwise, it is first going to try to download the
data to RAM1. If the data is changeable, you will then need to copy it out of PROGMEM to your desired memory region.

Note: Not sure which of these memory regions are faster: Flash, PSRAM, ... All of which other than RAM1 will use the cache.

Putting stuff on the stack - works, unless you do too much of it. If you look at the memory section of the product page.
https://www.pjrc.com/store/teensy41.html#memory
And there is no warning.
1725976255002.png

Note: the local Variables in the diagram is the stack, which grows from the top down... If you use too much it will start to
overwrite other variables. Which can lead to very unpredictable code. Have had a few sketches where I ran into this and with some of
them we put in code, that did things like, initialize the whole stack to some known state and at times would walk the stack trying to
figure out, how much of it was used up....

Dynamic memory - As mentioned, malloc uses RAM2, extmem_malloc uses the PSRAM,
It has been a while since I tried it, but I believe you can also create one or more heaps in RAM1 as well.
Using SMALLOC code that is in the core. This is used for extmem_malloc.
For example, you could create a global array of some size, that is placed in DTCM. And then create a heap using this memory...

Also I don't remember if it worked to create a heap into the wasted space of ITCM.

That is RAM1 is actually made up in 32kb chucks.
If your code for example in ITCM is 32K + 1 byte it will use up two chunks or 64K and the ITCM is then 512K - 64K in size.
That is shown in a build like:
Code:
Memory Usage on Teensy 4.1:
  FLASH: code:96416, data:32248, headers:8548   free for files:7989252
   RAM1: variables:53568, code:86952, padding:11352   free for local variables:372416
   RAM2: variables:12416  free for malloc/new:511872

So if you can shrink the code by enough to reduce it to fit into one less chunk you will have more space for data.

At one point I think a few of use @defragster? tried creating a heap within the wasted space?

Try it - If it were me, I would try out some of the different approaches and hopefully one works.

Good luck
 
ITCM now gets flagged as read-only in the MPU so the padding isn't writable.
It was always claimed 'Read Only' but worked when last checked with this code last edit Feb 2023.
Code:
...
// LAST PRINT
  printf( "End of Free ITCM = %u [%X] \n", ptrFreeITCM + sizeofFreeITCM, ptrFreeITCM + sizeofFreeITCM);
// This now causes CrashReport
  for ( uint32_t ii = 0; ii < sizeofFreeITCM; ii++) ptrFreeITCM[ii] = 1;
It is indeed now READ ONLY running the same code generates CrashReport on first Write with T_4.1 online building:
Code:
Memory Usage on Teensy 4.1:
  FLASH: code:39084, data:9160, headers:9096   free for files:8069124
   RAM1: variables:10016, code:34080, padding:31456   free for local variables:448736
   RAM2: variables:12416  free for malloc/new:511872
And running to CrashReport and repeat ...
Code:
++++++++++++++++++++++
Size of Free ITCM in Bytes = 31452
Start of Free ITCM = 34084 [8524]
End of Free ITCM = 65536 [10000]

T:\T_Drive\tCode\Memory\T4MemInfo\T4MemInfo.ino Sep 10 2024 09:18:32
CrashReport:
  A problem occurred at (system time) 9:20:0
  Code was executing from address 0x600016E2
  CFSR: 82
    (DACCVIOL) Data Access Violation
    (MMARVALID) Accessed Address: 0x8524
  Temperature inside the chip was 42.01 °C
  Startup CPU clock speed is 600MHz
  Reboot was caused by auto reboot after fault or bad interrupt detected
_stext        00000000
_etext        00008520 +34080b
_sdata        20000000
_edata        200022c0 +8896b
_sbss         200022c0
_ebss         20002720 +1120b
curr stack    2006ffa0 +448640b
_estack       20070000 +96b
_heap_start   20203080
__brkval      20203080 +0b
_heap_end     20280000 +511872b
_extram_start 70000000
_extram_end   70000000 +0b

<ITCM>  00000000 .. 0000ffff
<DTCM>  20000000 .. 2006ffff
<RAM>   20200000 .. 2027ffff
<FLASH> 60000000 .. 607fffff
<PSRAM> 70000000 .. 707fffff

avail STACK   448640 b   438 kb
avail HEAP    511872 b   499 kb
avail PSRAM  8388608 b  8192 kb


++++++++++++++++++++++
Size of Free ITCM in Bytes = 31452
Start of Free ITCM = 34084 [8524]
End of Free ITCM = 65536 [10000]
 
A one line CORES edit can make the unused RAM1/ITCM 'padding' available as "READWRITE".
In the example above that area is over 30 KB - but can be less than 1KB depending on the build.

This risks self(virus)-modifying code or the dangers of miscalculation or access pointer misuse.
Test code above seems to be posted 2/23/2023 as: https://forum.pjrc.com/index.php?threads/memory-usage-teensy-4-1.72235/post-321283

There may be a more surgical edit but replacement of the commented line ran to completion of writes to ITCM padding:
Code:
FLASHMEM void configure_cache(void)
{
...   
    SCB_MPU_RBAR = 0x00000000 | REGION(i++); // ITCM
    SCB_MPU_RASR = MEM_NOCACHE | READWRITE | SIZE_512K;
    //SCB_MPU_RASR = MEM_NOCACHE | READONLY | SIZE_512K;
 
@jmarsh - was it a post you made showing at least one build/source edit that dropped ITCM code by some measurable amount? Having to do with C++ reservation or other [fault or output stubs?]?

It would be best to work on building to reduce the PADDING to directly free RAM1 DTCM space directly rather than hacking CORES and shoehorning into the variable size remaining there. Directly by moving little used code to FLASH, or other optimization reducing ITCM 32KB block usage and padding 'waste'.

And using nanolib only compromised float output but gave a good reduction in size? Other?
 
The CPU has a 32KB data cache (which is as fast as RAM1), so there's not much point manually copying any dataset smaller than that from PSRAM.

Any idea how this cache works? I'm assuming it would only cache specific memory that it has accessed.

My completely untested theory (and probably wrong) was if you were even accessing only 50% of the items out of order in the array on psram it would be faster to read the whole array on to the stack sequentially (if it fits) then do your multiple operations on them before sending them back to psram.

I'm not sure if you can break up your larger arrays into smaller chunks to do work. Regardless I think your real only options are (short of going sdram) constant data in flashmem. And large arrays in psram.

I think there's specific burst sizes as well as caching etc to consider. @defragster will know a lot more about that than me.

In any case profiling is the key. Its the only way your going to find out whats faster. And profile using ARM_DWT_CYCCNT. If values arent being cached correctly you can try optomisations by reading indiviudal values into local members, operate them then output them back to psram.
 
Any idea how this cache works? I'm assuming it would only cache specific memory that it has accessed.
It is (mostly) transparent; when you fetch a byte from PSRAM, the entire 32-byte cacheline containing that byte will be fetched into the cache. Subsequent reads/writes will use the cache until the cacheline is evicted, either manually or by being replaced with another (which will not happen with a 32KB continuous block). So it will give the same benefit as manually copying a chunk into RAM1, modifying it, then copying it back except that there is no manual copying involved and the write-back process will happen only when needed (only for data that has been changed and if the cache space is needed for something else). In fact both of the memcpy calls would be operating through the data cache already so placing the data in RAM1 is just doubling the work.
There are prefetch instructions to allow loading a cacheline ahead of time, these can help with "random" access due to the CPU being pipelined.
 
@jmarsh p#37 notes details on the cache - ideally the 32KB cache is designed to most effectively afford efficient access with minimal overreading and only writing back changed blocks. Profiling and use case might show the cache being less than effective given the PSRAM overhead - those when possible should be moved to memory with less access overhead if it is time critical and space allows.

Some numbers on the MB/Sec throughput would provide expected baseline performance. It is quite high given the design at hand - whether it starves the processor for certain operations would be found if the obvious things went to the obvious places based on needed space and could then be tweaked as possible.

The PJRC PSRAM test code does a complete set of tests avoiding the cache for worst case to test function. Those are single task tests. Other prominent use of the PSRAM has been DMA buffered display writes for fast and effective updates while other things are running generating the next screen buffer for effective usage.
 
Back
Top