RAM optimization for large arrays

Lesept

Member
Hello
I'm very new to Teensy, but quite experienced in Arduino and ESP32.
I switched to Teensy 4.1 for programming and executing an AI application, which uses very large arrays of float (or chars). I have a Teensy 4.1 with 8MB PSRAM and 256 MB Flash.
I read the "Teensy 4.1 Memory Configuration and Use" page and I have some questions on how to optimize the memory usage for execution speed.

I see that the code is copied into RAM1 at runtime. But I may not know the size of the code so I may not know the amount of available memory in RAM1. Then, if I declare a large array, I'm not sure that it will fit in RAM1.

In short, I think it is hard to know the optimum placement of each array I use when I design the code. Then I can't choose to use DMAMEM or EXTMEM.

So here are my questions:
  • Should I check the size of each array and decide to put them in RAM1, RAM2 or PSRAM by myself, or should I let the Teensy decide the optimum memory usage?
  • Are there any function calls to check each memory type usage at runtime, such as external_psram_used or RAM1_used?
 
Just look at the memory usage info that prints every time you compile.

If using Arduino IDE with the console panel very small, you may need to increase its size or scroll up to see the memory usage info.
 
Thanks, I know that I can have the global view of the memory usage, but I'd like to have a finer view of it, meaning knowing for each array if it is stored in RAM1, RAM2 or PSRAM.
So, do I need to do the math by myself and optimize the storage place for each array (and declare them as DMAMEM or EXTMEM if necessary), or does it work nicely and automatically during execution (if there is no place left in RAM1 for a given array, it will be stored in RAM2, until no place left then on PSRAM) ?
 
What you're asking doesn't really make sense. The location of an array is determined either by its definition (if it is static) or the functions used to allocate it (if it is dynamic).
 
Total RAM (RAM1 + RAM2) is only 1MB, so if you have code + data (static or dynamic) that approach or exceed 1MB, you'll need to assign some data to PSRAM (EXTMEM) and/or keep some code in FLASH (PROGMEM). Don't worry about it too much in advance. You can cross that bridge when you come to it.
 
So, do I need to do the math by myself and optimize the storage place for each array (and declare them as DMAMEM or EXTMEM if necessary)

Yes, for static and global variables you need to use DMAMEM or EXTMEM in your code if you want them placed in those memory areas.

But I wouldn't call that "do the math". It's simply a matter of typing the keyword for the memory storage you want. The linker does all the actual math of assigning all your variables to specific memory locations. If you really want to check the actual locations it assigned, you can find a .sym file the compiler creates in a temporary folder (in Arduino IDE, use File > Prefs to turn on verbose output during compile, so you can see the compiler commands with full pathnames). Or at runtime you can cast a pointer to the variable into an integer and print the number.
 
Thanks a lot for all your answers.
I think I must provide more details on what I'm doing.

I use an AI framework which trains a deep neural network and exports a C/C++ code. This code is made of several .c and .cpp files, and a certain number of .h files which contain the declaration of large float arrays.
I want to execute this code on the Teensy 4.1. If I compile it on a Linux server, I can execute it readily.
But using an embedded system requires more thorough memory management.

To run the network, the code executes all the layers one after the other. They are chained, because the output of one layer is the input of the next one. These input and output are also float arrays, their sizes change from one layer to another. To manage this I can either do dynamic allocation of allocate the largest required size once and for all, but this takes more memory.

Then all layers use 2 additional arrays, a large one and a smaller one. The sizes also change from one layer to another. For example, in the test network I'm using currently, the largest array is 921600 floats (around 3MB), while the smallest one has only 64 floats.
So for each layer I have 4 arrays : input, weights (large), bias (smaller) and output. The weights and bias arrays are used only once, they live inside a block of { }. For the large 3MB it's easy: it must be put in PSRAM. But for other layers, the sizes are smaller and they can fit either in RAM1 or RAM2.
But for typing the keyword for the memory storage I want, I must know if they will fit inside or not.

The easiest choice would be to always use PSRAM, but the speed would decrease. So I need to decide for some of these arrays to put them in RAM1 or RAM2, RAM1 being the best choice for speed. This is what I called 'doing the math' : analyze all the layers one by one and check if the arrays fit in RAM1, or if it is already full, in RAM2 or if not in PSRAM. Keeping in mind that the code already takes some place in RAM1.

My question is: do I need to do this, or can I let the linker do it for me (hopefully optimized)? If so, how do I declare the arrays? Simply as
Code:
float input[250];
?

Paul, you said:
Or at runtime you can cast a pointer to the variable into an integer and print the number.
What are the adresses of the beginning of RAM1, RAM2 and PSRAM?
 
My question is: do I need to do this, or can I let the linker do it for me (hopefully optimized)? If so, how do I declare the arrays? Simply as
Code:
float input[250];
?
The linker will use RAM1 by default and complain (fail compilation) if it doesn't fit. So unless that actually happens you don't need to worry about manually specifying which memory to use.
 
The weights and bias arrays are used only once, they live inside a block of { }

If some of this stuff is compile time constants, you can use PROGMEM to put it into the flash memory (base address is 0x60000000). Whether that gives any benefit compared to PSRAM is an open question...
 
Thanks.
In my case, with all the .h files, does the code size indicated at compile time include these header files or not?

The linker will use RAM1 by default and complain (fail compilation) if it doesn't fit. So unless that actually happens you don't need to worry about manually specifying which memory to use.
Actually, I do for several reasons:
  • The size of input and output arrays change during execution, so if they are stored in RAM1 (which is the best option for speed) then the remaining RAM1 changes from layer to layer,
  • I cannot find the optimum place for all arrays just by trial and error.
Also, you can dynamically allocate to PSRAM, so you can decide at run-time where you want your arrays.
Thanks, but can I also dynamically allocate in RAM1 and RAM2?
 
If some of this stuff is compile time constants, you can use PROGMEM to put it into the flash memory (base address is 0x60000000). Whether that gives any benefit compared to PSRAM is an open question...
Yes, that's the slowest solution. I tried it on an ESP32 and obtained (for a given neural network, much smaller) an execution time of 7 seconds.
If all the arrays were located in RAM, I got 20 ms.
That's the reason why I need to optimize the location of each array.
 
Thanks.
In my case, with all the .h files, does the code size indicated at compile time include these header files or not?


Actually, I do for several reasons:
  • The size of input and output arrays change during execution, so if they are stored in RAM1 (which is the best option for speed) then the remaining RAM1 changes from layer to layer,
  • I cannot find the optimum place for all arrays just by trial and error.

Thanks, but can I also dynamically allocate in RAM1 and RAM2?
Yes. malloc() goes to RAM2. Stack is in RAM1.
 
Yes. malloc() goes to RAM2. Stack is in RAM1.
Thanks for your answer.

To sum up what I understood:
  • If I declare (for example) float x[20000]; it is placed in RAM1 unless there is not enough space in which case it is placed in RAM2.
  • If I malloc an array of 20000 * sizeof(float) it is placed in RAM2. If there is not enough space, it is placed in PSRAM
  • If I extmem_malloc my array, it is placed directly in PSRAM.
Is this correct?
 
No. Nothing automatically moves. It either fits where you put it, or it doesn’t. Please just write some test programs. You’ll see.
 
I think your going to have trouble trying to dynamically allocate arrays of varying size and try get them into ram1, ram2 or psram (unless you know the size at compile time)

Have you increased your psram speed get? Those should set it to 132mhz. (I think by default it's 88mhz)

Code:
//Reset clock to 132 Mhz
      CCM_CCGR7 |= CCM_CCGR7_FLEXSPI2(CCM_CCGR_OFF);
      CCM_CBCMR = (CCM_CBCMR & ~(CCM_CBCMR_FLEXSPI2_PODF_MASK | CCM_CBCMR_FLEXSPI2_CLK_SEL_MASK))
          | CCM_CBCMR_FLEXSPI2_PODF(4) | CCM_CBCMR_FLEXSPI2_CLK_SEL(2); // 528/5 = 132 MHz
      CCM_CCGR7 |= CCM_CCGR7_FLEXSPI2(CCM_CCGR_ON);

I suggest running the psram test to check https://github.com/PaulStoffregen/teensy41_psram_memtest/blob/master/teensy41_psram_memtest.ino

Sequential array access in psram can be pretty quick. If my testing was correct (doing own psram test) setting 2 million floats in an array on an 8mb psram chip at 132mhz was 41.5 milliseconds. Reading the 2 millions was 23.5 milliseconds. (All done sequentially and setting with a fixed value)

I also suggest if your doing multiple operations on array elements in psram, prefetch them into local variables if possible.

For smaller arrays stored on psram, if you keep ram1 pretty empty you could also check the size and copy them directly to the stack, operate on them, then copy them back into psram (as sequential access is a lot faster)
 
Last edited:
Thanks.
In my case, with all the .h files, does the code size indicated at compile time include these header files or not?
Of course, headers are included in compilation.

Actually, I do for several reasons:
  • The size of input and output arrays change during execution, so if they are stored in RAM1 (which is the best option for speed) then the remaining RAM1 changes from layer to layer,
  • I cannot find the optimum place for all arrays just by trial and error.
If you declare an array as float input[250]; you cannot change its size at runtime. If you're using dynamic arrays (using malloc), those will go in RAM2. If you're worried about exhausting RAM2, use extmem_malloc() instead which will use PSRAM first then fallback to RAM2 if the former runs out. If you still run out of memory (assuming you do check for memory allocation failures...), then the program has consumed all of PSRAM and RAM2 and is simply too large.
Thanks, but can I also dynamically allocate in RAM1 and RAM2?
You can't dynamically allocate space in RAM1. Any free space in RAM1 is used for the program's stack.
 
For smaller arrays stored on psram, if you keep ram1 pretty empty you could also check the size and copy them directly to the stack, operate on them, then copy them back into psram (as sequential access is a lot faster)
Thanks, this seems interesting, could you please explain more or give me a short example?
 
Well I was thinking. Sequential access using psram is the fastest access.

If you need to do a lot of modifications to the array in psram, while also accessing many elements of it. It might be quicker to copy it to an empty array in your method. The empty array will be on the stack. Should be fast access. And will go out of scope when the method ends.

But you would have to profile it for speed. It depends how much your doing with your arrays

Something like this

Code:
EXTMEM float myArray[500];  // Array in PSRAM

void processArray() 
{
    float localArray[500];  //Memory allocated for local copy on the stack.  Values will be random and not prefilled so shouldn't take much time to allocate

    // Copy from PSRAM to local RAM
    memcpy(localArray, myArray, sizeof(myArray));

    // Do your processing on localArray
    for (int i = 0; i < 500; i++)
   {
        localArray[i] += 1.0f;
    }

    // Copy back to PSRAM
    memcpy(myArray, localArray, sizeof(myArray));
}
 
OK, I see, thanks a lot.
Actually, there are 2 kinds of arrays in my application:
  • input and output : they are used and changed throughout the execution of the application.
  • all the others (weights and biases) are just read only. But they can be so large that they only fit in PSRAM.
So the 2 first arrays (input and output) should be placed in RAM2, but I may not be able to copy the big arrays in RAM for faster access. But I'll try...

In your example code, does the first memcopy instruction copy the array into RAM1 or RAM2? I guess the stack is in RAM1, but what if there is not enough space in RAM1 to copy the array? Does is stop execution or just not copy the array?
 
The first memcopy will copy it to the stack as its a local array in a method. So it's in ram1. Being on the stack it won't need to be deleted.

But yes ram1 is limited to 512kb (plus stack and other code) I don't think there is a limit to the stack size, other than not exceeding ram1's limit.
 
The CPU has a 32KB data cache (which is as fast as RAM1), so there's not much point manually copying any dataset smaller than that from PSRAM.
 
Back
Top