Extremely high RAM1 usage for code on Teensy 4.1, out of space in RAM1

Nuclear_Man_D · Mar 17, 2024

Hello,

I am attempting to run an incredibly large application on the Teensy 4.1. Most of the code was originally compiled for Linux, and so it never was forced to be small enough to run on any, not even the largest, microcontrollers. Normally, this would be because the program uses many globals with high RAM usage, or because the application isn't large enough to fit in FLASH, but here neither are true. Interestingly, this program is using more than 280k in the RAM1 bank for code alone:

My question is, what part of my code is using this RAM? My hypothesis is vtables, since I have tons of classes, some of which have a lot of abstract methods. The thing is, I know from looking at the T4.1's core source code that it already has a lot of classes in it, and to my knowledge those classes don't occupy this much RAM. It seems like if this data could be moved to flash, I'd have plenty of space for it, since I'm using hardly any of the flash capacity (relatively speaking). Also, is this sort of RAM1 usage for code normal?

I should probably note my OS, optimization settings, and other details of my configuration, if anything at least for context:
- I am running Linux Mint (uname -a: "Linux nuclaer-machine 5.15.0-89-generic #99-Ubuntu SMP Mon Oct 30 20:42:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux")
- Arduino 1.8.19 with TeensyDuino 1.58. Compiling through Arduino, no CMake or Makefile.
- Currently only testing with an SD card connected. I will connect other hardware later but there is no wiring here to be related to the RAM usage.
- Optimization setting is set at "Faster". "Fastest" makes the memory consumption far worse, "Fast" does not improve RAM, and "Smallest Code" prevents the Teensy4.1 from booting as I discuss later in this post. "Debug" does not compile.

The sketch is broken into three parts: The small part that I edit in the Arduino editor, the OS I wrote for Arduino a few years ago, and the new application code I added. I don't want to share that application code for various reasons, but the OS part utilizing a very large amount of the space is located at my gitlab server here: https://git.nuclaer-servers.com/Nuclaer/ntios-2020. If I run a nearly identical setup with the application code removed, the RAM code usage is still high, so I know that the OS portion is using a huge amount of this RAM1 code space:

Given that I do have a style of programming reflected in both the application and the OS code, I'd think it's likely that whatever is eating up this code block in the OS is the same mechanism for the application code. The compilation results also seem to indicate that the OS is using more of this block.

For comparison, this is the utilization of the ASCIITable example sketch:

To confirm, I have moved to PROGMEM the few large global arrays I had, namely three icons and two pixel fonts. I saw a drop in RAM1 usage, but it wasn't enough. Nonetheless, I don't think that would be considered code memory in RAM1, so I don't think it really has much to do with the question in this post.

I want to mention again that I couldn't get the "smallest code" optimization option to run on Teensy 4.1, even for the ASCIITable example sketch, so it is not my code at fault. The Teensy, after being flashed with such code, will not connect over USB and cannot be flashed automatically; the button to manually initiate flashing must be pressed. This is really beyond the scope of this post, and deserves it's own thread, but I wanted to mention it. "Debug" optimization mode also did not compile, but I did not investigate it much.

Anyway, back to the original question: What in my code is using so much space in RAM1? Furthermore, how can I move it to flash, even if this reduces my execution speed?

More generally, what specific parts of our compiled code (vtables, for example) go to which RAM and flash blocks? This is vaguely described in many places, but I'm referring to the more specific and intense technical details.

kd5rxt-mark · Mar 17, 2024

As for the answer to your "is this sort of RAM1 usage for code normal?" quesstion, the short answer is a definite "probably." By default, all of your code gets moved to RAM1 for faster execution. To alleviate some of this, you can selectively configure some of your code functions that are not speed sensitive to execute out of flash instead (e.g. there's probably no need for your setup() to execute quickly). This is done by preceding the function definition with the keyword FLASHMEM. Give that a try & see how much of a positive change (actually, reduction) that might make in your RAM1 usage.

Good luck & have fun !!

Mark J Culross
KD5RXT

defragster · Mar 17, 2024

Nuclear_Man_D said:
I saw a drop in RAM1 usage, but it wasn't enough. Nonetheless, I don't think that would be considered code memory in RAM1,

Ram1's 512KB is as noted any and all code not marked FLASHMEM, and it is also any and all compile time allocated RAM - whether initialized as user vars or allocated otherwise with the remainder as stack.

Just like CODE can sit in flash with FLASHMEM - static data can sit in flash PROGMEM or with proper 'const' declarations. DMAMEM can also be used from slower RAM2 (but covered with 32KB data cache) to reserve space - but never initialized automatically. So less used RAM could be moved to DMAMEM and manually initialized/copied from PROGMEM stored data. Also RAM2 holds the heap used for dynamic memory malloc() type requests if that could help move items from RAM1 or if used by the 'application' at hand.

And for ref the 'padding' value is the unused portion of a 32KB code block - it sits IDLE and unused. Removing enough code to push that over 32KB will result in 32KB more RAM1 available for variables or local variables.

Nuclear_Man_D · Mar 18, 2024

kd5rxt-mark said:
By default, all of your code gets moved to RAM1 for faster execution. To alleviate some of this, you can selectively configure some of your code functions that are not speed sensitive to execute out of flash instead

This is very helpful - I had to move a lot of functions to flash but this made execution of the program possible. I hadn't known about the FLASHMEM macro before!

defragster said:
DMAMEM can also be used from slower RAM2 (but covered with 32KB data cache) to reserve space - but never initialized automatically. So less used RAM could be moved to DMAMEM and manually initialized/copied from PROGMEM stored data.

This is also helpful - I don't have many globals, I had already moved all global constants to PROGMEM, but I saved a few KiB moving my globals to DMAMEM. Interestingly, moving certain globals to DMAMEM breaks TeensyThreads, although I haven't found the pattern for this yet. Well, almost all of my memory allocation is handled, either directly or indirectly, by malloc/free due to the way my program is set up.

defragster said:
And for ref the 'padding' value is the unused portion of a 32KB code block - it sits IDLE and unused. Removing enough code to push that over 32KB will result in 32KB more RAM1 available for variables or local variables.

I was actually wondering how this mechanism worked! I had been guessing the padding was for byte/word alignment for ARM architecture. Good to know!

I decided to run Paul's CoreMark (https://github.com/PaulStoffregen/CoreMark) with/without use of FLASHMEM and DMAMEM to see what the real effects on speed look like. Here's what I got without any modifications to the sketch:

Here is with FLASHMEM on all functions:

As you can see, the difference is completely not noticeable, but it saves about 4k of RAM1. I ran this test again and got the exact same results.
Now, if I then switch the CoreMark settings to use malloc (change MEM_METHOD to MEM_MALLOC in core_portme.h, and implement portable_free and portable_malloc), this makes CoreMark use RAM2. The results are a little worse, but by very little:

All benchmarks were run with optimization as 'Faster' and clock speed at 600Mhz.

If these benchmarks are accurate, then in reality there is no reason not to use FLASHMEM and DMAMEM on everything, unless you need a 1-2% speed increase. I did not test with overclocking, this is all at 600Mhz. Honestly, I would have expected the speed to be changed at least 30%, if not a few times, knowing how slow some of these memories can be compared to a tightly-coupled RAM. Seems I was wrong.

I think I'm going to consider modifying the linker settings to see if I can move all the compiled code to FLASHMEM in light of this. I found that a huge chunk, perhaps up to 170k, of the RAM1 being used is not code in my sketch, but in the Teensy 4.1 core or in libraries (namely the RA8875 library from Adafruit). Temporarily removing use of certain libraries had a huge impact on my RAM1 usage. I'd rather not modify and hack up the libraries, and implementing much of it myself is not prudent, hence the idea of changing the linker script or whatever necessary settings.

The only issues I'd see with the benchmark are that it either wouldn't accurately represent the executed code, or that it is using library functions that are in RAM1, giving it a speed boost. From what I can tell, the CoreMark implementation is not using almost any (if any) library functions, and in my case CoreMark seems to be doing similar operations to my own code. So, I think it will accurately represent the difference in execution speed if all code is in flash only, at least for my case.

PaulStoffregen · Mar 18, 2024

All of the important CoreMark code fits within the Cortex-M7's 32K instruction cache and all its data fits within the 32K data cache.

Nuclear_Man_D · Mar 19, 2024

PaulStoffregen said:
All of the important CoreMark code fits within the Cortex-M7's 32K instruction cache and all its data fits within the 32K data cache.

Well... That would certainly explain the "good" results! That was a silly thing for me to overlook . I think it would be interesting to see the accurate results, so I tried modifying the startup code to disable use of the cache. I wouldn't go to such effort for running a CoreMark, but I didn't find any information online about the tangible speed differences, and perhaps it would be useful to others than myself. I don't understand this cache setup well though, and couldn't find documentation online (the link in the code is a 404 now), but I think my modification may have worked as my CoreMark score is over a magnitude lower. Here's the modified configure_cache function:

C:

FLASHMEM void configure_cache(void)
{
    // Top part of the function is not changed

    uint32_t i = 0;
    SCB_MPU_RBAR = 0x00000000 | REGION(i++); //https://developer.arm.com/docs/146793866/10/why-does-the-cortex-m7-initiate-axim-read-accesses-to-memory-addresses-that-do-not-fall-under-a-defined-mpu-region
    SCB_MPU_RASR = SCB_MPU_RASR_TEX(0) | NOACCESS | NOEXEC | SIZE_4G;
    
    SCB_MPU_RBAR = 0x00000000 | REGION(i++); // ITCM
    SCB_MPU_RASR = MEM_NOCACHE | READWRITE | SIZE_512K;

    // TODO: trap regions should be created last, because the hardware gives
    //  priority to the higher number ones.
    SCB_MPU_RBAR = 0x00000000 | REGION(i++); // trap NULL pointer deref
    SCB_MPU_RASR =  DEV_NOCACHE | NOACCESS | SIZE_32B;

    SCB_MPU_RBAR = 0x00200000 | REGION(i++); // Boot ROM (no longer cached)
    SCB_MPU_RASR = MEM_NOCACHE | READONLY | SIZE_128K;

    SCB_MPU_RBAR = 0x20000000 | REGION(i++); // DTCM
    SCB_MPU_RASR = MEM_NOCACHE | READWRITE | NOEXEC | SIZE_512K;
    
    SCB_MPU_RBAR = ((uint32_t)&_ebss) | REGION(i++); // trap stack overflow
    SCB_MPU_RASR = SCB_MPU_RASR_TEX(0) | NOACCESS | NOEXEC | SIZE_32B;

    SCB_MPU_RBAR = 0x20200000 | REGION(i++); // RAM (AXI bus) (no longer cached)
    SCB_MPU_RASR = MEM_NOCACHE | READWRITE | NOEXEC | SIZE_1M;

    SCB_MPU_RBAR = 0x40000000 | REGION(i++); // Peripherals
    SCB_MPU_RASR = DEV_NOCACHE | READWRITE | NOEXEC | SIZE_64M;

    SCB_MPU_RBAR = 0x60000000 | REGION(i++); // QSPI Flash (no longer cached, even though not used)
    SCB_MPU_RASR = MEM_NOCACHE | READONLY | SIZE_16M;

    SCB_MPU_RBAR = 0x70000000 | REGION(i++); // FlexSPI2 (no longer cached, even though not used)
    SCB_MPU_RASR = MEM_NOCACHE | READWRITE | NOEXEC | SIZE_16M;

    // The rest here is unmodified
}

It's nice to know the external RAM is cached! I did not know that. Well, I reran my tests with the code above to configure the cache, and here are my results:

Data Location	Code Location	CoreMark (-O3)	CoreMark (-O2)
RAM1 (stack)	RAM1 (no FLASHMEM)	2392.92	2406.74
RAM2 (malloc)	RAM1 (no FLASHMEM)	533.18	547.62
RAM1 (stack)	RAM2 (FLASHMEM)	126.12	210.42
RAM2 (malloc)	RAM2 (FLASHMEM)	124.59	209.86

This makes much more sense, thanks Paul. It's interesting that the speed when using just RAM1 is essentially the same as cached, that is essentially the tight coupling? I knew it was fast but not that it was as fast as cache. Also - the speed using -O3 is slower consistently, and I did double check that I didn't mix up the samples. I would have expected -O3 to be faster on at least one pair of tests, but this didn't happen.

Of course, in real life the cache will be used, so these numbers are quite pessimistic I would assume. What are your thoughts, is this useful? Was this already known?

It would be cool if we could easily make good use of the entirety of the flash, but currently we need to put the FLASHMEM keyword in front of dozens of functions. Maybe there is a better way to handle programs larger than a few hundred kilobytes?

PaulStoffregen · Mar 19, 2024

Nuclear_Man_D said:
It's interesting that the speed when using just RAM1 is essentially the same as cached, that is essentially the tight coupling? I knew it was fast but not that it was as fast as cache.

Yep, TCM is basically the same speed as cache. It can be slightly slower in some cases, like heavy use of DMA.

There's a reason why we default to use of RAM1.

Nuclear_Man_D said:
What are your thoughts, is this useful? Was this already known?

I don't recall anyone ever running (and sharing) this specific test. But yeah, the speed of various memory has been discussed many times. So has effectiveness of the cache, which usually turns out to be quite good. But there are some notable applications where the cache is quite unhelpful. Direct FIR filter is the one that I remember most vividly.

-O3 offering little or no benefit while consuming quite a bit of extra code size has also been discussed several times. There's a reason we default to -O2.

Extremely high RAM1 usage for code on Teensy 4.1, out of space in RAM1

Nuclear_Man_D

New member

kd5rxt-mark

Well-known member

defragster

Senior Member+

Nuclear_Man_D

New member

PaulStoffregen

Well-known member

Nuclear_Man_D

New member

PaulStoffregen

Well-known member