Looking for Memory Management Tips and Tricks

macardoso

Well-known member
Hi All,

Related to my previous post but more general. I have a large project running on Teensy 4.1 to make a real-time inverted pendulum controller. The teensy is responsible for all tasks, except for torque amplification of the AC servo motor (managed by its own drive). I have a lot of libraries and code involved and the Teensy is getting filled up in RAM1. Still have plenty of RAM2 and Flash. Have spent several dozen hours trying to get this working.

Libraries Used:
  • LVGL
  • Arduino
  • SPI
  • IntervalTimer
  • Arduino FreeRTOS
  • timers.h (FreeRTOS Timers)
  • Custom RS485 Serial Host library for the servo drive (Allen Bradley Ultra 3000)
  • ILI9341_T4 display driver
  • QuadEncoder
  • FT6336U touchscreen
Right now I'm getting excellent real time performance out of the system, but integrating displaying and functionality is eating up the limited RAM1 I have.

In my latest build, I started to link data to a couple "LEDs" on the display and while it compiles, it overruns RAM1 with dynamic allocation during boot. This results in the double blink error code of the built-in LED. My compiler shows:

Code:
Memory Usage on Teensy 4.1:
  FLASH: code:384388, data:404968, headers:8336   free for files:7328772
   RAM1: variables:148992, code:328076, padding:32372   free for local variables:14848
   RAM2: variables:281376  free for malloc/new:242912

Flash has 7.3MB free, RAM2 has 242kB free, but RAM1 only has 14kB during static allocation that overflows with dynamic allocation.

Things I've tried:
  • ILI9341_T4 frame buffers and diff buffers to RAM2 (170kB static)
  • LVGL draw buffer moved to RAM2 (25kB static)
  • "Chart recorder" history buffer moved to RAM2 (2.4kB static)
  • Chart head/count variables moved to RAM2 (small)
  • Encoder Snapshots moved to RAM2 (small)
  • DAC sample history moved to RAM2 (small)
  • Digital input state flags moved to RAM2 (small)
  • QuadEncoder objects moved to RAM2 (0.5kB static)
  • FT6336U touch controller object moved to RAM2 (0.2kB static)
  • U3K_SerialHost object moved to RAM2 (1.5kB static)
  • LVGL FPS counter globals moved to RAM2 (small)
  • Ultra3000 status arrays / lookup tables moved to RAM2 (0.2kB)
  • FreeRTOS Tasks and Stack moved to RAM
    • All task stacks moved to RAM2 using DMAMEM (14kB dynamic)
    • All task TCBs moved to RAM2 (1.2kB dynamic)
    • Startup task stack + TCB moved to RAM2
    • FreeRTOS heap moved to RAM2 (up to 40kB dynamic)

Things I still plan to try:
  • Move LVGL Stack to RAM2

My Questions:
  1. I think much of the variable use in RAM1 is coming from libraries I reference. Is there any way to move these to RAM2?
  2. Program memory in RAM1 is 328kB. Any reasonable way to load/unload parts of this from Flash at runtime to make this chunk smaller or would that tank performance?
  3. Are there strategies / tools to evaluate memory in a more granular way?
  4. Am I just filled up and nothing more I can do?
I can attach project source code, but I'm not sure how appropriate it is to throw a big project on the thread or how useful it would be to answering the questions.

Thanks so much for your support!
 
Use FLASHMEM on functions that aren't performance critical. That moves them completely out of RAM1 so they run directly from flash. The flash memory is slow, but it's cached, so you only suffer that slowness for cache misses.
 
Since the sticky thread recommended project photos, they are attached here.
  • Teensy 4.1
  • TI DAC8563 eval. board (+/-10V analog out to servo drive as torque reference)
  • 4 inputs, 4 outputs, optoisolated 24VDC
  • MAX14789 transceiver for RS485 serial host interface to servo drive
  • Up to 4 quadrature encoders to Teensy hardware, RS422 differential receivers
  • 3.1" IPS capacitive touchscreen (ILI9341) running LVGL 9.2.2
  • USB serial debug to PC
  • (8) debug pins to MSO (task timing)
  • 24VDC input with buck converters for 5V and 3V3 rails
Will all go onto a credit card sized PCB once I prove it all out.

IMG_6897.jpg
IMG_6896.jpg
IMG_6898.jpg


image001.png
 
Are you using the "small" compile option? That might also help.
I haven't yet, although I'll give it a try. I'm using high speed, low latency ISRs up to 64kHz, fast RTOS task context switching and preemption, SPI (2 channels @ 60MHz), UART, I2C, DMA, etc. Reading about "Smaller" compile option sounds like a high risk to break many of these things. Compiling now with different options and I'll report back
 
Tried a couple, need to evaluate task timing and DAC output with Scope still. Don't feel the most comfortable with how inconsistent the results are with each compile option.

Faster: 14.8kB available - crashes
Faster with LTO: 20.0kB - works?
Fast: 47.6kB - crashes in strange ways
Fast with LTO: 52.7kB - works somewhat - stack overflow TaskLVGL, probably can fix
Smallest Code: 117.1kB - crashes, instruction access violation in LVGL task
Smallest Code with LTO: 121.2kB available - works?
Fastest: -17.8kB - No upload, insufficient RAM1
Fastest with LTO: -45.6kB - No upload, insufficient RAM1

If Fast with LTO or Smallest Code with LTO works reliably, then that would be a good option I think.
 
Last edited:
Implemented a couple "FLASHMEM" calls in non-critical code.
  • Smallest Code with LTO: 153.9kB free
  • Faster (Default): 47.6kB
Think I'm safe again for a while...
 
I don't understand what you mean by "overruns RAM1 with dynamic allocation" - dynamic memory allocation uses RAM2.

You have to be very careful moving entire objects and stacks into RAM2, quite often if any code uses DMA or other hardware-based memory transfers it is written to assume the memory is uncached.
 
I don't understand what you mean by "overruns RAM1 with dynamic allocation" - dynamic memory allocation uses RAM2.

You have to be very careful moving entire objects and stacks into RAM2, quite often if any code uses DMA or other hardware-based memory transfers it is written to assume the memory is uncached.
That's a good question. I'm very much a novice here, so please bear with me. My understanding is that malloc() adds all new variables to RAM2, however several libraries (notably FreeRTOS) use newlib allocation that places dynamic allocation into RAM1. I would get _sbrk_r errors at runtime, during startup of the code. This would happen when I had less than 20kB available in RAM1. LVGL might also have issues with this, although I think LVGL 9 switched to malloc() internally.
 
malloc (which is the newlib allocator) calls _sbrk_r when it needs more heap space so they both allocate from RAM2. The C++ keyword "new" is also just a wrapper around malloc. So I don't see how it's possible for anything to do dynamic allocation from RAM1.
 
Use FLASHMEM on functions that aren't performance critical. That moves them completely out of RAM1 so they run directly from flash. The flash memory is slow, but it's cached, so you only suffer that slowness for cache misses.
Paul, I followed this and seemingly saved a bunch of RAM1. Then due to some errors with moving data structures into RAM2 and crashing things, I reverted back to an old save point and tried again (with much more granular checks between changes) and I can't seem to save any RAM1 with FLASHMEM calls. Looking at the Teensy memory (https://protosupplies.com/learn/prototyping-system-for-teensy-4-1-working-with-teensy-4-1-memory/) it looks like only FASTRUN code lives in RAM1. What am I missing?
 
@macardoso: RAM is allocated in chunks (16K / 32K ?), so until you move enough functions from RAM to FLASH to free up an entire chunk, the primary thing that you will see will be a gradual change in the "padding" number as you move more and more.

Hope that helps . . .

Mark J Culross
KD5RXT
 
With FreeRTOS the stack may be elsewhere. But in that case there's really no reason to create more free RAM1 since nothing will use it.
 
Thank you all so much for sharing your knowledge. Stumbling my way through some of the intricacies of Teensy. Looking into how "padding" changes with these tweaks. I'm guessing once it hits 32k then I'll see a big jump in available RAM1 and padding will drop commensurately.

With FreeRTOS the stack may be elsewhere. But in that case there's really no reason to create more free RAM1 since nothing will use it.

Is there a downside to assigning the FreeRTOS stack to RAM2? I also looked into the option to use application allocated heap for FreeRTOS through the config file. Unsure how large this really gets in practice or advantages/disadvantages of manually setting this up.
 
Is there a downside to assigning the FreeRTOS stack to RAM2?

RAM1 is faster because it's connected by the "tightly coupled" buses. RAM2 connects by slower AXI bus.

But RAM2 is cached, so for most access cache hits give the same speed. RAM2 is only slower for cache misses. How much difference that makes, if any, really depends on how your program uses the memory.

However, RAM1 is the fastest memory, so using it for frequency accessed stuff gives the best performance.
 
Back
Top