Multiple issues using TeensyThreads on T3.5. Dynamic heap allocation problems.

VictorFS

Well-known member
Hello all!

I am recently started using the TeensyThreads library in order to have two different tasks working concurrently. Everything was working fine until I started using dynamic memory allocation using malloc, new, String class, and others. So, the thing is, I know that the use of dynamic allocation is discouraged for MCUs, but in my full code (20k lines+) I am taking all precautions to avoid heap fragmentation and other issues.

The minimum working example that I could come up with showing the issues I am having is shown below. I am using Teensy 3.5 with PlatformIO Teensy 4.17.0 (but have also tested with Arduino IDE 1.8.19 and newest Teensyduino 1.57 and the issue is still reproducible for me).

Code:
#include <Arduino.h>
#include <TeensyThreads.h>

#define SIZE_ALLOC 500
uint32_t bytes_alloc = 0;

void thread_func()
{
  while(1)
  {
    threads.delay(500);
    char* pointer = (char*)malloc(SIZE_ALLOC);
    if(pointer)
    {
      bytes_alloc += SIZE_ALLOC;
      Serial.printf("Thread allocated %d bytes on the heap. Addr = %p . Total bytes allocated so far = %d\n", SIZE_ALLOC, pointer, bytes_alloc);
    }
    else
    {
      Serial.printf("Failed to allocate %d bytes on the heap!\n" , SIZE_ALLOC);
    }
  }
}

void setup()
{
  Serial.begin(9600);
  delay(5000);
  threads.addThread(thread_func, 0, 8192); // The thread allocates around 5 kB on the heap and then malloc always returns NULL
  //threads.addThread(thread_func); // The thread allocates just one time and than the functions stops
  Serial.println(threads.threadsInfo()); // Information for thread 0 (main) is incorrect, information for thread 1 is missing
}

void loop() 
{
  // Do stuff
}

When I run this code, the output on my serial monitor is:
Code:
_____
0:Stack size:10240|Used:537067512|Remains:-537057272|State:RUNNING|

Thread allocated 500 bytes on the heap. Addr = 0x1fff3958 . Total bytes allocated so far = 500
Thread allocated 500 bytes on the heap. Addr = 0x1fff3b50 . Total bytes allocated so far = 1000
Thread allocated 500 bytes on the heap. Addr = 0x1fff3d48 . Total bytes allocated so far = 1500
Thread allocated 500 bytes on the heap. Addr = 0x1fff3f40 . Total bytes allocated so far = 2000
Thread allocated 500 bytes on the heap. Addr = 0x1fff4138 . Total bytes allocated so far = 2500
Thread allocated 500 bytes on the heap. Addr = 0x1fff4330 . Total bytes allocated so far = 3000
Thread allocated 500 bytes on the heap. Addr = 0x1fff4528 . Total bytes allocated so far = 3500
Thread allocated 500 bytes on the heap. Addr = 0x1fff4720 . Total bytes allocated so far = 4000
Thread allocated 500 bytes on the heap. Addr = 0x1fff4918 . Total bytes allocated so far = 4500
Thread allocated 500 bytes on the heap. Addr = 0x1fff4b10 . Total bytes allocated so far = 5000
Thread allocated 500 bytes on the heap. Addr = 0x1fff4d08 . Total bytes allocated so far = 5500
Failed to allocate 500 bytes on the heap!
Failed to allocate 500 bytes on the heap!
Failed to allocate 500 bytes on the heap!

So the problems I can see are:
  • threads.threadsInfo() function is displaying incorrect stack bytes used/ remaining for thread 0 (main thread) and is not showing information for thread 1 (the one we created).
  • The malloc() inside the thread function returns NULL after around 5.5kB have been allocated, despite having hundreds of kB left in the heap.
  • If I change the line "threads.addThread(thread_func, 0, 8192);" to "threads.addThread(thread_func);", it seems like the thread_func function is executed only once and than it crashes. However state of the thread remains as RUNNING.

Am I missing something big here? Can someone shed a light on this issue? I have tested with two different Teensy 3.5 boards and the results are the same. I have also tested with one Teensy 4.1 and the behavior of the malloc() changes. Now it works until the heap is full (as expected). However, I think this is merely because Teensy 4.1 uses a separate "RAM2" for dynamic memory allocation.

Thank you!
 
I was playing around with the code and was able to actually make the program crash with a much simpler code:

Code:
#include <Arduino.h>
#include "TeensyThreads.h"

void thread_func()
{
  while(1)
  {
    threads.delay(500);
    Serial.printf("Thread print!\n");
  }
}

void setup()
{
  Serial.begin(9600);
  delay(3000);
  threads.addThread(thread_func); // The thread prints just one time and than crashes
}

void loop() 
{
  // Do something
}

The result I get is that the thread prints just one time and than crashes. If I change Serial.printf to Serial.println it works. The malloc problems I commented above still happen, tho. I am starting to think that all of those issues are related. Or I am just really stupid and missing something obvious...
 
I was playing around with the code and was able to actually make the program crash with a much simpler code

I don't have any personal experience with the T3.5 (I've used T4.x for all of my projects). What optimization level are you building at ?? Does changing the optimization level cause any change in the resulting behavior ??

Mark J Culross
KD5RXT
 
I was playing around with the code and was able to actually make the program crash with a much simpler code:

The result I get is that the thread prints just one time and than crashes. If I change Serial.printf to Serial.println it works. The malloc problems I commented above still happen, tho. I am starting to think that all of those issues are related. Or I am just really stupid and missing something obvious...

I think I began using the library below to reduce code size associated with printf(), but the documentation makes a point to that it is not only small, but also thread-safe and does not use malloc().

https://github.com/embeddedartistry/arduino-printf
 
I don't have any personal experience with the T3.5 (I've used T4.x for all of my projects). What optimization level are you building at ?? Does changing the optimization level cause any change in the resulting behavior ??

Mark J Culross
KD5RXT

HI Mark, thanks for the answer. I don't have access to the code during the weekend, but I didn't change any compilation flags or settings, they are all default. I used platformIO and later installed a fresh copy of the arduino IDE and latest TD without changing any parameters, just selecting the correct target to upload (teensy 3.5) . When I get back to office I will take a look at the optimization levels.
 
I think I began using the library below to reduce code size associated with printf(), but the documentation makes a point to that it is not only small, but also thread-safe and does not use malloc().

https://github.com/embeddedartistry/arduino-printf

Thanks Joe! I will definitely give this library a try.. I think the whole code will benefit from it. Unfortunately, I will still need to use malloc() or new for things like Strings and objetc constructors, so the problem with malloc() still remains....

The thing is that I am sensing that something is really off with my threading., and the problem observed with malloc is not the root cause itself, just a thing that uses the heap and shows some unknown underlying memory conflict or something. From my first post, even the threads information print come out incorrect, the number of threads shown is wrong, the used and unused stack memory for thread zero is incorrect, etc.

I would be very glad if someone if a teensy 3.5 could upload this small code to their device and check if the output is the same as mine.
 
Thanks Joe! I will definitely give this library a try.. I think the whole code will benefit from it. Unfortunately, I will still need to use malloc() or new for things like Strings and objetc constructors, so the problem with malloc() still remains....

The thing is that I am sensing that something is really off with my threading., and the problem observed with malloc is not the root cause itself, just a thing that uses the heap and shows some unknown underlying memory conflict or something. From my first post, even the threads information print come out incorrect, the number of threads shown is wrong, the used and unused stack memory for thread zero is incorrect, etc.

I would be very glad if someone if a teensy 3.5 could upload this small code to their device and check if the output is the same as mine.

Just uploaded your code snippet to the Teensy3.5 and yes I got the same results as you did using "printf" however if you change it to just "print" it works, i.e., continuously prints "Threads print"!.


The interesting thing is that in your original sketch in post #1 printf seems to be printing from the thread without a problem.


The only other issue is that inside the thread in post #1 you are continually increasing your pointer buffer size without ever releasing it and reallocting it. If you add free(pointer) in the if(pointer) test it will run on the T3.5. Think you said it that it will even fail on a T4.1 when it runs out of memory.
 
If you add free(pointer) in the if(pointer) test it will run on the T3.5. Think you said it that it will even fail on a T4.1 when it runs out of memory.

I also ran the code on 3.5 and got the error. If I move the malloc() code from the thread function to loop(), it runs okay there until memory is exhausted, so there is something about how threads and malloc() interact. I tried a few things like changing default stack size, etc. I also tried using new (std::nothrow) instead of malloc(), and that did at least seem to result in a different number of successful allocations with different stack sizes, but the relationship between stack size and successful allocations still didn't make sense.
 
I also ran the code on 3.5 and got the error. If I move the malloc() code from the thread function to loop(), it runs okay there until memory is exhausted, so there is something about how threads and malloc() interact. I tried a few things like changing default stack size, etc. I also tried using new (std::nothrow) instead of malloc(), and that did at least seem to result in a different number of successful allocations with different stack sizes, but the relationship between stack size and successful allocations still didn't make sense.

Indeed, the malloc() code does do something on first call with regard to finding and using a HEAP to build on?

Adding this to setup before starting the thread - whatever is allocated here is then available for alloc by the threads? Except the last ~5K! - as shown below with 10,000 and 100,000 malloc - also the same on 15,000

Call code below from setup with allocSize = 10000; only works to:
Code:
...
Thread allocated 500 bytes on the heap. Addr = 0x1fff4d08 . Total bytes allocated so far = 5500
Failed to allocate 500 bytes on the heap!
But calling with allocSize = 100000; works:
Code:
...
Thread allocated 500 bytes on the heap. Addr = 0x2000ab78 . Total bytes allocated so far = 94500
Thread allocated 500 bytes on the heap. Addr = 0x2000ad70 . Total bytes allocated so far = 95000
Failed to allocate 500 bytes on the heap!

Manual Alloc and Free before starting threads seems to reserve space?:
Code:
void alloc_func()
{
  char* pointer = NULL;
  uint32_t allocSize = 10000;
  pointer = (char*)malloc(allocSize);
  if (pointer)
  {
    Serial.printf("SETUP allocated %d bytes on the heap. Addr = %p . Total bytes allocated so far = %d\n", allocSize, pointer, bytes_alloc);
    free( pointer );
  }
  else
  {
    Serial.printf("Failed to allocate %d bytes on the heap!\n" , allocSize);
  }
}
 
Indeed, the malloc() code does do something on first call with regard to finding and using a HEAP to build on?

Manual Alloc and Free before starting threads seems to reserve space?:

malloc() works as expected in the main thread (loop), but not in the thread created via addThread(). Something to do with that thread's stack having been allocated from the heap?
 
malloc() works as expected in the main thread (loop), but not in the thread created via addThread(). Something to do with that thread's stack having been allocated from the heap?

I didn't post the code - but the p#9 alloc_func() was just called from setup() in the existing code - then it no longer acted the same when called from the Thread.

The malloc code has some dependencies that may not like getting called from the thread fake stacks?
And the first caller in sets up the basis of the malloc RAM ...
Code:
#define STACK_POINTER() ((char *)AVR_STACK_POINTER_REG)
...
[B]char *__brkval = NULL;	// first location not yet allocated[/B]
...
void *
malloc(size_t len)
{
...
	/*
	 * Step 3: If the request could not be satisfied from a
	 * freelist entry, just prepare a new chunk.  This means we
	 * need to obtain more memory first.  The largest address just
	 * not allocated so far is remembered in the brkval variable.
	 * Under Unix, the "break value" was the end of the data
	 * segment as dynamically requested from the operating system.
	 * Since we don't have an operating system, just make sure
	 * that we don't collide with the stack.
	 */
	[B][U]if (__brkval == 0)
		__brkval = __malloc_heap_start;[/U][/B]
	cp = __malloc_heap_end;
	if (cp == 0)
		cp = STACK_POINTER() - __malloc_margin;
	if (cp <= __brkval)
	  /*
	   * Memory exhausted.
	   */
	  return 0;
	avail = cp - __brkval;

For thread usage it would probably better to have local memory management? Alloc a static block perhaps and manually 'deal with it'? Except that wouldn't cover
... still need to use malloc() or new for things like Strings and objetc constructors
> but maybe those things need to be avoided, or done otherwise/statically, for threading to be 'safe'?
 
Some of the answers may be found in the original thread started by @ftrias: https://forum.pjrc.com/threads/41504-Teensy-3-x-multithreading-library-first-release. Going through it I did find where he gave an explanation of how the stack works for threads:

https://forum.pjrc.com/threads/4150...-first-release?p=243592&viewfull=1#post243592


@tni mentioned in that thread:
There is a ton of pitfalls - not even 'new' / 'malloc' are thread safe (as in you will get memory corruption).
 
Yes, many pitfalls. It seems like dynamic allocation will work only if it is limited to the main thread, where the stack pointer is the system stack pointer. Probably a good idea to also limit Strings and printing to that thread. The OP didn't say what the application was or how the multi-threading is used.
 
Just uploaded your code snippet to the Teensy3.5 and yes I got the same results as you did using "printf" however if you change it to just "print" it works, i.e., continuously prints "Threads print"!.


The interesting thing is that in your original sketch in post #1 printf seems to be printing from the thread without a problem.


The only other issue is that inside the thread in post #1 you are continually increasing your pointer buffer size without ever releasing it and reallocting it. If you add free(pointer) in the if(pointer) test it will run on the T3.5. Think you said it that it will even fail on a T4.1 when it runs out of memory.

Mjs513, that's correct. The issue of memory leakage, calling malloc() without calling free() would crash the sketch anyway under any circumstance. My ideia was just to show that malloc() fails after I allocated around 5 kB of RAM, despite having much much more free RAM to be allocated. In my full program (20k+ lines ), I don't use heap allocation inside a loop, and I have monitored the RAM usage of the program using Teensy 3.x RAM Monitor, so I am certain that the RAM usage is not growing over time. Does your threads.threadsInfo() function also returns incorrect info? Thanks!
 
I also ran the code on 3.5 and got the error. If I move the malloc() code from the thread function to loop(), it runs okay there until memory is exhausted, so there is something about how threads and malloc() interact. I tried a few things like changing default stack size, etc. I also tried using new (std::nothrow) instead of malloc(), and that did at least seem to result in a different number of successful allocations with different stack sizes, but the relationship between stack size and successful allocations still didn't make sense.

Exactly! I have performed some of those tests myself and also couldn't find any clear relationship between the stack size of the threads and how many bytes could be allocated using malloc() inside the thread before returning NULL pointers, but it does make a small difference sometimes. One thing that is bugging me is the print provided by the threads.threadsInfo() function. Is your resulting print also something like:
Code:
0:Stack size:10240|Used:537067512|Remains:-537057272|State:RUNNING|
? Not only it is not showing thread if threadID = 2 but also the "used" and "remains" bytes sizes are very weird. This function outputs the same incorrect information for me both in Teensy 3.5 and Teensy 4.1. Which in my opinion could be a clue to show that something is wrong, it's not just a malloc() problem.
 
Indeed, the malloc() code does do something on first call with regard to finding and using a HEAP to build on?

That's a very interesting take! So a "quick and dirty" workaround for the issue would be to allocate "X" bytes of RAM using malloc in the main thread before adding any new thread, which will initialize some internal malloc() pointers. We can free this RAM, and then all threads created with addThread will be able to allocate around "X" - 5kB of RAM from the heap. Is that correct?

I will definitively give this a try tomorrow. I know that the most correct thing to do would be to refactor the code to not use dynamic heap allocations at all, but the time that would take would be enormous, as my project has been in development for years. The application is also not very critical ( datalogger for sensors with many com options ), and I could tolerate the risk.
 
malloc() works as expected in the main thread (loop), but not in the thread created via addThread(). Something to do with that thread's stack having been allocated from the heap?

One of the tests I performed was to provide static (global) char buffers to the addThread function to use as the stack. It didn't change my results :(
 
Exactly! I have performed some of those tests myself and also couldn't find any clear relationship between the stack size of the threads and how many bytes could be allocated using malloc() inside the thread before returning NULL pointers, but it does make a small difference sometimes. One thing that is bugging me is the print provided by the threads.threadsInfo() function. Is your resulting print also something like:
Code:
0:Stack size:10240|Used:537067512|Remains:-537057272|State:RUNNING|
? Not only it is not showing thread if threadID = 2 but also the "used" and "remains" bytes sizes are very weird. This function outputs the same incorrect information for me both in Teensy 3.5 and Teensy 4.1. Which in my opinion could be a clue to show that something is wrong, it's not just a malloc() problem.

I saw similar funny # for Remains

That's a very interesting take! So a "quick and dirty" workaround for the issue would be to allocate "X" bytes of RAM using malloc in the main thread before adding any new thread, which will initialize some internal malloc() pointers. We can free this RAM, and then all threads created with addThread will be able to allocate around "X" - 5kB of RAM from the heap. Is that correct?

I will definitively give this a try tomorrow. I know that the most correct thing to do would be to refactor the code to not use dynamic heap allocations at all, but the time that would take would be enormous, as my project has been in development for years. The application is also not very critical ( datalogger for sensors with many com options ), and I could tolerate the risk.

This MIGHT be a work around - but it might just give way to another issue later on.

The malloc code uses those fixed points for REF and it isn't clear they are valid when Threads are in use.

Maybe it does work and the 5K missing is accounting for the Thread RAM usage?
 
Yes, many pitfalls. It seems like dynamic allocation will work only if it is limited to the main thread, where the stack pointer is the system stack pointer. Probably a good idea to also limit Strings and printing to that thread.

Please correct me if I am wrong, but my understanding is: the heap and the main thread stack are placed at opposite ends of the memory. The heap grows up and the stack grows down. We don't even know if the stack and heap collided unless we place markers on the memory and keep checking for their value. The heap is global, whereas the stack is thread local. So I don't understand why malloc() or "new" would need to rely on the system stack pointer information for anything. However, I do understand that the malloc() function itself is not thread safe and could lead to memory corruption if it happens at the same time in two different threads. But wouldn't the limitation be overcome if we could just disable interrupts during the malloc , free and realloc functions?

I will try to make some improvements to my code, following your suggestions. For printing, I will try to use the "arduino-printf" library as recommended. I will also try to create all the needed objects and Strings in the main thread and reserve enough space so realloc is not required when the string grows. I will also experiment with the "hack" found by defragster and check if it just works or leads to any problem elsewhere.

The OP didn't say what the application was or how the multi-threading is used.

I am using one thread to perform all communication with an external 4G modem via hardware serial. It uses AT commands so there is a lot of "send command, wait, check reply" involved. With threading, the main loop can keep doing other things while we wait.
 
Malloc seems to have the idea there is one HEAP - but each Thread seems to have some reserved space from RAM - how that is allocated would be shown in TeensyThreads? Does it do malloc() for the requested Stack space?

Depending on how that is ordered the malloc code ... part shown above ... could be making bad calculations when called in Thread context versus expected main() code context ... i.e. ref to Stack Pointer.

Note: I did GLOBAL search for AVR_STACK_POINTER_REG and that is not apparent in the code base - so maybe it is a compiler value? No idea what that maps to. Especially on the ARM, not AVR, processor in use?
 
Looking at TeensyThreads.cpp, no malloc - just :
threadp = new ThreadInfo();
stack = new uint8_t[stack_size];

Added threads get default 1K or specified Stack and Thread0 main loop() gets 10K.

Not sure what that might add to clarify ... but that is the code ...
 
Yes, TeensyThreads uses "new", which maps directly to malloc(). But there is also the option to provide an already existing buffer to serve as stack for the newly created thread. In my full code, I am already doing that, because it seems like it is less prone to bugs and makes it easier to track down how much memory is used (global static variables memory usage is seen at compile time). It did not help with the problems I am facing with the threads and dynamic allocation though. I think that this is because all of this dynamic allocation is performed by the main thread before the a new thread is created, so it is not the new thread who is calling the mallocs(), but the main loop, which seems to be working fine.

Thank you very much for the help! Tomorrow I will try to apply the suggestions I got from this thread and experiment more with the code.
 
Please correct me if I am wrong, but my understanding is: the heap and the main thread stack are placed at opposite ends of the memory.

This is indeed the default (no threads) behavior for Teensy 2 and 3.

On Teensy 4, the default has the stack start at the top of ITCM (RAM1) memory, and the heap is located in RAM2 memory.
 
malloc() will report that no memory is available if the requested allocation will collide with the stack, as defined by the value of the stack pointer at the time of the call to malloc(). With TeensyThreads, each thread has its own stack. By default, threads are allocated from the heap, so the stack pointer will be no more than "stack size" bytes above the bottom of the heap. You can also pass in the address of an array elsewhere in RAM to use as stack, but either way, the calculation of available heap space in malloc() will not be correct for any thread except the main thread, which uses the system stack. If what I'm saying here is correct, you can use however much heap space malloc() thinks is available to each thread, but the only thread from which you can access all of the heap is the main thread.

EDIT: I did some testing with multiple threads, each one trying to allocate memory, and also doing the same thing in the main thread. The results are "weird", with only the main thread consistently able to malloc() until the heap is actually exhausted. With multiple threads doing allocation, a given thread will have a failure on one pass, and then perhaps succeed on the next pass. I _think_ the total amount of malloc'd memory is correct, so as I said above, you may be able to allocate some memory from any thread, but I don't see an obvious way to know in advance how much heap will be available to a given thread. The results change from run to run. Here's the test code.

Code:
#include <Arduino.h>
#include <TeensyThreads.h>

#define SIZE_ALLOC 1000 // JWP 500
uint32_t total_bytes_alloc = 0;
uint32_t thread_bytes_alloc[5] = {0,0,0,0,0};

void thread_loop( int threadID )
{
  bool fail = false;
  while(1)
  {
    threads.delay(50); // JWP (500)
    char* pointer = (char*)malloc(SIZE_ALLOC);
    if(pointer)
    {
      fail = false;
      thread_bytes_alloc[threadID] += SIZE_ALLOC;
      total_bytes_alloc += SIZE_ALLOC;
      Serial.printf("Thread %1d malloc %d. Addr = %p. thread_total = %d. total = %d\n",
                        threadID, SIZE_ALLOC, pointer,
                        thread_bytes_alloc[threadID], total_bytes_alloc);
    }
    else
    {
      if (fail == false)
        Serial.printf("Thread %1d malloc failed.\n" , threadID);
      fail = true;
    }
  } 
}

void thread_func( void *param )
{
  thread_loop( (int)param );
}

void setup()
{
  Serial.begin(9600);
  delay(1000); // JWP (5000);
  threads.addThread(thread_func, (void*)1, 8192); // 8192 bytes stack alloc'd from heap 
  threads.addThread(thread_func, (void*)2, 8192); 
  threads.addThread(thread_func, (void*)3, 8192); 
  threads.addThread(thread_func, (void*)4, 8192); 
  //threads.addThread(thread_func); // default heap 1024 bytes is alloc'd from heap
  Serial.println(threads.threadsInfo()); // Information for thread 0 (main) is incorrect, information for thread 1 is missing
}

void loop() 
{
  thread_loop(0);
}
 
Last edited:
EDIT: I did some testing with multiple threads, each one trying to allocate memory, and also doing the same thing in the main thread. The results are "weird", with only the main thread consistently able to malloc() until the heap is actually exhausted. With multiple threads doing allocation, a given thread will have a failure on one pass, and then perhaps succeed on the next pass. I _think_ the total amount of malloc'd memory is correct, so as I said above, you may be able to allocate some memory from any thread, but I don't see an obvious way to know in advance how much heap will be available to a given thread. The results change from run to run. Here's the test code.

You are correct! That's the exact same behavior I was noticing. In my case I have tested with only the main thread + 1 created thread. Sometimes the thread 1 is unable to allocate memory, but if the main thread tries to allocate something, it always succeed, and MAYBE thread 1 is able to malloc() again because the main malloc() seems to have "unlocked" the malloc() ability to the thread. However, this behavior is very weird and cannot be relied on, we cannot know in advance what will be the behavior, as you pointed out.

However, I have also tested a code similar to yours (but with only the main thread and one extra thread) but the difference is that I followed one "hack" found by defragster which is to uso malloc() in the setup() function to allocate X bytes of before doing anything thread related. We can than free this pointer immediately, and proceed to add the threads. Doing this, both the thread and the main were able to reliably malloc() around X - 5 kB of RAM. After that, it continues with the "weird" behavior we saw before, where the thread is only able to malloc sometimes. I don't know how this "hack" would affect other things in the system, but somehow it seems to have worked. For sure, I need to do many more tests before I feel safe using this :)
 
Back
Top