Multiple issues using TeensyThreads on T3.5. Dynamic heap allocation problems.

Defragster, I tried to run the code with the modifications you mentioned in your last post (#48) and could also reproduce the behavior you have shown. Very weird. I have no idea what caused this.

I tried a few more things, including having the thread stacks on the stack of the main thread, and I think the fundamental problem is the same. If you call malloc() with different stack pointers, you can overwrite memory you did not intend. I added call to memset() in thread_loop() to set all malloc'd bytes to 0xFF. When I do this, only the main thread continues to run.

That's interesting. I modified my last code (post #46) to memset all allocated heap and also use our stack and was still able to successfully and reliably malloc() from any thread. I have also performed a simple check for heap and stack integrity, and it seems to work. I don't know if the test I made was enough to prove anything, but what do you think?

Code:
#include <Arduino.h>
#include <TeensyThreads.h>
#define SIZE_ALLOC 2500 // JWP 500
#define STK_CHK_SIZE 800

uint32_t total_bytes_alloc = 0;
uint32_t thread_bytes_alloc[5] = {0,0,0,0,0};
bool threads_created = false;

void thread_loop( int threadID )
{
  int numLoops = 0;
  char stkSig[STK_CHK_SIZE];
  for ( int ii = 0; ii < STK_CHK_SIZE; ii++ ) stkSig[ii] = (char)threadID; // Set markers on out thread stack
  while(1) {
    numLoops++;
    char* pointer = (char*)malloc(SIZE_ALLOC);
    if (pointer) {
      thread_bytes_alloc[threadID] += SIZE_ALLOC;
      total_bytes_alloc += SIZE_ALLOC;
      memset(pointer, (char)threadID, SIZE_ALLOC); // Set markers on the heap we allocated

      Serial.print("Thread ");
      Serial.print(threadID);
      Serial.print(" loop #");
      Serial.print(numLoops);
      Serial.print(" malloc ");
      Serial.print(SIZE_ALLOC);
      Serial.print(". Addr = 0x");
      Serial.print((uint32_t)pointer, HEX);
      Serial.print(". thread_total = ");
      Serial.print(thread_bytes_alloc[threadID]);
      Serial.print(". total = ");
      Serial.println(total_bytes_alloc);

      bool heap_corrupted = false;
      for ( int ii = 0; ii < SIZE_ALLOC; ii++ )
      {
        if(pointer[ii] != (char)threadID) heap_corrupted = true; // Check for heap integrity
      }
      if(heap_corrupted) Serial.printf("Thread #%d heap corruped!\n", threadID);
      }
    else
    {
      Serial.print("Thread ");
      Serial.print(threadID);
      Serial.print(" loop #");
      Serial.print(numLoops);
      Serial.println(" malloc failed!");
    }
    bool stack_corrupted = false;
    for ( int ii = 0; ii < STK_CHK_SIZE; ii++ )
    {
      if(stkSig[ii] != (char)threadID) stack_corrupted = true; // Check for stack integrity
    }
    if(stack_corrupted) Serial.printf("Thread #%d stack corruped!\n", threadID);
    threads.delay(100);
  } 
}

void thread_func( void *param )
{
  thread_loop( (int)param );
}

void setup()
{
  Serial.begin(9600);
  delay(3000);
}

void loop() 
{
  int threadStackSize = 1024;
  int threadCount = 4;
  char threadsStack[threadCount][threadStackSize]; // The stack of the threads are created on the stack of the loop()
  if(!threads_created)
  {
    // Only runs once
    for (int i=1; i<=threadCount; i++)
    threads.addThread(thread_func, (void*)i, threadStackSize, threadsStack[i-1]); // threads stack are now on the main stack
    Serial.println(threads.threadsInfo()); // Information for thread 0 (main) is incorrect, information for thread 1 is missing
    threads_created = true;
  }
  thread_loop(0);
}

Does this code also works for you? I stopped using printf as it seems to cause many problems and in my full code I have already started to use the arduino-printf library, which is malloc free.
 
Not sure if this points to issue with TeensyThreads code or just malloc not liking the thread world as presented.

I think that's correct. It's not just TeensyThreads, but rather any RTOS with a stack per task. And T4.x is okay because it uses smalloc() rather than malloc?

I tried this malloc alternative (https://github.com/thi-ng/tinyalloc) and it seems to work quite well. Just add the "c" and "h" files to your sketch folder, and here is my test program modified to use it, with a static array of 128K bytes. I can't say I've done a lot of testing with it, but it works correctly in this test program. Of course this does not address the use of new...

Code:
#include <Arduino.h>
#include <TeensyThreads.h>
extern "C" {
#include "tinyalloc.h"
}

#define SIZE_ALLOC 1024
uint32_t total_bytes_alloc = 0;
uint32_t thread_bytes_alloc[5] = {0,0,0,0,0};

void thread_loop( int threadID )
{
  bool fail = false;
  while(1) {
    char* pointer = (char*)ta_alloc(SIZE_ALLOC);
    if (pointer) {
      fail = false;
      thread_bytes_alloc[threadID] += SIZE_ALLOC;
      total_bytes_alloc += SIZE_ALLOC;
      Serial.printf("Thread %1d ta_alloc %d. Addr = %p. thread_total = %d. total = %d\n",
                        threadID, SIZE_ALLOC, pointer,
                        thread_bytes_alloc[threadID], total_bytes_alloc);
    }
    else {
      if (fail == false)
        Serial.printf("Thread %1d malloc failed.\n" , threadID);
      fail = true;
    }
    threads.delay(50); // JWP (500)
  } 
}

void thread_func( void *param )
{
  thread_loop( (int)param );
}

const int threadStackSize = 1024;
const int threadCount = 4;
char ta_buffer[128*1024];

void setup()
{
  Serial.begin(9600);
  while (!Serial) {} // JWP (5000);
  
  // initialize tinyalloc
  ta_init( (void*)ta_buffer, (void*)(ta_buffer+sizeof(ta_buffer)), sizeof(ta_buffer)/SIZE_ALLOC, 16, 4 );

  for (int i=0; i<threadCount; i++)
    threads.addThread( thread_func, (void*)(i+1), threadStackSize ); // stack alloc'd from heap 
  Serial.println(threads.threadsInfo());
}

void loop() 
{
  thread_loop( 0 );
}
 
I tried this malloc alternative (https://github.com/thi-ng/tinyalloc) and it seems to work quite well. Just add the "c" and "h" files to your sketch folder, and here is my test program modified to use it, with a static array of 128K bytes. I can't say I've done a lot of testing with it, but it works correctly in this test program. Of course this does not address the use of new...

That's another very interesting library I will definitively check. Once again, thanks Joe. I am already using the arduino-printf library which was your suggestion and so far it fits my code like a glove. Barely had to make any changes to the original code.

Unfortunately, the biggest problem I am facing right now isn't the direct heap memory allocation using malloc, but the underlying use of malloc() which is made by the String class, and also objects constructors. I can't imagine a way to re-route all the calls those things make from malloc to ta_alloc. :(

I will try to make some more tests to see if any workaround to still use malloc reliably inside the threads is possible. If not, I have a lot of work left to do rewriting all my code :(
 
Unfortunately, the biggest problem I am facing right now isn't the direct heap memory allocation using malloc, but the underlying use of malloc() which is made by the String class, and also objects constructors. I can't imagine a way to re-route all the calls those things make from malloc to ta_alloc. (

I think we’ve shown that malloc is not reliable from threads other than main on T3. Let’s talk about why you need String in more than one thread. There is no reason to use multiple threads for anything that is logically sequential. Even if you are writing strings out multiple serial ports in multiple threads, you can construct all strings in the main thread and pass them to the “comm” threads.
 
I think we’ve shown that malloc is not reliable from threads other than main on T3. Let’s talk about why you need String in more than one thread. There is no reason to use multiple threads for anything that is logically sequential. Even if you are writing strings out multiple serial ports in multiple threads, you can construct all strings in the main thread and pass them to the “comm” threads.

That's correct. While I agree that it would be possible to just use malloc from the main thread, it would require drastic changes in my code because Strings are used. And sometimes when they grow in size, realloc() is used to create an bigger buffer on the heap. Nowadays, if I would rewrite the code knowing all the pitfalls of using Strings, dynamic memory allocation and threading in Teensy, I would have made many things very differently. Unfortunately, when the code started I did not have such knowledge, so my insistence to be able to use malloc() is mainly to comply with legacy code.

That being said, one thing is still not clear to me: In my reply #51 in this thread, I posted a code where I THINK I was able to reliably malloc from any thread until the heap is exhausted. I have also checked the heap integrity and stack for overruns and could not find any errors. Did you run this code? My question is because your conclusion is that "malloc is not reliable from threads other than main on T3", so I imagine that you saw that code and noticed that something was wrong, something I could not perceive. Could you please tell me what is it?

Since than, I have tried many different things with that code, added entropy to the allocation size and delay times, made one thread free the heap allocated from other thread, applied realloc(), checked for heap fragmentation ,and many other things. So far, all my tests where successful. If you wish, I can post the full test sketch including those testes (about 180 lines). My conclusion, however, is unclear. Do you think that it is still unreliable? Thank you very much for your time. :D
 
Hi Victor. My phone does not show message numbers, so I’m not sure which one you mean. Is it the one with mutexes? I will try it later. If new uses malloc(), you can switch all allocations to tinyalloc by simply writing your own C function called malloc() which calls ta_alloc(). Just be sure to declare extern “C” and it will be linked instead of the one in libc.
 
Hi Victor. My phone does not show message numbers, so I’m not sure which one you mean. Is it the one with mutexes? I will try it later.

Hi Joe! The reply I am referring is the one where I end with
Code:
Does this code also works for you? I stopped using printf as it seems to cause many problems and in my full code I have already started to use the arduino-printf library, which is malloc free.
. This one doesn't use mutexes (for simplicity, but it would be ideal because if the thread is interrupted by another one while it is performing malloc, that would probably cause memory corruption). In this code I posted, my main change was to allocate the stack of the 4 threads directly on the stack of the main thread (loop function). The ideia is that it would make malloc work as it is intended (heap end is below stack pointer). For far, at least in my testing, it is working as intended. One thing that we need to make sure is that the buffer we allocated for the threads stack never goes out of scope (as it is local, placed at the programs stack), so the loop() function can never end. But there are some easy workarounds for this without any compromises I am aware of.

If new uses malloc(), you can switch all allocations to tinyalloc by simply writing your own C function called malloc() which calls ta_alloc(). Just be sure to declare extern “C” and it will be linked instead of the one in libc.

Wow, that would make the things super easy. But would it also link this ta_alloc() for libraries outside of my main sketch? Example: Image that a 4G modem library uses Arduino Strings (WString.h) internally, which use malloc(), realloc() and free() from libc. If I write my on malloc() function in my main sketch, would this malloc also overwrite the one used by WString ?
 
Last doe posted with .yield() ending the alloc was not safe with mutex, but assume it finished that to yield before time slice ended preventing multiple callers - and then counter var only allowed one thread to alloc at a time in last posting and the problem still persisted.

And TeensyThreads core structs indicated corruption - didn't go back to print more in ThreadsInfo or other debug.

Just seems internal malloc gets hit in unexpected ways it isn't designed to handle without a fixed understanding of heap/stack ram placement. Not sure if it might be solved using .LD offset values for reference or other edit.
 
Last doe posted with .yield() ending the alloc was not safe with mutex, but assume it finished that to yield before time slice ended preventing multiple callers - and then counter var only allowed one thread to alloc at a time in last posting and the problem still persisted.

Sure that the concurrent access to malloc() calls from multiple threads is a problem that should be addressed. However the main problem looks to be related to how malloc understands the RAM. From my testing, if I place the threads stacks inside the main thread stack (that is, above the heap), the problem seems to have been fixed. Did you test with the code I posted on my post #51 ? This one works reliably for me.


I’m sure libc’s malloc is in a separate file, so even though it’s in a lib file, it should get replaced at link time.

That's great! This will be my next attempt at solving the issue if the testing I am performing now does't show any improvements. Many thanks for the tip! Right know I have modified my main code to allocate the threads stack as a local buffer on the main thread stack. So far it is working. I am letting the equipment under observation to see if any crash will happen in the next hours.
 
Sure that the concurrent access to malloc() calls from multiple threads is a problem that should be addressed. However the main problem looks to be related to how malloc understands the RAM. From my testing, if I place the threads stacks inside the main thread stack (that is, above the heap), the problem seems to have been fixed. Did you test with the code I posted on my post #51 ? This one works reliably for me.

That's great! This will be my next attempt at solving the issue if the testing I am performing now does't show any improvements. Many thanks for the tip! Right know I have modified my main code to allocate the threads stack as a local buffer on the main thread stack. So far it is working. I am letting the equipment under observation to see if any crash will happen in the next hours.

Sorry, didn't try when you noted it worked, I got moved to other tasks here ...

Wondering if there is anything telling or unique about the 'new found' malloc code that might point to a solution or change for the currently included code?

Didn't get to download that code and try yet - would be interesting if it led to a testably/provably better general purpose and 'thread safe' solution. Will click download before I head off for current chore and look soon ...

<edit>: Github grab shows license is 'Apache ' not MIT ... not sure if that blends with including in PJRC installed code ....
 
From my testing, if I place the threads stacks inside the main thread stack (that is, above the heap), the problem seems to have been fixed. Did you test with the code I posted on my post #51 ? This one works reliably for me.

Victor, your program does run correctly, but if I simply comment out the checking for stack corruption, it does not run at all. This is what I have seen over and over with this issue. A working program will "break" with a simple change, and a program that doesn't work at all will suddenly run with a minor change that should not affect the results. I think you are mistaken that allocating the thread stacks on the system stack is a reliable fix.

Also, I tried replacing the libc malloc() with a local function that calls ta_alloc(), and that does not work. I know you don't want to do this, but it may not be as difficult as you think to move all of your dynamic memory allocation into the main thread.
 
Last edited:
Victor, your program does run correctly, but if I simply comment out the checking for stack corruption, it does not run at all. This is what I have seen over and over with this issue. A working program will "break" with a simple change, and a program that doesn't work at all will suddenly run with a minor change that should not affect the results. I think you are mistaken that allocating the thread stacks on the system stack is a reliable fix.

Also, I tried replacing the libc malloc() with a local function that calls ta_alloc(), and that does not work. I know you don't want to do this, but it may not be as difficult as you think to move all of your dynamic memory allocation into the main thread.

Joe, sorry for the delay, I didn't have access to my Teensy and computer during the weekend. I was not able to reproduce the error mentioned by you when we check for stack corruption. I tried to comment out many different combinations of lines and none that I've tried seems to have crashed my program. Which exact combination of lines did you comment out?

As for replacing malloc() with ta_alloc(), it's a pity that it didn't work :( But thank you for the attempt. I have come to agree with you that indeed trying to make malloc() work reliably from the threads may be even more time consuming than moving all my dynamic allocations to the main thread.

Wondering if there is anything telling or unique about the 'new found' malloc code that might point to a solution or change for the currently included code? Didn't get to download that code and try yet - would be interesting if it led to a testably/provably better general purpose and 'thread safe' solution. Will click download before I head off for current chore and look soon ...
<edit>: Github grab shows license is 'Apache ' not MIT ... not sure if that blends with including in PJRC installed code ....

I think what ta_alloc does differently from malloc is that it doesn't rely on checking the stack pointers to find out the end and start of heap, how many bytes left, etc. Instead, a static allocated buffer with known size is given to ta_alloc, so it works as just an memory chunk distributor for that memory region. It seems simpler than the standard malloc() found in Teensy cores. Like malloc(), ta_alloc also doesn't seem to be thread safe.
 
As for replacing malloc() with ta_alloc(), it's a pity that it didn't work :( But thank you for the attempt. I have come to agree with you that indeed trying to make malloc() work reliably from the threads may be even more time consuming than moving all my dynamic allocations to the main thread.

You haven't said how many threads you have or what they do, but it may be fairly easy to do an initial test by converting the threads that use malloc/new/String to ordinary functions and call them in a loop from your main thread.

I think what ta_alloc does differently from malloc is that it doesn't rely on checking the stack pointers to find out the end and start of heap, how many bytes left, etc. Instead, a static allocated buffer with known size is given to ta_alloc, so it works as just an memory chunk distributor for that memory region. It seems simpler than the standard malloc() found in Teensy cores. Like malloc(), ta_alloc also doesn't seem to be thread safe.

Yes, I think that's right.
 
You haven't said how many threads you have or what they do, but it may be fairly easy to do an initial test by converting the threads that use malloc/new/String to ordinary functions and call them in a loop from your main thread.

Sorry, I forgot to answer this. I use the main thread 0 (loop) to perform most of the operations needed such as data logging, mathematical computations, sensor data acquisition, SD card manipulation, USB command parsing, etc. Thread 1 is an communication thread. It works like an second loop, interfacing with a serial modem , checking and sending data, performing some network activities, etc. Most of this work involves some kind of sequential "send command, wait reply, parse replay", so it is easy to use threads.delay() instead of the standard delay() and keep the modem library almost without any modifications. Although it seems simple, the communication thread is a loop with more than 6000 lines of code, because there are a lot of housekeeping to be done to keep the connections working. It uses dynamic allocation to create Strings and modem objects.

For me, it is not so easy to move all malloc/new/String to the main thread because those two threads are independent and asynchronous. But it seems like I have no option.

I also have an third thread, which is much smaller. All it does is to blink a LED with different patterns. This is a 60 lines of code loop and it only uses threads.delays and digitalWrite functions.
 
Sorry, I forgot to answer this. I use the main thread 0 (loop) to perform most of the operations needed such as data logging, mathematical computations, sensor data acquisition, SD card manipulation, USB command parsing, etc. Thread 1 is an communication thread. It works like an second loop, interfacing with a serial modem , checking and sending data, performing some network activities, etc. Most of this work involves some kind of sequential "send command, wait reply, parse replay", so it is easy to use threads.delay() instead of the standard delay() and keep the modem library almost without any modifications. Although it seems simple, the communication thread is a loop with more than 6000 lines of code, because there are a lot of housekeeping to be done to keep the connections working. It uses dynamic allocation to create Strings and modem objects.

For me, it is not so easy to move all malloc/new/String to the main thread because those two threads are independent and asynchronous. But it seems like I have no option.

I also have an third thread, which is much smaller. All it does is to blink a LED with different patterns. This is a 60 lines of code loop and it only uses threads.delays and digitalWrite functions.

Your design sounds reasonable in terms of having the serial port i/o and the LED blinking in separate threads. You say the main thread and comm thread are asynchronous, but are they completely independent, or do they pass information back and forth? The basic requirement is to allocate dynamic objects in the main thread and pass them to the comm thread via the thread argument (or some other method) or simply make them global.
 
but are they completely independent, or do they pass information back and forth?

The main thread 0 can send some commands to be executed on the comm thread 1, but not the other way around. In order to to this, I have created an thread safe command queue, with all operations protected by mutexes. The main thread (which generates the commands) can place new commands on the queue at any time. For the thread 1, there is a specific time on its loop where it dequeue commands and executes them.

The main problem in avoiding the dynamic memory allocation in thread 1 is that it uses a lot of Strings to do text manipulation which is directly related to the modem comm. It wouldn't make much sense to move this to the main thread. The solution would be to go back to the char arrays, but that's a lot of work. Just to have an idea, I have searched for "String" in my code and it found 1800 occurrences :(
 
Also Joe, could you please tell which exact lines you commented out in the code from my post #51 which made not work at all? I have tried to reproduce in my code here, commented many different sections but the allocations were still working. Many thanks!
 
Also Joe, could you please tell which exact lines you commented out in the code from my post #51 which made not work at all? I have tried to reproduce in my code here, commented many different sections but the allocations were still working. Many thanks!

I was wrong. Sorry about that. The change that I made affected the printing, but not the allocation. Maybe this is okay. I'm really not sure. Have you tried it in your application?
 
I was wrong. Sorry about that. The change that I made affected the printing, but not the allocation. Maybe this is okay. I'm really not sure. Have you tried it in your application?

No problem! So, I am already testing this in my application and so far it is working. However, I am not confident about it yet.

This is the full story: I have some occasional reboots in my application where the program is saved by the watchdog. Sometimes the code runs for two days in a row before it crashes and reboots. Sometimes it works for some hours. Those random reboots are the reason made me start an investigation to see what is going on. I have no idea where the code crashes, because I couldn't recognize any pattern. The TeensyThreads library was one of the things I started to suspect, because I have had some other problems in the past related to threading (such as problems printing to an SSD1306 OLED display, missing incoming characters from hardware serial, and so), which I managed to fix. Than I started to be suspicious about memory issues, because the threads.threadsInfo() function returns incorrect results, so I thought that it might be a clue showing that the problem is memory related. That was when I experimented with malloc and found the problem which originated this forum post.

Unfortunately, my application still crashes from time to time, so either:
1 - The crashes were not caused by malloc problems, or
2 - The workaround I am testing didn't completely solve the malloc problems

Right now, I am aiming more towards option 1. I think that even if malloc was a problem and the workaround fixed the issue, there is some other thing in my code that is the cause of the crashes.
 
No problem! So, I am already testing this in my application and so far it is working. However, I am not confident about it yet.

This is the full story: I have some occasional reboots in my application where the program is saved by the watchdog. Sometimes the code runs for two days in a row before it crashes and reboots. Sometimes it works for some hours. Those random reboots are the reason made me start an investigation to see what is going on. I have no idea where the code crashes, because I couldn't recognize any pattern. The TeensyThreads library was one of the things I started to suspect, because I have had some other problems in the past related to threading (such as problems printing to an SSD1306 OLED display, missing incoming characters from hardware serial, and so), which I managed to fix. Than I started to be suspicious about memory issues, because the threads.threadsInfo() function returns incorrect results, so I thought that it might be a clue showing that the problem is memory related. That was when I experimented with malloc and found the problem which originated this forum post.

Unfortunately, my application still crashes from time to time, so either:
1 - The crashes were not caused by malloc problems, or
2 - The workaround I am testing didn't completely solve the malloc problems

Right now, I am aiming more towards option 1. I think that even if malloc was a problem and the workaround fixed the issue, there is some other thing in my code that is the cause of the crashes.

Are you sure that no thread switches can occur while the threads are executing non-reentrant code? The easiest way to do that is to set the TeensyThreads timeslice to a very large value so that thread switches occur only when your code calls yield(), and to be sure that each thread calls yield() or delay() within its timeslice.
 
Wow, we are following the same line of thought. That is exactly the thing I am trying today. I have set the time slices to 100000 milliseconds and used only yield() and delay() to switch context. The application with this modification has been running for 7 hours now and so far no problems. I will keep checking to see what happens. Sometimes the application was crashing many many hours or even days after startup, so I will need to wait and see.

Do you know if SPI communication (using SPI Library), I2C (using Wire) and Hardware Serial communication codes are non-reentrant? I know that they are not thread safe, but each resource is being used by only one thread in my code. USB Serial, EEPROM, SD and the command queue can be accessed by both threads and are already completely wrapped by mutexes.
 
Last edited:
Wow, we are following the same line of thought. That is exactly the thing I am trying today. I have set the time slices to 100000 milliseconds and used only yield() and delay() to switch context. The application with this modification has been running for 7 hours now and so far no problems. I will keep checking to see what happens. Sometimes the application was crashing many many hours or even days after startup, so I will need to wait and see.

Do you know if SPI communication (using SPI Library), I2C (using Wire) and Hardware Serial communication codes are non-reentrant? I know that they are not thread safe, but each resource is being used by only one thread in my code. USB Serial, EEPROM, SD and the command queue can be accessed by both threads and are already completely wrapped by mutexes.

You are doing the right thing by using each one from only one thread. They are not reentrant, and they are not thread-safe.
 
Just an quick update. I currently have 3 different Teensy 3.5 running my full application (with huge slice times for each thread and the threads stack placed on the main thread's stack). No crashes or reboots yet! :)
 
Just an quick update. I currently have 3 different Teensy 3.5 running my full application (with huge slice times for each thread and the threads stack placed on the main thread's stack). No crashes or reboots yet! :)

Good news. Thanks for letting us know. What was your timeslice before?
 
Back
Top