Teensy 3.x multithreading library first release

In these instances where 4 threads are used, there are actually 8 allocated, so does this not create additional overhead that could be alleviated by a linked list that creates the number of needed threads and not additional?

I am just considering things here as the comment posted here from the code triggered a thought that while this library is amazing and we are GREATLY thankful for it, there are opportunities to optimize and make it more efficient with a smaller footprint.
 
i dont think it uses the 8 stacks if using less, however, if i never used threads in my dual lcd setup, the jitter is very noticable when streaming to both lcds the animations in a single loop, now the speed is so fast its as if only one is connected, but theres actually 2, running at top speed :)
 
Tady which?

teensyduino and teensy3.5, quad threads, 13 i2c,8 spi, servo, digital hvac on touchscreen, animated, speed cntrl volume via canbus and radio's root console, animated digital push to start, keyless & alarm integration, too much to list... IMG_0151.jpg

without teensythreads, the single loop would be a nightmare to deal with considering the latency of a single thread with all the devices attached :)
 
Last edited:
Hehe nice work..
I had in mind the two lcd code you were talking about ;)
What is that project on the picture btw?
 
i dont think it uses the 8 stacks if using less, however, if i never used threads in my dual lcd setup, the jitter is very noticable when streaming to both lcds the animations in a single loop, now the speed is so fast its as if only one is connected, but theres actually 2, running at top speed :)

You are correct.

Code:
class ThreadInfo {
  public:
    int stack_size;
    uint8_t *stack=0;
    int my_stack = 0;
    software_stack_t save;
    volatile int flags = 0;
    int priority = 0;
    void *sp;
    int ticks;
};

typedef void (*ThreadFunction)(void*);
typedef void (*ThreadFunctionInt)(int);
typedef void (*ThreadFunctionNone)();

The hard coded thread array allocates 8 of these classes.

Each of those 8 contains this:
Code:
typedef struct {
  uint32_t r4;
  uint32_t r5;
  uint32_t r6;
  uint32_t r7;
  uint32_t r8;
  uint32_t r9;
  uint32_t r10;
  uint32_t r11;
  uint32_t lr;
#ifdef __ARM_PCS_VFP
  uint32_t s0;
  uint32_t s1;
  uint32_t s2;
  uint32_t s3;
  uint32_t s4;
  uint32_t s5;
  uint32_t s6;
  uint32_t s7;
  uint32_t s8;
  uint32_t s9;
  uint32_t s10;
  uint32_t s11;
  uint32_t s12;
  uint32_t s13;
  uint32_t s14;
  uint32_t s15;
  uint32_t s16;
  uint32_t s17;
  uint32_t s18;
  uint32_t s19;
  uint32_t s20;
  uint32_t s21;
  uint32_t s22;
  uint32_t s23;
  uint32_t s24;
  uint32_t s25;
  uint32_t s26;
  uint32_t s27;
  uint32_t s28;
  uint32_t s29;
  uint32_t s30;
  uint32_t s31;
  uint32_t fpscr;
#endif
} software_stack_t;

It uses 1596 bytes without even doing anything. It seems a more dynamic approach could snip that down a bit.
 
Thanks for all your ideas. I guess 12K of storage is a bit much if you're not using it all. Of course, the part between `__ARM_PCS_VFP ` is only compiled for CPUs with floating point, so for the Teensy 3.2, LC, etc, it uses quite a bit less memory.

In any case, here is my thinking. One possibility is we can put the state on the stack so there is only one allocation. Or we can allocate a ThreadInfo every time a thread is created (and release when done). These scenarios are problematic because even if the thread is finished, you should still be able to inquire about its state. So when do you deallocate the ThreadInfo structure? At the very least, I can move the "save" structure that stores the registers to the stack. That doesn't need to survive the end of the thread. Most implementations of threads do this. I didn't in order to simplify debugging.

Right now, the library has a fixed array and just overrides unused ThreadInfo items. We could change that to a linked list that allocates new ThreadInfo items for new threads when all existing items are used. When threads stop, the library just marks these items as empty and then new threads reuse these empty items. In this way, threads are allocated but never deallocated.
 
Would it be possible/make sense to have the user code create the static stack space needed for each thread and pass the pointer for use in the linked list as they get created? Optionally if the user passes a NULL pointer do the alloc as the thread is created?
 
Thanks for all your ideas. I guess 12K of storage is a bit much if you're not using it all. Of course, the part between `__ARM_PCS_VFP ` is only compiled for CPUs with floating point, so for the Teensy 3.2, LC, etc, it uses quite a bit less memory.

In any case, here is my thinking. One possibility is we can put the state on the stack so there is only one allocation. Or we can allocate a ThreadInfo every time a thread is created (and release when done). These scenarios are problematic because even if the thread is finished, you should still be able to inquire about its state. So when do you deallocate the ThreadInfo structure? At the very least, I can move the "save" structure that stores the registers to the stack. That doesn't need to survive the end of the thread. Most implementations of threads do this. I didn't in order to simplify debugging.

Right now, the library has a fixed array and just overrides unused ThreadInfo items. We could change that to a linked list that allocates new ThreadInfo items for new threads when all existing items are used. When threads stop, the library just marks these items as empty and then new threads reuse these empty items. In this way, threads are allocated but never deallocated.

True. Alternatively, another boolean in the struct for hasEnded would resolve the issue with determining a threadFinished state.

I did compile that on a 3.5, the byte count was the result of sizeof(threads);

This is certainly something I am highly interested in as all through college, I was the person to whom people went when they had linked list questions. I think they are amazing data structures and I am fascinated by them daily (I know, I need a hobby or something).


Even using a single int as flags, we could determine 7 different states just by handling the bitwise operations. We could bring the footprint down, maybe raise the context switching speed a little faster or execution times better.

For example (and I have been working since 6am, so this would be a really, really rough skeleton of psuedocode.

Code:
typedef struct {
	public:
		uint8_t stack_size = 0; //How much memory will be needed by a calloc()?
		uint8_t *stack = &stack_size; //Pointer to the memory given to the thread?
		uint8_t my_stack = 0;
		software_stack_t* softStack; //pointer to the stack this thread had/has access to. (Have to be able to implement critical sections of code) (can also reuse pointers to "dead" threads)
		volatile uint8_t states = 0; //([FIRST_RUN], [STARTED], [STOPPED]) 3 flags, 1 byte
		volatile uint8_t threadStates = 0; //([EMPTY], [RUNNING], [ENDED], [ENDING], [JOIN_WAIT], [PAUSED], [RESUMING]) //7 flags, 1 byte.
		uint8_t priority = 0;
		void *sp;
		int ticks; //Maybe make it atomic?
				   //And for those of us who have debugged code with an insane amounts of threading//
		String threadName;
	} threadInfo;

	typedef struct threadNode{
	public:
		int numTotalThreads; //This will be the total length of the list.(Running threads or otherwise)
		threadNode* prev;
		ThreadInfo* threadInfo;
		threadNode* next;
	};

	threadNode* head;
	threadNode* tail;

If we allocate threads, they get added to the list with a priority. That's a simple insert algorithm. As threads are given processor time, their priority drops a bit. This creates a priority thread list.

I will clean this up, but It's just my first thoughts. The entire struct (node) can be deleted once the thread has completed.
 
Would it be possible/make sense to have the user code create the static stack space needed for each thread and pass the pointer for use in the linked list as they get created? Optionally if the user passes a NULL pointer do the alloc as the thread is created?

That's what it does right now.

But perhaps there is some confusion. There are two things that need space. (1) The stack that stores variables as functions get called. So as one function calls another and another, this space gets filled up. Each thread needs one. This is allocated dynamically or you can pass your own pointer. (2) The state of the thread, as stored in ThreadInfo. This is the fixed part that could save some space by being dynamic as well.
 
Last edited:
What is a good general approach to debugging the threads? I have 3 threads running. The main thread is just a heartbeat turning on and off the internal LED on the Teensy36. A thread that reads the serial port data and another thread that process the serial data. The read and process threads are synced up by a mutex.

If I leave the controller run, the main thread crashes every once awhile (the LED stops blinking).
 
if you get a total lockup one of the main problems is a mutex ran into itself due to recursion, if you post code we may find the issue
 
The main thread calls heartbeat after it starts the other two threads.

Code:
void heartbeat()
{
    pinMode(LED_PIN_INTERNAL, OUTPUT);
    while(true)
    {
        digitalWrite(LED_PIN_INTERNAL, LOW);
        threads.delay(500);
        digitalWrite(LED_PIN_INTERNAL, HIGH);
        threads.delay(500);
    }
}

The two other threads are listed below. There is no data coming in the serial port. Both mutex are locked in the constructor.

Code:
void BaseControl::readSerialData()
{
    uint32_t dataRead = 0x00;
    char serialDataCharRead = 0x00;
    bool storeData = false;
    bool serialDataReadComplete = false;

    while(true)
    {
        while (Serial.available())
        {
            if (m_inputData.index < sizeof(m_inputData.buffer))
            {
                serialDataCharRead = Serial.read();
                dataRead = (serialDataCharRead << 24) | (dataRead >> 8);

                if (storeData == true)
                {
                    m_inputData.buffer[m_inputData.index] = serialDataCharRead;
                    m_inputData.index++;

                    if (dataRead == COMMAND_CODES_COMMON_COMMAND_END)
                    {
                        serialDataReadComplete = true;
                        storeData = false;
                        break;
                    }
                }

                if (storeData == false && dataRead == COMMAND_CODES_COMMON_COMMAND_START)
                {
                    storeData = true;

                    // Copy the command start value into buffer to allow data structure reuse
                    memcpy((void*) m_inputData.buffer, (void *) &dataRead, sizeof(dataRead));
                    m_inputData.index = sizeof(dataRead);
                }
            }
            else
            {
                // Can't store any more data, need to process what is in the buffer
                serialDataReadComplete = true;
                storeData = false;
                break;
            }
        }

        if (serialDataReadComplete == true)
        {
            // Wait for the other thread to signal to copy the serial data in to the buffer
            m_lockCopyData.lock();

            m_dataToBeProcessed = m_inputData;

            // Signal the other thread that the copy of the serial data in to the buffer is done
            m_lockProcessData.unlock();

            // Clear read buffer
            m_inputData.clear();
            serialDataReadComplete = false;
        }
    }
}

void BaseControl::processBaseCommand()
{
    while (true)
    {
        // Signal the other thread to copy the new command in to the buffer
        m_lockCopyData.unlock();
        // Wait for the other thread to signal the copy of the serial data in to the buffer is done
        m_lockProcessData.lock();

        if (m_dataToBeProcessed.index > 0)
        {
            const uint32_t commandCode = getCommandCode();

            switch (commandCode)
            {
                case COMMAND_CODES_COMMON_SUB_SYSTEM_ID:
                    processSubsystemIDRequest();
                    break;

                default:
                    processSubsystemCommand(commandCode);
                    break;
            }

            // Clear the decode data buffer
            m_dataToBeProcessed.clear();
        }
    }
}
 
Last edited:
you have two threads manually toggling the mutexes, the safest method would be to use a scope lock, the mutex doesnt "block" the way your trying to do your coding, to give turn for processing /reading, you should be processing it as soon as its received, not having them run together in parallel while the timeslices is switching back and forth, mutexes protect resources/vars, its not to signal when to process data... thats your code to handle, not a mutex, regardless, those 2 functions take little to no reasources and dont deserve 2 threads as they are more or less ran together as one.
 
Last edited:
The threads are setup to allow future changes, for example the first thread pushes all the commands on to a vector while the second one takes the commands off of a vector to process. The two mutex used to be counting semaphores running in FreeRTOS for task syncs. Switching things over to threads to see how it runs. The mutex will eventually protect the command vector.
 
just saying, all im running is scope locks among 4 threads and ~ 75K of code, no problems here, your method requires you to keep track of your locks among the functions, but, when your code gets bigger, so does the tracking
 
Sounds like he's got some time on that, though personally I'm hoping for easier implementation of a real RTOS on the T4.
 
Sanity check. I have a Teensy that communicates through UART to a slow cellular modem part (a SIM800). The library I'm using is blocking, meaning the Teensy stops and waits for the response of the modem (or cellular network). This can sometimes take seconds. I was thinking about rewriting the library to make it non-blocking, but maybe it's also possible to put the communication functions in a separate thread using the multithreading library? How would one proceed with this?
 
Back
Top