TeensyThreads: About Time Slice, Efficiency, Thread Priority and Cycle Time

cebersp

Well-known member
Hi,
this is about TeensyThreads, the library which is part of the distribution. Perhaps these findings are useful for others.
Doku is here: https://github.com/ftrias/TeensyThreads
Discussion is here: https://forum.pjrc.com/threads/4150...library-first-release?highlight=TeensyThreads

It is a multitasker, which is perhaps a hidden gem to some users.
You can define up to 16 threads, which are loops, including the normal arduino loop().
A lowest priority interrupt from a timer cuts regularly into whatever is running including libraries (but not interrupt routines), will store the registers, will restore the registers for the next thread and resume this next thread where it was interrupted. So it cycles through all threads. Each thread gets a certain time slice for computing. It can use threads.yield() to generously give up it's rest of time. So the next thread starts immediately.

From the Doku it was not really obvious to me, how to set up different priorities and how to set up the right slice time. So I did some experiments with the code below.
Code:
threads.setSliceMicros(10); // Setting is needed!
This sets the timeslice for one tick to 10 microseconds.
Code:
  int id1= threads.addThread(thread_func, 1); // Start first function with parameter 1
  int id2= threads.addThread(thread_func2, 1); // Start second function with parameter 1
  threads.setTimeSlice(id2, 10); // set priority. This Thread gets 10 times more time
In this case this is 10 * ticks = 100µs per cycle for the second thread.

In my application, I wanted some time critical audio data handling outside the audio library to run with very high priority and have the graphical user interface with ST7735_t3 run with rather low priority. The audio library is running controlled by its own interrupt.
If you have 5 threads running, one of them the normal Arduino loop, then their number of ticks add up to a cycle time together with time for interrupt driven code. Only after this cycle time a specific thread will run again. So you are interested to have a low time slice for a tick. On the other hand switching of threads needs time too, so very low time slices have low efficiency.

ThreadEff.jpg
The test is using two counters and the count after 1 second is compared for different tick lengths.
The picture shows efficiency for a Teensy 4.1 running at 600MHz. There are two counter threads running as loops without thread.yield(). The first with one tick and the second with 10 ticks priority. If you choose a tick length of 10 microseconds, it still gives an efficiency of 95% for the second high priority thread and 85% for the low priority thread.

ThreadsTab.JPG
The table shows the count numbers in thousands (in 10-fold time you should get 10-fold count) and it shows the maximum cycle time. So for a tick length of 10 microseconds with one thread of 1 tick and one thread of 10 ticks you get a cycle time of 110 us. In this case, the time for interrupt driven code is neglectable.

Code:
/* To find Out about Timing and Priorities of TeensyThreads
 *  CWE 05.02.2022
 *  Teensy 4.1 @600MHz
 * Slice: 1µs 1: 623k 2: 47413k Max: 14 Min: 12 
 * Slice: 3µs 1: 4436k 2: 77669k Max: 42 Min: 33 
 * Slice: 10µs 1: 7687k 2: 86931k Max: 110 Min: 110 
 * Slice: 30µs 1: 8620k 2: 89595k Max: 330 Min: 330 
 * Slice: 100µs 1: 8955k 2: 90598k Max: 1100 Min: 1100 
 * Slice: 300µs 1: 9070k 2: 91062k Max: 3300 Min: 3300 
 * Slice: 1000µs 1: 9084k 2: 90955k Max: 11000 Min: 11000
 * Slice: not set 1: 52770k 2: 47977k Max: 21000 Min: 21000 => no priority
 * https://github.com/ftrias/TeensyThreads
*/

#include <TeensyThreads.h>

volatile long int count = 0, oldCount, count2 = 0, oldCount2;
volatile uint32_t lastMicros, diffMicros, maxMicros=0, minMicros= 100000;

void thread_func(int inc) {
  while(1) count += inc;
}

void thread_func2(int inc) {
  while(1) count2 += inc;
}

void measCycle() { // Measure Cycle Time for all Threads in µsec 
  lastMicros= micros();
  while(1) {
    uint32_t nowMicros= micros();
    diffMicros= nowMicros-lastMicros;
    maxMicros= max(maxMicros, diffMicros);
    minMicros= min(minMicros, diffMicros);
    lastMicros= nowMicros;
    threads.yield();
  }  
}

int slice= 10; // Variation here

void setup() {
  threads.setSliceMicros(slice); // Setting is needed!
  //threads.setSliceMillis(500);
  int id1= threads.addThread(thread_func, 1);
  int id2= threads.addThread(thread_func2, 1);
  threads.setTimeSlice(id2, 10); // set priority. This Thread gets 10 times more time
  int id3= threads.addThread(measCycle);
}

void loop() {
  int getCount= count;
  int getCount2= count2;
  Serial.printf("Slice: %dµs 1: %dk 2: %dk Max: %d Min: %d \n", slice, 
    (getCount-oldCount)/1000, (getCount2-oldCount2)/1000, maxMicros, minMicros);
  oldCount= getCount;
  oldCount2= getCount2;
  maxMicros=0;
  minMicros= 100000;
  threads.delay(1000);
  //delay2(1000);
  //delay(1000);
}


void delay2(uint32_t ms)
{
  int mx = millis();
  while(millis() - mx < ms);
}

So all-in-all for Teensy 4.1 @600MHz a tick length around 10 microsecs seems to be something like a sweet spot for still rather good efficiency and low cycle time.

Have fun, Christof
 
Thank you for this analysis. It is a big help to understand how this system works.

I am looking at creating a data logger, and I am considering this approach. I want to sample the analog inputs at 100kHz (10uS period) and store it to an SD card.

1. The readerThread needs to read from the analog inputs and then delay for 10uS. It will read these samples into a buffer. When the buffer is full it will send a pointer to the buffer to the SDWriterThread. readerThread will then continue with another buffer.

2. The SDWriterThread will wait until a buffer shows up, and then it will write that data to the SD card. This code will not have any delays in it (as far as I know).

I think I need to set the time slice to 5uS (5+5=10), and then use the yield in the sampler thread. and leave the SDWriterThread to essentially run with the rest of the reader's time slice and then the writer's 5uS time slice. So if the reader takes 2uS to read the samples, then the SDWriterThread will get 3uS + 5uS to run before the scheduler swaps it out to run the reader.

Does this sound right?
 
Just a quick reply. If you want precise 100 kHz sampling, you may want to use an interval timer and do the sampling in the ISR. I don't know the internals of TeensyThreads, but relying on the sum of time slices to schedule the sampling may not be as reliable as you want. The SDFAT example program TeensySdioLogger shows how to write to SD to avoid the long delays that can occur on write. Writes are usually fast, but can take up to 40 ms, so you need to size the ring buffer to be able to hold that much data. If you combine that example program with sampling in an IntervalTimer ISR, you don't need TeensyThreads to do what you want.
 
Many thanks Christof !! exactly what I am looking for at the moment.

BadMisterFrosty I am currently doing something very similar to what you are planning. I am using an ISR triggered by the ADC data ready line, which puts the data into a circular buffer. In a state machine thread I am chasing the tail of the buffer and putting the data out onto either a serial port or Ethernet port ie real time streaming.

With this information about improving thread efficiency and changing timeslices, I feel that I can make it more robust and handle higher throughputs.
 
Back
Top