Zilch cooperative multi-tasking for Teensy 3.x/LC

Status
Not open for further replies.
I like the approach of statically allocated tasks. That way you can size the heap/stack without considering the number of tasks.

I guess I'll have to figure out some way of having to use the floating point registers between two tasks. Maybe non context switch example would be funcA uses floating point but does not finish it's floating point operation when funcB is called from funcA. Does the hardware handle that for us?

I'm fairly new to ARM and Arduino, but the way I'll use Zilch is to call only create() and yield(). No task can call yield() in the middle of a floating-point operation, so no task can possibly interrupt another task's floating-point. The only constraint is that if there are interrupts, the ISRs should do no floating-point. I'll add mailboxes and queues for communication between tasks and between ISRs and tasks, and one timer interrupt for delays and for timeouts on mailboxes and queues.
 
Hi Duff (and others),

This stuff looks interesting and may be something I might try for my well monitoring stuff... Sorry if my questions are a little too basic.

Was wondering if there are some good examples of how to do some semi simple things, for example handle inputs from a serial port. Suppose I have a simple task that waits for input from an XBee or now in my newer cases LoRa radio. If it receives data, probably do some simple processing.

On linux, I would create a thread, probably setup an event object and then tell the thread to wait for that even to happen.
Currently in Teensy, I have the main loops typically cycle through whichever objects that might be there. Alternatively I could use Serialevents.

What is the preferred way to do this in zilch? So far my guess is to create a task that does something like:
Code:
static void task1(void *arg) {
    while ( 1 ) {
        while (Serial1.available()) {
            // process current stuff
        }  
        yield();
    }
}
Which my guess would work. Can also imagine maybe the yield get replaced by a delay() to reduce how often it is called when it has nothing.

Again looks like some fun stuff to try out.

Kurt
 
Kurt, your code looks reasonable, though I'm not experience with Arduino. I've done lots of systems with similar cooperative tasking. As for using delay(), I think it's something to avoid. The number of swaps would be reduced, but only by spending time spinning inside delay(). There's no real advantage to that, and it would make your system less responsive.
 
Based on Duff's reply about serialEvent() to my initial late night look what you show is what I was expecting - maybe just transfer any waiting data to a local buffer for sketch or other task processing?

Since the T_3.6 cycles loop() at [1 million/sec @240 MHz or 750,000 @180 MHz] with yield() each loop() going through USB plus Serial[1-6] I don't think you have to worry about calling too much when you only add one for each active port. Better to make the quick check than to miss/corrupt input or complicate it with stopping/restarting in any way.

NOTE:: My timing sample is using USB and Serial2. I hacked yield() to do only USB & Serial2 and my loop count is now :: 1,752,557. The loop() is minimal - but that is a lot of added overhead checking unused Serial1, 3, 4, 5, 6. Even every third second when I transfer 126 bytes looped on Serial2 then print to USB and the loop() count only drops to - 1,739,802.

It is my intent to replicate that same sketch ( sample I wrote for a post reply last week) under Zilch Tasks to compare the cycle rates. I just plugged in a T_3.0 and I did that loop back loop() timing test. At 96 MHz the T_3.0 does 260,384 loops with yield() only checking USB and Serial2, restoring all serialEvents checks to yield() it drops to - 199,661
 
Thanks, I probably should look at the code. Was not sure if the implementation of delay was updated, to logically take the thread off of a logical ready list, and only put it back on if the amount of time had elapsed...
 
What is the preferred way to do this in zilch? So far my guess is to create a task that does something like:
Code:
static void task1(void *arg) {
    while ( 1 ) {
        while (Serial1.available()) {
            // process current stuff
        }  
        yield();
    }
}
Which my guess would work. Can also imagine maybe the yield get replaced by a delay() to reduce how often it is called when it has nothing.
yes that would work just fine though calling 'yield' is actually better for the context switch performance since 'delay' has some overhead. Using a while loop is fine but if you hang around in there to long without calling yield your other tasks will be starved for cpu time.
 
Thanks, I probably should look at the code. Was not sure if the implementation of delay was updated, to logically take the thread off of a logical ready list, and only put it back on if the amount of time had elapsed...
No, 'delay' fortunately calls yield but 'delayMicroseconds' does not. The scheduler is just a simple round robin with no ready list or such. If you want something periodic just use intervalTimer, these tasks will not block interrupts. This library sits in between the standard big loop and RTOS. Whats nice is you can use something like the Audio library with really no problem because all the Audio Objects update functions will always preempt a Zilch task. Then you can have for an example tasks for fft, peak, etc... while updating an OLED in a multitasked way.
 
I got my prior serial test post to work with the serialEvent replaced by task2 below. I don't have the post link in the sketch as I usually do - but that user wanted to 'read a serial string terminated by 10ms deadtime'. So that is the rest of the sketch that watches an elapsedMillis() for 10 ms since last character and then prints the string received. It is a single Teensy 3.6 in this case with Rx><Tx crossed. Each 3 seconds it prints Serial2 back to itself to receive as below. Other than note below about overall timing it is running the Serial2 transmission printing to USB and functioning the same.

NOTE: I also added the timing loop from that code and with Zilch in control the 1 second loop count is 380,938 at 240 MHz versus 1,000,000 in native Arduino code. And that is with Zilch yield() not polling the other 5 serial ports which as noted above in native Arduino gave a count of 1,752,557 loops/second. Other than task 1 and 3 blinking the LED and then delay() & yield() it is the same functionality.


Code:
//void serialEvent2() {
/*******************************************************************/
// 2nd task
static void task2(void *arg) {
  while (1) {
    char ich;
    if ( Serial2.available() ) {
      if ( w_time > 0 ) {
        if ( w_time > 20 )
          Serial.println();
        Serial.print( ":"); // debug
      }
      while ( Serial2.available() ) {
        ich = Serial2.read();
        t3_buf[t3_idx] = ich;
        ++t3_idx;
      }
      t3_buf[t3_idx] = 0;
      w_time = 0; // reset timer since data arrived
    }
    yield();
  }
}
 
init_stack() contains the code shown below, which stores the current value of MSP in os.loop_stack_root, but then, rather then setting os.frame[0].stack_top and os.frame[0].sp to the VALUE of os.loop_stack_root (MSP), those values are set to the ADDRESS of os.loop_stack_root. I think it should assign the value of os.loop_stack_root to both "stack_top" and "sp" of the 0'th task. This error doesn't break anything because "stack_top" is for debug and frame[0].sp gets overwritten by the first call of task_swap(), i.e. before it is ever used.

// asm to save current stack pointer for the main task in os object
asm volatile( "MRS %[result], MSP\n" : [result] "=r" ( os.loop_stack_root ) );
os.task[0].address = 1;
os.task[0].stack_top = &os.loop_stack_root;
os.frame[0].sp = &os.loop_stack_root;
 
I'm thinking of not using the 'loop' at all anyway. It's becomes an odd ball task that I can't figure out how to use a predefined memory space for its stack. I had to travel this weekend so I didn't do any work on this. I'm leaning hard on using a single memory pool that tasks get stack space from so then its easier to implement some basic stack integrity checks against. This would mean that the user would have to allocate a total memory amount and then make sure when they allocate stack memory for each task the total fits into memory pool amount. Right now I'm working on optimizing the context switch even more, removing the bloat from the 'yield' function since we don't need to update the tasks state since they cannot preempt each other, we already know what state there in.
 
I'm thinking of not using the 'loop' at all anyway. It's becomes an odd ball task that I can't figure out how to use a predefined memory space for its stack.
I'm not sure what you mean. The stack of the loop task is the "system" stack. You don't need to initialize its SP, and you don't need to figure out where it is. The first time yield() gets called, the loop (0th) task frame gets saved. I don't see anything wrong with letting the user allocate the stacks. That way they can choose where they want them.

Right now I'm working on optimizing the context switch even more, removing the bloat from the 'yield' function since we don't need to update the tasks state since they cannot preempt each other, we already know what state there in.

I agree with that. You really don't need to use bit masks. Since tasks are in an array, the code is simpler and more readable if you define "ntasks", where ntasks=1 for the loop task, and curtask, which is 0 for loop task and 1..N for additional tasks. With that approach, yield() looks like this:

void yield( void ) {
// return if only the main context is running
if (ntasks <= 1)
return;
// get pointers to prev/next tasks and increment curtask w/ wrap
volatile stack_frame_t *prev = &os.task[curtask].frame;
if (++curtask >= ntasks)
curtask = 0;
volatile stack_frame_t *next = &os.task[curtask].frame;
// make the swap
task_swap( prev, next );
}
 
I hear you but having the odd ball loop task will mean that you will have a hard time performing any checks against it. Almost no scheduler or RTOS out there that I've seen uses the loop as a task or thread.

As far as the bit masks and nTasks I've been working on something more basic. Knowing that each stack structure has a sequential address's I hope to use this information in the context switch assembly. If I can pull it off, I will cut about 90% of the instructions in 'yield' from before and make it scream in switch speed! There are a few details like programming that need to be done now:p.
 
I hear you but having the odd ball loop task will mean that you will have a hard time performing any checks against it. Almost no scheduler or RTOS out there that I've seen uses the loop as a task or thread.

As far as the bit masks and nTasks I've been working on something more basic. Knowing that each stack structure has a sequential address's I hope to use this information in the context switch assembly. If I can pull it off, I will cut about 90% of the instructions in 'yield' from before and make it scream in switch speed! There are a few details like programming that need to be done now:p.

The loop task exists whether it executes loop() or another function. I like to put a while(1) in loop() so it's like other tasks. You should not assume all tasks have the same stack size, if that's what you mean by sequential. Better to keep tasks and swapping separate, in my opinion.
 
The loop task exists whether it executes loop() or another function. I like to put a while(1) in loop() so it's like other tasks.
I can't see a good reason to keep it. I just updated the library here, you will now have to make another task for code you put in loop. I've updated the example to show this also.

You should not assume all tasks have the same stack size, if that's what you mean by sequential. Better to keep tasks and swapping separate, in my opinion.
No i don't assume that, you can have a stack any size you want but it has to be 4 byte aligned which is taken care of by the uint32_t stack arrays.

I did however optimize 'yield', now should be much faster, still need to measure it.
 
I didn't get to your new code tonight. But I got a dedicated T_3.6 doing FreqMeasure counting LED blinks - I made helping on another thread.

Indeed on the prior version one quarter of the blinks come from loop() code. Where I did 3 tasks with just my qBlink(); and put {qBlink(); return;} for loop.

The speed of that with ZILCH is 1MHz, if I do an Arduino sketch with nothing but qBlink() I get 2.2222 MHz [with yield only checking USB Serial]. If I restore yield() in Arduino to check all 6 Serial# T_3.6 ports that drops to 681,818.

qBlink() is one of these - where the one commented slows it down by over 5% taking 1,000,000 to 937,500.
Code:
#define qBlink() {GPIOC_PTOR=32;} // Toggle LED_BUILTIN   GPIOC_PTOR = (1 << 5);
//#define qBlink() (digitalWriteFast(LED_BUILTIN, !digitalReadFast(LED_BUILTIN) ))

// ...

static void task3(void *arg) {
  while ( 1 ) {
    qBlink();
    yield();
  }
}

I can clean and post the FreqMeasure sketch - for me with no hardware tools, but a spare T_3.6, it was an easy way to gauge processor throughput without adding overhead instrumentation in the sketch. If I wired up another couple of T_3.6's I could monitor time in each task if they each blinked a unique pin. Now I wonder how fast interrupts are for counting these frequencies - maybe one T_3.6 could catch them?

As always - thanks to Koromix for TYQT - it is a great SerMon tool! With a bit of a hack to pins_teensy.c the newest TYQT is reliably online in just over 400ms on Windows!
Code:
	delay(20);  // split the delay(400);
	usb_init();
	delay(380);
 
I didn't get to your new code tonight. But I got a dedicated T_3.6 doing FreqMeasure counting LED blinks - I made helping on another thread.
Cool, I'm looking for use case examples.

The faster yield looks good. Very nice. Do you still intend to support Teensy LC?
Yes, 'yield' is much faster now. I messed around with that far to long though but I'm happy for now. Yes, I will get on porting the LC over just not till after the Thanksgiving. I still need to look at the floating point stuff for the 3.5/6 also.
 
It would be a change to your pure round robin - but having a way to raise/lower a task priority seems like a natural need. One way would be at init/begin of say 5 tasks a RUN LIST ( I didn't look at new yield yet ) that low priority task would get put in the list once, normal tasks go in twice and high priority could go in three times - so yielding low priority would come back less often than yielding a high priority - and other than the queue/run list size this would have no impact on the yield processing overhead:
(task#/priority).begin:: 1/HIGH, 2/NORM, 3/NORM, 4/NORM, 5/LOW would run in a cyclic order of: task#:: 1, 2, 3, 4, 1, 5, 1, 2, 3, 4 (repeat).

QUESTION: can Task#5 CALL Task#1 - i.e. is there is a 'USER" way to jump to the stored stack/task? Maybe that is a way for a low priority task to call a high priority task on demand as a way of allowing the user to prioritize with no overhead in yield()? Having this would allow the sketch to programmatically adjust priority at runtime.

The FreqMeas came from this: FreqMeasure-stops-reporting-values-after-a-few-seconds - not directly a use case for Zilch - but an external tool to that seemed to fit, since adding in prints and spew really messes the actual perf. Though as below it might be informative to make the one Zilch task be the FreqMeas timer on the same T_3.6.

In that sample I run loop with qBlink() on pin13 while at the same time running FreqMeas on the same pin3 - to see simple way to know FreqMeas was working. Made it easy to try things that improved the cycle rate. When I took the one math line the original user did with double to float it jumped something about 20% on the T_3.6. "float frequency = FreqMeasure.countToFrequency(sum / count);". in doing that any simple "if()" that seemed like a way to save cycles reducing other overhead- ended up dropping the net cycle count.

It does show that when little is happening that the current yield() with serialEvent checks times SIX limits the top end - and ZILCH is just repurposing that overhead. When I get it on the new Zilch - I wanted a baseline first - I should see improvement later today.
 
Quick update using fresh Zlich_Beta v1.1 - I commented the worker thread in the updated memory_layout to have the fast qBlink() in each of the three tasks.

2,400,000 is the reported frequency from those three tasks hitting qBlink()! It seems the new yield() - and no loop() - processing has much less overhead. ( IDE 1.6.12 on a T_3.6 at 240 MHz )

This should compare well to the version before - 2.4 MHz versus 1 Mhz, and now beats/equals RAW Arduino 2.2 MHz with USB serialEvent() still in PJRC yield().
...
Indeed on the prior version one quarter of the blinks come from loop() code. Where I did 3 tasks with just my qBlink(); and put {qBlink(); return;} for loop.

The speed of that with ZILCH is 1MHz, if I do an Arduino sketch with nothing but qBlink() I get 2.2222 MHz [with yield only checking USB Serial]. If I restore yield() in Arduino to check all 6 Serial# T_3.6 ports that drops to 681,818.

This only shows the RAW maximum Task Switch RPM with no workload. With globals ii and jj even the ii++ in each task drops to 2.1MHz, and adding the conditional jj++ drops it to 2 Mhz.
Code:
static void task3(void *arg) {
  while ( 1 ) {
    qBlink();
[B]    ii++;
    if ( ii % 2 ) jj++;[/B]
    yield();
  }
}

However making this change results in a 3.157 MHz update rate - showing the qBlink() isn't a limiting factor if the task stays and works twice as long the task switch overhead lost time goes down and the blink rate goes up as expected so FreqMeas can keep up and runs plenty fast - though I think the 'edge' may be near. [ without the millis() compare it actually went up to 3.333 MHz at %2 but down to 1.7 at %3 ]
Code:
static void task3(void *arg) {
  while ( 1 ) {
    qBlink();
    ii++;
    if ( ii % 2 ) {
      if ( millis() % 2 ) {
        jj++;
      }
      yield();
    }
  }
}

I'll make a version with a self contained FreqMeas() task ...
 
Last edited:
QUESTION: can Task#5 CALL Task#1 - i.e. is there is a 'USER" way to jump to the stored stack/task? Maybe that is a way for a low priority task to call a high priority task on demand as a way of allowing the user to prioritize with no overhead in yield()? Having this would allow the sketch to programmatically adjust priority at runtime.
Hmmm good question, yes you can call another task explicatively but that defeats the purpose of having multitasking? Also it could get into weird loop where it blocks other tasks but try it out and let me know. Though i would say i won't support this in the library.

As far priorities I don't want to go down that path now or ever. I would say use Chibios or Freertos that is ported to Teensy or use IntervalTimer or some isr related timer to have high priority code run at predicable intervals.

Right now I'm exploring the use of the PSP stack pointer for Thread code i.e. non ISR code (ISR's always use the MSP stack pointer). Might be fools gold though this is how most RTOS for arm do it but for tasks(fibers) this could cause problems.
 
I have a working version of memory_layout_qBlink2.ino [ View attachment memory_layout_qBlink2.ino ] that prints out cycle rate for all ZILCH tasks, with worker() enabled with smaller stack and slower print rate so the stacks fit on the screen. This can run on a single Teensy where LED pin 13 is wired to same Teensy Pin3. Or a second Teensy can run this and print and that taskFM() can be removed from the Teensy under test with the wire from TeensyA pin13 to TeensyB pin3.

Hmmm good question, yes you can call another task explicatively but that defeats the purpose of having multitasking? Also it could get into weird loop where it blocks other tasks but try it out and let me know. Though i would say i won't support this in the library.

As far priorities I don't want to go down that path now or ever. I would say use Chibios or Freertos that is ported to Teensy or use IntervalTimer or some isr related timer to have high priority code run at predicable intervals.

Right now I'm exploring the use of the PSP stack pointer for Thread code i.e. non ISR code (ISR's always use the MSP stack pointer). Might be fools gold though this is how most RTOS for arm do it but for tasks(fibers) this could cause problems.

Your call:
> I was hoping there might be a way to have taskA call taskB through the ZILCH scheme, I did not want to call it 'manually' through "C" interface - it would have weirdness for sure and wouldn't have ZILCH stack data, I just wanted an out of order on demand call to a ZILCH taskB. I saw a special yieldTo( TaskB ) perhaps that would pop taskA, one time run taskB, resume taskA.
> Given I have not but looked at ZILCH code, it seemed like it might fall out naturally to prepopulate the call list with alternate numbers of references to different tasks once at startup without and complexity during normal run.

It will be easy to have a task hang [ forget to put in yield() ) or have a task end prematurely two "ifdef DEBUG" things I thought of:
> before joining a normal YIELD cycle, under DEBUG run each task once:: print 'starting task #' - then on yield() return, print 'STACK USE by task #'. Continue through each task once and then start normal operation.
> #ifdef DEBUG: print "task # exit" when any task does a return exit versus a yield to identify a malformed task.
 
Very much agree on staying away from pseudo-priority. I like the new yield(), which is where all of the speed improvement comes from, but prefer using Arduino setup/loop as 0th task, which is consistent with all RTOS I've used.

If you add these lines to task_create(), you can build the linked list as you go and avoid the loop in os_start().

if (num > 0) // if not 0th task
os.frame[num-1].next = &os.frame[num]; // prev frame's 'next' = this frame
os.frame[num].next = &os.frame[0]; // this (last) frame's 'next' = 0th frame

gcc inits globals to 0, so init_stack() resolves to just one line (two if you use memset() to 0 expicitly):

memset( &os, 0, sizeof(os) );
os.memory_fill_pattern = memory_fill;
 
I got the teensy LC working, really didn't have to change much except the context switch code. You can download the library here.
 
Glad the LC worked out nicely!

RE my comments above - I looked at the code - indeed none of that falls out naturally to keep it clean and efficient with the reliance on calling the common yield() - any diversion from that path with a task deferring to a 'priority' task breaks when that task calls yield(). And as far as I read multi-placing a task breaks because there are two separate structures in parallel arrays used independently to navigate the switching. That might be structurally changeable - but would be a question for another day.

For my timing testing last I looked at it I used the debug() pin 12 toggle in yield() to count cycle rate on a second Teensy rather than putting that in each task. It shows good performance - maybe that is how you were using it @duff with a counting device? With nothing else going on the time spent in the PJRC yield() with check serialEvent() for USB and all Serial#'s really adds significant overhead.
 
I got the teensy LC working, really didn't have to change much except the context switch code. You can download the library here.

Very nice. I've got it working on 3.2 and LC. Hope you don't mind if I ask a few questions.

For the KINETISL, you call task_swap(), and the ASM code assumes that the two arguments (current frame and next frame pointers) are in registers r0 and r1. For KINETISK, you do the swap entirely in ASM, including statements to move the current frame and next frame pointers into r0 and r1. Is it guaranteed that two pointer-type arguments will be in r0 and r1 when you call a function? Is that true regardless of whether or not the function defines the arguments as volatile?

Can you explain your choice to use PSP rather than MSP for tasks?

For LC, is there any reason you don't save the current frame in two steps rather than three? I tried the code below and it seems to be equivalent to the 3-step process in beta 2.0

"STMIA r0!,{r3-r7}" "\n\t" // Save r3,r4-r7 to r0! (currentframe->sp)
"MOV r2,r8" "\n\t" // move r8 into r2
"MOV r3,r9" "\n\t" // move r9 into r3
"MOV r4,sl" "\n\t" // move sl (r10) into r4
"MOV r5,fp" "\n\t" // move fp (r11) into r5
"MOV r6,ip" "\n\t" // move ip (r12) into r6
"MOV r7,lr" "\n\t" // move lr into r7
"STMIA r0!,{r2-r7}" "\n\t" // Save r8-r12,lr to r0! (currentframe->r8)
 
Status
Not open for further replies.
Back
Top