Use of yield() in Teensy Cores and Libraries

joepasquariello

Well-known member
The recent thread "Thoughts on Handling Complexity" led to a discussion of cooperative multi-tasking and the use of yield(). In Arduino, yield() is an empty function defined as "weak" to allow replacement with a function that does a cooperative task switch. Here is the comment from the Arduino core:

/* Empty yield() hook. This function is intended to be used by library writers
* to build libraries/sketches that support cooperative threads. It's defined as
* a weak symbol and can be redefined to implement a cooperative scheduler.
*/

The calls to yield() in the latest version of the Teensy cores are listed below. All are made from within while loops, waiting either for a hardware operation to complete, or a timeout to expire. If using a cooperative RTOS, calling any of these functions from within a task will result in that task yielding the CPU if the function must wait for the operation to complete. If there is no need to wait, such as being able to put some number of bytes into a TX buffer, the function will return without yielding.

I think this is the right model to maintain, so the only change I would suggest to the cores is to add yield() to analogRead() in Teensy4/analog.c, to be consistent with Teensy3/analog.c.

The TeensyDuino libraries contain many uses of yield(), too many to discuss in one thread, but I’m interested in SPI and I2C, so I’ll focus on those for now. The SPI library transfer functions currently do not call yield(), and I think they could. For I2C, the Teensy4 file WireIMXRT does call yield() while waiting. WireKinetis has timeouts, but does not call yield(), and I think it should.

Code:
Teensy (none)

Teensy3

    analog.c (2)
        analogRead - wait for conversion complete
    main.cpp (1)
        main - after each execution of loop()
    pins_teensy.c (1)
        delay - wait for delay complete
    serial1.c, serial2.c (4)
        serial_end - wait for TX complete
        serial_putchar - wait for buffer to have room (without hardware FIFO)
        serial_write - wait for buffer to have room (with hardware FIFO)
        serial_flush - wait for TX complete
    serial3.c, serial4.c, serial5.c, serial6.c, serial6_lpuart.c (3)
        serial_end - wait for TX complete
        serial_putchar - wait for buffer to have room (no hardware FIFO)
        serial_flush - wait for TX complete
    Stream.cpp
        timedRead - wait for timeout (if no data available)
        timedPeek - wait for timeout (if no data available)
    usb_flightsim/joystick/keyboard/midi/mtp/rawhid/seremu/serial/serial2/serial3
        (generally wait for send complete or timeout on read)

Teensy4 (same as Teensy3 with the exception of analogRead)

    delay.c (1)
        delay - wait for delay complete
    HardwareSerial.cpp (3, one class for HardwareSerial1-8)
        end - wait for TX complete
        flush - wait for TX complete
        write_9_bit - wait for buffer to have room
    main.cpp
        main - after each execution of loop()
    Stream.cpp
        timedRead - wait for timeout (if no data available)
        timedPeek - wait for timeout (if no data available)
    usb_flightsim/joystick/keyboard/midi/mtp/rawhid/seremu/serial/serial2/serial3
        (generally wait for send complete or timeout on read)
 
Normally for all hardware's that takes a lot of time to complete
for example slow interfaces such as UART/I2C/SPI etc.
You use both transmit and receive interrupt mechanisms, sometimes even DMA if available.

For example if we want to transmit something over UART
First when the program starts you allocate a buffer with for example 1024 bytes
(or less depending how much memory that is available),
it could also be allocated dynamically, when there is need for it,
depending on the requirements for the specific 'product'.

1. if you need to send something
2. buffer is filled with the data
3. transmit is activated, if not already running which begins to send the first data from the buffer
4. transmit interrupt is activated
(sometimes it can be activated before beginning to send the first data, but the buffer need to have the data so the interrupt routine have something to send)
this then automatically reads the buffer when above sent data is finished,
and transmit the rest of the data in the background,
when it's finished/or a transmit error has occurred (specially for I2C),
it can set a flag, so that the main program can see if the transmit was successful.
5. the main program runs and just checks the flags above, no wait is required.
also the main program can fill in the buffer, (safest is to disable interrupts when doing so, to ensure 'atomic' access)
a wait would only occur if the buffer is full and the main loop then need to wait for it to be empty
(this state can be avoided by having a bigger buffer size)

a more advanced alternative to using the peripheral specific interrupt system is by using DMA access
then transmits specially for UART can be done without any CPU interaction
and the interrupt is normally only occurring at the end of DMA transfer.


ADC also have a interrupt system, so no waits are needed for the conversion to complete,
but maybe normally you don't use interrupts for simple reading of the ADC,
but when doing continuous readings, you could utilize adc interrupts
to make the main loop more 'free' from unnecessary waits
also by utilizing a buffer system

1. have a buffer to store for example 16 adc-readings
2. activate adc interrupts
3. start a adc-read
4. the adc interrupt routine takes the newly read value and puts it into the buffer
then starts a new adc-read unless the buffer is full, two flags can be set: read complete, buffer full
5. the main loop only reads the flags and process the data, still no unnecessary waits are required.


then how the main loop takes care of multiple tasks is another chapter



note.
when I'm talking about a buffer I mean a software-FIFO buffer.
 
Thanks, @Manicksan. You're right that DMA and interrupts provide other ways to avoid waiting. The question I'm asking is where calls to yield() would be useful if someone is using a cooperative RTOS. Functions related to using peripherals with DMA or completion interrupts would not have calls to yield().
 
have you seen TeensyThreads
that is utilizing a real task switching threads.yield() function

that mean that you could write a read ADC task
but then I did think about one more time
no this would still wait for the adc conversion to complete

then by having yield() call inside analogRead:s while (!(ADCx_HS & ADC_HS_COCO0)) { yield(); }
as you want
we can override it with

Code:
void yield() {
    threads.yield();
}
if using TeensyThreads


and here is the task
Code:
volatile int doRead=0;
volatile int adcValue=0;
void readADC_task() {
    while(1) {
         if (doRead) {
               doRead = 0;
               adcValue = analogRead(14); // A0 
         }
         threads.yield();
    }
}

setup() {
     threads.addThread(readADC_task);
}
 
have you seen TeensyThreads that is utilizing a real task switching threads.yield() function we can override it with

Code:
void yield() { threads.yield(); }

Yes, that's exactly right. TeensyThreads is time-sliced preemptive, but if you set the time-slice to be very large, and always call threads.yield() before the time-slice expires, then it is cooperative. With yield() defined as you show, calls to functions in the Cores and Libraries will yield the CPU when they are waiting for something to happen.
 
I've seen that delayMicroseconds() doesn't call yield, not even one time.
I've made a version based on delay() that calls yield(), but I lowered the internal resolution to keep it working, aka run the loop for 10+ microseconds when running on 24MHz (2+ microseconds could work but is a close one).
I can use F_CPU to find the CPU speed and adapt.

Is there a minimum CPU speed for the teensy 4.0 and 4.1, those don't have a compile time F_CPU?
 
@ AlainD
delays should be avoided
but could in some rare occasions be used when bitbanging timing sensitive stuff
in other cases use a state machine together with hardware timers
that avoid delays completely

except threads.delay() that actually allows other stuff to run



yield the CPU
do that mean what I think it mean
i.e. halting the cpu?
 
These conversations about yield() are so difficult because yield() is weak symbol that's meant to be overridden when someone wants to change from the simple event callbacks we have now to something else, like a cooperative system or a preemptive RTOS. What yield() will actually do isn't a fixed known quantity.
 
These conversations about yield() are so difficult because yield() is weak symbol that's meant to be overridden when someone wants to change from the simple event callbacks we have now to something else, like a cooperative system or a preemptive RTOS. What yield() will actually do isn't a fixed known quantity.

I think all of the current uses of yield() in the Teensy cores are consistent with the purpose of yield() as stated in the comment in the Arduino core, to perform a cooperative task switch. If we only look for places to add calls to yield() that are also consistent with that purpose, I don't think we will introduce any new difficulty for EventResponder. If someone is using TeensyThreads in a preemptive mode, then yield() would not be calling threads.yield(). This is only relevant for cooperative task switching.
 
Teensy 4 is quite new
and taking the pandemic in context
there has not been enough resources
to fix things that are missing
i.e. the different calls to yield();

@joepasquariello
The main loop actually have a call to yield
But I can now agree that it would be nice
if the calls to yield where the same as in teensy 3.x

It's at least better than just stay in the loops and 'twiddle your thumbs'
 
Yes, delayMicroseconds() should be used for very short or very precise delays, otherwise use delay().

Sometimes a delay for 1ms is to long and 100-200 microseconds would be enough, without the need of a very precise delay. I prefer then to have a few call's to yield().
 
@ AlainD
delays should be avoided
but could in some rare occasions be used when bitbanging timing sensitive stuff
in other cases use a state machine together with hardware timers
that avoid delays completely

except threads.delay() that actually allows other stuff to run

A state machine is very powerful, but if the goal is to take 3-5 readings to be able to get a median of 3 or 5, it's often overkill.
 
I've cleaned up my delayMicrosecondsWithYield and added some extra testfunctions.
Unfortunaly things like F_CPU_ACTUAL and ARM_DWT_CYCCNT are considered non public, so those are removed and only the original function for Teensy LC is left..
The following code was for the Teensy LC, but also runs on 3.6 and 4.1 (and probably the others also)
It seems even more accurate than the library delayMicroseconds on LC and for longer periods. For short (a few microseconds) the overhead of millis() is to high.

Code:
inline void delayMicrosecondsWithYieldLC(uint32_t usec1)
{
  const uint32_t start = micros();  // call to micros() is about 36 cycles or about 1.5 usec at 24Mhz
  #if((defined F_CPU) && (F_CPU >= 48000000))  // first call to micros() takes more than 1 usec
    const uint32_t uSecReserve = 15u;  // We take this nr of reserve usec for the last call to yield
  #else
    const uint32_t uSecReserve = 25u;  // We take this nr of reserve usec for the last call to yield
  #endif
  
  // It will not call yield() when les than uSecReserve are left or 24x that amount of instructions at 24Mhz.
  // This function will be very often be accurate within 3usec when running faster than 24MHz exept for very long yield() calls.
  // For very short durations the calls to micros() are giving extra delay, especially at 24MHz ...
  // F_CPU_ACTUAL Teensy 4/4.1 variable, not #define and for internal use --> not public; F_CPU is defined on Teensy 4.1

  if (usec1 > 0)
  {
    #if((defined F_CPU) && (F_CPU <= 24000000))  // first call to micros() takes more than 1 usec
      --usec1;
    #endif
    while (micros() - start + uSecReserve < usec1)
    {
      yield();
    };
    while (micros() - start < usec1);
  }
};

void Test1delay(unsigned int teller2)
{
  unsigned int microseconds;
  unsigned int totalmicros;
  unsigned int totalmicros2;
  
  microseconds = micros();
  delayMicroseconds(teller2);
  totalmicros = micros() - microseconds;
  microseconds = micros();
  delayMicrosecondsWithYieldLC(teller2);
  totalmicros2 = micros() - microseconds;
  if ((totalmicros2 < teller2) || (totalmicros2 > teller2 + 2u) )  // || (totalmicros2< (totalmicros - 3u))|| (totalmicros2> (totalmicros + 3u))
  {      
    Serial.print(teller2);
    Serial.print(':');
    Serial.print(totalmicros);
    Serial.print('_');
    Serial.print(totalmicros2);
    Serial.print(' ');
  };
};

void TestdelayMicrosecondsWithYield(void)
{
#ifndef F_CPU
  Serial.print('A');
  Serial.print(0);
#else
  Serial.print('_');
  Serial.print(F_CPU);
#endif
  Serial.print(' ');


  unsigned int microseconds;
  unsigned int totalmicros;

  microseconds = micros();
  delay(1);
  totalmicros = micros() - microseconds;
  Serial.print(totalmicros);
  Serial.print(' ');
  
  Test1delay(1000);

  for (unsigned int teller2 = 1; (teller2 <= 19); teller2 = teller2 + 1)
  {
    Test1delay(teller2);
  };

  for (unsigned int teller2 = 20; (teller2 <= 2999); teller2 = teller2 + 17)
  {
    Test1delay(teller2);
  };
  Serial.println(' ');
};

// the setup routine runs once when you press reset:
void setup() {
  // initialize serial communication at 9600 bits per second:
  Serial.begin(9600);
  pinMode(LED_BUILTIN, OUTPUT);
  delay(500); // Delay 1000 ms
}

// the loop routine runs over and over again forever:
void loop() {
  digitalWriteFast(LED_BUILTIN, HIGH);
  delayMicrosecondsWithYield(500);
  digitalWriteFast(LED_BUILTIN, !digitalRead(LED_BUILTIN));
  TestdelayMicrosecondsWithYield();
  delayMicrosecondsWithYield(500000);
}
 
Back
Top