Teensy 3.x multithreading library first release

@ftrias - not done anything beyond initial test yet that worked as noted. Added yield() to threads for them to exit after being done some simple task - in minor change to the example\print.ino.

I noticed I skewed the run/exec time of the threads shared with loop() - and it took some added code to discover how/why as it was odd behavior of the non-linear switching operation.

>> Wondering if there would be an easy way to add '#ifdef DEBUG' instrumentation that could give thread operation feedback?
like :
Code:
class ThreadInfo {
  public:
    uint32_t cyclesStart=0;  // On T_4 the CycCnt is always active - on T_3.x it currently is not - unless Audio starts it AFAIK
    uint32_t cyclesAccum=0;

When a thread is started: thread_X.cyclesStart = ARM_DWT_CYCCNT

When a thread is swapped for time or yield: thread_X.cyclesAccum += ARM_DWT_CYCCNT - thread_X.cyclesStart

Then once per second or so a print of cycles/time spent per thread would show inactive or long running threads by getting and zeroing the cyclesAccum per thread? Along the lines of what the Audio Library shows with CPU_USE.


>> Not followed the code develop or seen how time is allowed for loop() to run?

>> There is 'thread 0, which is always running' - is that just the default task switch thread?

>> Would be interesting to know how much time is spent in in loop() from the above - as it was only counting loop()/sec that told of the off thread behavior I created.


Perhaps there are better or additional ways to get feedback that might give a feel for where the time goes when something isn't clicking right? The read of the ARM_DWT_CYCCNT is a fast 'memory' read so would add minimal over head '#ifdef DEBUG'.
 
@ftrias - not done anything beyond initial test yet that worked as noted. Added yield() to threads for them to exit after being done some simple task - in minor change to the example\print.ino.

I noticed I skewed the run/exec time of the threads shared with loop() - and it took some added code to discover how/why as it was odd behavior of the non-linear switching operation.

>> Wondering if there would be an easy way to add '#ifdef DEBUG' instrumentation that could give thread operation feedback?
like :
Code:
class ThreadInfo {
  public:
    uint32_t cyclesStart=0;  // On T_4 the CycCnt is always active - on T_3.x it currently is not - unless Audio starts it AFAIK
    uint32_t cyclesAccum=0;

When a thread is started: thread_X.cyclesStart = ARM_DWT_CYCCNT

When a thread is swapped for time or yield: thread_X.cyclesAccum += ARM_DWT_CYCCNT - thread_X.cyclesStart

Then once per second or so a print of cycles/time spent per thread would show inactive or long running threads by getting and zeroing the cyclesAccum per thread? Along the lines of what the Audio Library shows with CPU_USE.


>> Not followed the code develop or seen how time is allowed for loop() to run?

>> There is 'thread 0, which is always running' - is that just the default task switch thread?

>> Would be interesting to know how much time is spent in in loop() from the above - as it was only counting loop()/sec that told of the off thread behavior I created.


Perhaps there are better or additional ways to get feedback that might give a feel for where the time goes when something isn't clicking right? The read of the ARM_DWT_CYCCNT is a fast 'memory' read so would add minimal over head '#ifdef DEBUG'.

This is an excellent idea. I committed a change along your suggestion: https://github.com/ftrias/TeensyThreads/commit/ef3ef46441ea79d4849a674325685c7cb531905e

If you uncomment '#define DEBUG' in TeensyThreads.h, you get a new function 'unsigned long Threads::getCyclesUsed(int id)'. You pass the thread id returned by addThread as in:

Code:
int id = threads.addThread(func, 1);
delay(1000);
Serial.println(threads.getCyclesUsed(id));
 
This is an excellent idea. I committed a change along your suggestion: https://github.com/ftrias/TeensyThreads/commit/ef3ef46441ea79d4849a674325685c7cb531905e

If you uncomment '#define DEBUG' in TeensyThreads.h, you get a new function 'unsigned long Threads::getCyclesUsed(int id)'. You pass the thread id returned by addThread as in:

Code:
int id = threads.addThread(func, 1);
delay(1000);
Serial.println(threads.getCyclesUsed(id));

Awesome! It seemed like it would be nice.

Note: for T_3.x's the cycle counter is not always running so DEBUG should start it ( and T_LC doesn't have one ):
Code:
  ARM_DEMCR |= ARM_DEMCR_TRCENA; // Assure Cycle Counter active
  ARM_DWT_CTRL |= ARM_DWT_CTRL_CYCCNTENA;

Note: That Accum doesn't get reset - and that adds multiple calls in sketch with stop()/start(). So I saw this as a way to make it easy on the user and less intrusive in debug code:

I get this output for initialized threads:
Code:
th#0 use=296021495	th#1 use=298854709	th#2 use=1816594	th#3 use=2

Clearly this is a hack as I hardcoded 8? And just plopped a global "uint32_t threadCycles[ 8 ];", but it works with getCyclesUsedAll() prototyped in the class in .h and in .cpp added this code
Code:
uint32_t threadCycles[ 8 ];
unsigned long *Threads::getCyclesUsedAll(void) {
  stop();
  for (int i = 0; i < MAX_THREADS; i++) {
    if (threadp[i]) { // initialized thread?
      if (threadp[current_thread]->flags == RUNNING) { // is it running?
        threadCycles[i] = threadp[i]->cyclesAccum;
        threadp[i]->cyclesAccum = 0;
        if ( threadCycles[i] <2 ) threadCycles[i] = 2; // show 2 - not run
      }
      else { // not running
        threadCycles[i] = 1;
      }
    }
    else { // not initialized
      threadCycles[i] = 0;
    }
  }
  start();
  return threadCycles;
}

Above output works for sketch to run this once per second from loop():
Code:
void thread_Activity() {
	uint32_t *threadCycles;
	threadCycles = threads.getCyclesUsedAll();
	for (int i = 0; i < 8; i++) {
		if ( threadCycles[i] > 0 )
			Serial.printf( "th#%u use=%u\t", i, threadCycles[i]);
	}
	Serial.println();
}

Problem - I was trying to add this as a THREAD to just run as a way to add a thread and it runs ONCE and then never prints again? I just edited to print each time and it stops after the loop starts?


Output - note the trailing "_TH" when the thread version prints - then it stops when loop is entered?
Can you see if I have something BAD in the sketch - or is it triggering something stopping the thread from execution once loop() printing starts?
Code:
T:\tCode\tThreads\PrintExPlus\PrintExPlus.ino Aug 20 2019 13:45:11
[B]th#0 use=200723099	th#1 use=35862	th#2 use=223	th#3 use=2	  <TH_
th#0 use=1	th#1 use=1	th#2 use=1	th#3 use=1	  <TH_
th#0 use=1	th#1 use=1	th#2 use=1	th#3 use=1	  <TH_
th#0 use=1	th#1 use=1	th#2 use=1	th#3 use=1	  <TH_
th#0 use=1	th#1 use=1	th#2 use=1	th#3 use=1	  <TH_
[/B]3160419	45376531	8039	48544989
th#0 use=285902887	th#1 use=288282873	th#2 use=1687992	th#3 use=739886826	
3274150	47041664	8334	50324148
th#0 use=296265333	th#1 use=298898904	th#2 use=1750150	th#3 use=2	
3274123	47036032	8333	50318488
th#0 use=296229789	th#1 use=298863042	th#2 use=1749938	th#3 use=2

Here is the edited example\print.ino edited and run on Teensy 4:
Code:
#include <TeensyThreads.h>

volatile uint32_t countL = 0;
volatile uint32_t count = 0;
volatile uint32_t countB = 0;

void thread_func(int inc) {
	while (1) {
		count += inc;
		if ( 0 &&  !(count % 1667) )
			threads.yield();
	}
}

void thread_funcB(int inc) {
	while (1) {
		countB += inc;
		threads.yield();
	}
}


void thread_Activity() {
	uint32_t *threadCycles;
	threadCycles = threads.getCyclesUsedAll();
	for (int i = 0; i < 8; i++) {
		if ( threadCycles[i] > 0 )
			Serial.printf( "th#%u use=%u\t", i, threadCycles[i]);
	}
	Serial.println();
}

void TH_thread_Activity() {
	static uint32_t nextShow = 0;
	static uint32_t cyclesShow = F_CPU_ACTUAL;
	volatile static uint32_t cnt = 0;
	nextShow = ARM_DWT_CYCCNT + cyclesShow;
	while (1)
	{
		cnt++;
		//if ( ARM_DWT_CYCCNT - nextShow > cyclesShow ) 
		{
			//if ( cnt > 10 ) {
			cnt = 0;
			nextShow = ARM_DWT_CYCCNT + cyclesShow;
			uint32_t *threadCycles;
			threadCycles = threads.getCyclesUsedAll();
			for (int i = 0; i < 8; i++) {
				if ( threadCycles[i] > 0 )
					Serial.printf( "th#%u use=%u\t", i, threadCycles[i]);
			}
			Serial.println( "  <TH_");
		}
		threads.yield();
	}
}

void setup() {
	Serial.begin(115200);
	while (!Serial && millis() < 4000 );
	Serial.println("\n" __FILE__ " " __DATE__ " " __TIME__);
	threads.setMicroTimer( 10 );
	threads.addThread(thread_func, 1);
	threads.addThread(thread_funcB, 1);
	threads.addThread(TH_thread_Activity);
}

elapsedMillis tWait = 0;
void loop() {
	countL++;
	if ( tWait >= 1000 ) {
		tWait = 0;
		Serial.print(countL);
		Serial.print("\t");
		Serial.print(count);
		Serial.print("\t");
		Serial.print(countB);
		Serial.print("\t");
		Serial.println(countB + count + countL);
		count = 0;
		countB = 0;
		countL = 0;
[B]		// thread_Activity(); [COLOR="#FF0000"]// ABOVE output was with this enabled - accts the same when commented - TH_thread_Activity() stops running?[/COLOR][/B]
	}
	//threads.delay(1000);
}

As posted the output does this with commented _Activity() line:
Code:
T:\tCode\tThreads\PrintExPlus\PrintExPlus.ino Aug 20 2019 13:52:15
th#0 use=206723925	th#1 use=35863	th#2 use=222	th#3 use=2	  <TH_
th#0 use=1	th#1 use=1	th#2 use=1	th#3 use=1	  <TH_
th#0 use=1	th#1 use=1	th#2 use=1	th#3 use=1	  <TH_
th#0 use=1	th#1 use=1	th#2 use=1	th#3 use=1	  <TH_
th#0 use=1	th#1 use=1	th#2 use=1	th#3 use=1	  <TH_
3125290	44923620	7956	48056866
3272113	47052364	8333	50332810
3271932	47058087	8334	50338353
 
@ftrias - I haven't gotten back to this - just saw your profiling post … made me return to this as it does similar for TeensyThreads

still wondering why the task TH_thread_Activity() to dump the fails to run after loop() starts? But works when doing the same from loop() as thread_Activity() - which shows htat thread is not getting execution time?

Also the zero of the accumulated values and ALL() in one version to capture the thread run times.
 
@defragster

Was this what you were expecting to see:
Code:
C:\Users\Merli\AppData\Local\Temp\arduino_modified_sketch_349377\Print.ino Aug 24 2019 00:41:47
1	0	0	1
th#0 use=2	th#1 use=2	th#2 use=2	
3572205	47058752	8334	50639291
th#0 use=1156543385	th#1 use=298940578	th#2 use=1750151	
3571121	47061363	8333	50640817
th#0 use=296296484	th#1 use=298954699	th#2 use=1799930	
3571363	47061362	8333	50641058

If it is I just didn't put TH_thread_Activity in its own thread. Loop is thread zero already and you just need to call it as a function. so the whole sketch looks like this:
Code:
#include <TeensyThreads.h>

volatile uint32_t countL = 0;
volatile uint32_t count = 0;
volatile uint32_t countB = 0;

void thread_func(int inc) {
  while (1) {
    count += inc;
    if ( 0 &&  !(count % 1667) )
      threads.yield();
  }
}

void thread_funcB(int inc) {
  while (1) {
    countB += inc;
    threads.yield();
  }
}


void thread_Activity() {
  uint32_t *threadCycles;
  threadCycles = threads.getCyclesUsedAll();
  for (int i = 0; i < 8; i++) {
    if ( threadCycles[i] > 0 )
      Serial.printf( "th#%u use=%u\t", i, threadCycles[i]);
  }
  Serial.println();
}

void TH_thread_Activity() {
  static uint32_t nextShow = 0;
  static uint32_t cyclesShow = F_CPU_ACTUAL;
  volatile static uint32_t cnt = 0;
  nextShow = ARM_DWT_CYCCNT + cyclesShow;
  while (1)
  {
    cnt++;
    //if ( ARM_DWT_CYCCNT - nextShow > cyclesShow ) 
    {
      //if ( cnt > 10 ) {
      cnt = 0;
      nextShow = ARM_DWT_CYCCNT + cyclesShow;
      uint32_t *threadCycles;
      threadCycles = threads.getCyclesUsedAll();
      for (int i = 0; i < 8; i++) {
        if ( threadCycles[i] > 0 )
          Serial.printf( "th#%u use=%u\t\t", i, threadCycles[i]);
      }
      Serial.println( "  <TH_");
    }
    threads.yield();
  }
}

void setup() {
  Serial.begin(115200);
  while (!Serial && millis() < 4000 );
  Serial.println("\n" __FILE__ " " __DATE__ " " __TIME__);
  threads.setMicroTimer( 10 );
  threads.addThread(thread_func, 1);
  threads.addThread(thread_funcB, 1);
  //threads.addThread(TH_thread_Activity);
}

elapsedMillis tWait = 0;
void loop() {
  countL++;
  if ( tWait >= 1000 ) {
    tWait = 0;
    Serial.print(countL);
    Serial.print("\t");
    Serial.print(count);
    Serial.print("\t");
    Serial.print(countB);
    Serial.print("\t");
    Serial.println(countB + count + countL);
    count = 0;
    countB = 0;
    countL = 0;
    thread_Activity(); // ABOVE output was with this enabled - accts the same when commented - TH_thread_Activity() stops running?
  }
  //threads.delay(1000);
}
 
@defragster

Was this what you were expecting to see:

...

I wanted there to be an active running thread printout it out - not in loop but lines like this from TH_thread_Activity():

Code:
th#0 use=1	th#1 use=1	th#2 use=1	th#3 use=1	  <TH_

The comments in TH_thread_Activity() where trials to get it to wait and not print each time - and then dropping them showed that thread dying as loop() starts it seems based on posted output.

After those efforts I quit for review here in case I was doing something wrong … got some more ideas to try ...
 
Problem - I was trying to add this as a THREAD to just run as a way to add a thread and it runs ONCE and then never prints again? I just edited to print each time and it stops after the loop starts?

Sounds to me that Serial.print is not thread save.
(I would be surprised if there are NO global variables in Serial/Stream/Print)
 
@ftrias - let me know if you like the post #403 threads.getCyclesUsedAll() version clearing ACCUM data better and can put it in [my hack is not clean with #8 hardcoded] - it seems @mjs513 did ... but he's been merging my code for over a year now :)

@mjs513 - I see you answered post #401 question assumed but couldn't decipher in the code ::
Loop is thread zero already

QUESTION: Is loop() called on the same schedule as other threads or between each 'other' thread call?

In the end I don't have ANY idea what was FIXED - but it is running as desired now! The three things THOUGHT to be problems were put back and it runs fine, dual thread Serial.print() and then loop() th#0 and th#1 each taking about 50% of CPU cycles ????

I began to think it was using Serial.print in loop() and another thread … so I switched to sprint( showBuf) in TH_thread_Activity(), I got a valid thread buffer to show in loop(). And finally got it working like that - not sure of the real diff. I added two more threads and different task chores. And code below prints the sprint( showBuf ) in both threads - where func() name follows the string.

MAYBE it was using a series of Serial.print() in the TH_() versus the sprintf to buffer and a single prints that is much faster to Serial? Though loop() uses a series as well?

This is the desired output - using sprint( showBuf ) from TH_ AND loop() as labeled at line end - where the showBuf ACCUM data is calculated each F_CPU_ACTUAL cycles only in TH_thread_Activity():
Code:
T:\tCode\tThreads\PrintExPlus\PrintExPlus.ino Aug 24 2019 01:01:38
th#0 use=204340286 [0.340567]	th#1 use=1311 [0.000002]	th#2 use=1425 [0.000002]	th#3 use=2 [0.000000]	th#4 use=2 [0.000000]	th#5 use=2 [0.000000]	[B]TH_thread_Activity[/B]

th#0 use=204340286 [0.340567]	th#1 use=1311 [0.000002]	th#2 use=1425 [0.000002]	th#3 use=2 [0.000000]	th#4 use=2 [0.000000]	th#5 use=2 [0.000000]	[B]loop[/B]
11	8743896	138792	277584	555168	8882699
th#0 use=29168864 [0.048615]	th#1 use=178458391 [0.297431]	th#2 use=197037369 [0.328396]	th#3 use=31832225 [0.053054]	th#4 use=29436150 [0.049060]	th#5 use=30812309 [0.051354]	[B]TH_thread_Activity[/B]

th#0 use=29168864 [0.048615]	th#1 use=178458391 [0.297431]	th#2 use=197037369 [0.328396]	th#3 use=31832225 [0.053054]	th#4 use=29436150 [0.049060]	th#5 use=30812309 [0.051354]	loop
10	8743770	138790	555164	1110328	8882570
th#0 use=29169040 [0.048615]	th#1 use=178454480 [0.297424]	th#2 use=197038536 [0.328398]	th#3 use=31834063 [0.053057]	th#4 use=29438027 [0.049063]	th#5 use=30810788 [0.051351]	[B]TH_thread_Activity[/B]

th#0 use=29169040 [0.048615]	th#1 use=178454480 [0.297424]	th#2 use=197038536 [0.328398]	th#3 use=31834063 [0.053057]	th#4 use=29438027 [0.049063]	th#5 use=30810788 [0.051351]	loop
10	8743896	138792	832748	1665496	8882698
// ...

This is the working code where the noted edits made to support threads.getCyclesUsedAll() are in place:
Code:
#include <TeensyThreads.h>

volatile uint32_t countL = 0;
volatile uint32_t count = 0;
volatile uint32_t countB = 0;
volatile uint32_t countC = 0;
volatile uint32_t countD = 0;
volatile uint32_t showNow = 0;

void thread_func(int inc) {
	while (1) {
		count += inc;
		if ( !(count % 63) )
			threads.yield();
	}
}

void thread_funcB(int inc) {
	while (1) {
		countB += inc;
		delayMicroseconds(2);
		threads.yield();
	}
}

void thread_funcC(int inc) {
	while (1) {
		countC += inc;
		countC += inc;
		threads.yield();
	}
}

void thread_funcD(int inc) {
	while (1) {
		countD += inc;
		countD += inc;
		countD += inc;
		countD += inc;
		threads.yield();
	}
}

char showBuf[200];
void TH_thread_Activity() {
	uint32_t *threadCycles;
	int jj;
	static uint32_t nextShow = 0;
	static uint32_t cyclesShow = F_CPU_ACTUAL;
	while (1) {
		if ( nextShow - ARM_DWT_CYCCNT > cyclesShow ) {
			nextShow = ARM_DWT_CYCCNT + cyclesShow;
			jj = 0;
			threadCycles = threads.getCyclesUsedAll();
			for (int i = 0; i < 8; i++) {
				if ( threadCycles[i] > 0 ) {
					jj += sprintf( &showBuf[jj], "th#%d use=%lu [%f]\t", i, threadCycles[i], threadCycles[i] / (float)cyclesShow);
				}
			}
			Serial.print( showBuf );
			Serial.println( __func__ );
			showNow = 2;
		}
		threads.yield();
	}
}

void setup() {
	Serial.begin(115200);
	while (!Serial && millis() < 4000 );
	Serial.println("\n" __FILE__ " " __DATE__ " " __TIME__);
	threads.setMicroTimer( 100 );
	showBuf[0] = 0;
	threads.addThread(thread_func, 1);
	threads.addThread(thread_funcB, 1);
	threads.addThread(TH_thread_Activity);
	threads.addThread(thread_funcC, 1);
	threads.addThread(thread_funcD, 1);
}

elapsedMillis tWait = 0;
void loop() {
	countL++;
	if ( 2 == showNow ) {
		Serial.println( );
		Serial.print( showBuf );
		Serial.println( __func__ );
		showNow = 0;
	}
	if ( tWait >= 1000 ) {
		tWait = 0;
		Serial.print(countL);
		Serial.print("\t");
		Serial.print(count);
		Serial.print("\t");
		Serial.print(countB);
		Serial.print("\t");
		Serial.print(countC);
		Serial.print("\t");
		Serial.print(countD);
		Serial.print("\t");
		Serial.println(countB + count + countL);
		count = 0;
		countB = 0;
		countL = 0;
	}
	threads.delay(100);
}
 
Sounds to me that Serial.print is not thread save.
(I would be surprised if there are NO global variables in Serial/Stream/Print)

Just saw this post in the mix … it is working with last post - Not sure I cleaned up anything I did wrong by accident - but I did for sure 'reduce' the chance for Serial.print collision - and altered the timing if that is what it was. Though two threads are now printing … will see if it is running after some Zzzz's.

Very odd because it would ALWAYS never run the Activity() thread after loop thread did it's first print. I saw the loop was entering up to the 1 sec timeout and then printing - then no more Activity() thread. In the process I made things WORSE before they seemed to go right where it was going dead needing Button to upload - but before all threads but the Activity thread kept being active ????

May have been a combo of abuse - I had Micros Timing very fast at threads.setMicroTimer( 10 ) and loop() had no .yield() - now bigger slices as well at threads.setMicroTimer( 100 ); and loop() only 10 Hz not some millions.

This is the kind of feedback the 'Cycles Per Thread' can help to correct errors where linear logic can fail on time slicing not knowing where the problem is?

If would be nice perhaps if the thread name could optionally be passed - the compiler gives a pointer to string. That could also help in debugging:
Code:
void thread_funcC(int inc) {
        [B]thread.pNameSet( __func__);[/B]
	while (1) {
		countC += inc;
		countC += inc;
		threads.yield();
	}
}

It could start NULL and then be set as above with thread.pNameSet( __func__); for storage. Then be shown with thread.nameShow( threadId );

Though not sure how threadId is known? On dual core ESP32 there is a function that returns what CORE# the function is running on?

@ftrias would it make sense to do something like this so it could be used at compile time?:
Code:
  #define DEF_MAX_THREADS 8
  static const int MAX_THREADS = DEF_MAX_THREADS;

// for use like abused in the posted unsigned long *Threads::getCyclesUsedAll(void) {
uint32_t threadCycles[ DEF_MAX_THREADS ];
 
@ftrias - I haven't gotten back to this - just saw your profiling post … made me return to this as it does similar for TeensyThreads

still wondering why the task TH_thread_Activity() to dump the fails to run after loop() starts? But works when doing the same from loop() as thread_Activity() - which shows htat thread is not getting execution time?

Also the zero of the accumulated values and ALL() in one version to capture the thread run times.

I think it would still be useful to see per-thread CPU usage, but profiling is probably more useful.

FYI, I just created a new TeensyGProf that's even better than TeensyProf. See https://github.com/ftrias/TeensyGProf

On the forum: https://forum.pjrc.com/threads/57333-Profiling-using-gprof-on-Teensy-4?p=213512#post213512
 
I think it would still be useful to see per-thread CPU usage, but profiling is probably more useful.

FYI, I just created a new TeensyGProf that's even better than TeensyProf. See https://github.com/ftrias/TeensyGProf

On the forum: https://forum.pjrc.com/threads/57333-Profiling-using-gprof-on-Teensy-4?p=213512#post213512

Won't the profiling at 1ms intervals will miss a lot on 600 MHz T4? It does a great deal in 600,000 cycles :) It can cycle loop() 5 to 20 Million times in a second.

And when it comes to THREADS watching each thread switch timing is exact to the CPU CYCLE per thread and will show when a thread is BROKEN either ALL ON or ALL OFF.

For TThreads - the per thread seems critical as noted when going from linear coding to Time Slicing it is very easy to get lost/broken - of course then the usage report may get lost too ...
 
Won't the profiling at 1ms intervals will miss a lot on 600 MHz T4? It does a great deal in 600,000 cycles :) It can cycle loop() 5 to 20 Million times in a second.

And when it comes to THREADS watching each thread switch timing is exact to the CPU CYCLE per thread and will show when a thread is BROKEN either ALL ON or ALL OFF.

For TThreads - the per thread seems critical as noted when going from linear coding to Time Slicing it is very easy to get lost/broken - of course then the usage report may get lost too ...

I suppose we could test more frequently, but it may not make a difference. gprof is a sampling profiler so it relies on statistical distributions. If you run it for 1 minutes (60000 samples), it will catch the biggest offenders at the risk of missing out on small fry. For more details on why this works see http://wwwcdf.pd.infn.it/localdoc/gprof.pdf
 
I suppose we could test more frequently, but it may not make a difference. gprof is a sampling profiler so it relies on statistical distributions. If you run it for 1 minutes (60000 samples), it will catch the biggest offenders at the risk of missing out on small fry. For more details on why this works see http://wwwcdf.pd.infn.it/localdoc/gprof.pdf

No doubt it will be statistically relevant and helpful in most cases. I didn't get as far as opening the source to see how it gathers the samples - but I should. But in the case of TThreads and forced time slices this info tells a great deal of general info more than random samples - especially where each thread has a known task set - and when trying to understand where it has gone wrong on the upper level.
Code:
thread_func is th#1:
thread_funcB is th#2:
TH_thread_Activity is th#3:
thread_funcC is th#4:
thread_funcD is th#5:

th#0 use=207934324 [0.346557]	th#1 use=1303 [0.000002]	th#2 use=1417 [0.000002]	th#3 use=2 [0.000000]	th#4 use=2 [0.000000]	th#5 use=2 [0.000000]	loop

Since I added the sprintf() on the th#3 there won't be contention for Serial.print with th#0/loop that does the print.

Just did a quick test and for sure there is contention when a subthread prints the loop() thread goes mute on Serial.print. Need to scroll back and find the note @mjs513 referred to - which is why I came here ...
 
TeensyThreads task switcher Overhead measure: 16.8% ... 100M of the 600M cycles with my current sketch and 6 total threads.

I added a DEBUG Accum of CycCnt's complementing the thread timing - that is when a thread is off the clock I put the Engine on the clock until the next thread is clocked in:
Code:
[B][COLOR="#FF0000"]th_Eng use=[U]101003188[/U] [0.168339][/COLOR][/B]	th#0 use=28416306 [0.047361]	th#1 use=175960630 [0.293268]	th#2 use=194081934 [0.323470]	th#3 use=30918336 [0.051531]
	th#4 use=28799936 [0.048000]	th#5 use=30123054 [0.050205]	[B]thSum=589303384 [0.982172][/B]	loop

Most of the tasks at hand quickly TH.yield() so the switching is much faster than normal time slice sharing no matter the clock switch rate so that 17% overhead is worse than full or longer slices would be. But the overhead on switching does look significant.

In the TH_thread_Activity() I created also started counting the SUM of measured clock cycles for each task. That comes out to 98.2% so there is no double counting. I just overlapped the stop/start CYC_CNT on the one end and it jumped to 99.44% from 98.2% so the number of task switches and the couple of ticks involved in the tracking account for that 'loss' it seems.

Just added count of threadEngineSwitches showing there are 827,466 - with each switch taking 123.67 cycles per switch:
Code:
[B][COLOR="#FF0000"]th_Eng use=102330882 [0.170551] w/SW#=827466 [Cyc/=123.67][/COLOR][/B]	th#0 use=28750668 [0.047918]	th#1 use=176506911 [0.294178]	th#2 use=195159196 [0.325265]	th#3 use=32442995 [0.054072]	th#4 use=29560783 [0.049268]	th#5 use=31115592 [0.051859]	thSum=595867027 [0.993112]	loop

I took out the PRIORITY thread SEARCH CODE loop and the overhead drops over 66% from 17% to 5.2% - which yields 14+% more task switches per second:
Code:
[COLOR="#FF0000"][B]th_Eng use=31468513 [0.052448] w/SW#=947844 [Cyc/=33.20][/B][/COLOR]	th#0 use=32741061 [0.054568]	th#1 use=202109210 [0.336849]	th#2 use=223228089 [0.372047]	th#3 use=37719772 [0.062866]	th#4 use=34320755 [0.057201]	th#5 use=35571489 [0.059286]	thSum=597158889 [0.995265]	loop

Here is the .h and .cpp file as used including my current sketch with it in use in TH_thread_Activity():
Code:
[ATTACH]17312._xfImport[/ATTACH]
 
<< IGNORE - RUNNING is not what assumed>.

@ftrias / @mjs513 / ... ::
I just edited the PRIORITY test loop and I get this - much improved at 7% over 17%:
Code:
th_Eng use=[B][COLOR="#FF0000"]40984570 [0.068308] w/SW#=932166 [Cyc/=43.97][/COLOR][/B]	th#0 use=32500477 [0.054167]	th#1 use=199464888 [0.332441]	th#2 use=220106337 [0.366844]	th#3 use=36990595 [0.061651]	th#4 use=33562497 [0.055937]	th#5 use=35459571 [0.059099]	thSum=599068935 [0.998448]	loop

Is this ::
Code:
  if (threadp[current_thread]->flags == RUNNING) { // is it running?
    if (threadp[current_thread]->priority) {
      priority_thread = current_thread;
      currentCount = threadp[current_thread]->ticks; // priority
      threadp[current_thread]->priority = 0; [COLOR="#FF0000"]// SHOULD THIS be  -=1; ????[/COLOR]
    }
  }

Given that it only cares about the CURRENT_THREAD ---- Assuming current_thread is initialized :) - - - IS this the functional equivalent of this? ::
Code:
  for (int i = 0; i < MAX_THREADS; i++) {
    if (threadp[i]) { // initialized thread?
      [COLOR="#FF0000"]if (threadp[current_thread]->flags == RUNNING) { // is it running?[/COLOR]
        if (threadp[i]->priority) {
          current_thread = i;
          priority_thread = i;
          currentCount = threadp[i]->ticks; // priority
          threadp[i]->priority = 0;
          break;
        }
      }
    }
  }

Of course - asking this seems redundant :: if (threadp[current_thread]->flags == RUNNING) { // is it running?
 
Last edited:
Post #416 didn't fix the time spent in the thread scheduler where p#415 showed it consuming ~17% of Teensy 4 execution time.

Question: to make the scheduler more efficient would it be possible to maintain a simpler list of 'next thread' and 'Priority thread' to get it started? This would require anything that alters the 'list' to update/maintain it by adding or removing the threads when their state was altered. The commands that could do that would be:
>> Kill, Suspend, Restart, addThread. setState

Also noted in p#416 was when priority was set to 1 or higher, current code always goes to ZERO taking it off the priority list - should it just decrement by one so if 2 or higher it would continue as a priority until is reduces to Zero?
Code:
        if (threadp[i]->priority) {
//…
          threadp[i]->priority = 0; //  should this be :: threadp[i]->priority -= 1;
 
Post #416 didn't fix the time spent in the thread scheduler where p#415 showed it consuming ~17% of Teensy 4 execution time.

Question: to make the scheduler more efficient would it be possible to maintain a simpler list of 'next thread' and 'Priority thread' to get it started? This would require anything that alters the 'list' to update/maintain it by adding or removing the threads when their state was altered. The commands that could do that would be:
>> Kill, Suspend, Restart, addThread. setState

Also noted in p#416 was when priority was set to 1 or higher, current code always goes to ZERO taking it off the priority list - should it just decrement by one so if 2 or higher it would continue as a priority until is reduces to Zero?
Code:
        if (threadp[i]->priority) {
//…
          threadp[i]->priority = 0; //  should this be :: threadp[i]->priority -= 1;

Looking at it now, it does seem like the priority thread support is not implemented very efficiently. If I recall, I added it at a later time. It was really just a way to give a boost to threads that had been suspended via a lock. Maybe it's best to just get rid of priorities. It doesn't make any practical difference and the current code is pretty inefficient.
 
I have question about EEPROM Write action and thread switching. I know than eeprom write action can be critique and can't be interrupt during proceed. What's append if multi thread switch context during those operation ? Using mutex not seem a good solution for me because their don't avoid switching context. And like EEPROM write cycle is low level operation that can't be interrupt, switching context can crash it. Perhaps one thread Stop() before and thread Start() after can avoid this ? Is-it a good solution, or is there another ?
 
Teensy 4 timeslice 5 ms? bug or feature?

With scope I notice Teensy 3* has thread timeslice of 10 ms, but Teensy 4 has 5 ms timeslice?

I think the problem is that the T4 GPT isr will fire twice unless you have asm volatile ("dsb"); at the end of the isr.

Code:
static void __attribute((naked, noinline)) gpt1_isr() {
  GPT1_SR |= GPT_SR_OF1;  // clear set bit
  __asm volatile("dsb");
  __asm volatile("b context_switch");
}
 
Teensy 4 timeslice 5 ms? bug or feature?

With scope I notice Teensy 3* has thread timeslice of 10 ms, but Teensy 4 has 5 ms timeslice?

I think the problem is that the T4 GPT isr will fire twice unless you have asm volatile ("dsb"); at the end of the isr.

Code:
static void __attribute((naked, noinline)) gpt1_isr() {
  GPT1_SR |= GPT_SR_OF1;  // clear set bit
  __asm volatile("dsb");
  __asm volatile("b context_switch");
}

Great catch! I have committed your suggested change to github.
 
I have gone through the complete thread to see if anyone has used this library with the Encoder library but I saw nothing. So my question is, does anyone think that it is possible to use the two libraries together. I have one rotary encoder with 2048 steps/rev and one linear scale to read. Then the results will end up on a 7.0" 4D display. Similar to one used by someone on the tread.
I am thinking of using a linked list to pass results forwards and backwards between the display and the other functions.
 
Need some assistance here....

I'm building a CNC drill press to automatically position a table under the drill bit so I can drill PCB's automatically and accurately. I have 3 steppers X-Y-Z. All is working just fine but I was hoping to have a clock at the top of my display that displays real time. I created a ProcessTime function for a thread. I only have 1 thread and this ProcessTime is called. ProcessTime basicall gets millis() and converts to hh:mm:ss, i have a delay in ProcessTime so it only updates the display time every second.

Threading works, except during the very brief time update, my stepper routine seems to get paused (you can hear and see the stepper stop). I've tried different delays in ProcessTime, but nothing.

Is threading supposed to create a totally independent thread where it's time will not affect time in the main program?

Just started playing with this, so i'm definitely a novice with threading.

UPDATE here's a snippet of my code

requires
1) a stepper
2) the A4988 drive board
3) ili9341 display
4) teensy 3.2


Code:
#include <Arduino.h>
#include <ILI9341_t3.h>           // fast display driver lib
#include "TeensyThreads.h"
#include <font_Arial.h>             // custom fonts that ships with ILI9341_t3.h

#define FONT_BODY Arial_24
#define TFT_DC 9       // DC pin on LCD
#define TFT_CS 10      // chip select pin on LCD
#define LED_PIN A9     // lcd pin to control brightness
#define SD_CS A1        // chop select  pin on sd card
#define T_CS  A7
#define T_IRQ  A8

#define STEP_DELAY  1100
#define PULSE_DELAY  1

#define ZMOTOR_PIN1 8
#define ZMOTOR_PIN2 A3
#define ZMOTOR_PIN3 A2
#define ZMOTOR_PIN4 A0

#define XMOTOR_STEP A5
#define XMOTOR_DIR A6
#define YMOTOR_STEP 1
#define YMOTOR_DIR 0
#define YMOTOR_ENABLE 2


int h, m, s;
char str[12];
volatile unsigned long curtime;
volatile unsigned long JogTime;

ILI9341_t3 Display = ILI9341_t3(TFT_CS, TFT_DC, 240, 320);

void setup() {

  Serial.begin(38400);
  threads.addThread(ProcessTime);

  pinMode(SD_CS , OUTPUT);
  pinMode(LED_PIN, OUTPUT);
  pinMode(TFT_CS, OUTPUT);

  pinMode(XMOTOR_STEP, OUTPUT);
  pinMode(XMOTOR_DIR, OUTPUT);

  Display.begin();
  delay(100);

  Display.fillScreen(0);
  Display.setRotation(3);

  analogWrite(LED_PIN, 255);

  curtime = millis();

  // when steppers are running you can hear them pause for like a ms ever second

  MoveX(3000);
  MoveX(-3000);

}

void loop() {

}

void ProcessTime() {

  while (1) {
    threads.delay(1000);
    JogTime = (millis() - curtime) / 1000;
    h = (JogTime / 3600);
    m = (JogTime - (3600 * h)) / 60;
    s = (JogTime - (3600 * h) - (m * 60));
    sprintf(str, "%02d:%02d:%02d", h, m, s);
    Display.setFont(FONT_BODY);
    Display.setTextColor(50000, 0);
    Display.fillRect(100, 50, 200, 50, 0);
    Display.setCursor(100, 50);
    Display.print(str);
    threads.yield();
  }
}


int MoveX (int Steps) {

  int i = 0;

  if (Steps > 0) {

    digitalWrite(XMOTOR_DIR, HIGH);

    for (i = 0; i < Steps; i++) {

      digitalWrite(XMOTOR_STEP, HIGH);
      delayMicroseconds(100);
      digitalWrite(XMOTOR_STEP, LOW);
      delayMicroseconds(2500);


    }
  }

  else if (Steps < 0) {
    Steps = -Steps;

    digitalWrite(XMOTOR_DIR, LOW);

    for (i = 0; i < Steps; i++) {

      digitalWrite(XMOTOR_STEP, HIGH);
      delayMicroseconds(100);
      digitalWrite(XMOTOR_STEP, LOW);
      delayMicroseconds(1500);

    }
  }
  return i;
}

Thanks in advance
 
Last edited:
Need some assistance here....

I'm building a CNC drill press to automatically position a table under the drill bit so I can drill PCB's automatically and accurately. I have 3 steppers X-Y-Z. All is working just fine but I was hoping to have a clock at the top of my display that displays real time. I created a ProcessTime function for a thread. I only have 1 thread and this ProcessTime is called. ProcessTime basicall gets millis() and converts to hh:mm:ss, i have a delay in ProcessTime so it only updates the display time every second.

Threading works, except during the very brief time update, my stepper routine seems to get paused (you can hear and see the stepper stop). I've tried different delays in ProcessTime, but nothing.

Is threading supposed to create a totally independent thread where it's time will not affect time in the main program?

Just started playing with this, so i'm definitely a novice with threading.

UPDATE here's a snippet of my code

requires
1) a stepper
2) the A4988 drive board
3) ili9341 display
4) teensy 3.2


Code:
#include <Arduino.h>
#include <ILI9341_t3.h>           // fast display driver lib
#include "TeensyThreads.h"
#include <font_Arial.h>             // custom fonts that ships with ILI9341_t3.h

#define FONT_BODY Arial_24
#define TFT_DC 9       // DC pin on LCD
#define TFT_CS 10      // chip select pin on LCD
#define LED_PIN A9     // lcd pin to control brightness
#define SD_CS A1        // chop select  pin on sd card
#define T_CS  A7
#define T_IRQ  A8

#define STEP_DELAY  1100
#define PULSE_DELAY  1

#define ZMOTOR_PIN1 8
#define ZMOTOR_PIN2 A3
#define ZMOTOR_PIN3 A2
#define ZMOTOR_PIN4 A0

#define XMOTOR_STEP A5
#define XMOTOR_DIR A6
#define YMOTOR_STEP 1
#define YMOTOR_DIR 0
#define YMOTOR_ENABLE 2


int h, m, s;
char str[12];
volatile unsigned long curtime;
volatile unsigned long JogTime;

ILI9341_t3 Display = ILI9341_t3(TFT_CS, TFT_DC, 240, 320);

void setup() {

  Serial.begin(38400);
  threads.addThread(ProcessTime);

  pinMode(SD_CS , OUTPUT);
  pinMode(LED_PIN, OUTPUT);
  pinMode(TFT_CS, OUTPUT);

  pinMode(XMOTOR_STEP, OUTPUT);
  pinMode(XMOTOR_DIR, OUTPUT);

  Display.begin();
  delay(100);

  Display.fillScreen(0);
  Display.setRotation(3);

  analogWrite(LED_PIN, 255);

  curtime = millis();

  // when steppers are running you can hear them pause for like a ms ever second

  MoveX(3000);
  MoveX(-3000);

}

void loop() {

}

void ProcessTime() {

  while (1) {
    threads.delay(1000);
    JogTime = (millis() - curtime) / 1000;
    h = (JogTime / 3600);
    m = (JogTime - (3600 * h)) / 60;
    s = (JogTime - (3600 * h) - (m * 60));
    sprintf(str, "%02d:%02d:%02d", h, m, s);
    Display.setFont(FONT_BODY);
    Display.setTextColor(50000, 0);
    Display.fillRect(100, 50, 200, 50, 0);
    Display.setCursor(100, 50);
    Display.print(str);
    threads.yield();
  }
}


int MoveX (int Steps) {

  int i = 0;

  if (Steps > 0) {

    digitalWrite(XMOTOR_DIR, HIGH);

    for (i = 0; i < Steps; i++) {

      digitalWrite(XMOTOR_STEP, HIGH);
      delayMicroseconds(100);
      digitalWrite(XMOTOR_STEP, LOW);
      delayMicroseconds(2500);


    }
  }

  else if (Steps < 0) {
    Steps = -Steps;

    digitalWrite(XMOTOR_DIR, LOW);

    for (i = 0; i < Steps; i++) {

      digitalWrite(XMOTOR_STEP, HIGH);
      delayMicroseconds(100);
      digitalWrite(XMOTOR_STEP, LOW);
      delayMicroseconds(1500);

    }
  }
  return i;
}

Thanks in advance

Use intervalTimer
 
Back
Top