TeensyTimerTool

I should read more about c++ and use it..
Yes, but beware, looking at the STL code with unprotected eyes may cause serious injury as they say :))
 
Last edited:
By the way it really is a great lib you put together. One question I have is if there is a way for it to know if a timer is already in use by another library so the user knows it has to be fixed?
Thanks!

Unfortunately I didn't find a way to check if a timer module is in use. If you have some idea how to identify a used timer I could try to implement it. But at least I tried to only touch those timer modules which are actually requested. So, if you don't use a channel from TMR1 it will leave the module alone.

There is a (currently undocumented) error callback which will inform users if they run out of timer channels. If you for example do something like

Code:
Timer t1(TMR1), t2(TMR1), t3(TMR1), t4(TMR1), t5(TMR1);

The error function will be called on t5 with a corresponding error code. But this feature is not yet fully functional.
 
Wow, the c++'atomic' header is a good find - I should read more about c++ and use it.. :)

It uses ldrex/strex , too (see the disassembly) - but is easier to use.

Interesting. The " ldrex/strex " was used in micros() because we needed to protect read TWO DWORD values. Test code made some millions likely of repeat calls and without it the millisecond timer _isr() showed conflict, when it was added the same test ran with no signs of trouble - but that was in a simple context. The micros() only reads ARM_CYCCNT and two values from then math with millis() values - no _isr() or interrupt effort to compute that.

Going back to the "ldrex/strex" in the sample with TeensyTimerTool code ( versus the manual timer code used prior ) - and the count was changing properly as opposed to ever increasing - not sure what that code was doing?

However { with dsb in _isr() } it was not always working properly and on occasion the non atomic increment came away on top of a non-zeroed value with TMR1, but never seen when using GPT1 timer? { Perhaps the GPT1 since us does not slip happens to be in sync such that the _isr() clears the inc++ ** SEE BELOW }

Also as noted before the GPT1 timer hits a static microsecond last digit when printed, where both TMR1 and TCK versions slip with an increasing displayed microseconds.

This is with a 25,000 microsecond timer as before. GPT1 shows 135K counts per interval and TMR1 shows 140K counts (but exhibits doubled non-atomic counting). The TCK counter without "ldrex/strex" runs fine at 118K with the 1us slip about each 7 seconds.

This is mostly right - but shows a flaw in ldrex/strex as used? As noted ATOMIC adds overhead - more than raw ldrex/strex it seems - and the ldrex/strex when it works is good for more than a single atomic variable when needed.

**Writing note above I added a counter to the ldrex/strex loop to see how often it detects _isr() interruption. I always WANTED to do that during the micros() coding but never did - but never saw results in testing that suggested it was needed.
>> INDEED - using the GPT1 timer with TTT the _isr never shows signs of interrupt during the 'loopCnt++;', so it is not testing the ldrex/strex here.
>> However the TMR1 code regularly has an _isr() during the loopCnt++ in the ldrex/strex do/while. About 20 of those in 10 seconds with 3 of those resulting in doubled counts.

This output shows that - three work and two do not:
Code:
136315	8750038	loopstrex=1
136308	8775039
136315	8800039	loopstrex=1
[B]	272624	8825040
[/B]136311	8850039
136315	8875039	loopstrex=1
[B]	272625	8900040[/B]
136311	8925039
136315	8950039
136316	8975039
136316	9000039
136315	9025040	loopstrex=1
136308	9050040
136315	9075040
136316	9100040
136316	9125040
136315	9150040	loopstrex=1
136308	9175040

Using this code:
Code:
#include "arm_math.h"	// micros() synchronization
uint32_t callback_safe_update;
volatile uint32_t loopCnt;
#include "TeensyTimerTool.h"
using namespace TeensyTimerTool;
volatile uint32_t loopCntIsr = 0;
uint32_t loopstrex = 0;

void isr()
{
	loopCntIsr = loopCnt;
	//Serial.print(".");
	loopCnt = 0;
	asm volatile("dsb"); // xxx

}

Timer t1 (TMR1);
//Timer t1 (GPT1); // stable us
//Timer t1 (TCK);

void setup()
{
	t1.beginPeriodic(isr, 25000);
}
void loop()
{
#if 0
	loopCnt++;
#else
	do {
		__LDREXW(&callback_safe_update);
		loopCnt++;
		loopstrex++;
	} while ( __STREXW(1, &callback_safe_update));
#endif
	if ( loopCntIsr ) {
		if ( loopCntIsr >= 160000 ) Serial.printf("\t");
		Serial.printf("%d\t%d", loopCntIsr, micros());
		if ( loopstrex > loopCntIsr ) Serial.printf("\tloopstrex=%d",loopstrex-loopCntIsr);
		Serial.println();
		loopstrex=0;
		loopCntIsr = 0;
	}
}
 
I FIGURED IT OUT! I considered this doing the micros() edit - and it just came back to me as I put a log in the woodstove ...

When doing micros() I didn't have any side effects reading two DWORDS of the sys_tick status - but in crafting that code it occurred to me that if the ldrex/strex loop cycled to completion - found an interrupt had fired - and then repeated it could have unwanted side effects if writes occurred in the that block, so all calcs are done outside.

Looking at the loopCnt++ - that doesn't have apparent side effects on a 'volatile' : Other than the read of volatile loopCnt may have been "completed". The compiler assured it was read freshly - but that is then it seems superseded/followed by the ldrex/strex sequence coded as an atomic block.

Code:
That's the important part I suppose - the ldrex/strex blocks work - BUT any direct interaction with an _isr() and shared vars [even volatile] requires critical thought - as with any _isr() data.  
The ldrex/strex block generally has low/minimal overhead UNLESS interrupted ( check out micros() it runs in average 38 or less cycles thanks to Paul saving 2-4 cycles in TD 1.49 ).

[U]Similar code (prior post?) with [B]volatile std::atomic<uint32_t> loopCnt;[/B] runs at only 104K versus 133K here[/U] when using even this complex ldrex/strex block. 

And as noted that 'atomic' covers only one variable - not a series of operations.  BTW: ldrex/strex is supported on M4/T_3.x but NOT M0/T_LC

I did another WWW scan on the ldrex/strex - all it does it repeat if an interrupt occurs during the do{__LDREXW//...}while(__STREXW(...)); - application of that concept is up to the user - single core or multicore or ...

The current _isr detect block looks like this and seems to be working - WITHOUT "dsb". I added the extra if() in the block to assure the interrupt was caused from 'this' sketch, not from sys_tick or other.
Code:
	loopstrex = 0;
	do {
		__LDREXW(&callback_safe_update);
		if ( loopstrex != 0 && loopCntIsr > 1 ) { // repeat entry and the _isr that fired was 
			Serial.printf("\t\t[[loopCnt=%lu__%lu]]", loopCnt, loopCntIsr);
			loopCnt = 1;
		}
		else
			loopCnt++;
		loopstrex++;
	} while ( __STREXW(1, &callback_safe_update));

Here is a bad string of them - sketch can run 2 to 30 seconds without triggering the ldrex/strex. The double count from loopCnt not being zeroed is prevented - and no 'dsb' is in use. The count is stable.
Code:
133880	61725265
133879	61750265
133879	61775265
133879	61800265		[[loopCnt=0__133879]]
133879	61825269	loopstrex=2		[[loopCnt=1__133853]]
133853	61850269	loopstrex=2
133853	61875265
133879	61900266
133879	61925266
133879	61950266
133879	61975266		[[loopCnt=0__133879]]
133879	62000269	loopstrex=2		[[loopCnt=1__133853]]
133853	62025269	loopstrex=2
133853	62050266
133879	62075266
133879	62100266
133879	62125266
133879	62150266		[[loopCnt=0__133879]]
133879	62175270	loopstrex=2
133853	62200267		[[loopCnt=133879__133878]]
133878	62225272	loopstrex=2
133845	62250267
133879	62275267
 
Cool, just did some tests on your code as well. To be honest I don't understand how you use the ldrex/strex (but that is probably my ignorance) From what I understood from stackexchange you'd use it like this? (VER 3)

Code:
#include "arm_math.h" // micros() synchronization
uint32_t callback_safe_update;
volatile uint32_t loopCnt;
uint32_t newValue;

#include "TeensyTimerTool.h"
using namespace TeensyTimerTool;
volatile uint32_t loopCntIsr = 0;
uint32_t loopstrex = 0;

void isr()
{
    loopCntIsr = loopCnt;
    //Serial.print(".");
    loopCnt = 0;
    asm volatile("dsb"); // xxx
}

//Timer t1(TMR1);
Timer t1 (GPT1); // stable us
//Timer t1 (TCK);

void setup()
{
    t1.beginPeriodic(isr, 25000);
}
void loop()
{
#define VER 3

#if (VER == 1)
    loopCnt++;
#elif (VER == 2)
    do
    {
        __LDREXW(&callback_safe_update);
        loopCnt++;
        loopstrex++;
    } while (__STREXW(1, &callback_safe_update));

#elif (VER == 3)
    do
    {
         newValue = __LDREXW(&loopCnt) + 1;
    } while (__STREXW(newValue, &loopCnt));

#endif

    if (loopCntIsr)
    {
        if (loopCntIsr >= 160000)
            Serial.printf("\t");
        Serial.printf("%d\t%d", loopCntIsr, micros());
        if (loopstrex > loopCntIsr)
            Serial.printf("\tloopstrex=%d", loopstrex - loopCntIsr);
        Serial.println();
        loopstrex = 0;
        loopCntIsr = 0;
    }
}

Which works, but again, I have no experience with this.
 
Regarding the "creeping" count with TMR and TCK

  • The GPT uses the 24Mhz clock which gives exactly 600'000 ticks per 25ms
  • The TMR uses 150MHz with a 1/128 prescaler -> 25'000 * (150.0 / 128.0) = 29296.875 ticks so the 25ms can not be hit exactly.
  • TCK relies on checking the cycle counter during yield() so the accuracy will depend on what is going on in the background. But it is very unlikely that the tick() function will be hit at the exactly (sub microseconds) right time.
 
Checked it with 32ms which should give a even tick count. It still creeps a little bit. I found that the calculated reload ticks are off by one, fixed it, now it gives exactly the right frequency. I'll update the git repo this evening.

(line 16 of TMRchannel.h should read (note the -1)
Code:
  uint16_t reload = t > 0xFFFF ? 0xFFFF : (uint16_t)t - 1;

Thanks for spotting this
 
Checked it with 32ms which should give a even tick count. It still creeps a little bit. I found that the calculated reload ticks are off by one, fixed it, now it gives exactly the right frequency. I'll update the git repo this evening.

(line 16 of TMRchannel.h should read (note the -1)
Code:
  uint16_t reload = t > 0xFFFF ? 0xFFFF : (uint16_t)t - 1;

Thanks for spotting this

I figured CLOCK res could be a factor - but I saw it and had to make note of it in case there was some off by 1.

As far as:
Code:
    do
    {
         newValue = __LDREXW(&loopCnt) + 1;
    } while (__STREXW(newValue, &loopCnt));

There is a lot of mystery/variety about it and diff notes can be found - in my reading during beta (and again this thread) it was clear the only net effect was: Enclosed 'do' loop code repeats when any interrupt fires during the execution of the 'do' loop.
> It does not actually watch a given variable for any practical end a reference last year said - it can be a dummy - so that is what is in place with micros().
> Likely there are some context notes in Beta T4 thread where FrankB and defragster came across it and FrankB provided a link to a simple example that was IIRC clear or clear enough to make it work with minor effort as used and tested for micros().
> As noted never saw reason to doubt the implementation because in testing at 600 MHz calling micros() taking under 40 cycles every sequential value for microseconds was tested and recorded in various ways (for some many days likely) to always increment by one across the 1,000 systicks per second interrupt with two calls regularly returning the same us value, and on the _isr for systick as long as it caught somewhere in the 60 us it would return then next value - then have 999 us to gain ground for the next call slowed by the _isr.
> puzzling thing is how the ATOMIC coding can use the same underlying ' do{__LDREXW//...}while(__STREXW(...)); ' and result in the loss of so many counts (25%) per second with what should be less code that the p#55 code. The answer is because it works - so it probably properly encloses the 'volatile read' of the variable - but even that would only add time in the cases where an interrupt occurs.
 
The c++ atomic adds code to the reads, too. In the ISR there are some "dmb".
You might want to take a look at the disassembly (can't post it now, I'm away from my arduino workplace at home)

I'm not sure if they are really needed - But the gcc folks for sure had a reason to do it this way.
Shortest execution time for dmb is 0(zero) cycles. It depends on the queue.

edit: sorry, not reads - I think it was resetting the count to 0
 
For dSb: I now read again (completely forgot about that - it was several month ago) that it does not help in any case. It just helps because it adds execution time. A ARM employee wrote this.

Better is to read the interrupt flag again, after resetting it.
Edit: this way, it is guaranteed that the flag got reset and will not trigger again the same interrupt.
Then - an additional dsb prevents the issue mentioned in the errata I posted above.
 
Last edited:
I wonder how ldrex/strex will work on imxrt 1170 - shouldn't it detect writes by the 2nd core, too? Disabling the interrupts will not work anymore on the shared memory.
 
Read that multicore is the main use case for it. Wondering if in that case fiddling around on machine instruction level is a good idea? Probably better to use proven higher level approaches like atomic.h or some threading?
 
The c++ atomic adds code to the reads, too. In the ISR there are some "dmb".
You might want to take a look at the disassembly (can't post it now, I'm away from my arduino workplace at home)

I'm not sure if they are really needed - But the gcc folks for sure had a reason to do it this way.
Shortest execution time for dmb is 0(zero) cycles. It depends on the queue.

edit: sorry, not reads - I think it was resetting the count to 0

Good point Frank - atomic on the variable could affect ANY use of that variable that 'compiler thinks' needs to be protected - rather than just where it is known to be needed as where above it is coded in a single spot. Even so that so that seems 'expensive'!

Taking out the ' do{__LDREXW//...}while(__STREXW(...)); ' {snippet below change "#if 0 to #if 1"} only raises the count to 150K like this where detected errors are shown shifted right - this one doubles, then triples:
Code:
149943	446151905
149943	446176906
	299886	446201907
	449825	446226907
149938	446251906
149943	446276906

Versus the 133K using the protection block - with gratuitous printing and goes up to 135K counts with just this minimal code and the other debug loop print removed print - the #if takes out the protection block.:
Code:
void isr()
{
	loopCntIsr = loopCnt; // This var is vulnerable - iff the isr() can fire before it is tested and used in loop()
	loopCnt = 0;
	//asm volatile("dsb"); // xxx
}
void loop()
{
#if 0
	loopCnt++;
#else
	loopstrex = 0;
	do {
		__LDREXW(&callback_safe_update);
		if ( loopstrex != 0 && loopCntIsr > 1 ) { // repeat entry and the _isr that fired was ours
			loopCnt = 1;
		}
		else
			loopCnt++;
		loopstrex++;
	} while ( __STREXW(1, &callback_safe_update));
#endif
	if ( loopCntIsr>1 ) {
		Serial.println();
		if ( loopCntIsr >= 160000 ) Serial.printf("\t");
		Serial.printf("%lu\t%lu", loopCntIsr, micros());
		loopCntIsr = 1;
	}
}

@luni - this is the horror that can result from 'meaningless' testing { re: p#7 and #9 ) :) So in this case there was value.
 
Read that multicore is the main use case for it. Wondering if in that case fiddling around on machine instruction level is a good idea? Probably better to use proven higher level approaches like atomic.h or some threading?

+1. Yes, for sure.
Execution time might rise a little bit. But better that than bugs and increased jitter.
One of the main reasons for using Microcontrollers without OS is the low, fixed response time (good real time behavior) .
 
Read that multicore is the main use case for it. Wondering if in that case fiddling around on machine instruction level is a good idea? Probably better to use proven higher level approaches like atomic.h or some threading?

That instruction was added ~7 years back - on all M4's {and Frank noted that Atomic uses it} - so it pre-dates multicore on these MCU's. How synchronization will work across the 1170 asymmetric cores will be a puzzle - maybe they'll share some lines or common interrupts for signaling - but having all interrupts tied together would be onerous.

@luni: Thread search Teensy-4-0-First-Beta-Test for 'strex' and the Jan/17,18,19/2019 four post comments with this from Frank in p#988 Jan 18, 2019.

I found the instruction the 17th and Frank read into it and gave this example with comment based on his reading and then I read more WWW notes and put it to work:
Tim:
played a few minutes with ldrex/strex:
Code:
#include "arm_math.h"
#include "core_cmInstr.h"

void setup() { delay(1000); }

void loop() {
static uint32_t a = 0;
static uint32_t b = 0;
uint32_t c,d, dummy;
 do {
  __LDREXW(&dummy);
  c = a;
  d = b;
  if (c==10) {delay(2);a++;}//<- interrupt happens most likely here   
 } while ( __STREXW(0, &dummy));
 Serial.printf("c: %d, c:%d\n",c,d);
 delay(500);
 a++;b++;
}
It just detects interrupts :) so.. if an interrupt is detected, it repeats the loop.
simple.

Looks like post #1106 had me migrate my lame +/-1us hack to use 'current' code - and that post claims I ran 100,000,000 test calls in 31 seconds against a 100 us clock. Then I started real testing :) I don't recall when Paul flipped the switch from low res micros() to that version that gave cycle count resolved to 1 us. It was before first 1062 beta hardware, by the end of March 2019. Paul did as noted speed it up 5% using his 'ASM' blackbelt along with limiting round up error for TD 1.49 - but that was the calc code done after the values were safely/atomically read.
 
Yup, on a single core, it's just a interrupt-detection-tool. The "dummy" variable was a simple way to use it this way.
 
Last edited:
@luni - Great stuff.

Wondering a couple of things: (Note: I am mainly looking at T4)

This morning I was hacking on the Teensy_I2C_Sniffer sketch (different thread) and wondered if it would work better if the timer ran at 2mhz... He was/is using the Timeer1 library. I made a hacked up version of the constructor which allowed me to pass in a floating point for number of microseconds, so passed in 0.5 and was able to get the timer to run at that speed.

Wonder if it makes sense to add that capability here?

Knowing when different timers are in use.

With the IntervalTimer code on T4 which uses PIT timers, we check: if (channel->TCTRL == 0) break;
To know that the timer has not yet been initialized and assumed free.
But I see you are not yet doing PIT timers.

A assume you already know it, but on T4, the Quad Timer is already used now in a few places.
a) PWM - Severalf of the channels are used for different IO pins, if the user ties to do PWM on those pins:
Extracted from the PWM source table...
Code:
	{2, M(1, 0), 0, 1},  // QuadTimer1_0  10  // B0_00
	{2, M(1, 2), 0, 1},  // QuadTimer1_2  11  // B0_02
	{2, M(1, 1), 0, 1},  // QuadTimer1_1  12  // B0_01
	{2, M(2, 0), 0, 1},  // QuadTimer2_0  13  // B0_03
	{2, M(3, 2), 0, 1},  // QuadTimer3_2  14  // AD_B1_02
	{2, M(3, 3), 0, 1},  // QuadTimer3_3  15  // AD_B1_03
	{2, M(3, 1), 0, 1},  // QuadTimer3_1  18  // AD_B1_01
	{2, M(3, 0), 0, 1},  // QuadTimer3_0  19  // AD_B1_00
Note: I know I looked earlier and I believe one or two more pins will also be setup this way for 4.1 board.

b) PulsePosition
c) ADC - currently may use QT4 (one or two channels) if the user wants a timed ADC .
...

Again great stuff!
 
@luni - Great stuff.
Thanks!

This morning I was hacking on the Teensy_I2C_Sniffer sketch (different thread) and wondered if it would work better if the timer ran at 2mhz... He was/is using the Timeer1 library. I made a hacked up version of the constructor which allowed me to pass in a floating point for number of microseconds, so passed in 0.5 and was able to get the timer to run at that speed.
Wonder if it makes sense to add that capability here?

Originally I thought enabling more than 1MHz would be a good idea, but did you ever look at the load this will generate? Last time I checked I got this result (4 PITs)
Code:
f:100.0 kHz Load:  41.6  (w/o interrupts: 6500010 with interrupts 11125203)
f: 50.0 kHz Load:  17.3  (w/o interrupts: 6500010 with interrupts 7858722)
f: 25.0 kHz Load:   8.5  (w/o interrupts: 6500010 with interrupts 7105767)
f: 12.5 kHz Load:   4.3  (w/o interrupts: 6500010 with interrupts 6789252)
f:  6.2 kHz Load:   2.2  (w/o interrupts: 6500010 with interrupts 6644030)
f:  3.1 kHz Load:   1.1  (w/o interrupts: 6500010 with interrupts 6571456)
f:  1.6 kHz Load:   0.6  (w/o interrupts: 6500010 with interrupts 6536124)
https://forum.pjrc.com/threads/57959-Teensy-4-IntervalTimer-Max-Speed?p=218577&viewfull=1#post218577

I think the underlying problem is the same for all timers in the T4, so I thought enabling that will call for trouble... But I'll have a look at your code later today maybe you found the trick to accelerate it which I'm looking for so long...

A assume you already know it, but on T4, the Quad Timer is already used now in a few places.
Yes I know, same problem when I did TeensyDelay (which uses the FTMs) a couple of years ago. Therefore, this time I tried to make it more easy for the user to choose which resource to use.
Thanks for the PWM info, I was looking for that since the corresponding table from the PJRC homepage is not yet updated with T4 info. A general table of pre allocated resources would be very useful indeed. The library documentation on gitHub still misses a lot of required information but the whole thing is quite new...

Anyway, the performance of the TCK timers which don't use hardware ressources is pretty amazing with a T4. They don't suffer from bus sync wait time and don't run from an interrupt context.
 
Got a 'file not found' compile error which was corrected by changing /src/Teensy/TMR/TMR.h : line 3

Code:
#include "TmrChannel.h"

to

Code:
#include "TMRchannel.h"
 
Sorry, the usual issue with filenames not case sensitive in Windows.... I'll fix that later today. Let me know if there are more of those bugs :)
 
For dSb: I now read again (completely forgot about that - it was several month ago) that it does not help in any case. It just helps because it adds execution time. A ARM employee wrote this.

Better is to read the interrupt flag again, after resetting it.
Edit: this way, it is guaranteed that the flag got reset and will not trigger again the same interrupt.
Then - an additional dsb prevents the issue mentioned in the errata I posted above.

@Frank: Do you think the last line in the code below is enough for " to read the interrupt flag again, after resetting it."? (Don't want to introduce a dummy, where the compiler complains about not using it). It works, but one never knows with this esoteric stuff.

Code:
if (callback != nullptr && regs->CSCTRL & TMR_CSCTRL_TCF1)
{
    regs->CSCTRL &= ~TMR_CSCTRL_TCF1;
    callback();
    regs->CSCTRL = regs->CSCTRL; //<----------
}
 ....
asm volatile("dsb");
 
That's read AND write.. hm, more than needed.
I think the compiler shouldn't complain with
Code:
...
callback();
regs->CSCTRL;
}
...
as long it is volatile. A warning would be wrong, because the volatile says, it has to be read (and exactly at this place), no matter if it is used or not.
But I have not tested this.
@MichaelMeissner is the expert.
 
Doesn't that get optimized away? Even when volatile? I'll have a look at the assembly...
 
Back
Top