IntervalTimer on Teensy 4.0 versus 3.6

Status
Not open for further replies.
@luni - re #2:
Making the timer faster is one thing, but you should really look at the load this generates. Currently the T4 gets already overloaded by 4 PITs just flipping pins at about 300kHz. So, making it possibly faster without solving the load issue might generate more issues than it solves. Further up in this thread I showed measurements with the TMRs since they run on 150MHz. One TMR overloads the T4 (can't even blink the LED) at about 4MHz, didn't do experiments with more timers running in parallel so far.
Code I first see above p#7 doesn't have the "dsb": can you locate the code of interest and assure it has that and retest?

back to short off topic - clock speed … as noted by KurtE - try 200 not 300 MHz - Using 200 MHz: if (div_ipg > 3) div_ipg = 3;
Code:
Avg CycCnt for 32000 is 3.009719
CC@0=1 CC@2=27307 CC@3=1 CC@4=2158 CC@7=2 CC@8=539 CC@9=1 CC@10=539 CC@12=539 CC@18=912 CC@29=1

My test now gets all the printed contents the DMA the code does, and improves access from 4 clocks to 3 clocks as written. Most of 32000 are done in 2 clocks : CC@2=27307 - only 10% more than at 150 MHz that showed CC@2=24688
 
Code I first see above p#7 doesn't have the "dsb": can you locate the code of interest and assure it has that and retest?
Sure, did it, but nothing different which somehow is clear, the isr code is long enough so that the bus has already synced when it comes to dsb -> should return quite quickly.

I pulled the current core, revived my old load test ad gave it a try:
Code:
#include "arduino.h"

IntervalTimer t1, t2, t3, t4;

template<unsigned pin>
void test() // dummyfunction
{
    digitalWriteFast(pin,!digitalReadFast(pin));   
    asm volatile ("dsb") ; 
}

volatile int dummy;
constexpr unsigned loops = 1'000'000;

// count processor cycles needed for a loop
unsigned speedTest(unsigned loops)
{
    uint32_t start = ARM_DWT_CYCCNT;
    for (unsigned i = 0; i < loops; i++)
    {
        dummy++;
    }
    uint32_t end = ARM_DWT_CYCCNT;

    return end - start;
}

void setup()
{
    while (!Serial);
    for(unsigned i = 0; i < 14; i++) pinMode(i, OUTPUT);
    
    // required for T3.6
    ARM_DEMCR |= ARM_DEMCR_TRCENA;
    ARM_DWT_CTRL |= ARM_DWT_CTRL_CYCCNTENA;

    // Measure cycles required for loop without any interrupts
    noInterrupts();
    uint32_t withoutInts = speedTest(loops);
    interrupts();

    for (int T = 2; T < 1001; T *= 2)
    {        
        t1.begin(test<0>, T);
        t2.begin(test<1>, T);
        t3.begin(test<2>, T);
        t4.begin(test<3>, T);

        uint32_t withInts = speedTest(loops);

        float load = 100.0f * (1.0f - (float)withoutInts / (float)withInts);
        Serial.printf("f:%5.1f kHz Load: %5.1f%%", 1000.0f / T, load);
        Serial.printf("  (w/o interrupts: %d with interrupts %d)\n", withoutInts, withInts);
        Serial.flush();
    }
}

void loop()
{
    digitalWriteFast(LED_BUILTIN, !digitalReadFast(LED_BUILTIN));
    delay(500);
}

Output: (4 Interval timers, each toggling a pin)
Code:
f:500.0 kHz Load:  96.9%  (w/o interrupts: 6500008 with interrupts 208184499)
f:250.0 kHz Load:  69.0%  (w/o interrupts: 6500008 with interrupts 20968603)
f:125.0 kHz Load:  34.5%  (w/o interrupts: 6500008 with interrupts 9928551)
f: 62.5 kHz Load:  17.2%  (w/o interrupts: 6500008 with interrupts 7853884)
f: 31.2 kHz Load:   8.6%  (w/o interrupts: 6500008 with interrupts 7112793)
f: 15.6 kHz Load:   4.3%  (w/o interrupts: 6500008 with interrupts 6791815)
f:  7.8 kHz Load:   2.2%  (w/o interrupts: 6500008 with interrupts 6647413)
f:  3.9 kHz Load:   1.1%  (w/o interrupts: 6500008 with interrupts 6571528)
f:  2.0 kHz Load:   0.6%  (w/o interrupts: 6500008 with interrupts 6536233)

What do I need to change to test the improved performance?
 
First look - p#27 code fails to print with :: if (div_ipg > 3) div_ipg = 3;

Restoring the 4 divisor gets similar results to the posting. ...
 
Using the ipg=4 - also failed to print a couple times. Perhaps because the timers were left running. I edited the code to .end() them after each test pass. I also wondered if the template code affected anything - it does as the counts indicate - but they vary from run to run ???

Also added a dummy untimed loop/10 to let the timers wake up because the results run to run are not consistent - it didn't help. And printed the numbers as UNSIGNED as they over flowed and went negative with %d.

Not sure of the definition of 'Load' used? It is cycles used for loop count of 1M. That test doesn't call the test<#>() functions at all - let alone the same number of times as freq is increased - With timer _isr()'s it is doing a lot more work so it is expected it would use more cycles. And the longer it runs the more work it does resolving the _isr()'s?

> More valid direction might be to make an array for each speed - have the no interrupt test run with calls to test()'s the expected number of times? Also maybe have each _isr() track the number of calls it resolves and add those to compare to the non-interrupt case in some fashion? That is missing calls would be a measure of loading that prevented a certain number of timer calls as a factor.

Code:
T:\tCode\FORUM\InterValTimerTestJan\InterValTimerTestJan.ino Jan 27 2020 13:09:19
f:500.0 kHz Load:  99.0%  (w/o interrupts <T>: 6500004 with interrupts 672307233)
f:250.0 kHz Load:  81.4%  (w/o interrupts <T>: 6500004 with interrupts 34936597)
f:125.0 kHz Load:  41.6%  (w/o interrupts <T>: 6500004 with interrupts 11130565)
f: 62.5 kHz Load:  20.9%  (w/o interrupts <T>: 6500004 with interrupts 8217585)
f: 31.2 kHz Load:  10.4%  (w/o interrupts <T>: 6500004 with interrupts 7250791)
f: 15.6 kHz Load:   5.2%  (w/o interrupts <T>: 6500004 with interrupts 6859319)
f:  7.8 kHz Load:   2.6%  (w/o interrupts <T>: 6500004 with interrupts 6676486)
f:  3.9 kHz Load:   1.3%  (w/o interrupts <T>: 6500004 with interrupts 6587744)
f:  2.0 kHz Load:   0.7%  (w/o interrupts <T>: 6500004 with interrupts 6542932)


f:500.0 kHz Load:  98.5%  (w/o interrupts ABCD: 6500004 with interrupts 419589535)
f:250.0 kHz Load:  85.7%  (w/o interrupts ABCD: 6500004 with interrupts 45313804)
f:125.0 kHz Load:  42.8%  (w/o interrupts ABCD: 6500004 with interrupts 11362460)
f: 62.5 kHz Load:  21.2%  (w/o interrupts ABCD: 6500004 with interrupts 8245308)
f: 31.2 kHz Load:  10.7%  (w/o interrupts ABCD: 6500004 with interrupts 7280391)
f: 15.6 kHz Load:   5.3%  (w/o interrupts ABCD: 6500004 with interrupts 6863924)
f:  7.8 kHz Load:   2.7%  (w/o interrupts ABCD: 6500004 with interrupts 6677375)
f:  3.9 kHz Load:   1.3%  (w/o interrupts ABCD: 6500004 with interrupts 6588756)
f:  2.0 kHz Load:   0.7%  (w/o interrupts ABCD: 6500004 with interrupts 6543358)

Here is where I left the code:
Code:
#include "arduino.h"

IntervalTimer t1, t2, t3, t4;

template<unsigned pin>
void test() // dummyfunction
{
	digitalWriteFast(pin, !digitalReadFast(pin));
	asm volatile ("dsb") ;
}

volatile int dummy;
constexpr unsigned loops = 1'000'000;

void testA() // dummyfunction
{
	digitalWriteFast(0, !digitalReadFast(0));
	asm volatile ("dsb") ;
}
void testB() // dummyfunction
{
	digitalWriteFast(1, !digitalReadFast(1));
	asm volatile ("dsb") ;
}
void testC() // dummyfunction
{
	digitalWriteFast(2, !digitalReadFast(2));
	asm volatile ("dsb") ;
}
void testD() // dummyfunction
{
	digitalWriteFast(3, !digitalReadFast(3));
	asm volatile ("dsb") ;
}

// count processor cycles needed for a loop
unsigned speedTest(unsigned loops)
{
	for (unsigned i = 0; i < loops/10; i++)
	{
		dummy++;
	}
	uint32_t start = ARM_DWT_CYCCNT;
	for (unsigned i = 0; i < loops; i++)
	{
		dummy++;
	}
	uint32_t end = ARM_DWT_CYCCNT;

	return end - start;
}

void setup()
{
	while (!Serial && millis() < 4000 );
	Serial.println("\n" __FILE__ " " __DATE__ " " __TIME__);
	for (unsigned i = 0; i < 14; i++) pinMode(i, OUTPUT);

	// required for T3.6
	ARM_DEMCR |= ARM_DEMCR_TRCENA;
	ARM_DWT_CTRL |= ARM_DWT_CTRL_CYCCNTENA;

	// Measure cycles required for loop without any interrupts
	noInterrupts();
	uint32_t withoutInts = speedTest(loops);
	interrupts();

	for (int T = 2; T < 1001; T *= 2)
	{
		t1.begin(test<0>, T);
		t2.begin(test<1>, T);
		t3.begin(test<2>, T);
		t4.begin(test<3>, T);

		
		uint32_t withInts = speedTest(loops);
		t1.end();
		t2.end();
		t3.end();
		t4.end();

		float load = 100.0f * (1.0f - (float)withoutInts / (float)withInts);
		Serial.printf("f:%5.1f kHz Load: %5.1f%%", 1000.0f / T, load);
		Serial.printf("  (w/o interrupts <T>: %lu with interrupts %lu)\n", withoutInts, withInts);
		Serial.flush();
	}
	Serial.println();
	Serial.println();

	for (int T = 2; T < 1001; T *= 2)
	{
		t1.begin(testA, T);
		t2.begin(testB, T);
		t3.begin(testC, T);
		t4.begin(testD, T);

		uint32_t withInts = speedTest(loops);
		t1.end();
		t2.end();
		t3.end();
		t4.end();

		float load = 100.0f * (1.0f - (float)withoutInts / (float)withInts);
		Serial.printf("f:%5.1f kHz Load: %5.1f%%", 1000.0f / T, load);
		Serial.printf("  (w/o interrupts ABCD: %lu with interrupts %lu)\n", withoutInts, withInts);
		Serial.flush();
	}
}

void loop()
{
	digitalWriteFast(LED_BUILTIN, !digitalReadFast(LED_BUILTIN));
	delay(500);
}
 
Using the ipg=4 - also failed to print a couple times. Perhaps because the timers were left running

Since at 500kHz the processor is kind of overloaded it might take a long time until it gets the 1M cycles done. Can you try to start with e.g T = 4 (250kHz) to avoid this?

I also wondered if the template code affected anything - it does as the counts indicate - but they vary from run to run ???
The template is doing exactly the same as you did with your testA ... testC. Templates are a compile time only thing, I just used it to save typing :)

Not sure of the definition of 'Load' used? It is cycles used for loop count of 1M
It is the ratio of the time needed for the loop without the timers running to the time it takes with the timers running in percent. I.e. if the timer code wouldn't eat up time you'd get 0% load if the loop takes 2times as long as without the timers running you'd get 50%. If the timers eat up all processor cycles you'd get 100% load (of course it would be stuck in this case) Basically the load value shows you how much the processor is busy with handling the 4 interval timers. If it approaches 100% you can not do anything anymore.

The result you get with your code are basically the same? This test isn't about getting accurate load numbers. Just did it to check if it makes sense to enable users to increase the interrupt frequency to 2MHz or faster when the processor is already having a hard time at 500kHz (4 Timers only flipping pins).

I'd like to compare the performance of the current implementation to the 150MHz or this IPG ratio thing but I need some pointers what to change in the core for testing.
 
Last edited:
@Paul @luni @Frank B

Interval Timer: - playing around... Not trying to answer the question on IF we should convert the Interval Timer and GPT timers from 24mhz to bus speed, but instead, IF the configuration is buss speed.

Sort of like now SPI library is setup that if you do an SPI.beginTransaction(SPISettings(baud... We are not trying to decide which clock option is best, but instead we simply look at what clock configuration the user has in their sketch and configure using that...

So for example if I change the IntervalTimer.begin to something like:
Code:
	bool begin(void (*funct)(), float microseconds) {
		uint32_t pid_clock_mhz = (CCM_CSCMR1 & CCM_CSCMR1_PERCLK_CLK_SEL)? 24 : (F_BUS_ACTUAL / 1000000);
		uint32_t max_period = UINT32_MAX / pid_clock_mhz;
		if (microseconds == 0 || microseconds > max_period) return false;
		uint32_t cycles = (float)pid_clock_mhz * microseconds - 0.5;
		Serial.printf("IntervalTImer::begin %f %u %u\n", microseconds, pid_clock_mhz, cycles);
		if (cycles < 17) return false;
		return beginCycles(funct, cycles);
	}
It looks at the configuration to see if we are using the 24mhz clock or not and then computes the other stuff...

I tried the simple test sketch , with and without slight mod and it appears to work:

That is my setup now has:
Code:
void setup() {
  // Set PID and GPT to BUS speed 
 CCM_CSCMR1 &= ~CCM_CSCMR1_PERCLK_CLK_SEL;
And now PID and GPT run at 150mhz...

Without Mod, my debug output: IntervalTImer::begin 0.750000 24 17
With Mod: IntervalTImer::begin 0.750000 150 112

Does it make sense to clean this up and put in PR that does the right thing when user changes to use BUS speed?
 
@luni: This file : T:\arduino-1.8.9\hardware\teensy\avr\cores\teensy4\clockspeed.c - about line 148 for the divisor change. It is expected this might allow this I/O {digitalWriteFast(1, !digitalReadFast(1));} to complete faster if I read prior post right.

… had to run - just back ...

The non-interrupt code takes 6,500,000 cycles to run the loops. Any _isr() triggers save/call/restore overhead and executes code there in the process - those cycles add overhead time==cycles to the : for (unsigned i = 0; i < loops; i++) dummy++;
So there isn't a baseline factor but a moving target since the longer it takes … the longer it will take. Unless each _isr() exited after a fixed # of cycles. That is the baseline takes 0.01083333333333333333333333333333 seconds to run at 600 MHz. So for a give freq an expected # of _isr()'s will fire. If the _isr() was passed that number for each freq and did timer.end() when done that would allow the extended time of speedTest() to be fairly measured against the baseline 6.5M cycles. If there is no 'problem' in the processor that should draw a line through the points with finite cycles added based on the timer freq adding a 'unit' of work to process each additional _isr(). The 'problem' would come when the _isr() system gets too busy to service each _isr() and make any progress on the task at hand. The T4 can process in excess of 10M interrupts / second with minimal code under ~40 cycles and not be swamped given 600M cycles per second. In this case with 4 _isr()'s firing only 1 million each it really should have 50% CPU time left for the task at hand. But the moving target noted is it would have to run twice as long to account for the cycles used by the added interrupt code. That is where having each _isr() self .end() after it cycles the expected number of times would allow a concrete measure of the time added to perform the base task - and should result in consistent repeated run numbers.

Indeed the Template is just a param added to the function call - that is different than typical _isr() where it gets no params or context on entry. So the Test[ABCD] as written should run in fewer cycles as the indicator pin is const hardcoded per function, not stack/param based 'pin'.
 
Thank you all very much. I'm very grateful for the thoughtful responses from folks who obviously know a ton of detail here.

If this gets improved, great. If not, I'll remember the effort you all put in. If you ever need help with analog, mixed signal, or sensors, please let me know.

Thanks,
Greg
 
Does it make sense to clean this up and put in PR that does the right thing when user changes to use BUS speed?

As far as I'm concerned your suggestion does make a lot of sense.

I changed the GPT setup code in TeensyTimerTool:
Code:
if (isPeriodic)
    {
        //double tmp = micros * (24.0 / 1.0);   
        uint32_t pid_clock_mhz = (CCM_CSCMR1 & CCM_CSCMR1_PERCLK_CLK_SEL) ? 24 : (F_BUS_ACTUAL / 1000000);           
        double tmp = micros * (pid_clock_mhz / 1.0);

        uint32_t reload = tmp > 0xFFFF'FFFF ? 0xFFFF'FFFF : (uint32_t)tmp;
        regs->SR = 0x3F;         // clear all interupt flags
        regs->IR = GPT_IR_OF1IE; // enable OF1 interrupt
        regs->OCR1 = reload - 1; // set overflow value
        regs->CR |= GPT_CR_EN;   // enable timer
    }

which works as advertised. I will update the lib accordingly.

I assume
Code:
CCM_CSCMR1 &= ~CCM_CSCMR1_PERCLK_CLK_SEL;
changes the clock for GPT and PIT, right? Looks like I really have to read into that clock setting chapters, tried to avoid that so far...
 
And here the load for two GPTs toggling pins. Done for the 24Mhz clock
Code:
f:500.0 kHz Load:  26.9%  (w/o interrupts: 6500006 with interrupts 8895336)
f:250.0 kHz Load:  13.5%  (w/o interrupts: 6500006 with interrupts 7512452)
f:125.0 kHz Load:   6.7%  (w/o interrupts: 6500006 with interrupts 6967867)
f: 62.5 kHz Load:   3.4%  (w/o interrupts: 6500006 with interrupts 6730651)
f: 31.2 kHz Load:   1.7%  (w/o interrupts: 6500006 with interrupts 6612408)
f: 15.6 kHz Load:   0.9%  (w/o interrupts: 6500006 with interrupts 6556394)
f:  7.8 kHz Load:   0.4%  (w/o interrupts: 6500006 with interrupts 6528373)
f:  3.9 kHz Load:   0.2%  (w/o interrupts: 6500006 with interrupts 6514522)
f:  2.0 kHz Load:   0.1%  (w/o interrupts: 6500006 with interrupts 6507724)

And here for the 150MHz clock
Code:
f:500.0 kHz Load:  18.8%  (w/o interrupts: 6500006 with interrupts 8001073)
f:250.0 kHz Load:   9.3%  (w/o interrupts: 6500006 with interrupts 7165189)
f:125.0 kHz Load:   4.8%  (w/o interrupts: 6500006 with interrupts 6827204)
f: 62.5 kHz Load:   2.3%  (w/o interrupts: 6500006 with interrupts 6656056)
f: 31.2 kHz Load:   1.2%  (w/o interrupts: 6500006 with interrupts 6577284)
f: 15.6 kHz Load:   0.6%  (w/o interrupts: 6500006 with interrupts 6539266)
f:  7.8 kHz Load:   0.3%  (w/o interrupts: 6500006 with interrupts 6519921)
f:  3.9 kHz Load:   0.2%  (w/o interrupts: 6500006 with interrupts 6510379)
f:  2.0 kHz Load:   0.1%  (w/o interrupts: 6500006 with interrupts 6505643)

So, regardless of the allowed minimal timer period, changing the underlying clock to 150MHz significantly reduces the load the timers generate. I assume this is due to the smaller time it takes to sync both busses now. This is good news for my stepper timer library (TeensyStep). Might finally get comparable performance to the T3.6.
 
The non-interrupt code takes 6,500,000 cycles to run the loops. Any _isr() triggers save/call/restore overhead and executes code there in the process - those cycles add overhead time==cycles to the : for (unsigned i = 0; i < loops; i++) dummy++;
So there isn't a baseline factor but a moving target since the longer it takes … the longer it will take.

I assume you are questioning the load calculation? I'm simply calculating the time difference something (here a loop) takes with and without having the timers running in the background. I then relate this to the time it takes with the timers.

So:
Code:
load = (t_w - t_wo) / t_w   
load = 1 - t_wo / t_w

A few sanity checks for the definition:

Code:
t_w = t_wo   => load = 0.0  OK
t_w = 2*t_wo => load = 0.5  OK
t_w = 4*t_wo => load = 0.75 OK
t_w = inf    => load = 1.0 OK
So, I don't see any obvious problem here.


The T4 can process in excess of 10M interrupts / second with minimal code under ~40 cycles and not be swamped given 600M cycles per second.
Really? As soon as you need to read/write some registers (and you will need to at least reset the interrupt flag) the processor needs to wait to synchronize the busses which can take very long (especially if the peripheral clock is much lower than F_BUS, e.g. see https://community.arm.com/developer...-dsb-isb-on-cortex-m3-m4-m7-single-core-parts ). This of course is a pitty because it outbrakes the otherwise very fast processor. Increasing the peripheral clock like Kurt did reduces the sync time and the interrupts get more efficient (at least this is how I understand it)

Indeed the Template is just a param added to the function call - that is different than typical _isr() where it gets no params or context on entry. So the Test[ABCD] as written should run in fewer cycles as the indicator pin is const hardcoded per function, not stack/param based 'pin'.
No, the template does not add a param, the compiler generates 4 functions from the template one for each pin. It is exactly the same as writing the 4 functions manually. I compiled your code containing the template and the TestA.. functions. And extracted the code for the two functions from the .lst file:

Here the generated code for TestA
Code:
00000184 <testA()>:
     184:	mov.w	r3, #1107296256	; 0x42000000
     188:	ldr	r2, [r3, #8]
     18a:	lsls	r2, r2, #28
     18c:	bmi.n	19a <testA()+0x16>
     18e:	movs	r2, #8
     190:	str.w	r2, [r3, #132]	; 0x84
     194:	dsb	sy
     198:	bx	lr
     19a:	movs	r2, #8
     19c:	str.w	r2, [r3, #136]	; 0x88
     1a0:	dsb	sy
     1a4:	bx	lr
     1a6:	nop

And here for the test<0>
Code:
0000007c <void test<0u>()>:		
      7c:	mov.w	r3, #1107296256	; 0x42000000
      80:	ldr	r2, [r3, #8]
      82:	lsls	r2, r2, #28
      84:	bmi.n	92 <void test<0u>()+0x16>				
      86:	movs	r2, #8
      88:	str.w	r2, [r3, #132]	; 0x84
      8c:	dsb	sy
      90:	bx	lr
      92:	movs	r2, #8
      94:	str.w	r2, [r3, #136]	; 0x88
      98:	dsb	sy
      9c:	bx	lr
      9e:	nop

I can't spot any difference.
 
@luni
Thanks for the implementation details for templates - indeed, the same - so I manually wrote four the compiler would have done. I added more code to the template _isr() 'one time' and had to add alternate code 4 times for the A,B,C and D tests … only to see them run in the same time … I have removed that code.

The 10+M was for pin interrupts - so the bus sync stuff must be different - just looking for a way to have it show it to me …

Playing with the code some to precalc the number of _isr() hits expected based on the part of a second the no-interrupts takes to run.

Seems I got the fRatio and iCnt right. Having the _isr()'s .end() their timer at the number of counts expected this shows No isr()'s 'Left' to fire .... though many more numbers and indeed something up behind the scenes.

I get this stopping the _isr()'s from running longer than expected - looks like it adds 293 cycles per interrupt - where there were 5435 _isr()'s for each of the 4 timers when they were given .end().:
Code:
f:500.0 kHz Load:  49.5% iCnt=  5435  (w/o isr: 6500004 with _isr 12878568) d=6378564 Left#=0 Over=5
f:250.0 kHz Load:  41.7% iCnt=  2717  (w/o isr: 6500004 with _isr 11143746) d=4643742 Left#=0 Over=1

Indeed that shows Load % tracks as calculated ...
 
As soon as you need to read/write some registers (and you will need to at least reset the interrupt flag) the processor needs to wait to synchronize the busses which can take very long (especially if the peripheral clock is much lower than F_BUS, e.g. see https://community.arm.com/developer...-dsb-isb-on-cortex-m3-m4-m7-single-core-parts ). This of course is a pitty because it outbrakes the otherwise very fast processor. Increasing the peripheral clock like Kurt did reduces the sync time and the interrupts get more efficient (at least this is how I understand it)
This is how I understand it, too.
What I'm not sure about, is how this slows down everything. I might to have re-read everything.
There is a write- queue. When we write, we write to the queue, and I don't see how this slows down everything, as its access should be fast (I.e. 600MHz).
The case when we get trouble and "waitstates" is when the qeue is full. Is that correct?
So.. easy way around that is, not to do several "slow" read/writes I a row(means: do not fill the qeue), and instead we should try to use the time for more useful things (interleaving)

Write
Other code
Other code
Other code
...
Write
Other..

Or am I completely wrong? Might be and quite possible. I'll try to find a doc about the timing.. I want to understand the underlying mechanics in detail.

Edit: we should'nt do read after write, too if I'm right, because the read has to wait until the data got written. A kind of worst case.
 
Last edited:
Read after write: I'd like to know, if there is a difference for OCRAM, means "not strongly ordered memory"
In theory, the cpu could just read the just before written value, without accessing the bus (because the value is still in the queue and could be used) . Only, if we take DMA out of the equation, of course.
 
... so, in some cases it might be faster, not to use "volatile", and use (something like) a gcc memory barrier (NOT dsb) instead. (???) A volatile seems easy here, but I doubt it is the best way.. it tells gcc to write the data ASAP. This rule is a little too strong. Again, this is guessing, and I might be wrong.
This leads to the question how good gcc is and wether it knows about the ram/periphal difference - and if not, if we can tell gcc about that somehow, so that gcc can use this knowlage for optimizations regarding the order of instructions.

Disclamer: All this has to be verified - I'm not sure about it...
 
Last edited:
Does DMAMEM has to be volatile? We have to invalidate the cache before using DMA anyway.

Edit: it is not volatile. I thought, it was. Sorry.
 
Last edited:
A quick Update: I went ahead and created a Pull Request for this: https://github.com/PaulStoffregen/cores/pull/425

Again I did not change the defaults for PIT and GPT timers. I only put the code in like mentioned that looks at the register and hopefully works either way...

The actual place that currently sets GPT and PID timers to use 24mhz is in startup.c
Code:
	// Configure clocks
	// TODO: make sure all affected peripherals are turned off!
	// PIT & GPT timers to run from 24 MHz clock (independent of CPU speed)
	[COLOR="#FF0000"]CCM_CSCMR1 = (CCM_CSCMR1 & ~CCM_CSCMR1_PERCLK_PODF(0x3F)) | CCM_CSCMR1_PERCLK_CLK_SEL;[/COLOR]
	// UARTs run from 24 MHz clock (works if PLL3 off or bypassed)
	CCM_CSCDR1 = (CCM_CSCDR1 & ~CCM_CSCDR1_UART_CLK_PODF(0x3F)) | CCM_CSCDR1_UART_CLK_SEL;
And if it actually does reduce overhead, wonder if we should also look at the UART one as well?
 
Folks,
Just to clean up a loose end, I did try "__asm volatile ("dsb");" and even "isb" as some folks have suggested, but it made absolutely no difference in terms of net execution rate and no observable difference in jitter in the test code originally posted at the beginning of this thread.

Thanks!
 

As soon as you need to read/write some registers (and you will need to at least reset the interrupt flag) the processor needs to wait to synchronize the busses which can take very long (especially if the peripheral clock is much lower than F_BUS, e.g. see https://community.arm.com/developer...-dsb-isb-on-cortex-m3-m4-m7-single-core-parts ). This of course is a pitty because it outbrakes the otherwise very fast processor. Increasing the peripheral clock like Kurt did reduces the sync time and the interrupts get more efficient (at least this is how I understand it)
...


Thanks for mentioning that @luni - and @Frank for actually reading it and pointing it out.

That explains added lag - if there isn't a pause and the buses are not kept in sync it sounds like that is when _isr()'s double - but to clear the interrupt status takes time.
 
Folks,
Just to clean up a loose end, I did try "__asm volatile ("dsb");" and even "isb" as some folks have suggested, but it made absolutely no difference in terms of net execution rate and no observable difference in jitter in the test code originally posted at the beginning of this thread.

Thanks!

Ups, sorry that nobody is talking about your original problem anymore :)

Here some code which works with periods below 1µs. To make it run you need to pull the core files with Pullrequest #425 from gitHub. This includes both of Kurts changes. I.e. change of the minimal allowable period and detection of the actually used timer clock. To enable the 150MHz clock you add CCM_CSCMR1 &= ~CCM_CSCMR1_PERCLK_CLK_SEL to setup().

Code:
#include "Arduino.h"
// IntervalTimer Max Frequency Test
// Uses float arguments that are converted to allowable int multiples of F_BUS
// G. Kovacs 1/23/20
//
// Tested using Teensyduino Version 1.49
// For Teensy 4.0 minimum IntervalTimer period = 1.6 us *regardless* from 600MHz to 1.008 GHZ (overclock).
// These results are independent of compiler optimization settings but are presented here for "Fastest" (not default).

IntervalTimer sampleRate;

const int outPin = 14;           //Output pin
const float samplePeriod = 1.0; //Interrupt period in microseconds. May be a float - is internally converted to int.
boolean outputState = HIGH;

FASTRUN void ISR()
{
    digitalWriteFast(outPin, outputState);
    outputState = !outputState;
}

void setup()
{
    CCM_CSCMR1 &= ~CCM_CSCMR1_PERCLK_CLK_SEL;  //<-------- Switch Clock to 150MHz
    while (!Serial) {};

    pinMode(outPin, OUTPUT);

    bool OK = sampleRate.begin(ISR, samplePeriod); // Check if period is large enough
    Serial.println(OK ? "Timer OK" : "Timer Error");
}

void loop()
{
    while (1) {} // Do nothing...
}


1MHz timer.png
 
Last edited:
Alternatively you can use TeensyTimerTool (https://github.com/luni64/TeensyTimerTool). The master branch already includes the changes from Kurt. I.e. it can handle the 150MHz clock and accepts sub micro second periods. Well, it would, but currently the timer period is an unsigned, so you are limited to 1MHz (I'll extend that to float the next days).

Below, an example showing how to use the GPT1 timer for your purpose.

Code:
#include "TeensyTimerTool.h"
using namespace TeensyTimerTool;

Timer sampleRate(GPT1);

constexpr int outPin = 14;         //Output pin
constexpr int samplePeriod = 1;    //Interrupt period in microseconds. Currently only unsigned


FASTRUN void ISR()
{
    digitalWriteFast(outPin, !digitalReadFast(outPin));    
}

void setup()
{
    CCM_CSCMR1 &= ~CCM_CSCMR1_PERCLK_CLK_SEL; //<-------- Switch Clock to 150MHz    

    pinMode(outPin, OUTPUT);
    sampleRate.beginPeriodic(ISR, samplePeriod); 
}

void loop()
{    
}

Or if you prefer terse code:
Code:
#include "TeensyTimerTool.h"
using namespace TeensyTimerTool;

Timer sampleRate(GPT1);

constexpr int outPin = 14;           //Output pin
constexpr unsigned samplePeriod = 1; //Interrupt period in microseconds. Currently only unsigned

void setup()
{
    CCM_CSCMR1 &= ~CCM_CSCMR1_PERCLK_CLK_SEL; //<-------- Switch Clock to 150MHz
    pinMode(outPin, OUTPUT);
   
    sampleRate.beginPeriodic([] { digitalWriteFast(outPin, !digitalReadFast(outPin)); }, samplePeriod);
}

void loop() {}

Output is a clean 1MHz signal in both cases. If you are interested in how the lambda expression in the second example works have a look at the readme https://github.com/luni64/TeensyTimerTool#lambda-expressions-and-callbacks-with-context
 
Last edited:
Thank you all.

I appreciate all of the input and was also interested in all of the side discussions. The current state of this seems to be that there is no significant improvement possible (for now).

I'm limiting my work to what would be available to a general user (i.e., student) right out of the box with a current install of Teensyduino.

The point of the thread was to ask why the Teensy 4.0 is actually slower than the 3.6 in terms of using IntervalTimer and, while it definitely is with the current code, I have gone back to the 3.6 without really understanding if there is a path for improvement (for the masses) or not.

As a side-note, with IntervalTimer and careful de-scrambling of jumbled port pins, I can do jitter-free, parallel I/O at better than 1 MSPS. With a custom version of the Teensy 3.6 layout that brings out two 16-bit ports (1/2 of two 32-bit ports, really) that is still great at 1 MSPS. With code-driven ("nop-tweaked") code, it is possible to hit 20 MSPS but with occasional jitter. Those seem to be the boundaries, and the Teensy 4.0 provides much faster calculations, but much slower parallel I/O due to lack of port availability. Here's to the "Teensy 4.6..." :)
 
As for the PIT timer, we do have other options. We currently have them tied to the 24mhz OSC clock as setup in startup.c
Code:
	// PIT & GPT timers to run from 24 MHz clock (independent of CPU speed)
	CCM_CSCMR1 = (CCM_CSCMR1 & ~CCM_CSCMR1_PERCLK_PODF(0x3F)) | CCM_CSCMR1_PERCLK_CLK_SEL;
We do have the option of feeding them instead feed it the same system clock that feeds ADC and XBAR...
Not sure how much work that would be to do and what other ramifications that might have.

How do we select a different clock source for the PIT?
 
Status
Not open for further replies.
Back
Top