GPT1/2 interrupt latency and startup for both counters

SteveW

Active member
I've continued to play with using the GPT timers for frequency counting and am starting to get some working code together. I hope to publish it here as a mini-library when its more formed to help those doing the same thing, if my limited understanding of C++ classes permits :). This is more of a discussion that a real problem I guess, just interested to get outside thoughts. Anyway the results from using one counter to time counts for the other using interrupts are interesting, at best impressive and I'd be interested in comments from those more expert! I've coded counter initialisation routines that both set up the counters and install a choice of interrupts for rollover and compare1 events. For compare1 there is a freeze option that freezes the count of the other counter in the first line of the compare1 ISR code. So for single measurement frequency or period counts the GPT compare1 value will freeze counting after a set gate time or cycle count for period counting. I'm testing with 10 MHz PWM Teensy output jumpered to both GPT external count inputs so far.

Results: If I configure GPT1 as the gate time counter, with its compare 1 value set at 10,000,000 (1 second) gate time and then count frequency input on GPT2 I get a result of 10,000,000 counts usually, very occasionally with +1 error. The error is arithmetically added, same results with either 1 sec or 100 sec gated counts. Incredible accuracy really. But ... this is with GPT1 started first as shown in the attached code, GPT1 stops the GPT2 count in the "freeze" ISR as described. If I reverse the counter start order using startGPTCounter(GPT_GPT2FIRST) in my code then starting GPT2 first results in an error consistently at +8 counts.

Whats happening given the difference? I think the compare1 interrupt latency I'm getting on the ARM is a little more than I expected, somewhere under 1uS tho (<8 periods@10MHz). I think that's being compensated by starting GPT1 first with the C code: {GPT1_CR |= GPT_CR_EN; GPT2_CR |= GPT_CR_EN;}. There appears to be a significant fraction of a uS delay between the two starts when this method is used. I think this exactly compensates out the interrupt latency that delays GPT2 freeze. Be interested if anyone agrees with this thinking. Obviously the two delays might vary at different clock frequencies and processor overheads so I'm wary of relying on the apparent compensation. Guess in a real system raising interrupt priority might help things to be more consistent.

I know nothing about ARM assembler but wonder if inline assembler to replace my C, {GPT1_CR |= GPT_CR_EN; GPT2_CR |= GPT_CR_EN;}, might tighten up the "simultaneous" counter start? Not sure if just GPTnCR =x; is faster when compiled, but that creates possible bugs if undesireable bit operations result. I suppose the other way at this problem is to start GPT2 here earlier, read while counting then read it on count completion and subtract the startup count later, but I'm thinking there are errors in doing it that way too. I suppose both counters could be read while running. But it complicates the code and am not sure it would end up as more accurate. I may do this later anyway to code some continuous count routines. I wish the ARM had a simple simultaneous startup option for both counters, would all be pretty easy if the hardware supported that. Anyway if anyone can point me at some inline assembler to achieve fast counter start/stops Id be very interested to try it out. I've pasted relevant code fragments below,

Steve

#define GPT1_ROLLOVER_ISR GPT1RollISR // ISR names defined
#define GPT2_ROLLOVER_ISR GPT2RollISR
#define GPT1_COMPARE1_ISR GPT1Compare1ISR
#define GPT2_COMPARE1_ISR GPT2Compare1ISR
#define GPT1_COMPARE1FREEZE_ISR GPT1CompareFreezeISR
#define GPT2_COMPARE1FREEZE_ISR GPT2CompareFreezeISR
#define GPT1 1
#define GPT2 2
#define GPT_BOTH 3
#define GPT_BOTH_GPT2FIRST 4
#define GPT_CLOCK_SOURCE 3
#define GATE_CLOCK_FREQ 10000000
#define NO_INTERRUPT 0 // Counter ISR setup control variable
#define ROLLOVER 1 // ISR setup
#define COMPARE_RESET 2 // ISR setup
#define COMPARE_FREEZE 3 // ISR setup

void initGPTCounter(uint8_t GPTn, uint8_t clkSource, uint32_t compare1=0, uint8_t interrupt=NO_INTERRUPT);
// Sets up a GPT timer. Parameters: 1/2=GPT1/2, clkSource for timer 1..3, compare1 = count up comparison value, interrupt parameter
// controls which ISR is setup: NO_INTERRUPT 0, ROLLOVER 1, COMPARE_RESET 2, COMPARE_FREEZE 3. Defaults for compare 1 and interrupt

// Initiate single frequency count task:

case FREQ_COUNT_SINGLE:
{
taskClearCounts();
float gateTime = 10; // *** Test value for gating in seconds
uint32_t gateCount = (uint32_t)((float)GATE_CLOCK_FREQ*gateTime); // ***** THIS CODE OUT FOR NON_TEST SETUP
initGPTCounter(GPT2,GPT_CLOCK_SOURCE,0,NO_INTERRUPT); // Sets up GPT1 with clock input on pin 14, clock source options 1..3.
initGPTCounter(GPT1,GPT_CLOCK_SOURCE,gateCount,COMPARE_FREEZE); // Sets up GPT2 with clock input on pin 25, params: GPTn, clkSource, compare1 value,interrupt type)
startGPTCounter(GPT_BOTH);
// **** This code starts GPT1 first (GPT_BOTH==3)produces minimal error +1 count, error is +8 counts if GPT2 started first ***
break;
}

void startGPTCounter(uint8_t GPTn)
{
if (GPTn == GPT_BOTH) {GPT1_CR |= GPT_CR_EN; GPT2_CR |= GPT_CR_EN; return;} // Enable both GPT 1 and 2 near-simultaneously, GPT1 first
if (GPTn == GPT_BOTH_GPT2FIRST) {GPT2_CR |= GPT_CR_EN; GPT1_CR |= GPT_CR_EN; return;} // GPT 2 enabled first
if (GPTn == GPT1) {GPT1_CR |= GPT_CR_EN; return; } // Enable only GPT1. Counter will reset to start from zero assuming GPT_CR_ENMOD is set by init()
if (GPTn == GPT2) {GPT2_CR |= GPT_CR_EN; return;} // Enable only GPT2
}

// Compare freeze ISR for the GPT1 compare 1 counter interrupt

void GPT1CompareFreezeISR()
{
GPT2_CR &= ~GPT_CR_EN; // Freeze the other GPT counter (GPT2), 64 bit gated counts can then be read from the static counter with minimal error
GPT1_CR &= ~GPT_CR_EN; // Freeze GPT1
GPT1_SR |= GPT_SR_OF1; // Clear SR reg compare 1 flag
GPT1Flags |= GPT_SR_OF1; // Interrupt flag variable is global volatile bool type, set the compare 1 flag
while (GPT1_SR & GPT_SR_OF1);
asm volatile ("dsb"); // Prevents ISR firing twice
}
 
You can try something like
{
uint32_t gpt1=GPT1_CR | GPT_CR_EN;
uint32_t gpt2=GPT2_CR | GPT_CR_EN;
asm("":::"memory");
GPT1_CR = gpt1;
GPT2_CR = gpt2;
asm("dsb":::"memory");
}

If that does not work, we can translate that with godbolt and see exactly happens..
Note, the bus to the timers does not work with CPU speed. So you can expect some cycles difference.
 
Last edited:
You can try something like
{
uint32_t gpt1=GPT1_CR | GPT_CR_EN;
uint32_t gpt2=GPT2_CR | GPT_CR_EN;
asm("":::"memory");
GPT1_CR = gpt1;
GPT2_CR = gpt2;
asm("dsb":::"memory");
}

If that does not work, we can translate that with godbolt and see exactly happens..
Note, the bus to the timers does not work with CPU speed. So you can expect some cycles difference.

Frank , thanks! That’s all very useful. The asm isn’t as nasty as I thought it might be. Thanks so much for the explanation re bus speed, I couldn’t figure why things were so slow on a 600MHz processor. I’ll run the code with the asm method later today and report any difference,

Steve
 
I tried it with godbolt: https://godbolt.org/z/dbbefx5K7
Looks like the compiler does translate the code above optimal (with -O2):
Code:
[COLOR=#000000][COLOR=#008080]foo():
[/COLOR]
       [COLOR=#0000ff]ldr [/COLOR][COLOR=#4864aa]r0[/COLOR][COLOR=#000000], [/COLOR][COLOR=#008080].L2[/COLOR]
       [COLOR=#0000ff]ldr[/COLOR][COLOR=#4864aa] r1[/COLOR][COLOR=#000000], [/COLOR][COLOR=#008080].L2[/COLOR][COLOR=#000000]+[/COLOR][COLOR=#098658]4[/COLOR]
       [COLOR=#0000ff]ldr[/COLOR][COLOR=#4864aa] r2[/COLOR][COLOR=#000000], [[/COLOR][COLOR=#4864aa]r0[/COLOR][COLOR=#000000]][/COLOR]
       [COLOR=#0000ff]ldr[/COLOR][COLOR=#4864aa] r3[/COLOR][COLOR=#000000], [[/COLOR][COLOR=#4864aa]r1[/COLOR][COLOR=#000000]][/COLOR]
       [COLOR=#0000ff]orr[/COLOR][COLOR=#4864aa] r2[/COLOR][COLOR=#000000], [/COLOR][COLOR=#4864aa]r2[/COLOR][COLOR=#000000], [/COLOR][COLOR=#098658]#1[/COLOR]
       [COLOR=#0000ff]orr[/COLOR][COLOR=#4864aa] r3[/COLOR][COLOR=#000000], [/COLOR][COLOR=#4864aa]r3[/COLOR][COLOR=#000000], [/COLOR][COLOR=#098658]#1[/COLOR]
[COLOR=#ff0000]        str     r2, [r0][/COLOR]
[COLOR=#ff0000]        str     r3, [r1][/COLOR]
       [COLOR=#0000ff]dsb[/COLOR]
       [COLOR=#0000ff]bx[/COLOR][COLOR=#4864aa]lr[/COLOR]
[COLOR=#008080].L2:[/COLOR]
[COLOR=#0000ff].word [/COLOR][COLOR=#098658]1075757056[/COLOR]
[COLOR=#0000ff].word[/COLOR][COLOR=#098658] 1075773440[/COLOR]

[/COLOR]
The red lines are the writes to the GPT registers.

I *think* this goes over the IGP bus, which is clocked by 600MHz / 4 = 150MHz
Or it might be 24MHz.. not that sure about it...
 
Last edited:
it's possible to overclock that IPG - in the past when I tried it, it was not 100% stable in every case. But it leads to pretty amazing speedup..
 
Ok , that’s really useful, so I doubt it’s possible to start these timers any closer together then, consecutive asm statements being the best I can hope for. I guess bus speed controlling the timers may be the issue. I will run things with the asm code later just to confirm. I’m using a default Teensyduino installation to compile with here. Even if I can’t get them closer it’s helpful to know this going forwards. I can see how that compensation effect varies in future also. Some sort of calibration could perhaps be built into the system. Overclocking, interesting if maybe not the ideal :)

Steve
 
https://github.com/PaulStoffregen/c...d942c55a2278c6cd72b/teensy4/clockspeed.c#L148

You could try to change the both "4" to 3 or even 2.. (600 / 3 = 200MHz or 600 / 2= 300MHz). But, as said, there may be issues.

Might play! At what point can I change the bus speed with my code, first thing in setup?

OK so I've run tests using the suggested assembler now. An issue I see with this basic assembler approach is that we read the register using C code before the modify, which matters not a jot before startup, but that to use the same code for the bit clear to freeze the counter in assembler might slow the ISR down a little. Wonder how I might better code a RMW in assembler to perform a bit clear in the GPT2_CR register.

Anyway testing the code as modified, quoted below with assembler bit set and clear after a C read of the reg does imply a very slight difference startup speed for the two counters. Counts over 10 sec at 10 MHz coded in C alone GPT1 start first were 10000000 (with the occasional +1) and GPT2 started first were 10000008 just now. With the assembler pasted in for both start and stop results were a consistent 10000002 and 10000006 as prev, respectively. So, it appears to me that GPT2 enabled second may be starting slightly faster (1 count) with assembler despite that similar compiled code. This would seem to increase the error as the start/stop delays no longer balance out. At least the error is reduced when doing things the wrong way round here. I cant seem to speed up the stop code , as the ISR is now coded at least , which would be the ideal. Same results with or without the ISR mod below. Whatever happens here the Teensy is likely to deliver very good performance for longer gate times as these errors are simply additive. As is the hardware seems capable of managing + 1 count error , for high accuracy 100 sec gate time could give +/- 0.1 Hz accuracy @ 10 MHz. Be interesting if I could speed up that timer stop though. Or quantify the error for consistency across frequencies and apply compensation if possible. I will play some more when we have siggen input to the clock pins sorted tho that needs some hardware added for input protection.

Code mods with asm:

void startGPTCounter(uint8_t GPTn)
{
//if (GPTn == GPT_BOTH) {GPT1_CR |= GPT_CR_EN; GPT2_CR |= GPT_CR_EN; return;} // Enable both GPT 1 and 2 near-simultaneously, GPT1 first
// NEW CODE:

if (GPTn == GPT_BOTH)
{
uint32_t gpt1=GPT1_CR | GPT_CR_EN;
uint32_t gpt2=GPT2_CR | GPT_CR_EN;
asm("":::"memory");
GPT1_CR = gpt1;
GPT2_CR = gpt2;
asm("dsb":::"memory");
}


//if (GPTn == GPT_BOTH_GPT2FIRST) {GPT2_CR |= GPT_CR_EN; GPT1_CR |= GPT_CR_EN; return;} // GPT 2 enabled first
// NEW:
if (GPTn == GPT_BOTH_GPT2FIRST)
{
uint32_t gpt1=GPT1_CR | GPT_CR_EN;
uint32_t gpt2=GPT2_CR | GPT_CR_EN;
asm("":::"memory");
GPT2_CR = gpt2;
GPT1_CR = gpt1;
asm("dsb":::"memory");
}

if (GPTn == GPT1) {GPT1_CR |= GPT_CR_EN; return; } // Enable only GPT1. Counter will reset to start from zero assuming GPT_CR_ENMOD is set by init()
if (GPTn == GPT2) {GPT2_CR |= GPT_CR_EN; return;} // Enable only GPT2
}

// NEW ISR CODE WITH asm BITCLEAR:

void GPT1CompareFreezeISR()
{
//GPT2_CR &= ~GPT_CR_EN; // Freeze the other GPT counter (GPT2), 64 bit gated counts can then be read from the static counter with minimal error

{
uint32_t gpt2=GPT2_CR & ~GPT_CR_EN; //IS THIS LINE OF C SLOWING THINGS MORE THAN THE IDEAL?
asm("":::"memory");
GPT2_CR = gpt2;
asm("dsb":::"memory");
}

GPT1_CR &= ~GPT_CR_EN; // Freeze GPT1
GPT1_SR |= GPT_SR_OF1; // Clear SR reg compare 1 flag
GPT1Flags |= GPT_SR_OF1; // Interrupt flag variable is global volatile bool type, set the compare 1 flag
while (GPT1_SR & GPT_SR_OF1);
asm volatile ("dsb"); // Prevents ISR firing twice
}

Steve
 
Might play! At what point can I change the bus speed with my code, first thing in setup?
Whenever you call set_arm_clock.. Or try to copy the lines form set_arm_clock (regarding the IPG) to your program.

The other things..and "assembler".
Note, it is not really assembler.

a)
This thing: asm("":::"memory"); does not add any assembler instruction to your program.
It is a instruction for the compiler, that means: whatever you (the compiler) have planned to do with memory, do it before this point.
The other thing (""dsb":::"memory") is a combination of a real assembler instruction and the compiler-instruction.
"dsb": well.. a bit complicated. The CPU has some pipelines. One of them is a pipeline that gets filled with data that has to be written to the memory or a device.
This makes much sense, when there is a slower bus involved. The cpu just fills the pipline in its own high speed without having to wait for a slower bus. Another part of the cpu then does the writes while the CPU continues to work on other instructions.
the "dsb" means: Write all the pipeline contents NOW, and do nothing else. Don't load data and block the bus, just store.
That also means, following instructions have to wait. (You don't get anything for free, as usual)

b) The simple looking a |= 1 constists of three instructions: 1: Load, 2. Modify, 3. Store
The goal was to put the stores of both lines as close together as possible.
before, it was:
a|= 1;
b|= 1;
So we had 1. load, 2. modify, 3. store, 4 load, 5 modify, 6 store.

with the reordering
temp1 = a | 1;
temp2 = b | 1;
- compiler memory barrier -
a = temp1;
b = temp2;
it gets reordered to : 1 load, 2 load, 3 modify, 4 modify , (or load - modify - load modify - - does not matter much) , 5 store, 6 store.


Now, when you write this:
{
uint32_t gpt2=GPT2_CR & ~GPT_CR_EN; //IS THIS LINE OF C SLOWING THINGS MORE THAN THE IDEAL?
asm("":::"memory");
GPT2_CR = gpt2;
you see, it is not useful and does excatly the same as if it was write in one line: GPT2_CR = GPT2_CR & ~GPT_CR_EN Exactly the same means - no positive or negative effect - no need to add the middle "memory" barrier. The "dsb" then pushes all pipeline contents out to the memory interfaces/busses - which might be helpful here, or not (don't know)
 
Last edited:
You can use the godbolt link I gave you, and modify the foo() function on the left to see the influence of different ways to write code. Even if you don't know assembler, you get an idea of what is happening if you look at the assembler output. ldr is load, str is store - the "r0".."r1" etc are cpu registers.
You will also see that the the middle asm (""::memory") seems to have no influence. This is the case, because foo() is very short and GPT registers are volatile. It might a look a little bit different when the code lines are embedded in a larger function, and so I added it.

(indeed we had some occasions in the teensyduino-core where that barrier was needed - some code did not work or was slower than needed)

I hope my English was halfway understandable, despite all the typos.
 
Last edited:
You can use the godbolt link I gave you, and modify the foo() function on the left to see the influence of different ways to write code. Even if you don't know assembler, you get an idea of what is happening if you look at the assembler output. ldr is load, str is store - the "r0".."r1" etc are cpu registers.
You will also see that the the middle asm (""::memory") seems to have no influence. This is the case, because foo() is very short and GPT registers are volatile. It might a look a little bit different when the code lines are embedded in a larger function, and so I added it.

(indeed we had some occasions in the teensyduino-core where that barrier was needed - some code did not work or was slower than needed)

I hope my English was halfway understandable, despite all the typos.

Totally understandable, really clear, and thanks so much for taking the time to write that excellent tutorial. Any asm instructions I’ve pasted in to date have been copying without full understanding really. It’s moved my understanding forwards and I will keep re reading it to get the fullest understanding. I have been thinking in terms of simpler processors and had no idea pipelining was going on, nor what the ::memory code line meant prior to this. I’ve now understood why delays occur stuffing data to a slower bus via a pipeline. Also that those delays could vary dependent on what the processor is doing at the time. I will indeed have a look at godbolt. There’s days of follow up work to get the full benefit of what you have taught here! I hope to pay back a little with some library code which will incorporate the principles conveyed to me here regarding the GPT timers. Though I need to continue to develop the code a lot longer yet. I started out wanting in particular to be able to measure a 32.768 kHz crystal frequency to within 0.001 Hz and it seems the Teensy is going to achieve that once I code the period routine. The resulting counter should be incredibly flexible in terms of setup and in many ways equal or better commercial instruments. I’ll move on to make two input circuits next so I can start to look at how the counters perform over a wider frequency range,

Steve
 
Back
Top