Teensy 4 Timed Interrupts faster than one microsecond?

Status
Not open for further replies.

n4tlf

Member
Hello all, first time posting here. I have received a lot of good info here reading past posts. Now, I'm confused...

I am trying to build an old-time floppy disk controller for aligning drives, starting with eight inch Shugart SA-800s. I have the mechanics control all sorted out on both an Arduino Nano and a Teensy 4.1. I can even erase or write all zeros or all ones onto a track with both the Nano and 4.1, using analogWrite at either 250kHz of 500kHz on the 4.1.

What I need now is a way of controlling a "write" pin at a minimum of 2 or 4MHz rate (using interrupts?), and a "read" pin at the same rates, so I can control write or read pulses and their widths, depending on a zero or one data bit. I've looked at the timers, PWM, etc... but am confused by several posts regarding timer interrupts, latency, etc. For reading, I could also measure the time between data/clock bit edges, but that may be affected by any noise picked up.

I've been writing code (assembler or C) since the late 70s (8080 CP/M, PCs, Arduinos, BBB, etc...), but am slightly more of a hardware hacker.

Technically, for reading single sided disks (FM mode), there are 200ns width clock pulses every 8 microseconds, followed by a 200ns data pulse 4 microseconds later if data is a one or no data pulse if a zero. Writing to the floppy follows a similar pattern, with the data pulse width being wider, 500ns or so. Temporarily, I have a 74LS123 expanding the read pulses width to slightly over 500ns.

There were comments here that I've seen about issues with the T4 timers early on, and latency in reading input pins at a very fast rate (with interrupts?). I chose the Teensy 4, due to its fast clock, thinking I could do additional data processing between data acquisition interrupts, AND it has a LOT of RAM (potentially holding a full eight inch disk of image at 241kB). I have several Teensy 3.2, 3.5, 3.6, 4.0, and 4.1 available. I almost used a 3.5, as it's 5V tolerant on inputs. Therefore, I can change Teensy device easily.

Any ideas or recommendations on what I should look at to get sub-microsecond interrupts working, or other methods?
Thank you for any guidance!
Terry. N4TLF
 
Could I get a little more of a hint please?

I am getting 0.5 microsecond interrupts, but shorter than that doesn’t work. Using the TeensyTimerTool.
Thank you.
Terry
 
I haven't used interrupts much as I wanted to go as fast as possible. I am sampling at a specific rate, so I poll on ARM_DWT_CYCCNT, which increments every CPU clock cycle. It takes about 4 or 5 clock cycles to sample and compare in a loop, and I could do what I needed in about 25 clocks, so I can sample at 20 MHz (600 MHz/30 clocks). Writing to a GPIO takes 1 or 2 clocks, but reading a GPIO takes 8 clocks (I'm reading - that is included in the 25 clocks).

Are the interrupts from a timer, or are you using the data pulse as the interrupt? You could have an interrupt from the data pulse and measure the time between pulses - if 4 usec, it is a 1, if longer, it was a 0.
 
For writing a pulse, in the timer interrupt callback, I would set the pin high, delayNanoseconds (500), set the pin low, start the timer for the next interrupt.
 
I think it would be useful to see if you can leverage use of the SPI hardware built in to the Teensy. Looking at wikipedia, it seems that FM encoding could be done by encoding two bits for each bit and run the clock at 2X the desired speed. I believe some of the Teensy's have FIFO's which would also relax the required timing needed.
For reading the disk, I guess the SPI would need to be in slave mode, and you would need to recover the 2X clock from the data stream.
 
I used output compares back in the early 90's on old Motorola (NXP used to be Freescale and before that, Motorola) HC11s, creating one shot and PWM outputs, but they can be configured for pretty much any pulse output you want.

I'm assuming the iMX implementation is similar in concept, but likely way more powerful.

But if you want tight timing, done in hardware, output compares are built for this. No libraries exist for this, would be nice if they did. So you are stuck in learning the ins & outs of the timer & OC registers.

From my memory (might be a bit fuzzy, as I knew this stuff forwards and backwards back then), you basically, you select the timer source (clock) and divisor and the max number you want it to count to**. This creates a clock that counts at a rate and to a number you decide. This counter then lives in a count register you can read if you wanted.

** The max count register might be a 3rd compare, whose action is to reset the counter.

Then you set the compare register with a value, and an action to do when the values match. Hardware then constantly compares the set value with the counting register. When they match, it does the action (set pin high, low or toggle, for example). Great for creating a square wave.

If you need to create a pulse, then just setup a second compare, to act on the same pin this time to set it low.

Now you have a pulse generator in hardware. It can be one-shot, or continuous. For continuous, the clock counter counts up to the max value, then rolls over, the compares stay where they are and should do it all day long, until you adjust.

If you wanted a one shot pulse, then setup an interrupt to fire once your 2nd compare happens, reset the registers and go back to what you were doing.

Now the iMx docs are fairly fat, never thought I'd see the day when they would grow to over 3000 pages. So I've peeked at them, but haven't had a need I couldn't pull off in other ways, yet. Perhaps start with a quick read here: https://www.nxp.com/docs/en/reference-manual/GPTRM.pdf and see if you can apply that to the T4?
 
Thanks everyone for the replies. I’m somewhat familiar with counting clock pulses, comparing the count to a known threshold. But, I’m much more comfortable with timed interrupts, and getting other work done either within them, or around them. I guess that’s the hardware hacker in me!

I’m playing with callback interrupts using the teensytimertool on a 4.0right now, to see how it works. I can get 1/2 microsecond callbacks ok, with rather limited number of instruction execution within the interrupt function. Does anyone know if the teensytimertool is setup for a default rate of 150MHz on the bus, or 5he slower rate? If it’s the slower rate I can probably eek out a few more instructions within the isr.

I would also like ti find a way to get down to 1/4 microsecond interrupt speed if possible. I know, more time lost in overhead.... I will check into the other options mentioned above as well. Using the spi, or counting clock cycles. I haven’t messed with assembly/machine code since my 8080 days, but that may be another option.
Thanks again!
Terry
 
Does anyone know if the teensytimertool is setup for a default rate of 150MHz on the bus, or 5he slower rate?

Per default the TimerTool runs the GPT and PIT timers at 24MHz. You can easily switch them to 150MHz in the config file. Configuration is described here https://github.com/luni64/TeensyTimerTool/wiki/Configuration.

You generally can't get much faster than a few MHz interrupt frequency on that processor since the required resetting of the interrupt flag needs to be synced between the involved internal busses which can take quite some time. This will improve @150MHz since the frequencies of the busses are closer then. There are a couple of posts about this in the forum (search for IntervalTimer Efficiency etc.). On the positive side this means that you can do a lot of calculations or other stuff in the ISR for free, since it needs to wait for the sync of the flag at the exit anyway :).

Please be aware that this waiting eats up a significant amount of clock cycles which leaves not much for the foreground tasks. E.g. try to run 4 InterruptTimers at the same time. Last time I checked, they stalled the processor at < 500kHz (IIRC).


I’m somewhat familiar with counting clock pulses, comparing the count to a known threshold. But, I’m much more comfortable with timed interrupts, and getting other work done either within them, or around them.
The TimerTool also contains 20 TCK timers which are implemented in software by counting the clock pulses via the ARM cycle counter. You can use them with the same callback interface as the other timers which might help. They are good to about 6MHz call rate in theory (i.e. using an empty loop()). But, beeing software timers, that of course depends much on what is going on in the foreground...


Another thing: The TimerTool uses a std::function callback mechanism by default. This will add some overhead on the calling side. I don' t expect that to be a lot compared to the syncing time but if you want to try, you can simply switch it to standard void(*f)(void) callbacks in the config file.

Hope that helps.
 
Last edited:
I decided to get the logic analyzer out and measure how long the timer interrupts take in the TeensyTimerTool library with a Teensy 4.0. I varied both CPU speed and Timer clock speed (I used the GPT timer). As expected, at the higher CPU speeds, the Timer clock speed becomes a bigger factor in how long the interrupt takes. At 600 MHz CPU speed, the fastest frequency is about 2 MHz using the 24 MHz timer clock and 4 MHz using the 150 MHz timer clock.

150 MHz450 MHz600 MHz816 MHz
24 MHz700480440
150 MHz560270240180
Interrupt time in nsec, CPU Speed (across top) vs Timer Clock Speed (down left)

I measured the time by continuously toggling an output in the main loop. When an interrupt occurs, the pulse stops. The interrupt time can be measured by how long the pulse stops (see picture below). The only thing done in the timer callback was to toggle another output high and low - about 4-8 clock cycles, so pretty negligible.

logic.png

As a check, I also decreased the timer period until the toggling output stopped completely, indicating that the CPU is starved. These times were very similar to what was measured above, maybe slightly longer.

Here is the code used to create these measurements:

Code:
#include "TeensyTimerTool.h"

using namespace TeensyTimerTool;

PeriodicTimer t1 (GPT2);

void setup() {
  pinMode (12, OUTPUT);
  pinMode (20, OUTPUT);
  t1.begin (callback, 1.000);
}

void callback()
{
  digitalWriteFast(20, 1);
  asm ("dsb");
  digitalWriteFast(20, 0);
}

void loop() {

  while (true) {
    digitalWriteFast(12, 1);
    asm ("dsb");
    digitalWriteFast(12, 0);
    asm ("dsb");
  }

}
 
Nice measurement. Fits well with my experiences. A few remarks:

  • I might be wrong, but, I think that the "150MHz" timer clock speed is not fixed but actually F_CPU / 4.
  • You do not need to sprinkle asm("dsb") in your ISRs (or other code). This is already handled by the the TimerTool.
  • I'd be interested if the std::function callbacks add any additional cost. Do you mind doing a quick check with //#define PLAIN_VANILLA_CALLBACKS uncommented in the config file? 600/150Mhz would be the most interesting case

For the fun of it I took your data and looked at it a bit differently:

As I understand, the bus synchronization penalty comes from the fact that the faster ARM bus needs to wait for the slower Peripheral bus if it needs to read/write to the peripheral. I therefore corrected the "150MHz" to F_BUS/4, calculated a "mismatch" = F_CPU / F_TIMER for all your pairs and translated your absolute times to processor cycles. Here a plot of the interrupt time in cycles against the mismatch:

Screenshot 2021-03-07 085814.jpg

The larger the mismatch the more cycles the processor needs to wait for the sync. Which explains the surprisingly bad performance of timer interrupts on the IMXRT1060. Especially, for the default 24MHz / 600MHz pair (mismatch 25). But, please feel free to correct me if I interpreted something wrong here.
 
  • I might be wrong, but, I think that the "150MHz" timer clock speed is not fixed but actually F_CPU / 4.

The "150 MHz" is actually F_BUS for the GPT. I printed out F_BUS_ACTUAL and it is 75.6 for 150, 150 for 450 and 600, and 204 for 816. So that should make the points in the graph more linear.

  • You do not need to sprinkle asm("dsb") in your ISRs (or other code). This is already handled by the the TimerTool.

My logic analyzer (a Teensy 3.6) was running at 80 MHz, so at the higher CPU speeds it had trouble seeing the pulses. I had a delayNanoseconds in sometimes, and without the dsb, it would start the delay before the output actually changed so I still couldn't see it. The dsb seemed to help make it more consistent.

  • I'd be interested if the std::function callbacks add any additional cost. Do you mind doing a quick check with //#define PLAIN_VANILLA_CALLBACKS uncommented in the config file? 600/150Mhz would be the most interesting case

I'll switch the callback type and see what it does. I also want to look at the TMR (QUAD) timer, which looks like it always runs at 150 MHz.
 
Thank you all for working on this!
I've been experimenting with timer interrupts and a scope as well. The above posts help me understand what I've been seeing. luni and LAtimes information is especially enlightening.

I need to digest how all this affects my plans. I'm looking at how much time (cpu instructions) fit within the isr, and how much time the isr overhead takes. 2MHz may be barely fast enough, if I use one shots to lengthen clock and data pulses sent and/or received. Yep, a mostly hardware using chips to help solve a timing issue.

Again all the work above has helped greatly to figure out why such an awesome chip has such timing limitations. I will post more here as I find stuff as well.
Terry, n4tlf
 
I'm looking at how much time (cpu instructions) fit within the isr, and how much time the isr overhead takes.

You need to take into account that you can utilize the time needed for syncing the interrupt flag. Here an example using the code from LAtimes and replacing the callback by this:

Code:
void callback()
{    
    for (int i = 0; i < 100; i++)
    {
        digitalToggleFast(3);
    }
}

The code just toggles pin 3 a 100 times. I measured the ISR time for various number of toggles from 1 to 200. Here the result:

Screenshot 2021-03-07 191100.jpg

It clearly shows that you can, for example, toggle a pin up to some 20 times in the ISR without affecting the required time. This is quite different from other processors where you usually try to have the ISR as short as possible. Here you can do a quite a lot in the ISR without wasting additional time. => in case you have something to calculate in the ISR not all is lost :)

Edit: Measured at F_CPU=650MHz, Timer Clock = F_BUS = 162MHz,
 
Last edited:
I made some more measurements. I increased the logic analyzer to 120 MHz and removed all of the dsb's, so the measurements are more consistent now. Note that clock cycles are more precise at lower CPU speeds, since the higher CPU speeds are much larger than the logic analyzer sample rate.

I stumbled across the 1 msec timer tick interrupt, so I measured its time just for fun.

Main items (tested on Teensy 4.0):
1. PLAIN_VANILLA_CALLBACKS saves about 2-3 clock cycles.
2. TMR1 interrupt is shorter than GPT1 interrupt by about 20 clock cycles at higher CPU speeds. So if you need speed, try TMR1.
3. The 1 msec timer tick interrupt takes about 30-40 clock cycles.
4. The dsb command added about 5-6 clock cycles. It was hard to measure at the higher CPU speeds.

Here's my tables of interrupt times and interrupt clock cycles. My graph is a slightly different look at how CPU speed affects the number of clock cycles.

Screenshot 2021-03-07 120253.png
 
1. PLAIN_VANILLA_CALLBACKS saves about 2-3 clock cycles.
That's great. I was afraid that the std::function interface is quite expensive which it obviously isn't :)

2. TMR1 interrupt is shorter than GPT1 interrupt by about 20 clock cycles at higher CPU speeds. So if you need speed, try TMR1.
This is weird. The TMR has a combined IRQ for all 4 channels, therefore it needs to cycle through all 4 channels at each ISR. The GPT has its own IRQ and the ISR is very short. Should be the other way round....
Here the TMR ISR: https://github.com/luni64/TeensyTim...03f9a81684da45b72855/src/Teensy/TMR/TMR.h#L65 and here the GPT ISR: https://github.com/luni64/TeensyTim...03f9a81684da45b72855/src/Teensy/GPT/GPT.h#L58

I use "DSB" in the TMR ISR but I use the reread trick (reading necessarily waits until the busses are synced) in the GPT ISR. Might be interesting to change that to DSB as well...


4. The dsb command added about 5-6 clock cycles. It was hard to measure at the higher CPU speeds.
Yes, the used GPIOs are tightly coupled to the ARM bus. There is not much to do for DSB after a simple write to the GPIOs.
 
This is weird. The TMR has a combined IRQ for all 4 channels, therefore it needs to cycle through all 4 channels at each ISR. The GPT has its own IRQ and the ISR is very short. Should be the other way round....

I went back and verified my results just to make sure. I looked at the timing in the interrupt routine before the callback and after the callback.

GPT Pre-callback time: 108 nsec
Post-callback time: 108 nsec

TMR Pre-callback time: 133 nsec
Post-callback time: 42 nsec

So the pre-callback time is longer for TMR than GPT, as expected.

I use "DSB" in the TMR ISR but I use the reread trick (reading necessarily waits until the busses are synced) in the GPT ISR. Might be interesting to change that to DSB as well...

I changed the GPT ISR to use DSB at the end and it sped things up. It changed the post-callback timing from 108 to 42 nsec.

So with DSB, the GPT timing is about 25 nsec faster than TMR at 600 MHz and 58 nsec faster than GPT without DSB.
 
Thanks a lot, updated the repository accordingly.


BTW:
...and 58 nsec faster than GPT without DSB

Doesn't work without DSB (or other syncing code) for short ISRs because the ISR will be reentered (interrupt flag not yet set after the ISR is left)

Example: this works with DSB but not without
Code:
PeriodicTimer t1(GPT1);

void setup()
{
    pinMode(LED_BUILTIN, OUTPUT);
    t1.begin([] { digitalToggleFast(LED_BUILTIN); }, 250ms);
}
void loop()
{
}
 
Thank you guys for this work! I pretty much verified the above. Using the GPT, I can have reliable interrupts down to around 180-200nsec, with almost nothing inside the ISR, other than setting an output pin and an 10 iteration for loop to increment a variable (as a test). I need to see if I can accomplish the rest of the programming, probably using 250ns interrupts. It was GREAT following you fleshing this out!
Terry
 
Status
Not open for further replies.
Back
Top