Nanosecond Resolution Interrupts on Teensy 4.0

Status
Not open for further replies.

charlescarver

New member
I have an optical transmitter that encodes information into the position of pulses within a time duration (pulse position modulation). I'm trying to see if its possible to accurately and consistently (i.e., with a nanosecond resolution) capture the rising edge of these pulses with a Teensy 4.0. To accomplish this, I enable the board's cycle counter and attach an interrupt to the input pin. I then store the timing information in a circular buffer which is processed in loop() when the processor is available:

Code:
#define SIG 3

float clock_speed = 600*1e6;
const int buffer_size = 32;
uint32_t buffer[buffer_size];
uint32_t last = 0;
int write_index = 0;
int read_index = 0;

void setup() {
	ARM_DEMCR |= ARM_DEMCR_TRCENA;
	ARM_DWT_CTRL |= ARM_DWT_CTRL_CYCCNTENA;
	attachInterrupt(digitalPinToInterrupt(SIG), pulseReceived, RISING);
}

void pulseReceived() {
	uint32_t current = ARM_DWT_CYCCNT;
	if (last != 0) {
		buffer[write_index % buffer_size] = current - last;
		write_index += 1;
	}
	last = current;
}

void loop() {
	while (read_index < write_index) {
		Serial.print(read_index);
		Serial.print(",");
		Serial.println(buffer[read_index % buffer_size] * 1/(clock_speed) * 1e9);
		read_index += 1;
	}
}

To test the accuracy/stability, I generate 200 external pulses that are around 1560ns apart. Here are the results compared to my oscilloscope:

Teensy 4.0Oscilloscope
Max (ns)16211559.810
Min (ns)15121559.308
Mean (ns)1559.5771559.693
STDV (ns)9.2340.114

The results aren't horrible, but now I want to optimize it to be as stable/accurate as possible to see if this is a feasible route. I have a couple ideas but wanted to see what else I should consider:

1. Increase Interrupt Priority

It looks like it should be possible by using NVIC_SET_PRIORITY, although I'm not sure what port to use. The guides I've seen (here and here) recommend doing something like:

Code:
NVIC_SET_PRIORITY(IRQ_PORTA, 0);

but IRQ_PORTx is not set for the Teensy 4.0. I looked in the IRQ_NUMBER_t enum for the 4.0 (imxrt.h) and saw a few potential candidates (IRQ_GPIO1_INT0 looks relevant?) but have no clue which port corresponds to which pin. It looks like there's a macro, digitalPinToPort (pins_arduino.h), which might help identify it

2. Use a General Purpose Timer in Free-Running Mode

According to the data sheet for the CPU:

Each GPT is a 32-bit “free-running” or “set and forget” mode timer with programmable prescaler and compare and capture register. A timer counter value can be captured using an external event and can be configured to trigger a capture event on either the leading or trailing edges of an input pulse.

Would GTP1 or GTP2 be sufficient to capture the counter value of a rising edge with nanosecond resolution?

3. Increase the Clock Frequency

When I overclocked the CPU, the standard deviation of the pulses was lower (which doesn't seem surprising). Doing this in conjunction with other methods might make the timing more stable?

4. Try Something Else!

I'm very interested in learning about other things I can try. I'm not set on using the Teensy 4.0, but it's an interesting challenge and I want to see how far I can take it.
 
What happens after 32 samples and buffer is exhausted?

This "float clock_speed = 600*1e6;" could be "float clock_speed = F_CPU_ACTUAL;" and it would get the speed at runtime.

Just a note - on the T_4.0 the Cycle Counter is started early in power up to feed the micros() clock calculation, so that isn't needed.

On the 1062 where I/O was set to fast GPIO ports moved to :: // Use fast GPIO6, GPIO7, GPIO8, GPIO9

Reading the pjrc.com/teensy/schematic.html will show the port behind the pins, but those and other things are in the imxrt.h file.

Interesting it averages out - there is jitter in the _isr response. And in using the FAST GPIO it munges the overhead of detecting which pin on which port triggered the interrupt - adding more overhead. KurtE posted a note about that as there are fewer vectors for the ports so each _isr has to drill down another layer to find the _isr() to call.
 
What happens after 32 samples and buffer is exhausted?

The buffer is cyclical so new samples should be added to the beginning after filling up. Ideally the read index will be faster than the write index.

This "float clock_speed = 600*1e6;" could be "float clock_speed = F_CPU_ACTUAL;" and it would get the speed at runtime

Just a note - on the T_4.0 the Cycle Counter is started early in power up to feed the micros() clock calculation, so that isn't needed.

Those are good tips, thank you for that.

On the 1062 where I/O was set to fast GPIO ports moved to :: // Use fast GPIO6, GPIO7, GPIO8, GPIO9

Reading the pjrc.com/teensy/schematic.html will show the port behind the pins, but those and other things are in the imxrt.h file.

I want to make sure I'm reading the schematic right. Currently I'm interrupting pin 3, which corresponds to the EMC_05 port? But looking at IRQ_NUMBER_t in imxrt.h I don't see any corresponding values.

Interesting it averages out - there is jitter in the _isr response. And in using the FAST GPIO it munges the overhead of detecting which pin on which port triggered the interrupt - adding more overhead. KurtE posted a note about that as there are fewer vectors for the ports so each _isr has to drill down another layer to find the _isr() to call.

I was also surprised it averaged out, but I guess it makes sense given the number of samples I tested. Is there any way to disable the FAST GPIO or otherwise reduce the jitter?
 
Indeed it will wrap - I missed the :: [XXX_index % buffer_size]

That matches with " 3 EMC_05 4.5 " posted here :: Teensy-4-0-First-Beta-Test

The schematic also indicates : EMC_05 | G5 for pin 3. Another doc shows before going fast GPIO it would be :: GPIO4_IO5 - not sure beyond that just now.

Paul made this post on prior Teensy jitter - all code on T4 is FASTRUN by default - adjusting priorities might help.
 
Sorry, I will throw out a few things here, but not sure anything will help here or not.

First here is a sketch I was playing around with earlier. If I remember correctly I am not sure if it reduced the interrupt latency much if at all, but shows a few different registers...

Code:
#define IRQ_PIN 0
#define ECHO_PIN 1
#define TRIGGER_PIN 2

#define IRQ2_PIN 3
#define ECHO2_PIN 4
#define TRIGGER2_PIN 5

#define CORE_PIN0_PINREG_SLOW  GPIO1_PSR
#define readFastIRQPin() ((CORE_PIN0_PINREG_SLOW & CORE_PIN0_BITMASK) ? 1 : 0)
uint32_t cycles_per_second = 100;  //
void setup() {
  while (!Serial && millis() < 5000) ;
  Serial.begin(115200);
  pinMode(IRQ_PIN, INPUT);
  pinMode(ECHO_PIN, OUTPUT);
  pinMode(TRIGGER_PIN, OUTPUT);
  digitalWrite(TRIGGER_PIN, LOW);
  Serial.printf("Test IRQ timing:\n    Pins IRQ:%d ECHO:%d TRIGGER: %d\n", IRQ_PIN, ECHO_PIN, TRIGGER_PIN);
  pinMode(IRQ2_PIN, INPUT);
  pinMode(ECHO2_PIN, OUTPUT);
  pinMode(TRIGGER2_PIN, OUTPUT);
  digitalWrite(TRIGGER2_PIN, LOW);
  Serial.printf("    Normal pins: IRQ:%d ECHO:%d TRIGGER: %d\n", IRQ2_PIN, ECHO2_PIN, TRIGGER2_PIN);
  delay(500);

  //---------------------------------------------
  // First lets setup pin 0 to slow mode and direct ISR...
  CCM_CCGR1 |= CCM_CCGR1_GPIO1(CCM_CCGR_ON);
  attachInterruptVector(IRQ_GPIO1_0_15, &pin_isr);
  NVIC_ENABLE_IRQ(IRQ_GPIO1_0_15);
  Serial.println("After Attach"); Serial.flush();
  // I think this will have GPIO1 handle its pin 3
  IOMUXC_GPR_GPR26 = 0xFFFFFFFF;
  Serial.println("After set IOMUXC"); Serial.flush();
  GPIO1_ICR1 = 0x00; // set to 0
  Serial.println("After set ICR1"); Serial.flush();
  GPIO1_GDIR &= ~0x08;  // Make sure set as input in GPIO1
  GPIO1_EDGE_SEL = 0x08; // set to 0
  GPIO1_ISR = 0xffff;
  Serial.println("After set ISR1"); Serial.flush();
  GPIO1_IMR = 0x08;
  Serial.println("After set GPIO"); Serial.flush();

  //---------------------------------------------
  // Next setup pin 3 to use attach interrupt in normal mode
  attachInterrupt(IRQ2_PIN, &pin2_isr, CHANGE);
}

volatile uint32_t irq_count = 0;

void pin_isr(void) {
  digitalWriteFast(ECHO_PIN, digitalReadFast(IRQ_PIN));
  irq_count++;
  GPIO1_ISR = 0x08; // clear the IRQ
  asm("dsb");
}

void pin2_isr(void) {
  digitalWriteFast(ECHO2_PIN, digitalReadFast(IRQ2_PIN));
  irq_count++;
}

void loop() {
  // put your main code here, to run repeatedly:
  Serial.printf("Enter cycles per second default(%d):", cycles_per_second);
  while (!Serial.available()) ;
  uint32_t cps = 0;
  int ch;
  while ((ch = Serial.read()) != -1) {
    if ((ch >= '0') && (ch <= '9')) cps = cps * 10 + ch - '0';
  }
  if (cps) {
    cycles_per_second = cps;
  }
  uint32_t delay_per_cycle = 1000000 / (cycles_per_second * 2);

  irq_count = 0;
  elapsedMicros em = 0;
  for (uint32_t i = 0; i < cycles_per_second; i++) {
    digitalWriteFast(TRIGGER_PIN, HIGH);
    delayMicroseconds(delay_per_cycle);
    digitalWriteFast(TRIGGER_PIN, LOW);
    delayMicroseconds(delay_per_cycle);
  }

  uint32_t delta_time = em;
  Serial.printf("\nDirect IRQs processed:  %u dt: %d calc:%d\n", irq_count,
                delta_time, delay_per_cycle * 2 * cycles_per_second);

  // Now do normal way

  irq_count = 0;
  em = 0;
  for (uint32_t i = 0; i < cycles_per_second; i++) {
    digitalWriteFast(TRIGGER2_PIN, HIGH);
    delayMicroseconds(delay_per_cycle);
    digitalWriteFast(TRIGGER2_PIN, LOW);
    delayMicroseconds(delay_per_cycle);
  }

  delta_time = em;
  Serial.printf("Normal IRQs processed:  %u dt: %d calc:%d\n", irq_count,
                delta_time, delay_per_cycle * 2 * cycles_per_second);

}

Some of this test was a bust, in I did not get a specific pins interrupt to happen. I may play again at some point, to see if it is possible or not.

I see some issues in the code above, but I think it was just trying things out and left things in an inconsistent state. But to explain a little:

Suppose you wish to change one (or more pins) back from GPIO6 to GPIO1... Some of the steps needed would be:
You need to enable talking to GPIO1's registers else system will fault:
Code:
  CCM_CCGR1 |= CCM_CCGR1_GPIO1(CCM_CCGR_ON);

You need tell the system to use GPIO1 instead of GPIO6 for those pins. Currently the line:
Code:
  IOMUXC_GPR_GPR26 = 0xFFFFFFFF;
Says map all possible 32 GPIO pins from GPIO1 to GPIO6 (Which I currently have above... to map our Pin 0 (GPIO 1.3 or 6.3) back to GPIO1, the above line should be:
Code:
  IOMUXC_GPR_GPR26 = 0xFFFFFFF7;

-----

Some of the other things I would look at include, how is the IO pin you are interested in configured. Again if you look at our logical Pin 0 you may want to look at the register:
IOMUXC_SW_PAD_CTL_PAD_GPIO_AD_B0_03 which is on page 652, to see things like Pull up, pull down, Speed, strength, slew rate, ... Not sure if any of these impact your timings and the like, but something to look at.


-----

And now for something completely different: Again I have not studied your requirements and the like, but wonder if you need that level of precision if the approach of grabbing timer in an ISR is the correct approach? That is I am wondering if instead you should be using a Timer using something like an input capture? That is have the timer capture the exact cycle that it detects the IO pin change values, then in your ISR, you can simply grab that value from the timers register(s)? Again I have not fully looked at the different timers to see if they can give you the resolution you require.

But you might look at some of the libraries and the like, that I believe does this. Example PulsePosition. Note: I don't think Paul has picked up the changes to support it on T4. I believe @mjs513 has the most up to date T4 version up at: https://github.com/mjs513/PulsePosition

Again not sure if any of this helps or not.
 
First off, thank you both for the quick and useful replies.

Indeed it will wrap - I missed the :: [XXX_index % buffer_size]

That matches with " 3 EMC_05 4.5 " posted here :: Teensy-4-0-First-Beta-Test

The schematic also indicates : EMC_05 | G5 for pin 3. Another doc shows before going fast GPIO it would be :: GPIO4_IO5 - not sure beyond that just now.

Paul made this post on prior Teensy jitter - all code on T4 is FASTRUN by default - adjusting priorities might help.

Looking at the doc again, pin 3 also corresponds to PWM4_B2 which might be IRQ_FLEXPWM4_2? Either way, I think I identified some pins with known IRQs (more on that below).

Just saw this today : Measuring Interrupt Latency

That is from this NXP page - not sure if you need to be registered to see it : nxp.com … i.mx-rt1060-crossover-processor-with-arm-cortex-m7-core:i.MX-RT1060

It will need some translation to use PJRC #defines to some degree to the T_4 1062 … but you have a scope. If you can develop something from it, it might be telling or useful.

If the needed translation isn't clear please post ...

Thank you for the link, and I am able to see the doc without being registered. I'll take a look and see if I can quantify the latency a bit better.

Some of the other things I would look at include, how is the IO pin you are interested in configured. Again if you look at our logical Pin 0 you may want to look at the register:
IOMUXC_SW_PAD_CTL_PAD_GPIO_AD_B0_03 which is on page 652, to see things like Pull up, pull down, Speed, strength, slew rate, ... Not sure if any of these impact your timings and the like, but something to look at.

Good advice - which doc are you referencing?

And now for something completely different: Again I have not studied your requirements and the like, but wonder if you need that level of precision if the approach of grabbing timer in an ISR is the correct approach? That is I am wondering if instead you should be using a Timer using something like an input capture? That is have the timer capture the exact cycle that it detects the IO pin change values, then in your ISR, you can simply grab that value from the timers register(s)? Again I have not fully looked at the different timers to see if they can give you the resolution you require.

But you might look at some of the libraries and the like, that I believe does this. Example PulsePosition. Note: I don't think Paul has picked up the changes to support it on T4. I believe @mjs513 has the most up to date T4 version up at: https://github.com/mjs513/PulsePosition

Again not sure if any of this helps or not.

I tested the PulsePosition lib for the T4, and although it seems stable, the resolution is only 20ns which unfortunately doesn't fit my requirements. It also seems that it takes too long to read from the 16 channels and reset for more pulses. That being said, I used the code to grab the IRQ_NUMBER_t values for pins 6, 9, 10, 11, 12, 13, 14, 15, 18, and 19. I tested my original code on pin 6 with the highest priority interrupt and an overclocked CPU (to 1.008 GHz), and here are my new stability results:

Teensy 4.0Oscilloscope
Max (ns)2560.242559.95
Min (ns)2558.232559.33
Mean (ns)2559.592559.62
STDV (ns)0.490.13

This was for 200 pulses about 2560ns apart. The jitter seems significantly reduced, I'm guessing from the combination of pin 6 characteristics (the speed, slew rate, but I still need to confirm), GHz clock, and highest priority interrupt.

I'm going to run more tests to look at the limits of this method, and in the mean time, if you guys think of anything else I should try for more reliable timing, please let me know!
 
Good advice - which doc are you referencing?
The datasheet which you can download from: https://www.pjrc.com/teensy/datasheets.html

I tested the PulsePosition lib for the T4, and although it seems stable, the resolution is only 20ns which unfortunately doesn't fit my requirements. It also seems that it takes too long to read from the 16 channels and reset for more pulses. That being said, I used the code to grab the IRQ_NUMBER_t values for pins 6, 9, 10, 11, 12, 13, 14, 15, 18, and 19. I tested my original code on pin 6 with the highest priority interrupt and an overclocked CPU (to 1.008 GHz), and here are my new stability results:
The question might be, is the 20ns a hard rule for all of the timers, or potentially just what the timers were configured for?

Again I don't know, but threw it out there as a potential thing to look at...
 
@charlescarver - What cooling are you using for reliable operation at 1 GHz? ( 1.48 actually at 996MHz unless the boards file edited ) Without good/proper heat removal it can toast the MCU.

That latency doc notes F_bus goes to 200 IIRC at 800 MHz - that will help going up from 150 at 600 MHz. Testing at 816 probably better for long term MCU life if going to OC.
 
@charlescarver

If you are using the PulsePosition library on the T4 you may want to take a look at the PulsePositionIMXRT.cpp file. There are several timing parameters that are hardcoded in the library. I extracted them below for you.

Code:
// Timing parameters, in microseconds.

// The shortest time allowed between any 2 rising edges.  This should be at
// least double TX_PULSE_WIDTH.
#define TX_MINIMUM_SIGNAL   300.0

// The longest time allowed between any 2 rising edges for a normal signal.
#define TX_MAXIMUM_SIGNAL  2500.0

// The default signal to send if nothing has been written.
#define TX_DEFAULT_SIGNAL  1500.0

// When transmitting with a single pin, the minimum space signal that marks
// the end of a frame.  Single wire receivers recognize the end of a frame
// by looking for a gap longer than the maximum data size.  When viewing the
// waveform on an oscilloscope, set the trigger "holdoff" time to slightly
// less than TX_MINIMUM_SPACE, for the most reliable display.  This parameter
// is not used when transmitting with 2 pins.
#define TX_MINIMUM_SPACE   5000.0

// The minimum total frame size.  Some servo motors or other devices may not
// work with pulses the repeat more often than 50 Hz.  To allow transmission
// as fast as possible, set this to the same as TX_MINIMUM_SIGNAL.
#define TX_MINIMUM_FRAME  20000.0

// The length of all transmitted pulses.  This must be longer than the worst
// case interrupt latency, which depends on how long any other library may
// disable interrupts.  This must also be no more than half TX_MINIMUM_SIGNAL.
// Most libraries disable interrupts for no more than a few microseconds.
// The OneWire library is a notable exception, so this may need to be lengthened
// if a library that imposes unusual interrupt latency is in use.
#define TX_PULSE_WIDTH      100.0

// When receiving, any time between rising edges longer than this will be
// treated as the end-of-frame marker.
#define RX_MINIMUM_SPACE   3500.0

// convert from microseconds to I/O clock ticks
#define CLOCKS_PER_MICROSECOND (150./4)  // pcs 8+2
#define TX_MINIMUM_SPACE_CLOCKS   (uint32_t)(TX_MINIMUM_SPACE * CLOCKS_PER_MICROSECOND)
#define TX_MINIMUM_FRAME_CLOCKS   (uint32_t)(TX_MINIMUM_FRAME * CLOCKS_PER_MICROSECOND)
#define TX_PULSE_WIDTH_CLOCKS     (uint32_t)(TX_PULSE_WIDTH * CLOCKS_PER_MICROSECOND)
#define TX_DEFAULT_SIGNAL_CLOCKS  (uint32_t)(TX_DEFAULT_SIGNAL * CLOCKS_PER_MICROSECOND)
#define RX_MINIMUM_SPACE_CLOCKS   (uint32_t)(RX_MINIMUM_SPACE * CLOCKS_PER_MICROSECOND)
 
Status
Not open for further replies.
Back
Top