T4 interrupt latency and etc

DrM

Well-known member
I am trying to implement a state machine in an isr, and find some interesting timing problems.

Here is a synopsis of how it works, but you can skip this for the question.
The external parts are an S11639-01 (CCD sensor) interfaced and a MCP33131D (ADC) which connects to the T4 by SPI. There is an output TRG from the CCD that is intended to serve as the convert trigger for an ADC. TRG runs continuously with a small offset from the master clock. Exposure begins when the T4 sets the ST pin high, and ends 48 clocks after setting ST low. At 88 clocks after ST goes low, the "video" signal is clocked to the output output pin. The ADC should start sampling then.

The following code snippet is from the top of the ISR, it simply sets a pin on entry to the ISR. "sparePin" is pin 2 on the T4.
1710530795973.png



Here is a scope trace of what that looks like. The blue trace is the TRG signal, connect to pin 7 on the T4. The isr is attached to pin 7, rising edge. Notice the 120 nsec delay from the trigger until the pin is asserted in the ISR. The second image shows that the timing is constant.

So, that is a little surprising, it seems like it takes 72 instructions cycles to get to the ISR? On a bare metal platform, one would expect just a few instruction cycles to get to the isr, certainly under 10. Is there a away to speed this up?

1710530951935.png

1710531685739.png



Okay next issue.

Here we are looking at the time to get to one of the switch statements. Here is the code snippet, followed by the scope trace. The time to get to the case, is now 150 nsecs, the extra 25 nsecs corresponds to about 15 instructions, that's okay, not a big deal. So that is not a problem.

1710531835258.png


1710532070101.png



And finally, here is the ADC readout. First the code and then the traces. Notice there is about 150 nsecs before the transfer begins and another 100 nsecs afterwards. That extra 250 nsecs, is serious limitation for a readout 16 bits that should only take 533 nsecs at 30MHz. So, is there anyway to speed that up? I can move the 150nsecs inside the "convert" time so that the transfer actually starts promptly, but it would still be helpful to reduce the 100nsecs after.

1710532658538.png




1710532590846.png




Thank you
 
On a bare metal platform, one would expect just a few instruction cycles to get to the isr,
No, it has to save all the registers it uses to stack which is many cycles. When the interrupt occurs it has no knowledge of which registers are live or not so it has to save all the ones it will clobber, not just the callee-save set like a normal function call.
 
Remember there's also the time to de-glitch the input and process the vectored interrupt, not every part of the chip is clocked at the full 600MHz either. Anyway 72 cycles is only 120ns at 600MHz, which is very fast for an ISR response. Beyond that DMA is the way to go perhaps?
 
10 nsecs is a decent time to first instruction in an ISR, a few cycles for register saves after that or a single instruction context swap in some processors.

Anyway it turns out the ARM7 has 37 registers. So that is the 72 cycles for an unspecific context switch.

And anyway, as it turns out, I solved the problem. I am ignoring the TRG signal, instead triggering on the earlier clock edge, and hand coding the convert signal. That lets me make more optimal use of the time available. So, that's that.
 
So, that is a little surprising, it seems like it takes 72 instructions cycles to get to the ISR?

If you're using attachInterrupt(), which first runs dispatch code to call your function, then 72 cycles sounds about right.

But that's a bit of guesswork, since I can't see the portion of your code which sets up the interrupt. Details matter when you're talking about the number of cycles needed.
 
What you should do for this sort of questions is compose a small but complete program anyone can copy into Arduino IDE which demonstrates the behavior you're seeing. Then share that program as text, not images, so anyone here can indeed run it on their Teensy and see the same behavior.

Here's an example:

Code:
void myfunction() {
  digitalWriteFast(4, HIGH);
  delayMicroseconds(10);
  digitalWriteFast(4, LOW);
}

void setup() {
  pinMode(3, INPUT_PULLUP);
  attachInterrupt(3, myfunction, RISING);
  pinMode(4, OUTPUT);
  analogWrite(2, 128); // short pins 2 and 3
}

void loop() {
}

When I run this on a Teensy 4.1, I'm seeing even slightly longer time, about 170ns.

1710540155638.png



But the main point I want to emphasize is you can and should show a complete program as copyable text for this sort of situation, so anyone can quickly and easily REPRODUCE the issue.

This is especially important because small details can make quite a difference. As you described the problem, nobody can even SEE the specific way you configured the interrupt, much less REPRODUCE it.
 
@PaulStoffregen Fair point, and I did intend but forgot to include the attachInterrupt(). I was in a hurry over unrelated matters.

Anyway, your code example demonstrates everything there is to the question, that is indeed how I am setting up the interrupt. Is there another way provided in an API? Or is the alternative hand coding it at the register level?
 
Yes, there is a way. You can let attachInterrupt do most of the setup, then use attachInterruptVector to cause your function to run directly rather than the generic one. Inside your function, you'll need to clear the interrupt state for the pin you used. You can look at the generic code inside interrupt.c for a few hints. Also see the GPIO chapter in the reference manual for the interrupt status register you need to clear. If you don't clear that bit, your program will keep running the interrupt function over and over as an infinite loop.

If you try and get stuck, please show a complete program which can be copied into Arduino IDE.
 
@PaulStoffregen Hi Paul

So looking at interrupt.c, it seems like irq_anyport() is the generic handler, and the source of the interrupt is cleared in the line gpio[ISR_INDEX] = status (the usual "write the bit to clear the bit" kind of thing), and the line attachInterruptVector(IRQ_GPIO6789, &irq_gpio6789); connects it to the interrupt.

Did I guess that correctly?

In other words, i need to call this to load the isr,

C:
attachInterrupt( pin, myISR, rising );
attachInterruptVector(IRQ_GPIO6789, &myISR);

And at the beginning of my ISR, I need to do this, if I want to simply clear anything and everything that might pending in that GPI register (which would be perfectly okay in this instance. There is another pin that triggers frames, but it should be ignored in the state where this is happening anyway).

C:
    uint32_t status = gpio[ISR_INDEX] & gpio[IMR_INDEX];
    if (status) {
        gpio[ISR_INDEX] = status;

And it seems like that could shave off maybe something like n x 20 ops or so, with n depending on which pin it is on and what else is pending.
 
@PaulStoffregen P/S - further thought on the previous. I think i need to restore the previous interrupt handler when I am done, so that external triggers will be recognized again. That would mean doing something like this before i leave the isr, yes?

C:
attachInterruptVector(IRQ_GPIO6789, &irq_gpio6789);

But would it be necessary to do this (I imagine the answer is no, we're in interrupt state and it is going to be re-enabled anyway)?

NVIC_ENABLE_IRQ(IRQ_GPIO6789);
 
Okay, with that ado, and pending comments, I'll give it a try, on labrat hardware, and see what happens.
 
@PaulStoffregen Hi Paul, it works! Thank you. I have one question though, how do we restore the generic interrupt handler? Trying to load it with attachInterruptVector() gives an undefined reference. Another strategy, actually the one I might favor would be to save the current address from the table and the restore it when I am done with the "directconnet" isr.

The program is attached. It benchmarks a few digital i.o functions, The latency tests are at the bottom. I know, I should trim it to a simple minimal example, but it is actually pretty simple.

Here are the service routines that I use to load the isr into the jump table, clear the interrupt state register, and my attempt to restore the generic isr (commented out at present).

C:
#include "Arduino.h"
#include "imxrt.h"
#include "pins_arduino.h"

#define IMR_INDEX   5
#define ISR_INDEX   6

volatile uint32_t *gpio;
//void irq_gpio6789(void);

inline void directconnect( uint8_t pin, void (*function)(void) ) {
  gpio = portOutputRegister(pin);
  attachInterruptVector(IRQ_GPIO6789, function);
}

inline void directclear( ) {
  uint32_t status = gpio[ISR_INDEX] & gpio[IMR_INDEX];
  if (status) {
    gpio[ISR_INDEX] = status;
  } 
}

inline void directrestore() {
  //attachInterruptVector(IRQ_GPIO6789, &irq_gpio6789);
}


Here is the isr

C:
void timing_const_fast_isr() {

  directclear();
 
  cpucycles = elapsed_cycles();
  digitalWriteFast(OUTPIN,LOW);

  cpuavg += cpucycles;
  if ( cpucycles > cpumax) cpumax = cpucycles;
 
  //  Serial.println("isr");
}

And here is the code that connects the isr and runs it.
C:
pinMode(INPIN,INPUT);
    pinMode(OUTPIN,OUTPUT);
    digitalWriteFast( OUTPIN, LOW );

    attachInterrupt(digitalPinToInterrupt(INPIN), timing_const_fast_isr, RISING);
    directconnect( digitalPinToInterrupt(INPIN), timing_const_fast_isr );
    
    cpuavg = 0;
    cpumax = 0;
    for (int n = 0; n < NKNTS; n++ ) {
      
      cpucycles = 0;
      elapsed_cycles_start();
      digitalWriteFast(OUTPIN,HIGH);

      delayMicroseconds(1);
      if (!cpucycles) {
        Serial.print("isr did not run");\
        Serial.println(n);\
        break;
      }
    }
    
    directrestore();
    detachInterrupt( digitalPinToInterrupt(INPIN));


In case it is of interest, here is the output, first without the "directconnect" and the with directconnect. Notice the speed up is a solid 50% and the jitter is now 0.

C:
#define CYCLECLOCK_OVERHEAD_CYCLES 5

#define LATENCY_CONST_MEASUERED_CYCLES 104
#define LATENCY_CONST_MEASUERED_NANOSECS 173.3

#define LATENCY_CONST_MEASUERED_CYCLES_MAX 131
#define LATENCY_CONST_MEASUERED_NANOSECS_MAX 218.3

#define LATENCY_DIRECT_CONST_MEASUERED_CYCLES 51
#define LATENCY_DIRECT_CONST_MEASUERED_NANOSECS 85.0

#define LATENCY_DIRECT_CONST_MEASUERED_CYCLES_MAX 51
#define LATENCY_DIRECT_CONST_MEASUERED_NANOSECS_MAX 85.0


So, the remaining question is, how to we gracefully restore the generic handler? I realize I could simply call attachInterrupt(), but it seems like I should be able to either save whatever is in the register to begin with, and then reload it, or explicitly load the generic isr with attachInterruptVector(), if only the linker could find it.
 

Attachments

  • Controller_Benchmark_240312.ino
    18.4 KB · Views: 7
@PaulStoffregen

Yes, it was the attachInterrupt() that I was concerned about.

In interrupt.c, at line 89 it invokes NVIC_ENABLE_IRQ(IRQ_GPIO6789); Does that mean that interrupts become enabled again while I am still in my ISR?

I thought about reattaching the same pin of course, and then detach it (I am detaching it to stop the state engine), but that line invoking NVIC_ENABLE_IRQ gave me some concern about what would actually happen if I did that.

P/S I forgot to mention, in the actual applicaiton, I am doing the detach inside of the ISR/state engine.
 
In interrupt.c, at line 89 it invokes NVIC_ENABLE_IRQ(IRQ_GPIO6789); Does that mean that interrupts become enabled again while I am still in my ISR?

No. NVIC uses priority levels. As long as you don't reconfigure the priority, you'll never get a recursive interrupt. Upon entry to the ISR, the NVIC knows you're now running at that interrupt's priority level. It won't allow more interrupts of the same or lower priority, even if they are otherwise enabled.

On very old 8 bit architectures, often a single global interrupt enable bit would be automatically cleared by hardware and then set again when returning to main program. NVIC is far more sophisticated. The enable bits are never changed automatically by entry and exit of interrupt state, because NVIC uses separate priority level state.

To really learn the finer details of NVIC and pretty much everything Cortex-M, this book is the best source.


In theory it's also all covered in ARM's reference materials, but those are extremely difficult to read for learning. Joseph Yiu's book is well worth the price for how much more approachable it is to read.
 
@PaulStoffregen
Hi Paul, two things perhaps of interest.

First, here is my API for the direct attach. I think it would be nice to have something like this available in the general API. If so, the directDetach could simplified.

C:
volatile uint32_t *directgpio;
volatile uint32_t directmask = 0;

inline void directAttach( uint8_t pin, void (*function)(void), int mode ) {
  directgpio = portOutputRegister(pin);
  directmask = digitalPinToBitMask(pin);
  directfunction = function;
  attachInterrupt(pin, function, mode);
  attachInterruptVector(IRQ_GPIO6789, function);
}

inline uint32_t directClear( ) {
  uint32_t status = directgpio[ISR_INDEX] & directgpio[IMR_INDEX];
  if (status) {
    directgpio[ISR_INDEX] = status;
  }
  return status & directmask;
}

void directnoop() {
}

inline void directDetach(uint8_t pin) {
  attachInterrupt(pin,directnoop,RISING);
  detachInterrupt(pin);
}

Here is an example ISR,

C:
void timing_direct_isr() {
  if (directClear()) {
    cpucycles = elapsed_cycles();
    digitalWriteFast(outputpin,LOW);

    cpuavg += cpucycles;
    if ( cpucycles > cpumax) cpumax = cpucycles;
  }
}

Here is the performance comparison. The extra 6 cycles comprise two cycles to return the masked status, and 4 cycles to check it in the ISR. But the jitter is still 0. That's the important part for realtime (else it is a different kind of realtime). I write these out as macros, because I plan to use them in a header file to generate parameters for timing the state machine.

C:
#define CYCLECLOCK_OVERHEAD_CYCLES 5
#define LATENCY_DIRECT_MEASUERED_CYCLES 56
#define LATENCY_DIRECT_MEASUERED_NANOSECS 93.3
#define LATENCY_DIRECT_MEASUERED_CYCLES_MAX 56
#define LATENCY_DIRECT_MEASUERED_NANOSECS_MAX 93.3

#define LATENCY_MEASUERED_CYCLES 107
#define LATENCY_MEASUERED_NANOSECS 178.3
#define LATENCY_MEASUERED_CYCLES_MAX 129
#define LATENCY_MEASUERED_NANOSECS_MAX 215.0


The second thing, perhaps of interest, all of the above goes awry, if instead of loading the user ISR, I load something that calls the user ISR (in that version I save its address from the attach functIon). (Apologies for not including the code, I just ripped it out and am not particularly easger to reconstruct it). That one extra layer of call costs another 30 or so cycles, but more importantly, it produces the 25% jitter that we saw with the generic api. And, it seems like the rest of the generic api should otherwise produce constant timing when only one pin i s involved. So, it seems like calling a function from inside an isr is what messes up the timing.
 
@DrM :

Quoting from Paul's <post #7>,

"What you should do for these sort of questions is compose a small but complete program anyone can copy into Arduino IDE which demonstrates the behavior you're seeing. Then share that program as text, not images, so anyone here can indeed run it on their Teensy and see the same behavior."

I would love to play/experiment with this. However, since only selected snippets of code have been posted/included, alas, my implementation would be, at best, a guess at accurately reproducing what you've done. Any chance of getting a complete sketch that we can load up to play with ??

Mark J Culross
KD5RXT
 
@kd5rxt-mark I attached a complete code a few messages back. But here is the current code, which now includes the API described two messages back. This is the api I plan to use in my own work. Please see the attachment.

Start the sketch and then in the serial monitor, type "fast latency", and it will run the new fast isr test for you. The command "latency" will run the test of the conventional isr api. The command "help" will tell you what else can do.

I would like to run this on a few models of arduino and compare the timings for all of the tests included in the sketch (omitting the last where it is not supported on the hardware). But alas, I only have teensys.
 

Attachments

  • Controller_Benchmark_240312.ino
    18.8 KB · Views: 13
P/S I plan to put this up on my git-hub, just havent gotten around to it. Need to find some time to generate the readme.

Again the idea is "time your arduino", I tried to stay with the API as far as possible and not go to bid fiddlin'. The fast isr is the exception.
 
Back
Top