Input capture on teensy 4.1 and registers access time

Mloum · May 25, 2021

Hello everyone,

I am an assistant professor at the University of Angers (France). We are currently trying to use the teensy 4.1 to counts single photons or more precisely to count and tag-time the electrical pulses emanating from single-photon counter (i.e. Avalanche PhotoDiode).

Just for information, from this raw data, we get access to the Brownian motion of nanoparticles in solution and we can measure their size and shape (via Fluctuation Correlation Spectroscopy or Dynamic Light scattering).

Obviously, electronics for time-tagging does exist (and, for the initiated, we do have a TCSPC card) but are quite expensive (a few thousand euros). Open source projects based on FPGA do exist (like this one that we have tested in the past) but FPGA are... hard to use (at least for us now).

The idea is to use a fast micro-controller (the teensy 4.1) to time tag the pulses, thoroughly test it and then create an open-source project (and also publish an article because, apparently, that's why I am paid for...).

Our aim is to time-tag the pulses with a least a temporal precision of, let's say 50ns, with a count rate as high as possible (at least 1 million pulses per second).

For this goal, we are using the input captures of the 32 bits General Purpose Timer (GPT).

We are using double buffering to transfer the capture register to RAM, then from RAM to PSRAM, and then from PSRAM to a fast SDCard. So far so good except for the count rate that can't exceed 500kHz without getting wrong capture times.

Consequently, we have investigated and found an unexpected bottleneck.

We have measured on the oscilloscope the time taken for the capture interrupt in order to take place (see code below) and found it was 1µs (which explains the problem at 500kHz).

By increasing the clock speed sent to the GPT (PERCLCK_CLK_ROOT) from 24Mhz to 300Mhz we have reduced the interrupt total time to around 500ns. Even the minimal interrupt function (only resetting the interrupt flag) takes 150ns.

In other words, accessing the registers for writing or reading takes a lot of cycles and we did not plan it that while crafting all our memory transfers map in order to measure at the higher rate possible.

Now, time for some code.

Here is the setup of the GPT timer (number 2) in order to perform input capture :

Code:

void setupTimerGPT2(void)
{
  // IOMUX Configuration in order to have physical access to capture 1 et 2 pins of GPT2
  //GPT capture 1
  IOMUXC_GPT2_IPP_IND_CAPIN1_SELECT_INPUT = 1;  // remap GPIO_AD_B1_03_ALT8 GPT2 Capture1 (Channel 1)
  IOMUXC_SW_MUX_CTL_PAD_GPIO_AD_B1_03 = 8; // GPT2 Capture1 configuration ALT8 Pin 15
  IOMUXC_SW_PAD_CTL_PAD_GPIO_AD_B1_03 = 0x13000; //Pulldown & Hyst

  //GPT capture 2
  IOMUXC_GPT2_IPP_IND_CAPIN2_SELECT_INPUT = 1;  // remap GPIO_AD_B1_04_ALT8 sur GPT2 Capture2 (Channel 2)
  IOMUXC_SW_MUX_CTL_PAD_GPIO_AD_B1_04 = 8; // GPT2 Capture2 configuration ALT8 Pin 40
  IOMUXC_SW_PAD_CTL_PAD_GPIO_AD_B1_04 = 0x13000; //Pulldown & Hyst


  //Configuration du bus d'horloge
  // #define CCM_CSCMR1_PERCLK_CLK_SEL   ((uint32_t)(1<<6))
  
  // Change the Clock Controller Module in order to use PERCLK_CLK_ROOT for the counter and not OSC@24MHz (default)
  CCM_CSCMR1 &= ~CCM_CSCMR1_PERCLK_CLK_SEL; //

  // Change the prescaler between AHB_CLK_ROOT (typically 600MhZ) and PERCLK_CLK_ROOT (default is 4 -> 150MHz)
  CCM_CBCDR = CCM_CBCDR_IPG_PODF(1);	// NB I can't get 0 (that is to say no prescaler) to work
   
  // Set the CCM Clock Gating Register
  CCM_CCGR0 |= CCM_CCGR0_GPT2_BUS(CCM_CCGR_ON) |
               CCM_CCGR0_GPT2_SERIAL(CCM_CCGR_ON);  // enable clock
  

  //Clear GPT2 registers, namely  CR, PR and SR   
  GPT2_CR = 0;
  GPT2_PR = 0; //No prescaler.

  // "Clear" bit flags (ROV, IF1 and IF2) writing one in them
  GPT2_SR = GPT_SR_ROV |            //Clear bit ROV
            GPT_SR_IF1 |            //Clear bit IF1
            GPT_SR_IF2;             //Clear bit IF2
             

  //CR  register of GPT2 (Control Register)
  GPT2_CR = GPT_CR_EN |             //EN = 1 activate TIMER GPT2
            GPT_CR_FRR |            //Free run mode
            GPT_CR_CLKSRC(1) |      //Clock source is Peripheral Clock
            GPT_CR_IM1(1) |         //Capture activated on channel 1 on rising edge only
            GPT_CR_IM2(1);          //Capture activated on channel 2 on rising edge only


  //IR register of GPT2 (Interruptions)
  GPT2_IR = GPT_IR_ROVIE |          //Interruption on overflow of the 32bits counter
            GPT_IR_IF1IE |          //Interruption on Channel 1 capture 
            GPT_IR_IF2IE;           //Interruption on Channel 2 capture 
}

Here is the code to start the capture :

Code:

void Start_Capture(void)
{
  // Clear the interrupt flags 
  GPT2_SR = GPT_SR_ROV |            
            GPT_SR_IF1 |            
            GPT_SR_IF2;             

  // Custom variable initialization (not very relevant for this post)
  TimeTagPtr1 = 0;
  TimeTagPtr2 = 0;
  TimeTagNb1 = 0;
  TimeTagNb2 = 0;
  OverflowCount = 0;
  
  //Double buffer management
  halfBuffer1 = false;
  halfBuffer2 = false;


  //enable  IRQ_GPT2 interruption
  NVIC_ENABLE_IRQ(IRQ_GPT2);
}

And here is the code executed during the interrupt capture :

Code:

void GPT2capture() {

    // For timing on the oscilloscope  
    //digitalWriteFast(DEBUG_BLINK_PIN, HIGH);

    //Test the origin of interruption
    
    if (GPT2_SR & GPT_SR_ROV) {
      //32 bits counter overflow 
      GPT2_SR |= GPT_SR_ROV;    //Reset ROV flag
      // Tag the event "overflow" with the special time 0xFFFFFFFF
      PsRamBuffer1[TimeTagPtr1++] = 0xFFFFFFFF;
      PsRamBuffer2[TimeTagPtr2++] = 0xFFFFFFFF;
      // buffer wrap up
      if (TimeTagPtr1 == buffersize) TimeTagPtr1 = 0;
      if (TimeTagPtr2 == buffersize) TimeTagPtr2 = 0;
    }
    
    if (GPT2_SR & GPT_SR_IF1) {
      //capture onchannel 1
      GPT2_SR |= GPT_SR_IF1;   //reset IF1 flag 
      PsRamBuffer1[TimeTagPtr1++] = GPT2_ICR1;   //read and store capture register
      if (TimeTagPtr1 == buffersize) TimeTagPtr1 = 0;
      TimeTagNb1++;
    }
    
    if (GPT2_SR & GPT_SR_IF2) {
      //capture onchannel 2
      GPT2_SR |= GPT_SR_IF2;   //reset IF2 flag 
      PsRamBuffer2[TimeTagPtr2++] = GPT2_ICR2;   //read and store capture register
      if (TimeTagPtr2 == buffersize) TimeTagPtr2 = 0;
      TimeTagNb2++;
    }
    
    asm volatile ("dsb");   // wait for clear  memory barrier

    // For timing on the oscilloscope
    //digitalWriteFast(DEBUG_BLINK_PIN, LOW);
  }

And here is the "minimalistic" interruption I was mentioning earlier and takes 150ns :

Code:

void GPT2capture() 
{

GPT2_SR |= GPT_SR_IF1;	//Reset the interruption flag (assuming that only event are detected on channel 1)
asm volatile ("dsb");
}

As a side note, and this could be a clue, when the interruption is only :

Code:

void GPT2capture() 
{

digitalWriteFast(DEBUG_BLINK_PIN, HIGH);
asm volatile ("dsb");
digitalWriteFast(DEBUG_BLINK_PIN, LOW);
}

I get a periodic signal at "only" 23Mhz (instead of the 150Mhz that digitalWriteFast toggling can attain)

Finally one specific question :
- Could you tell me how to reduce the access time to the registers? Or, conversely, could you explain to me why it can't be reduced?

mborgerson · May 25, 2021

Do you need continuous sampling? If not, could you use an intermittent sampling scheme where you collect data for 10 milliseconds, stop and write to SD card, collect another 10milliseconds, stop and write to SD card, etc. etc..?

If intermittent sampling is acceptable, you might be able to avoid the timer interrupts and other issues by blocking interrupts, going into a tight loop which waits for rising edges, captures the count in the 600MHz CPU clock counter and stores it in an array, the loops back to wait for the next rising edge.

I'll see if I can come up with some example code to verify this algorithm.

mborgerson · May 25, 2021

Here is a sample sketch showing intermittent capture of input pulse timing.

Code:

// Fast_Stamp_4
// Sketch to test fast intermittent time-stamp collection
// M. Borgerson   May 25, 2010

//================
const int LEDpin = 13;
const int inpin1 = 5;
#define  LEDON digitalWriteFast(LEDpin, HIGH);
#define  LEDOFF digitalWriteFast(LEDpin, LOW);
#define  LEDTOGGLE digitalToggleFast(LEDpin);
//*******************************************************

// use a sample size that is multiple of 512 bytes for efficient SD storage
#define MAXSAMPLES 10240
uint32_t timestamps1[MAXSAMPLES];
uint16_t samplecount;

void setup() {
  //Initialize the digital pin 13 as an output for LED.
  pinMode(LEDpin, OUTPUT);
  pinMode(inpin1, INPUT);
  delay(500);
  Serial.begin(9600);
  Serial.println("\n\nFast Time Stamp Test.");
  Serial.println("Press 's' to collect and display sample,  'h for histogram.");

}


//MAIN LOOP routine
//=================
void loop() {
  char ch = 0;

  if (Serial.available()) {
    ch = Serial.read();
      LEDON
      CollectSample(MAXSAMPLES);
      AdjustSamples(MAXSAMPLES);
      LEDOFF
    if (ch == 's') {
      ShowSamples(200);
    }
    if (ch == 'h') {
      ShowHisto(MAXSAMPLES);
    }
    Serial.println("Press 's' to collect and display sample,  'h for histogram.");
  }

}

// Collect time stamps for rising edges with interrupts off
void  CollectSample(uint32_t numsamples) {
  uint16_t scount = 0;
  while (digitalReadFast(inpin1)); // wait for pin low to start
  noInterrupts();
  do {
    while (!digitalReadFast(inpin1)); // wait for rising edge
    timestamps1[scount] = ARM_DWT_CYCCNT; // capture CPU cycle count
    scount++;
    while (digitalReadFast(inpin1)); // wait for pin low to loop
  } while (scount < MAXSAMPLES);
  interrupts();
}


// Convert time stamps to CPU clock cycles since first sample.
// This handles roll over in the ARM CPU cycle counter
void AdjustSamples(uint32_t numsamples) {
  uint32_t i, firstvalue;
  firstvalue = timestamps1[0];
  for (i = 0; i < numsamples; i++) {
    timestamps1[i] = timestamps1[i] - firstvalue;
  }

}

// Show the timestamps for the first numsamples from array
void ShowSamples(uint32_t numsamples) {
  uint16_t i;
  Serial.println("CPU counts between rising edges");
  for (i = 0; i < numsamples-1; i++) {
    if ((i % 10) == 0)Serial.printf("\n% 4u: ", i);
    Serial.printf("%6lu ", timestamps1[i+1]- timestamps1[i]);
  }
  Serial.println();
}

//  For complete data collection, the samples would be written to SD.
//  For now, just show a histogram to evaluate stability and resolution.
//  I assume that the maximum input interval is not greater than 32767
//  CPU clock cycles or 18.3 milliseconds
#define HISTOSIZE 32678
static uint16_t histocounts[HISTOSIZE];   // 128KByte histogram buffer
void ShowHisto(uint32_t numsamples) {
  uint32_t i, dcount;
  uint16_t outliers = 0;
  uint16_t firstval, lastval;
  // first set all histocounts to zero
  for (i = 0; i < HISTOSIZE; i++) histocounts[i] = 0;
  // Now build the histogram data
  for (i = 1; i < numsamples; i++) {
    dcount = timestamps1[i] - timestamps1[i - 1]; // find difference in time stamps
    if (dcount < HISTOSIZE) {
      histocounts[dcount]++;
    } else {
      outliers++;
    }
  }
  // now display histogram.  Skip zero values in histogram data
  firstval = 0;  lastval = 0;
  for (i = 0; i < HISTOSIZE;  i++) {
    if ((firstval == 0)  && (histocounts[i] != 0)) firstval = i;
    if (histocounts[i] != 0) lastval = i;
  }
  // display firstval and lastval
  Serial.printf("\nhFirst valid count at: %u    Last valid count at %u\n", firstval,lastval);

  Serial.println("Histogram clock intervals and samples.");
  for (i = firstval; i <= lastval; i++) { // show only non-zero counts
    if(histocounts[i]) Serial.printf("%u   %u\n", i, histocounts[i]);
  }
  Serial.printf("Outliers:  %u\n", outliers);
  Serial.println();
}

The time stamp resolution appears to be about 9 cycles of the 600MHz clock, or about 15 nanoSeconds. It will capture at least 2 million samples/second (the limit of my inexpensive signal generator);

Here is some sample output:

Code:

Press 's' to collect and display sample,  'h for histogram.
CPU counts between rising edges

   0:    502    509    518    509    509    518    509    509    518    509 
  10:    518    509    509    518    509    509    509    518    509    509 
  20:    518    509    509    518    509    509    518    509    509    518 
  30:    509    518    509    509    518    509    509    518    509    509 
  40:    518    509    509    518    509    509    518    509    509    518 
  50:    509    509    518    509    509    518    509    509    518    509 
  60:    509    518    509    509    518    509    509    518    509    509 
  70:    518    509    509    518    509    509    518    509    509    518 
  80:    509    518    509    509    518    509    509    518    509    509 
  90:    518    509    509    518    509    509    518    509    509    518 
 100:    509    509    518    509    509    518    509    509    518    509 
 110:    509    518    509    509    518    509    509    518    509    509 
 120:    518    509    509    518    509    509    518    509    509    518 
 130:    509    509    518    509    509    518    509    509    518    509 
 140:    518    509    509    518    509    509    518    509    509    518 
 150:    509    509    518    509    509    518    509    509    518    509 
 160:    509    518    509    509    518    509    509    518    509    509 
 170:    518    509    509    518    509    509    518    509    509    518 
 180:    509    509    518    509    509    518    509    509    518    509 
 190:    509    518    509    509    518    509    509    518    509 
Press 's' to collect and display sample,  'h for histogram.
Press 's' to collect and display sample,  'h for histogram.
CPU counts between rising edges

   0:    510    518    509    509    518    509    509    518    509    509 
  10:    518    509    518    509    509    518    509    509    518    509 
  20:    509    518    509    509    518    509    509    518    509    509 
  30:    518    509    518    509    509    518    509    509    518    509 
  40:    509    518    509    509    518    509    509    518    509    518 
  50:    509    509    518    509    509    518    509    509    518    509 
  60:    509    518    509    518    509    509    518    509    509    518 
  70:    509    509    518    509    509    518    509    509    518    509 
  80:    518    509    509    518    509    509    518    509    509    518 
  90:    509    509    518    509    518    509    509    518    509    509 
 100:    518    509    509    518    509    509    518    509    518    509 
 110:    509    518    509    509    518    509    509    518    509    509 
 120:    518    509    509    518    509    518    509    509    518    509 
 130:    509    518    509    509    518    509    509    518    509    518 
 140:    509    509    518    509    509    518    509    509    518    509 
 150:    509    518    509    518    509    509    518    509    509    518 
 160:    509    509    518    509    509    518    509    509    518    509 
 170:    518    509    509    518    509    509    518    509    509    518 
 180:    509    509    518    509    509    518    509    518    509    509 
 190:    518    509    509    518    509    509    518    509    509 
Press 's' to collect and display sample,  'h for histogram.

 First valid count at: 509    Last valid count at 518
Histogram clock intervals and samples.
509   6575
510   1
518   3663
Outliers:  0
Press 's' to collect and display sample,  'h for histogram.

Of course, this is collecting only one channel of data. Collecting more channels will increase the time in the fast loop and and decrease the timing resolution.

It may be possible to combine this fast loop algorithm with the input capture of the 300MHz clock to get acceptable resolution without the issues that successive checking of multiple channels in the fast loop will present.

I think that the problems in the original code with high input rates are caused by the SD card writes. The SD routines do block interrupts for a short time (1 to 2 microseconds) with each write. It takes that long for the SD driver to set up the DMA transfers that accomplish the actual data transfer.

mborgerson · May 27, 2021

I wrote a test program to use the input capture capability on GPT2 and to save the data to SD. The program saves separate files for channels 1 and 2. It seems to work well with one channel at about 900KHz and the other up to about 1.7MHz pulse rates. I also added PWM pulse outputs from pins 5 and 6 for a bit of self-test capability if you jumper the pulse outputs to the input pins.

I found that running the GPT clock above 150MHz resulted in a lot of bad data. I suspect that running the GPT clock faster than the 150MHz bus clock for the timer registers causes the problems. I also found that counting pulses with longer duty cycles caused some problems, so I set up the PWM outputs for narrow pulses.

Here's the test program:

Code:

/**************************************************************************************
   Sketch to test fast  time-stamp collection and storage
   This version uses data from GPT input capture with Teensy 4.1
   and requires the installation of PSRAM to hold the large buffers
   needed for writing to SD Card with data rates up to a combined
   4 MBytes/second.
   The data values saved are 16-bit unsigned integers which represent the difference
   in GPT clock counts  between the current capture and the previous capture.
   This method has two advantages:
   1.  The subtraction of the unsigned 32-bit capture values automatically corrects for
       roll over in the 32-bit counter.
   2.  Storing the 16-bit differences cuts storage time and bandwidth in half.
 
   The scheme also has a disadvantage:  if the time between captures exceeds 65535 clocks
   you will have bad data.  However 65K counts of a 150MHz clock is 0.43 milliSeconds,
   which is not an issue with the high repetition rates I'm using for testing.
   If this limit is a problem, it can be increased by slowing the GPT clock--at the
   expense of reduced timing resolution.
   If average pules rates are known to be slower than about 250KHz, you could adjust
   the program to save the full 32-bit time differences, but that doubles the required
   storage for the buffers.
 
   If you are a Matlab user, you can generate an array of time stamps in microseconds
   with a few lines of Matlab code:
 
    %  starting with vector tds1 read from file
    tsusec = tds1 / 150;  % convert count difference to microseconds
    timestamp1 = cumsum(tsusec);  % convert to monotonically increasing time stamps


// M. Borgerson   May 27, 2010
***************************************************************************************/
#include <SD.h>
#include <TimeLib.h>

//================
const int LEDpin = 13;
const int inpin1 = 15;  // pins that can be connected to GPT capture inputs
const int inpin2 = 40;


const int pwmpin1 = 5; // FLEXPWM2_1_A
const int pwmpin2 = 6;// FLEXPWM2_2_A

File file1, file2;

#define  LEDON digitalWriteFast(LEDpin, HIGH);
#define  LEDOFF digitalWriteFast(LEDpin, LOW);
#define  LEDTOGGLE digitalToggleFast(LEDpin);
//*******************************************************

// Use a sample size that is multiple of 512 bytes for efficient SD storage.
// When writing to SD Card, buffers should hold at least  200milliseconds of data.
// For high sample rates, that means that buffers must be in PSRAM.
#define MAXSAMPLES 512000   // one half second storage at 1 million samples/second
uint16_t timestamps1A[MAXSAMPLES]EXTMEM;
uint16_t timestamps1B[MAXSAMPLES]EXTMEM;
uint16_t timestamps2A[MAXSAMPLES]EXTMEM;
uint16_t timestamps2B[MAXSAMPLES]EXTMEM;


// set up pointers to buffers for quick change inside IRQ handler
volatile uint16_t *bpt1[2]  = {&timestamps1A[0], &timestamps1B[0]};
volatile uint16_t *bpt2[2]  = {&timestamps2A[0], &timestamps2B[0]};

// These are the pointers used in the IRQ handler to store data
volatile uint16_t *bptr1 = bpt1[0];
volatile uint16_t *bptr2 = bpt2[0];

volatile uint16_t *outptr1 = NULL;
volatile uint16_t *outptr2 = NULL;

volatile uint32_t lastcapt1, lastcapt2;
volatile uint32_t samplecount1, samplecount2;
volatile uint32_t totalsamples1, totalsamples2;

// set pwm test output frequencies
uint32_t fc1 = 720000;
uint32_t fc2 = 900000;

uint32_t unixtime;

void setup() {
  //Initialize the digital pin 13 as an output for LED.

  pinMode(LEDpin, OUTPUT);
  pinMode(inpin1, INPUT);

  delay(500);
  Serial.begin(9600);
  Serial.println("\n\nFast Time Stamp output Test.");
  Serial.println("Press 's' to collect, 'q' to stop,  'h' for histograms.");


  if (!StartSDCard()) {
    Serial.println("Can not initialize SD card!");
    while (1) {
      LEDON
      delay(50);
      LEDOFF
      delay(50);
    }
  }
  setupTimerGPT2();
  SetPWM(720000, 900000);
}


//MAIN LOOP routine
//=================
void loop() {
  char ch = 0;
  char *fname;
  if (Serial.available()) {
    ch = Serial.read();

    switch (ch) {
      case 'c': // sample without writing to SD
        Serial.println("\nCollecting, but not saving.");
        totalsamples1 = 0; totalsamples2 = 0;
        // open the data files
        StartSampling();
        break;
      case 's':
        Serial.println("\nSaving data to SD card.");
        totalsamples1 = 0; totalsamples2 = 0;
        // open the data files
        fname =  GetFileName("TStamp1_", "TS1");
        file1 = SD.open(fname, FILE_WRITE);
        fname =  GetFileName("TStamp2_", "TS1");
        file2 = SD.open(fname, FILE_WRITE);
        StartSampling();
        break;
      case 'v':
        ShowSamples(100, timestamps1A);
        ShowSamples(100, timestamps2A);
        break;
      case 'h':
        ShowHisto(MAXSAMPLES, timestamps1A);
        ShowHisto(MAXSAMPLES, timestamps2A);
        break;
      case 'q':
        file1.close();
        file2.close();
        StopSampling();
        break;
      case 'a':
        SetPWM(900000, 1000000);
        break;
      case 'b':
        SetPWM(600000, 980000);
        break;
      case 'd':
        SD.sdfs.ls( LS_SIZE | LS_DATE |  LS_R);
        Serial.println();
        break;
    }
    Serial.println("Press 's' to collect, 'q' to stop,  'h' for histograms.");
  }
  CheckBuffers();

}

void CheckBuffers(void) {
  if (outptr1 != NULL) { // save buffer 1
    totalsamples1 +=  MAXSAMPLES;
    // write to SD here
    if (file1) {
      file1.write((const void *)outptr1, MAXSAMPLES * 2);
    }
    outptr1 = NULL;
  }
  if (outptr2 != NULL) { // save buffer 2
    totalsamples2 +=  MAXSAMPLES;
    // write to SD here
    if (file2) {
      file2.write((const void *)outptr2, MAXSAMPLES * 2);
    }
    outptr2 = NULL;
  }
}

void StartSampling(void) {
  bptr1 = bpt1[0];
  bptr2 = bpt2[0];

  outptr1 = NULL;
  outptr2 = NULL;

  samplecount1 = 0;
  samplecount2 = 0;
  lastcapt1 = 0;  // needed to maintain time coherence between
  lastcapt2 = 0;  // channels 1 and 2
  GPT2_CR |= GPT_CR_EN;  // Reset count and enable counter
  NVIC_ENABLE_IRQ(IRQ_GPT2);
}

void StopSampling(void) {
  GPT2_CR &=  ~GPT_CR_EN;  // Reset count and enable counter
  NVIC_DISABLE_IRQ(IRQ_GPT2);
  Serial.println("Sampling halted.");
  Serial.printf("Samples Saved:   Ch1: %lu      Ch2:  %lu\n", totalsamples1, totalsamples2);
}

char *GetFileName(const char *fbase, const char *ext) {
  static char filename[64];
  time_t nn;
  nn = now();
  uint8_t mo = month(nn);
  uint8_t dd = day(nn);
  uint8_t hh = hour(nn);
  uint8_t mn = minute(nn);
  sprintf(filename, "%s_%02d%02d%02d%02d.%3s", fbase, mo, dd, hh, mn, ext);
  return &filename[0];
}



// GPT IRQ Handler
void GPT_Handler(void) {
  static uint16_t bnum1, bnum2;
  static uint32_t capt1, capt2;
  uint32_t statreg;


  statreg = GPT2_SR;
  if (statreg & GPT_SR_IF1) { //capture on channel 1
    LEDON
    capt1 = GPT2_ICR1;   //read  counts
    GPT2_SR = GPT_SR_IF1;   //reset IF1 flag
    *bptr1++ = capt1 - lastcapt1;
    lastcapt1 = capt1;
    samplecount1++;
    if (samplecount1 == MAXSAMPLES) { // mark buffer ready and switch
      outptr1 = bpt1[bnum1];  // mark old buffer as full
      bnum1 = (bnum1 ^ 0x01);  // switch beteween buffers 0 and 1
      bptr1 =  bpt1[bnum1];
      samplecount1 = 0;
    }
    LEDOFF
  }

  if (statreg & GPT_SR_IF2) { //capture on channel 2
    LEDON
    capt2 = GPT2_ICR2;   //read  counts
    GPT2_SR = GPT_SR_IF2;   //reset IF2 flag
    *bptr2++ = capt2 - lastcapt2;
    lastcapt2 = capt2;
    samplecount2++;
    if (samplecount2 == MAXSAMPLES) { // mark buffer ready and switch
      outptr2 = bpt2[bnum2];  // mark old buffer as full
      bnum2 = (bnum2 ^ 0x01);
      bptr2 =  bpt2[bnum2];
      samplecount2 = 0;
    }

  }
  // asm("DSB"); // Adding this can extend IRQ from 360 to 1350 nanoseconds!
  // perhaps because it triggers a cache flush in EXTMEM
  LEDOFF
}

//  For complete data collection, the samples would be written to SD.
//  For now, just show a histogram to evaluate stability and resolution.
//  I assume that the maximum input interval is not greater than 32767
//  GPT clock cycles or 0.218 milliseconds
#define HISTOSIZE 32678
static uint32_t histocounts[HISTOSIZE];   // 128KByte histogram buffer
void ShowHisto(uint32_t numsamples, uint16_t tsvals[]) {
  uint32_t i, maxidx, dcount;
  uint32_t countmax;
  uint16_t outliers = 0;
  uint16_t firstval, lastval;
  double frmax, pdmax;
  // first set all histocounts to zero
  for (i = 0; i < HISTOSIZE; i++) histocounts[i] = 0;
  // Now build the histogram data
  countmax = 0;
  maxidx = 0;
  for (i = 1; i < numsamples; i++) {
    dcount = tsvals[i]; // read difference in time stamps
    if (dcount < HISTOSIZE) {
      histocounts[dcount]++;
      if (histocounts[dcount] > countmax) {
        countmax = histocounts[dcount];
        maxidx = dcount;
      }
    } else {
      // Serial.printf("Outlier: %lu\n", dcount);
      outliers++;
    }
  }
  // now display histogram.  Skip zero values in histogram data
  firstval = 0;  lastval = 0;
  for (i = 0; i < HISTOSIZE;  i++) {
    if ((firstval == 0)  && (histocounts[i] != 0)) firstval = i;
    if (histocounts[i] != 0) lastval = i;
  }
  // display firstval and lastval
  //Serial.printf("\nhFirst valid count at: %u    Last valid count at %u\n", firstval, lastval);

  Serial.println("Histogram clock intervals and samples.");
  for (i = firstval; i <= lastval; i++) { // show only non-zero counts
    if (histocounts[i]) Serial.printf("%u   %u\n", i, histocounts[i]);
  }
  // now show the frequency and period at maxidx count
  pdmax = (double)maxidx / 150e6;
  frmax = 1.0 / pdmax;
  Serial.printf( "Frequency at max: %8.2f KHz   Period at max:  %5.1f nSec\n\n",
                 frmax / 1000.0, pdmax * 1.0e9);
  //Serial.printf("Outliers:  %u\n", outliers);
  Serial.println();
}


// Show the timestamp intervals for the first numsamples from array
void ShowSamples(uint32_t numsamples, uint16_t tsvals[]) {
  uint16_t i;
  Serial.println("CPU counts between rising edges");
  for (i = 0; i < numsamples - 1; i++) {
    if ((i % 10) == 0)Serial.printf("\n% 4u: ", i);
    Serial.printf("%6lu ", tsvals[i]);
  }
  Serial.println();
}

// Set the output PWM frequency for pins 5 and 6 for self test
// The outputs are set for short positive pulses.  
// The frequencies will not be exact, especially when the
// requested values are above a few hundred KHz.
void SetPWM(uint32_t f1, uint32_t f2) {
  fc1 = f1; fc2 = f2;  // Copy value to globals for later use
  analogWriteResolution(8);
  analogWriteFrequency(pwmpin1, f1);
  analogWriteFrequency(pwmpin2, f2);
  analogWrite(pwmpin1, 4);  //low duty cycle
  analogWrite(pwmpin2, 4);
  Serial.printf("\nSet F1 to %lu and F2 to %lu\n", fc1, fc2);
}

// Set up GPT 2 for input capture on pins 14 and 40 --  Code from Mloum
void setupTimerGPT2(void) {
  // IOMUX Configuration in order to have physical access to capture 1 et 2 pins of GPT2
  //GPT capture 1
  IOMUXC_GPT2_IPP_IND_CAPIN1_SELECT_INPUT = 1;  // remap GPIO_AD_B1_03_ALT8 GPT2 Capture1 (Channel 1)
  IOMUXC_SW_MUX_CTL_PAD_GPIO_AD_B1_03 = 8; // GPT2 Capture1 configuration ALT8 Pin 15
  IOMUXC_SW_PAD_CTL_PAD_GPIO_AD_B1_03 = 0x13000; //Pulldown & Hyst

  //GPT capture 2
  IOMUXC_GPT2_IPP_IND_CAPIN2_SELECT_INPUT = 1;  // remap GPIO_AD_B1_04_ALT8 sur GPT2 Capture2 (Channel 2)
  IOMUXC_SW_MUX_CTL_PAD_GPIO_AD_B1_04 = 8; // GPT2 Capture2 configuration ALT8 Pin 40
  IOMUXC_SW_PAD_CTL_PAD_GPIO_AD_B1_04 = 0x13000; //Pulldown & Hyst

  //Configuration du bus d'horloge
#define CCM_CSCMR1_PERCLK_CLK_SEL   ((uint32_t)(1<<6))

  // Change the Clock Controller Module in order to use PERCLK_CLK_ROOT for the counter and not OSC@24MHz (default)
  CCM_CSCMR1 &= ~CCM_CSCMR1_PERCLK_CLK_SEL; //

  // Change the prescaler between AHB_CLK_ROOT (typically 600MhZ) and PERCLK_CLK_ROOT (default is 4 -> 150MHz)
  //  CCM_CBCDR = CCM_CBCDR_IPG_PODF(1);  // NB I can't get 0 (that is to say no prescaler) to work
  CCM_CBCDR = CCM_CBCDR_IPG_PODF(3);  // Clocks faster than this (150mHz) cause intermitent errors.
  // Note that 150MHz clock yields 6.67 nanoSecond resolution

  // Set the CCM Clock Gating Register
  CCM_CCGR0 |= CCM_CCGR0_GPT2_BUS(CCM_CCGR_ON) |
               CCM_CCGR0_GPT2_SERIAL(CCM_CCGR_ON);  // enable clock

  //Clear GPT2 registers, namely  CR, PR and SR
  GPT2_CR = 0;
  GPT2_PR = 0; //No prescaler.

  // "Clear" bit flags (ROV, IF1 and IF2) writing one in them
  GPT2_SR = GPT_SR_ROV |            //Clear bit ROV
            GPT_SR_IF1 |            //Clear bit IF1
            GPT_SR_IF2;             //Clear bit IF2

  //CR  register of GPT2 (Control Register)  Timer is set up, but not enabled
  GPT2_CR = GPT_CR_ENMOD |          //Set count to zero at enable
            GPT_CR_FRR |            //Free run mode
            GPT_CR_CLKSRC(1) |      //Clock source is Peripheral Clock
            GPT_CR_IM1(1) |         //Capture activated on channel 1 on rising edge only
            GPT_CR_IM2(1);          //Capture activated on channel 2 on rising edge only

  // clear the interrupt flags
  GPT2_SR = GPT_SR_ROV |
            GPT_SR_IF1 |
            GPT_SR_IF2;

  //IR register of GPT2 (Interruptions)
  GPT2_IR = GPT_IR_IF1IE |          //Interruption on Channel 1 capture
            GPT_IR_IF2IE;           //Interruption on Channel 2 capture

  // set up IRQ_GPT2 interrupts
  attachInterruptVector(IRQ_GPT2, &GPT_Handler );
  NVIC_SET_PRIORITY(IRQ_GPT2, 32); // high priority
  // IRQ is enabled by StartSampling
}

// initialize teensy time from  RTC for SD directories then start the SD Card.
bool StartSDCard() {
  setSyncProvider(getTeensy3Time);
  if (timeStatus() != timeSet) {
    Serial.println("Unable to sync with the RTC");
  } else {
    Serial.println("RTC has set the system time");
    unixtime = now();
  }

  if (!SD.begin(BUILTIN_SDCARD)) {
    Serial.println("\nSD File initialization failed.\n");
    return false;
  } else  Serial.println("initialization done.");
  // set date time callback function for file dates
  SdFile::dateTimeCallback(dateTime);
  return true;
}

/*
   User provided date time callback function.
   See SdFile::dateTimeCallback() for usage.
*/
void dateTime(uint16_t* date, uint16_t* time) {
  // use the year(), month() day() etc. functions from timelib

  // return date using FAT_DATE macro to format fields
  *date = FAT_DATE(year(), month(), day());

  // return time using FAT_TIME macro to format fields
  *time = FAT_TIME(hour(), minute(), second());
}

/*****************************************************************************
   Read the Teensy RTC and return a time_t (Unix Seconds) value

 ******************************************************************************/
time_t getTeensy3Time() {
  return Teensy3Clock.get();
}

mborgerson · May 27, 2021

Finally one specific question :
- Could you tell me how to reduce the access time to the registers? Or, conversely, could you explain to me why it can't be reduced?

Access time to the registers is slower than normal DTCM memory access because the peripheral bus runs at only 150MHz, whereas the internal bus to DTCM memory runs at the 600MHz CPU clock rate.

I think that adding the DSB instruction forces the processor to halt the internal 3-instruction pipeline, which slows things down even more.

In your sample code, you had a DSB instruction at the end of your interrupt handler. If you are writing to the QSPI PSRAM in the handler, that DSB forces the system to wait until the EXTMEM cache is fully written before exiting the interrupt handler. That can greatly increase the duration of the interrupt handler.

The DSB at the end of the handler is really needed only if your main program is going to read the EXTMEM data very quickly after the return from the interrupt. If you are writing to one buffer in the interrupt and reading from another to write to SD in the main program, you should never be reading the data written in the interrupt until long after it was written---when the file write gets to the end of the buffer.

jonr · May 27, 2021

A tight loop reading and testing GPIO6 takes about 33 nsec per edge - meeting the 50 ns temporal precision target. This does tie up the processor such that nothing else gets done.

Mloum · May 28, 2021

Thank you mborgerson for your very detailed answers.

I have right now to meet a deadline for something else but we will definitely look this into details a give some feedback soon.

mborgerson · May 28, 2021

jonr said:
A tight loop reading and testing GPIO6 takes about 33 nsec per edge - meeting the 50 ns temporal precision target. This does tie up the processor such that nothing else gets done.

I ran a test of the following collection loop with an input signal of about 2.36MHz:

Code:

// Collect time stamps for rising edges
void  CollectSample(uint32_t numsamples) {
  uint16_t scount = 0;

  while (digitalReadFast(inpin1)); // wait for pin low to start

  do {   
    while (!digitalReadFast(inpin1)); // wait for rising edge
    timestamps1[scount] = ARM_DWT_CYCCNT; // capture CPU cycle count
    scount++;
    while (digitalReadFast(inpin1)); // wait for pin low to loop
 
  } while (scount < MAXSAMPLES);

}

The resulting data gave the following histogram for 10240 samples:

Code:

Histogram clock intervals and samples.
244   1
247   5694
250   1
251   2
252   1
256   4539
257   1

I think that the 9 clock cycle difference indicates the time to make an extra loop when the test just misses an incoming rising edge. That 9 clock cycles is about 14.4 nanoseconds. If you were timing your loop with a scope using a signal bit, your loop would be longer due to the time needed to set the signal bit.

I tried looking at the assembly listing to count instructions, but the listing was nearly unintelligible with the standard 'fastest' optimization due to instruction reordering, dual issue, etc.etc. With 'debug' optimization the listing was a bit clearer---but still beyond my skills with ARM assembly. With that optimization, the histogram indicated about 13 clock cycles of spread, or
20.8 nanoseconds loop time waiting for the rising edge.

When I try the same clock as one input to my 2-channel GPT code, I get the following histogram:

Code:

Histogram clock intervals and samples.
167   511999
Frequency at max:   898.20 KHz   Period at max:  1113.3 nSec

Histogram clock intervals and samples.
62   1
63   116320
64   387783
65   122
66   2
127   3449
128   4321
130   1
Frequency at max:  2343.75 KHz   Period at max:  426.7 nSec

Note that the input for the 2.3MHz signal has a significant number of values at twice the sample period. I think that this means the combination of short period, two-channel collection, and SD storage slows things enough that the software can't keep up and misses a pulse about 3% of the time. Note also that there is no spread or missed pulses on the 898KHz signal on channel 1. The lack of spread is probably because the signal is generated with a PWM output on the T4.1 and the PWM and GPT clocks are derived from the same source.

If I slow the signal generator down to 1.7MHz, the histogram cleans up nicely:

Code:

Histogram clock intervals and samples.
167   511999
Frequency at max:   898.20 KHz   Period at max:  1113.3 nSec

Histogram clock intervals and samples.
86   2
87   330467
88   181529
89   1
Frequency at max:  1724.14 KHz   Period at max:  580.0 nSec

The uncertainty of the 1.7MHz signal is now down to 1 GPT clock cycle, or 6.67 nanoseconds. The reference manual says that there is an intrinsic uncertainty of 1 count in the capture data, and this experiment seems to confirm that.

I think that the speed limit for this method is determined by the interrupt handler latency and the duration of the ISR handler in the worst case---when two pulses arrive within a few nanoseconds of each other and both are processed in the same call to the ISR. It takes longer to process the input, and in the worst case there is additional time required to switch between storage buffer pointers and zero the sample count.

Input capture on teensy 4.1 and registers access time

Mloum

New member

mborgerson

Well-known member

mborgerson

Well-known member

mborgerson

Well-known member

mborgerson

Well-known member

jonr

Well-known member

Mloum

New member

mborgerson

Well-known member