Question about low level timing

Promnius · Jul 24, 2020

I am running a teensy 3.2 at 96Mhz, with the compiler set to 'fastest'. While I am able to make it run fast enough for my purposes, I am a little curious about WHY it behaves the way it does, and if there is a better way to control the number of cycles a task takes, short of writing everything in assembly. Just to be clear before someone suggests I grab my teensy 4.0, it is running just fine on the 3.2.

I have defined an empty yield function and disabled all interrupts. I've also increased pin drive strength so I can actually see the transitions at very crisp times. If I then cycle a pin using digitalWriteFast, it transitions every ~10nS (makes sense so far, this is a single CPU clock cycle). If I put it in a loop, obviously I get a lot of delay as the cpu executes the branch (interestingly, for short loops, say <10 cycles, the 'fastest' optimization must write it out longhand anyways- I see the size of the code increase as I increase the number of cycles, but the timing is still with no delays. You cross some threshold though and it compiles it with a branch). For my application, I am reading out a linear CCD using an external ADC with parallel data, so I don't need a waveform indefinitely, so copy and pasting a few thousand times to force the compiler to put everything into program memory is acceptable. This works- I can construct a 48Mhz output signal on a Teensy 3.2, which can only last for as long as the Teensy has enough program memory to keep going

I am sure there is a more elegant way to do this using PWM channels and Timers, but triggering interrupts on pin changes is too slow, and with the need to have multiple synchronized, offset, non-square clock signals, and to read a port also synchronized to this, just bitbanging out the whole waveform doesn't seem too awful. I don't need to do anything else while reading the CCD, and I don't have any use for the extra program space. It certainly works fine.

This means I can also now count clock cycles using 2 digital write commands, and an oscilloscope to time the delay between the two (for more than 10 clock cycles it becomes important to note that the ideal period is actually closer to 10.4nS). When I start adding other things inbetween the two write functions, I start to get unexpected results. Operations that I would expect to take a single clock cycle take 3. This includes __asm__ __volatile__ ("nop\n\t");, as well as digitalReadFast(). Note that I'm not assigning the result of the read to a variable, just reading it.

The next strange thing is that if I add TWO of these commands, ie, 2 nops or 2 digitalReadFasts, the second one takes a single clock cycle, ie the entire code block has 4 clock cycles stuck in between the two write pin transitions.

It's almost like the processor needs time to swap over from doing one thing to doing another?

I was able to write the code I needed, just by measuring everything on an oscilliscope and adding nops using trial and error until I got the waveform timing I wanted, but I was a little curious why I wasn't able to set up my timings analytically, or why I can't predict the number of clock cycles that nop will take, or get the compiler to compile code down to a single clock cycle that it clearly can execute in a single clock cycle under different circumstances.

Is it possible that the Teensy is hyperthreading (or something similar), and some instructions can pipeline better than others, depending on the order? I'm looking to go fast, but I'm also looking to get predictable timing.

I've attached simplified code that exhibits the same behavior. I've used the pin numbers that were in my circuit, but I would imagine that any could be used.

Code:

// Teensy 3.2 Bitbang Timing Test

// for maximum speed GPIO operation: compile in fastest, define void yield(), disable all interrupts, use digitalWriteFast (and static pin numbers), and set slew rates and drive strengths to fast.
// Also, for very short for loops the compiler will type it out longhand anyways, but for longer loops, ie, toggle a pin 1000 times, it's actually better to type it
// out 1000 times, or you lose a couple clock cycles everytime the loop refreshes. Obviously, this cannot continue forever as there is limited memory. And it looks kind of funny.
// also, manually index into arrays, where possible. using a variable for the index runs slower too.
// this gets single clock cycle output control, which for a teensy 3.2 running at 100Mhz (ok, 96Mhz), this is roughly 10nS per pin toggle.

#define pinOUTPUT0 16
#define pinOUTPUT1 21
#define pinINPUT 1

#define pinC0 15
#define pinC1 22
#define pinC2 23
#define pinC3 9
#define pinC4 10
#define pinC5 13
#define pinC6 11
#define pinC7 12

void setup() {
  pinMode(pinOUTPUT0, OUTPUT);
  pinMode(pinOUTPUT1, OUTPUT);
  pinMode(pinINPUT, INPUT);
  
  // not sure if this is necessary or not, I intend to use port reads later on.
  pinMode(pinC0, INPUT);
  pinMode(pinC1, INPUT);
  pinMode(pinC2, INPUT);
  pinMode(pinC3, INPUT);
  pinMode(pinC4, INPUT);
  pinMode(pinC5, INPUT);
  pinMode(pinC6, INPUT);
  pinMode(pinC7, INPUT);

  CORE_PIN16_CONFIG = CORE_PIN16_CONFIG | 0x00000040; // high strength
  CORE_PIN16_CONFIG = CORE_PIN16_CONFIG & 0xFFFFFFFB; // fast slew
  CORE_PIN21_CONFIG = CORE_PIN21_CONFIG | 0x00000040; // high strength
  CORE_PIN21_CONFIG = CORE_PIN21_CONFIG & 0xFFFFFFFB; // fast slew

  digitalWriteFast(pinOUTPUT0, LOW); // establishing initial states
  digitalWriteFast(pinOUTPUT1, HIGH);

  delay(1000);
}

// DEFINING THIS FUNCTION IS CRITICAL TO MAXIMUM SPEED OPERATION!!! IT DOESN'T EVEN HAVE TO GET CALLED ANYWHERE.
void yield () {} //Get rid of the hidden function that checks for serial input and such.

int x = 0; // variables to play with timing, seeing what takes more or less time to execute.
int myInts[1024];

void loop() {

  cli(); // disable interrupts
  //for (int i=0; i < 20; i++){ // 'fastest' will compile this in-line for a lowish loop number (<10)
    digitalWriteFast(pinOUTPUT1, LOW); // 1 clock cycle
    digitalWriteFast(pinOUTPUT1, HIGH); // 1 clock cycle

    // ---------- uncomment one of these instructions, or an adjacent pair, to see the timings mentioned in the comments -----------
    
    //x = digitalReadFast(pinC0); // 7 clock cycles
    //x = digitalReadFast(pinC0); // 2 more clock cycles, ie, 9 clock cycles to do 2 reads to a variable
    
    //digitalReadFast(pinC0); // 3 clock cycles
    //digitalReadFast(pinC0); // 1 more clock cycle, ie, 4 clock cycles to do 2 reads
    // UNLIKE the NOPs below, this does not cause the digitalWriteFast to execute any slower.
    
    //x = GPIOC_PDIR; // 5 clock cycles
    //x = GPIOC_PDIR; // 1 more clock cycle, ie, 6 clock cycles to do 2 port reads to a variable
    
    //myInts[15] = GPIOC_PDIR; // 5 clock cycles
    //myInts[16] = GPIOC_PDIR; // 4 more clock cycles, ie, 9 clock cycles to do 2 port reads to an array with fixed indexes
    // note that it only takes 6 clock cycles total (same as port reads to a variable) if the same index is used for both reads.
    
    //myInts[x] = GPIOC_PDIR; // 11 clock cycles
    //myInts[x] = GPIOC_PDIR; // 0 additional clock cycles? It's possible the compiler optimized this out.
    // I wouldn't think that's allowed though- the world may have changed, so if I request a read I would think
    // that has to be left in place.

    //x=x+1; // 8 clock cycles
    //x=x+1; // 0 additional clock cycles? the compiler must be reducing these to a '+2' instruction, which takes the same
    // amount of time because to the micro an addition operation is an addition operation.
    //x++ also takes the same amount of time. No surprise there.
    
    //__asm__ __volatile__ ("nop\n\t"); // 3 clock cycles?!
    //__asm__ __volatile__ ("nop\n\t"); // 1 more clock cycle, ie, 4 clock cycles to do 2 nops
    // But it also slows down the following digital writes, such that the pulse is active for 2 clock cycles?
    
    digitalWriteFast(pinOUTPUT0, HIGH); // 1 clock cycle
    digitalWriteFast(pinOUTPUT0, LOW); // 1 clock cycle
  //}
  sei(); // re-enable interrupts.

  delay(1000);
  if(x==0){__asm__ __volatile__ ("nop\n\t");} // force the usage of the variable x so it doesn't get optimized out.
}

Here are a couple waveforms I've picked up. Please pardon my poor grounding and only 50Mhz scope, I am sure the teensy is outputing a much cleaner signal. Still, it is far more than sufficient to see the timing.

First, 4 consecutive writes (a single pulse on two separate channels). Note there are no wasted clock cycles.

Second, I slip a nop inbetween the two pulses. Note the 3 extra clock cycles it adds.

Third, I add a second nop. Note it only takes 4 clock cycles inbetween the pulses. ALSO note the second pulse now has a 'dead' clock cycle . . . I was not able to produce this with any other code (ie, 2 digitalReadFasts did not cause this).

Promnius · Jul 24, 2020

Hmmm, not sure why those last 2 pictures are upside down. Sorry!

Nominal Animal · Jul 25, 2020

The effects in instruction timing (for Cortex-M4, PDF) you are seeing is a result of pipelining instructions within the Cortex-M4 core itself; i.e., at the lowest hardware level in the microcontroller implementation. The reason this occurs is that internally, the core is divided into stages, and some instructions can follow a different instruction in the pipeline (different stage) immediately, and take fewer cycles to execute than in other situations.

Teensy 3.2 has a Cortex-M4 core, and it has a cycle counter (DWT_CYCCNT) one can use to track the number of cycles an operation or a sequence of operations actually took.

Instead of a loop, would you consider trying a version that records the samples to RAM explicitly, using the preprocessor for unwinding the loop?

At the core, the idea is that you split the number of iterations to a sum of products with small terms (say, 2 - 10) – try to keep the number of terms in each product less than eight. For example, if you wanted 1023 iterations, which is just shy of 1024 (which would be easy, 4*4*4*4*4), you could use 1023=1008+7+3+5=4*4*3*3*7+7+3+5 and

Code:

#define  REPEAT_3(expr) expr; expr; expr
#define  REPEAT_4(expr) expr; expr; expr; expr
#define  REPEAT_5(expr) expr; expr; expr; expr; expr
#define  REPEAT_7(expr) expr; expr; expr; expr; expr; expr; expr
#define  REPEAT_1023(expr) REPEAT_4(REPEAT_4(REPEAT_3(REPEAT_3(REPEAT_7(expr))))); REPEAT_7(expr); REPEAT_3(expr); REPEAT_5(expr)

so that recording 1023 samples from the ADC would be

Code:

#define  DIAGNOSTIC_PIN  pin-to-toggle
#define  SAMPLING_MUX  adc-mux-setting
#define  SAMPLING_AREF  adc-aref-setting

static uint16_t  adc_buffer[1023];
static uint16_t *adc_next;

static inline void  adc_sample(void) __attribute__((always_inline))
{
    uint8_t lo, hi;
 
    // Start the conversion
    ADCSRA = (1<<ADEN) | ADC_PRESCALER | (1<<ADSC);

    digitalWriteFast(DIAGNOSTIC_PIN, 1);

    // Wait for conversion to complete
    while (ADCSRA & (1<<ADSC));

    // Record conversion, less significant byte first
    lo = ADCL;
    hi = ADCH;

    digitalWriteFast(DIAGNOSTIC_PIN, 0);

    // Store conversion result
    *(adc_next++) = lo + (uint_fast16_t)hi << 8;
}

static inline void adc_1023_samples(void)
{
    // Initialize the sample buffer
    adc_next = adc_buffer;

    // Enable ADC
    ADCSRA = (1<<ADEN) | ADC_PRESCALER;

    // Configure mux, reference, and high-speed mode
    ADCSRB = (1<<ADHSM) | (SAMPLING_MUX & 0x20);
    ADMUX = SAMPLING_AREF | (SAMPLING_MUX & 0x1F);

    // Disable interrupts
    cli();

    REPEAT_1023(adc_sample());

    // Enable interrupts
    sti();
}

As to the instruction timings, I would personally definitely write the adc_sample() in extended inline assembly, and replacing the while loop with NOPs and output pin changes (as needed to produce the non-square signals synced to the ADC sampling). To ensure each sample was converted, I'd use a global 32-bit variable, say adc_completion, initialized to (1<<ADSC) before each 1023-sample run, with a adc_completion &= ADCSRA; just before reading ADCL and ADCH. That way, if the sampling run went as planned, adc_completion will still be nonzero. If it is nonzero, then one or more of the samples were incorrectly read.

PaulStoffregen · Jul 25, 2020

Flash memory access time may also be playing a factor. Teensy 3.2 has only a tiny 256 byte cache between the flash and M4 processor (which also does some limited buffering & prefetching). Since the flash runs at only 24 MHz, a cache miss can cause a 3 cycle stall. But the flash is also 64 bits wide, so and most instructions are 16 bits, so it tends to fill the prefetch buffer & cache at approximately the speed more code executes.

Use FASTRUN on the function to put it into RAM. The RAM runs at the full CPU speed. But only the first half of the RAM (addresses below 0x20000000) is single cycle access for code fetches. By default FASTRUN will put your code there, unless of course other things have already allocated all that memory.

The processor still has a 3 stage pipeline and access to peripherals still goes over bus bridges, and often what seems like a single cycle operation actually is compiled into 2 or more instructions (usually because "large" constants are needed). So there's still a lot of complex reasons why things might take more cycles. But you can at least eliminate the flash memory latency with FASTRUN.

Promnius · Jul 25, 2020

Ok, wonderful! Thank you both for your answers, I have a much better idea of what I'm dealing with now. I've added FASTRUN but like you guessed, it made no difference in my simple test programs (other than to increase the global RAM usage slightly, as expected). I must not be outrunning the cache. Still planning on using this trick more in the future though! It also sounds like my approach of timing everything manually through trial and error was the right way to go- but that nested #define is beautiful! You can bet that's what my code will look like shortly! Thanks again!

Question about low level timing

Promnius

Member

Promnius

Member

Nominal Animal

Well-known member

PaulStoffregen

Well-known member

Promnius

Member