I am running a teensy 3.2 at 96Mhz, with the compiler set to 'fastest'. While I am able to make it run fast enough for my purposes, I am a little curious about WHY it behaves the way it does, and if there is a better way to control the number of cycles a task takes, short of writing everything in assembly. Just to be clear before someone suggests I grab my teensy 4.0, it is running just fine on the 3.2.
I have defined an empty yield function and disabled all interrupts. I've also increased pin drive strength so I can actually see the transitions at very crisp times. If I then cycle a pin using digitalWriteFast, it transitions every ~10nS (makes sense so far, this is a single CPU clock cycle). If I put it in a loop, obviously I get a lot of delay as the cpu executes the branch (interestingly, for short loops, say <10 cycles, the 'fastest' optimization must write it out longhand anyways- I see the size of the code increase as I increase the number of cycles, but the timing is still with no delays. You cross some threshold though and it compiles it with a branch). For my application, I am reading out a linear CCD using an external ADC with parallel data, so I don't need a waveform indefinitely, so copy and pasting a few thousand times to force the compiler to put everything into program memory is acceptable. This works- I can construct a 48Mhz output signal on a Teensy 3.2, which can only last for as long as the Teensy has enough program memory to keep going
I am sure there is a more elegant way to do this using PWM channels and Timers, but triggering interrupts on pin changes is too slow, and with the need to have multiple synchronized, offset, non-square clock signals, and to read a port also synchronized to this, just bitbanging out the whole waveform doesn't seem too awful. I don't need to do anything else while reading the CCD, and I don't have any use for the extra program space. It certainly works fine.
This means I can also now count clock cycles using 2 digital write commands, and an oscilloscope to time the delay between the two (for more than 10 clock cycles it becomes important to note that the ideal period is actually closer to 10.4nS). When I start adding other things inbetween the two write functions, I start to get unexpected results. Operations that I would expect to take a single clock cycle take 3. This includes __asm__ __volatile__ ("nop\n\t");, as well as digitalReadFast(). Note that I'm not assigning the result of the read to a variable, just reading it.
The next strange thing is that if I add TWO of these commands, ie, 2 nops or 2 digitalReadFasts, the second one takes a single clock cycle, ie the entire code block has 4 clock cycles stuck in between the two write pin transitions.
It's almost like the processor needs time to swap over from doing one thing to doing another?
I was able to write the code I needed, just by measuring everything on an oscilliscope and adding nops using trial and error until I got the waveform timing I wanted, but I was a little curious why I wasn't able to set up my timings analytically, or why I can't predict the number of clock cycles that nop will take, or get the compiler to compile code down to a single clock cycle that it clearly can execute in a single clock cycle under different circumstances.
Is it possible that the Teensy is hyperthreading (or something similar), and some instructions can pipeline better than others, depending on the order? I'm looking to go fast, but I'm also looking to get predictable timing.
I've attached simplified code that exhibits the same behavior. I've used the pin numbers that were in my circuit, but I would imagine that any could be used.
Here are a couple waveforms I've picked up. Please pardon my poor grounding and only 50Mhz scope, I am sure the teensy is outputing a much cleaner signal. Still, it is far more than sufficient to see the timing.
First, 4 consecutive writes (a single pulse on two separate channels). Note there are no wasted clock cycles.
Second, I slip a nop inbetween the two pulses. Note the 3 extra clock cycles it adds.
Third, I add a second nop. Note it only takes 4 clock cycles inbetween the pulses. ALSO note the second pulse now has a 'dead' clock cycle . . . I was not able to produce this with any other code (ie, 2 digitalReadFasts did not cause this).
I have defined an empty yield function and disabled all interrupts. I've also increased pin drive strength so I can actually see the transitions at very crisp times. If I then cycle a pin using digitalWriteFast, it transitions every ~10nS (makes sense so far, this is a single CPU clock cycle). If I put it in a loop, obviously I get a lot of delay as the cpu executes the branch (interestingly, for short loops, say <10 cycles, the 'fastest' optimization must write it out longhand anyways- I see the size of the code increase as I increase the number of cycles, but the timing is still with no delays. You cross some threshold though and it compiles it with a branch). For my application, I am reading out a linear CCD using an external ADC with parallel data, so I don't need a waveform indefinitely, so copy and pasting a few thousand times to force the compiler to put everything into program memory is acceptable. This works- I can construct a 48Mhz output signal on a Teensy 3.2, which can only last for as long as the Teensy has enough program memory to keep going
I am sure there is a more elegant way to do this using PWM channels and Timers, but triggering interrupts on pin changes is too slow, and with the need to have multiple synchronized, offset, non-square clock signals, and to read a port also synchronized to this, just bitbanging out the whole waveform doesn't seem too awful. I don't need to do anything else while reading the CCD, and I don't have any use for the extra program space. It certainly works fine.
This means I can also now count clock cycles using 2 digital write commands, and an oscilloscope to time the delay between the two (for more than 10 clock cycles it becomes important to note that the ideal period is actually closer to 10.4nS). When I start adding other things inbetween the two write functions, I start to get unexpected results. Operations that I would expect to take a single clock cycle take 3. This includes __asm__ __volatile__ ("nop\n\t");, as well as digitalReadFast(). Note that I'm not assigning the result of the read to a variable, just reading it.
The next strange thing is that if I add TWO of these commands, ie, 2 nops or 2 digitalReadFasts, the second one takes a single clock cycle, ie the entire code block has 4 clock cycles stuck in between the two write pin transitions.
It's almost like the processor needs time to swap over from doing one thing to doing another?
I was able to write the code I needed, just by measuring everything on an oscilliscope and adding nops using trial and error until I got the waveform timing I wanted, but I was a little curious why I wasn't able to set up my timings analytically, or why I can't predict the number of clock cycles that nop will take, or get the compiler to compile code down to a single clock cycle that it clearly can execute in a single clock cycle under different circumstances.
Is it possible that the Teensy is hyperthreading (or something similar), and some instructions can pipeline better than others, depending on the order? I'm looking to go fast, but I'm also looking to get predictable timing.
I've attached simplified code that exhibits the same behavior. I've used the pin numbers that were in my circuit, but I would imagine that any could be used.
Code:
// Teensy 3.2 Bitbang Timing Test
// for maximum speed GPIO operation: compile in fastest, define void yield(), disable all interrupts, use digitalWriteFast (and static pin numbers), and set slew rates and drive strengths to fast.
// Also, for very short for loops the compiler will type it out longhand anyways, but for longer loops, ie, toggle a pin 1000 times, it's actually better to type it
// out 1000 times, or you lose a couple clock cycles everytime the loop refreshes. Obviously, this cannot continue forever as there is limited memory. And it looks kind of funny.
// also, manually index into arrays, where possible. using a variable for the index runs slower too.
// this gets single clock cycle output control, which for a teensy 3.2 running at 100Mhz (ok, 96Mhz), this is roughly 10nS per pin toggle.
#define pinOUTPUT0 16
#define pinOUTPUT1 21
#define pinINPUT 1
#define pinC0 15
#define pinC1 22
#define pinC2 23
#define pinC3 9
#define pinC4 10
#define pinC5 13
#define pinC6 11
#define pinC7 12
void setup() {
pinMode(pinOUTPUT0, OUTPUT);
pinMode(pinOUTPUT1, OUTPUT);
pinMode(pinINPUT, INPUT);
// not sure if this is necessary or not, I intend to use port reads later on.
pinMode(pinC0, INPUT);
pinMode(pinC1, INPUT);
pinMode(pinC2, INPUT);
pinMode(pinC3, INPUT);
pinMode(pinC4, INPUT);
pinMode(pinC5, INPUT);
pinMode(pinC6, INPUT);
pinMode(pinC7, INPUT);
CORE_PIN16_CONFIG = CORE_PIN16_CONFIG | 0x00000040; // high strength
CORE_PIN16_CONFIG = CORE_PIN16_CONFIG & 0xFFFFFFFB; // fast slew
CORE_PIN21_CONFIG = CORE_PIN21_CONFIG | 0x00000040; // high strength
CORE_PIN21_CONFIG = CORE_PIN21_CONFIG & 0xFFFFFFFB; // fast slew
digitalWriteFast(pinOUTPUT0, LOW); // establishing initial states
digitalWriteFast(pinOUTPUT1, HIGH);
delay(1000);
}
// DEFINING THIS FUNCTION IS CRITICAL TO MAXIMUM SPEED OPERATION!!! IT DOESN'T EVEN HAVE TO GET CALLED ANYWHERE.
void yield () {} //Get rid of the hidden function that checks for serial input and such.
int x = 0; // variables to play with timing, seeing what takes more or less time to execute.
int myInts[1024];
void loop() {
cli(); // disable interrupts
//for (int i=0; i < 20; i++){ // 'fastest' will compile this in-line for a lowish loop number (<10)
digitalWriteFast(pinOUTPUT1, LOW); // 1 clock cycle
digitalWriteFast(pinOUTPUT1, HIGH); // 1 clock cycle
// ---------- uncomment one of these instructions, or an adjacent pair, to see the timings mentioned in the comments -----------
//x = digitalReadFast(pinC0); // 7 clock cycles
//x = digitalReadFast(pinC0); // 2 more clock cycles, ie, 9 clock cycles to do 2 reads to a variable
//digitalReadFast(pinC0); // 3 clock cycles
//digitalReadFast(pinC0); // 1 more clock cycle, ie, 4 clock cycles to do 2 reads
// UNLIKE the NOPs below, this does not cause the digitalWriteFast to execute any slower.
//x = GPIOC_PDIR; // 5 clock cycles
//x = GPIOC_PDIR; // 1 more clock cycle, ie, 6 clock cycles to do 2 port reads to a variable
//myInts[15] = GPIOC_PDIR; // 5 clock cycles
//myInts[16] = GPIOC_PDIR; // 4 more clock cycles, ie, 9 clock cycles to do 2 port reads to an array with fixed indexes
// note that it only takes 6 clock cycles total (same as port reads to a variable) if the same index is used for both reads.
//myInts[x] = GPIOC_PDIR; // 11 clock cycles
//myInts[x] = GPIOC_PDIR; // 0 additional clock cycles? It's possible the compiler optimized this out.
// I wouldn't think that's allowed though- the world may have changed, so if I request a read I would think
// that has to be left in place.
//x=x+1; // 8 clock cycles
//x=x+1; // 0 additional clock cycles? the compiler must be reducing these to a '+2' instruction, which takes the same
// amount of time because to the micro an addition operation is an addition operation.
//x++ also takes the same amount of time. No surprise there.
//__asm__ __volatile__ ("nop\n\t"); // 3 clock cycles?!
//__asm__ __volatile__ ("nop\n\t"); // 1 more clock cycle, ie, 4 clock cycles to do 2 nops
// But it also slows down the following digital writes, such that the pulse is active for 2 clock cycles?
digitalWriteFast(pinOUTPUT0, HIGH); // 1 clock cycle
digitalWriteFast(pinOUTPUT0, LOW); // 1 clock cycle
//}
sei(); // re-enable interrupts.
delay(1000);
if(x==0){__asm__ __volatile__ ("nop\n\t");} // force the usage of the variable x so it doesn't get optimized out.
}
Here are a couple waveforms I've picked up. Please pardon my poor grounding and only 50Mhz scope, I am sure the teensy is outputing a much cleaner signal. Still, it is far more than sufficient to see the timing.
First, 4 consecutive writes (a single pulse on two separate channels). Note there are no wasted clock cycles.
Second, I slip a nop inbetween the two pulses. Note the 3 extra clock cycles it adds.
Third, I add a second nop. Note it only takes 4 clock cycles inbetween the pulses. ALSO note the second pulse now has a 'dead' clock cycle . . . I was not able to produce this with any other code (ie, 2 digitalReadFasts did not cause this).