STRANGE GPIO speed perfomance with digitalWriteFast() (ring oscillator example)

Status
Not open for further replies.

hstone

Member
I am very interested in maximum GPIO speed for teensy 3 series.
I am aware from previous posts, that digitalWriteFast() can be extremely fast on teensy 3

I came across an interesting comparison of GPIO speeds of various MCUs, in this site:
http://fab.cba.mit.edu/classes/865.15/people/sam.calisch/9/index.html
It shows that even using digitalReadFast()/digitalWriteFast() for the teensy case, max speed is ONLY 266 Khz max (??)
In fact "optimized" teensy code is MUCH SLOWER compared to all other boards (STM32, LPC) & even arduino !!!

Here is the code:

#define PIN_OUT 2
#define PIN_IN 3

void setup() {
// put your setup code here, to run once:
pinMode(2, OUTPUT);
pinMode(3 , INPUT);
digitalWriteFast(2, LOW);
}

void loop() {
digitalWriteFast(2, !digitalReadFast(3));
}

I confirmed the same results with my teensy 3.1, which is really WEIRD !!

So, could anybody explain this SLOW performance ??
 
have you ever thought it could possibly be a problem with the library support? heh. nothing is stopping you from directly accessing the port registers without the use of digitalwritefast, which could potentially be the problem your having, not saying it is, but nothing is faster than direct port manipulation
 
The problem is not the teensy speed, it's that weird Arduino environment. After each run through the loop() it does thousand other things like checking the serial ports for incoming bytes to trigger the serial.events and so on.

To get around this "handbrake", you should write your own loop inside the loop(), so that the outer loop is never executed :

Code:
void loop()
    while(1) {
        digitalWriteFast(2, !digitalReadFast(3));
    }
}

This should speed up things considerably
 
My guess is it simply is that the yield function of the Teensy is being called every time you exit loop. The yield function checks to see if there is any Serial data available on all of the Serial ports

How are you timing this? What is on pin 3?

Try something simple? like:
Code:
void setup() {
  // put your setup code here, to run once:
  pinMode(13, OUTPUT);
}

void loop() {
  while(1) {
    digitalWriteFast(13, !digitalReadFast(13));
  }
}
On my T3.6 the logic Analyze is showing that each loop is taking about .1us to do both the read and the write and loop overhead.

Edit post: on T3.2 I tried with pins 2 and 3 connected to each other compiled at 120mhz and I was getting the combined loop averaging maybe .15us to do the loop.
 
Last edited:
What that does is reading and writing to toggle the pin.
There is a faster way, just use the toggle register.
 
Last edited:
What does this do?

Code:
#define qBlink() (digitalWriteFast(LED_BUILTIN, !digitalReadFast(LED_BUILTIN) )) // {GPIOC_PTOR=32;} 
void setup() {
  // put your setup code here, to run once:
  pinMode(13, OUTPUT);
}

void yield() {} // remove default serialEvent() checks

void loop() {
  while(1) {
    qBlink();
  }
}

Alternative {courtesy of FrankB} would be:
Code:
#define qBlink() {GPIOC_PTOR=32;} // (digitalWriteFast(LED_BUILTIN, !digitalReadFast(LED_BUILTIN) ))
 
What that does is reading and writing to toggle the pin.
There is a faster way, just use the toggle register.

With this:
Code:
#define PIN_OUT 2
#define PIN_IN  3

void setup() {
  // put your setup code here, to run once:
  pinMode(2, OUTPUT);
  pinMode(3 , INPUT);
  digitalWriteFast(2, LOW);
}

void loop() {
while (1) {
    digitalWriteFast(2, !digitalReadFast(3));
}
}

I get 3.0 MHz on a Teensy 3.2.

Without the while, all serial ports are checked in every interation. That means, on a Teensy 3.6, six !! serials are checked in every iteration, plus the USB-serial.
I don't know who invented this "great" feature, but it's the Arduino way..

You can disable this "great feature" if you add a line
Code:
void yield(void) {}

to your sketch. This way, you don't need the while(1) inside loop(). But it disables the SerialEvent feature of Arduino (which is not needed anyway, just use Serial.available()..)
yield() is a perfect example of today's lazy programming-style.. .. i wonder how many power-plants would be not needed worldwide, if software was a bit more optimized. Arduino is really a good example for inefficient code, here.


Edit:
This:
Code:
while(1) {GPIOC_PTOR=32;}
toggels pin 13 and gives 16MHz. Not too bad. 6 cycles.
(20MHz with 120MHz F_CPU)
 
Last edited:
I thought the serialEvent#() was a nice way to localize all the serial read processing in a uniform place.

But using the "void yield(){}" and then just calling serialEvent#() for the active ports to be scanned. Also the local yield() can just be edited to call the ones needed to get the best of both - on every loop() exit or delay().
 
I wonder how this compares (once it compiles) with pins wired from 14 to 13 - I made a FreqMeasure Teensy sketch but don't have time to hook it up - and no LA - it will be slower based on isr() overhead:

Code:
#define qBlink() {GPIOC_PTOR=32;} 
void setup() {
  pinMode(13, OUTPUT);
  pinMode(14, INPUT);
  attachInterrupt( 14, loopisr, CHANGE );
  qBlink();
}

void yield() {} // remove default serialEvent() checks

void loop() {} 

FASTRUN void loopisr() {
    qBlink();
}
 
Dear all, thanks for replying!

@Theremingenieur Yeap, i did not know loop() had such an overhead ...

Using registers is always faster, sure, but I recall Paul saying digitalWriteFast() could be as Fast as ~50 Mhz
(https://forum.pjrc.com/threads/24573-Speed-of-digitalRead-and-digitalWrite-with-Teensy3-0), if pin numbers are constants,
This is the speed of register access, so i prefer using digitalWriteFast() for more readable code.

So, Frank B's, 3 Mhz (not using toggling register, but just digitalWriteFast() & digitalReadFast()), still seems a bit slow to me...
So, if digitalWriteFast() is so fast (up to 50 Mhz), perhaps digitalReadFast() is a bit slower ? (causing a performance drop to 3Mhz?)

thanks in advance
 
correction:

digitalWriteFast() is SINGLE CYCLE that is 96 Mhz (teensy @ 96 Mhz)
toggling which requires 2 digitalWriteFast() is ~ 48 Mhz.
So in Frank B's, 3 Mhz result, the performance is from 96 Mhz to 3 Mhz (just by adding negation & digitalReadFast())
 
To get maximum the yield() has to go away - that more than doubles the overhead when the rest of the code is a simple set of instructions.

The bus speed is part of the limit as the clock limits the port access. That blocks the write until the bus cycle is up - I'm not sure if the port read also blocks until the next cycle?

Off hand I'm not sure what F_CPU and F_BUS you are sitting at - but the I/O runs at bus speed - after the CPU sets it up.

Calling this will spell it out (if I copied the right code)::
Code:
void CPUspecs() {
  Serial.println();
#if defined(__MK20DX128__)
  Serial.println( "CPU is T_LC");
#elif defined(__MK20DX256__)
  Serial.println( "CPU is T_3.1/3.2");
#elif defined(__MKL26Z64__)
  Serial.println( "CPU is T_3.0");
#elif defined(__MK64FX512__)
  Serial.println( "CPU is T_3.4");
#elif defined(__MK66FX1M0__)
  Serial.println( "CPU is T_3.5");
#endif
  Serial.print( "F_CPU =");   Serial.println( F_CPU );
  Serial.print( "ARDUINO =");   Serial.println( ARDUINO );
  Serial.print( "F_PLL =");   Serial.println( F_PLL );
  Serial.print( "F_BUS =");   Serial.println( F_BUS );
  Serial.print( "F_MEM =");   Serial.println( F_MEM );
  Serial.print( "NVIC_NUM_INTERRUPTS =");   Serial.println( NVIC_NUM_INTERRUPTS );
  Serial.print( "DMA_NUM_CHANNELS =");   Serial.println( DMA_NUM_CHANNELS );
  Serial.print( "CORE_NUM_TOTAL_PINS =");   Serial.println( CORE_NUM_TOTAL_PINS );
  Serial.print( "CORE_NUM_DIGITAL =");   Serial.println( CORE_NUM_DIGITAL );
  Serial.print( "CORE_NUM_INTERRUPT =");   Serial.println( CORE_NUM_INTERRUPT );
  Serial.print( "CORE_NUM_ANALOG =");   Serial.println( CORE_NUM_ANALOG );
  Serial.print( "CORE_NUM_PWM =");   Serial.println( CORE_NUM_PWM );

}
 
@ defragster
I use Teensy 3.1 @ 96 Mhz, just like as Frank B (Teensy 3.2 @ 96 Mhz)

CPU is T_3.1/3.2
F_CPU =96000000
ARDUINO =10801
F_PLL =96000000
F_BUS =48000000
F_MEM =24000000
NVIC_NUM_INTERRUPTS =95
DMA_NUM_CHANNELS =16
CORE_NUM_TOTAL_PINS =34
CORE_NUM_DIGITAL =34
CORE_NUM_INTERRUPT =34
CORE_NUM_ANALOG =21
CORE_NUM_PWM =12
 
Dear all, thanks for replying!

@Theremingenieur Yeap, i did not know loop() had such an overhead ...

Using registers is always faster, sure, but I recall Paul saying digitalWriteFast() could be as Fast as ~50 Mhz
(https://forum.pjrc.com/threads/24573-Speed-of-digitalRead-and-digitalWrite-with-Teensy3-0), if pin numbers are constants,
This is the speed of register access, so i prefer using digitalWriteFast() for more readable code.

So, Frank B's, 3 Mhz (not using toggling register, but just digitalWriteFast() & digitalReadFast()), still seems a bit slow to me...
So, if digitalWriteFast() is so fast (up to 50 Mhz), perhaps digitalReadFast() is a bit slower ? (causing a performance drop to 3Mhz?)

thanks in advance

Paul uses again other code. Repeating the statements eliminates jumps to the startpoint of the loop, of course, and therefore causes higher speeds.
Yes, digitalWriteFast is as fast as register-access (unfortunately we have no "digital"ToggleFast()" - but in this special case it would not be faster than consecutive digitalwriteFast(0);digitalwriteFast(1);)
 
Last edited:
It is easy to see what code is generated. Track down where the temporary files are stored. In my case that was in /tmp/arduino_build_xxx where xxx is some number. Stashed in their will be the .elf file. Find where the gcc tools are located and then use "$PATH/arm-none-eabi-objdump -S name.ino.elf|less" to dump the code.

For example:

Code:
void loop() {
  while(1) {GPIOC_PTOR=32;}
     484:       4a01            ldr     r2, [pc, #4]    ; (48c <loop+0x8>)
     486:       2320            movs    r3, #32
     488:       6013            str     r3, [r2, #0]
     48a:       e7fd            b.n     488 <loop+0x4>
     48c:       400ff08c        .word   0x400ff08c

A two instruction loop is about as tight as it gets.
 
digitalWriteFast() is SINGLE CYCLE

Sadly, it's much more complicated.

The more accurate statement is it's a single STR instruction, which depends on two registers being initialized with specific constants.

Typically, a single STR instruction takes 2 cycles. But Cortex-M4 has a hardware optimization for multiple similar STR instructions, where subsequent ones can (usually) be done with only a single cycle.

There's also a peripheral bridge in play here, between the CPU & switched bus matrix. Usually I don't think much about this bridge, other than the added latency which rarely matters. But for these types of extreme optimizations, it's important to know the chip's internal structure.

Of course, the STR instruction depends on 2 registers being loaded with the address and data. Often the compiler will move these outside any loop, but not always.

Another issue to consider is pinMode will turn on the slew rate limiting feature. As you approach higher frequencies, this could become a problem....
 
Here's a quick test, running on a Teensy 3.6 running at 180 MHz CPU speed.

You'll notice I put 8 digitialWriteFast() in a row. This should take advantage of the hardware's optimization for faster access on successive STR instructions.

Code:
void setup() {
  pinMode(14, OUTPUT);
  CORE_PIN14_CONFIG = PORT_PCR_MUX(1); // no slew rate limit
  noInterrupts();  // look pretty for the oscilloscope trigger
}

void loop() {
  while (1) {
    digitalWriteFast(14, HIGH);
    digitalWriteFast(14, LOW);
    digitalWriteFast(14, HIGH);
    digitalWriteFast(14, LOW);
    digitalWriteFast(14, HIGH);
    digitalWriteFast(14, LOW);
    digitalWriteFast(14, HIGH);
    digitalWriteFast(14, LOW);
  }
}

Here's what I see on my oscilloscope.

file.png
(click for full size)

Looks like each digitalWriteFast really is taking only a single cycle, since two of them occur in ~11ns. But the loop & other overhead takes about 22ns. In practice, you can usually only get these incredibly fast speeds in bursts.

The reason this 90 MHz waveform looks like a sine wave is (probably) due to the fact my oscilloscope has only 200 MHz bandwidth. I'm using passive probes that claim to have plenty of bandwidth, but I must admit they have longer ground leads because I'm a bit lazy so I replaced the normal ones with convenient Pamona minigrabber clips that don't come off when when moving parts around. The ground lead inductance causes fast edges to be seen with massive overshoot & undershoot. High bandwidth probing is tough to do well....

I'd be really curious to see what this waveform really looks like, if anyone has a high bandwidth scope and probes, and the time to go to all the trouble of setting up a measurement without ground lead inductance. Honestly, even if I had the 500 MHz or 1 GHz version of this scope (which is *far* outside of PJRC's test equipment budget), I probably wouldn't spend the time to set up proper probing. But this 200 MHz BW limited, casually probed waveform can show you the actual speed, even if the image isn't the nice digital square wave you'd hope to see.
 
Last edited:
Might also be worth mentioning many of the common logic chips, like 74HC series, probably can't handle short 5.5 ns pulse width from successive digitalWriteFast(). Wire lengths and ground loop areas that normally don't matter with slower, slew-rate limited digital signals become much more important at these speeds.
 
Since I'm sure the question will come up.... here's how it looks with Teensy 3.6 overclocked to 240 MHz. Same code as message #18 above, just running at 240 MHz CPU speed.

file.png

My scope has a tough time triggering properly on this weird waveform. You might notice I had to move the trigger level up to nearly the crest of the waveform, which I'm sure isn't actually 5V. This really shows the limitation of trying to use a 200 MHz bandwidth scope to measure a 120 MHz digital signal. My guess is the pin probably doesn't really have enough output bandwidth....
 
Last edited:
I'm wondering for which "real-world" application, pin-toggle "as fast as possible" is useful ? And if yes, why it has to be done in software, not hardware (for example SPI, I2S.., external oscillator..)

Edit:
For me, it's a pretty useless kind of benchmark..
 
Last edited:
Here's one more try to see the 120 MHz waveform. Same setup as #20. This time, I took the spring clip off the scope probe and touched the probe point directly to pin 14. Then I used the lead of a resistor to short from the GND pin next to pin 13 to the ground shield of the scope probe, which makes a *huge* improvement, much less overshoot & undershoot.

file.png

Of course, my scope is still only 200 MHz bandwidth. Not much I can do about that!

In the unlikely event anyone from Keysight ever reads this thread and wants to show off what their top-end scopes can do... just email me directly (paul at pjrc dot com) and I'll be happy to send you a Teensy 3.6 programmed with this code. Not than I can afford the upgrade to higher bandwidth, but I would be really curious to see what it looks like.
 
@UhClem : Thanks for the tip!

@Frank B (#21):
Actually I am not really interested in pin toggling, but more on how fast I can read GPIO (e.g. how speedy digitalReadFast() is).
I just thought that the ring oscillator would be a quick example to test GPIO speed across various platforms.

@Paul
This showcases indeed how speedy digitalWriteFast() can be. But my question was if digitalReadFast() is quite as fast & why then Frank B #7
result is only 3 Mhz :)
 
@UhClem : Thanks for the tip!

@Frank B (#21):
Actually I am not really interested in pin toggling, but more on how fast I can read GPIO (e.g. how speedy digitalReadFast() is).
I just thought that the ring oscillator would be a quick example to test GPIO speed across various platforms.

@Paul
This showcases indeed how speedy digitalWriteFast() can be. But my question was if digitalReadFast() is quite as fast & why then Frank B #7
result is only 3 Mhz :)

It the same speed for read. One or two cycles.But pretty useless if you dont do anything with the result. To get a idea of the speed, a synthetic benchmark which does not exactly do what you need is not very useful..
Again, where is the point of doing this in software?
Spi, I2S, DMA..
 
Last edited:
@Frank B, Yeap you are right.
I need to read a few 8-bit values (from 8 GPIO pins) as fast as possible (e.g. 24 Mhz) & save in memory & process :)
 
Status
Not open for further replies.
Back
Top