STRANGE GPIO speed perfomance with digitalWriteFast() (ring oscillator example)

Status
Not open for further replies.
Here's one more try to see the 120 MHz waveform. Same setup as #20. This time, I took the spring clip off the scope probe and touched the probe point directly to pin 14. Then I used the lead of a resistor to short from the GND pin next to pin 13 to the ground shield of the scope probe, which makes a *huge* improvement, much less overshoot & undershoot.

View attachment 9579

Of course, my scope is still only 200 MHz bandwidth. Not much I can do about that!

Paul, looking at your waveforms here, as compared to post #18, it's "quite obvious" you've greatly reduced the ground bounce by using the short stub ground lead. For reference, you can take a look at the following youtube, starting at 8:40 to 9:20.
https://www.youtube.com/watch?v=zodpCuxwn_o

Even without going to a faster scope, you can view the separation between response and complication due to ground bounce simply by making your pulses longer and with a delay in between them. With the longer ground lead, the ground bounce will be an obvious superposition on the signal.

EDIT: also, if you approximate the front-end of a 200-MHz scope as a basic RC low-pass filter, then the signal risetime can be approximated from F=1/(2*pi*RC), where RC = the time-constant Tau [0..63%]. IE, Tau ~ 0.8 nsec in this case, and the rise-time would be a bit longer. So, you should be seeing something on this order.
 
Last edited:
Here's a quick test, running on a Teensy 3.6 running at 180 MHz CPU speed.

You'll notice I put 8 digitialWriteFast() in a row. This should take advantage of the hardware's optimization for faster access on successive STR instructions.

Code:
void setup() {
  pinMode(14, OUTPUT);
  CORE_PIN14_CONFIG = PORT_PCR_MUX(1); // no slew rate limit
  noInterrupts();  // look pretty for the oscilloscope trigger
}

void loop() {
  while (1) {
    digitalWriteFast(14, HIGH);
    digitalWriteFast(14, LOW);
    digitalWriteFast(14, HIGH);
    digitalWriteFast(14, LOW);
    digitalWriteFast(14, HIGH);
    digitalWriteFast(14, LOW);
    digitalWriteFast(14, HIGH);
    digitalWriteFast(14, LOW);
  }
}

Here's what I see on my oscilloscope.

View attachment 9576
(click for full size)

Looks like each digitalWriteFast really is taking only a single cycle, since two of them occur in ~11ns. But the loop & other overhead takes about 22ns. In practice, you can usually only get these incredibly fast speeds in bursts.

Yes, this is the obvious way to see where the delays are. It shows that the pulses are very fast, and while-loop overhead is quite short. You could also, of course, show the overhead in the main loop() function by using the same sequence of 8 fast writes but remove the while statement.

Personally, I see these speeds as phenomenal as compared to the other microcontrollers that I've been using for some years, :).
---------

One additional thing. The guy doing the measurements on the page originally cited didn't seem to have a very good grasp of using proper scope probe grounding [ie, properly short ground leads], as mentioned in the other posts.
http://fab.cba.mit.edu/classes/865.15/people/sam.calisch/9/index.html

However of interest, his traces do show a certain amount of overshoot on the leading edge of some of the signals. I would be interested to see what longer pulses [eg, 1-usec, etc] look like on the Teensy with the present setups. Overshoot or no.

If overshoot, a good way to deal with such ringing is to use "source-series termination", ie try the same measurement with a 22 or 47 ohm series-R tied directly at the I/O pin with very short lead between the pin and resistor body. Try metal-film R.
 
This thread made me wonder how MY scope would compare so I just had to try a test.

As always, the first issue is how do you actually probe the signal without having the scope probe ground lead ruin your measurement ?? I found an interesting article on a simple probe technique that suggests you can get a good very high bandwidth probe without introducing significant ground inductance AND without spending a ton of money.

See Probing High-Speed Digital Designs at http://www.sigcon.com/Pubs/straight/probes.htm for some background and ideas.

I built the suggested 21:1 Probe and attached it to the Teensy 3.6 pins 14 and ground as Paul did above.

HighSpeedProbe.jpg

I then ran the four pulse test program from #18 above. My Agilent MSO-X 3024A 200MHz Scope is similar to Paul's but I wanted to see what my new probe would do. This is what I saw when running the Teensy 3.6 at 180MHz. Overclocking to 240MHz produced similar looking results with just the expected faster pulse time.


t36_180mhz_4pz.png

The trace was much "smoother" than I expected so I decided to dig in the back room and resurrect an old HP 500MHz scope and do a quick compare. The HP54503A is a really old dog but is kinda fun to use AND its timebase will go down to 200ps per division!! This is a 10x faster horizontal sweep than my newer scope will go. It has been left in the dust by newer scopes but my budget does not allow....... It's inputs have not been calibrated for sometime so we should take the risetime response with a little grain of salt but the Teensy 3.6 Overclocked to 240Mhz measurement is interesting.

HP54503.jpg

Perhaps someone with a "real" scope will add to this adventure.
 
Last edited:
Teensy 180 Mhz GPIO performance

I'm wondering for which "real-world" application, pin-toggle "as fast as possible" is useful ? And if yes, why it has to be done in software, not hardware (for example SPI, I2S.., external oscillator..)

Edit:
For me, it's a pretty useless kind of benchmark..

Hi Frank,

I am doing this on the Raspberry Pi in order to utilize the pins as a bus to communicate with other devices faster than the different serial protocols allow. On the Pi 3, for example, I can achieve approximately 60 Mhz via GPIO by setting the 32-bit register directly. However, I am currently utilizing one of the cores for this and I would like to free the full Pi up for other purposes so I would like to offload this functionality to a 180 Mhz Teensy. Have you been able to benchmark the maximum toggle speed on a 180Mhz Teensy? If so I would appreciate a post of your results as I'd like to see whether the speeds are sufficient for my needs.

Many thanks in advance!

Anthony
 
Have you been able to benchmark the maximum toggle speed on a 180Mhz Teensy?

The answer is yes. Yes, we have been able to benchmark this.

Please read the prior messages in this thread. You'll discover they are indeed benchmarking the maximum pin toggle speed, at 180 MHz and overclocked to 240 MHz.

I recommend reading carefully about some of issues with how the compiler implements loops and allocates registers. All benchmarking has caveats, and in this case those are pretty big ones. There's a lot of good info in this thread. Please, read it.
 
In addition to the caveats introduced by the compiler, another factor to consider is how the K66 flash cache has an impact on performance. When code is fetched from flash, there is a delay introduced, 1 or 2 clock cycles based on what I have been able to discover. The flash cache improves this performance by caching code and allowing to be fetched from the faster cache memory.

This means that the very first iteration of any time-sensitive code, such as bit banging, will be slower. Subsequent iterations will run much faster because there will be no delays fetching instructions from flash. So any of the above testing with back to back I/O writes, although they look good inside the loop, do not run as fast the first time through.

I can demonstrate this repeatedly with a bit of code I am using to capture data from multiple 24 bit A/D converters simultaneously. The serial bandwidth is less than 20 MHZ, so the code is 'tuned' to use repeated direct port writes to insert dummy wait states. But it is implemented without any looping and boils down to the following repeated code block:

Code:
  GPIOC_PDOR = 0x0001; 
  GPIOC_PDOR = 0x0001;
  GPIOC_PDOR = 0x0001;
  GPIOC_PDOR = 0x0001;
  GPIOC_PDOR = 0x0000;
  GPIOC_PDOR = 0x0000;
  GPIOC_PDOR = 0x0000;
  d = GPIOD_PDIR;
  rdgs[i] = d;
  i++;

This compiles to 24 repeated blocks like this:

Code:
  str	        r7, [r6, #0]
  str	        r7, [r6, #0]
  str	        r7, [r6, #0]
  str	        r7, [r6, #0]
  str.w	lr, [r6]
  str.w	lr, [r6]
  str.w	lr, [r6]
  ldr	        r2, [r3, #0]
  str.w	r2, [r9, #4]

The compiler is smart enough to use an immediate offset into the data buffer for the pointer contained in r9 so that it doesn't have to increment it, saving clock cycles. Anyway, this is as fast as you can get, and each instruction should take a clock cycle to execute. The net result should be 9 clock cycles, or 20 MHZ, which is the target for my test.

But every time the code executes, the first iteration looks terrible. The top trace is the A/D ready signal, running at 128 KHZ. The bottom trace is the serial clock, which is generated by the above code.

View attachment 13639

Execution should take a little over 1.2 uS. However, it is taking 3.1 uS. In addition the clocks, although not symmetrical, should be even. But the periods vary a bit. The pattern is consistent, and occurs every first iteration after reset.

The second, and subsequent iterations look quite a bit different:

View attachment 13640

Now the execution time is down to 1.74 uS and the clock periods are quite a bit more regular. This is still about 540 nS longer than it should be. Closer inspection reveals that the second cycle is 14 clocks wide, the third cycle is 12 clocks wide, and all of the other cycles are 13 clocks wide. Based on the disassembly, all should be 9 clocks wide. But the difference between the two is still quite significant.


In researching this issue, I came across a couple of posts here about the FASTRUN compiler directive : https://forum.pjrc.com/threads/27690-IntervalTimer-is-not-precise?p=64142&viewfull=1#post64142.

When using this directive on my code, the code was forced to run from SRAM instead of flash memory. I verified this by looking at the assembly listing. This solved the problem with the first iteration, although the code is still taking an average of 4 cycles longer than it should for clocking each bit. But at least the overall execution time is predictable.

So, the important thing to remember through all of this is to take your benchmarking with a grain of salt. Simple tests that run in tight loops can lead you to believe that the code is executing faster than it does when it is not running out of the flash cache, or SRAM. Use the FASTRUN directive to force your critical code to run out of memory so that it will always execute in a predictable period of time. There are numerous other techniques for dealing with problems implementing fast and predictable IO. But before going down that optimization path, be certain that you are getting a consistent benchmark by using the FASTRUN directive.
 
Update

Just to update this thread, here are the results showing slew-rate-limiting on and off, measured with a 1GHz Agilent MSO-X 4104A and 1GHz active probe (N2765A) and careful lead placement.

No news here, but cleaner waveforms.

Teensy 3.6 Slew Limited.jpeg

Teensy 3.6 Max Slew.jpeg


Some have written "beware" of turning slew-rate-limiting off. It depends...

If you use cheap plastic plug-boards and flying leads, leave it at the default. If you do impedance-controlled PCB's and put the Teensy on those, consider turning slew-rate-limiting off. :)
 
Last edited:
To see how fast code like this to generate clocks and grap data can run, use a minimalist approach.

//Minimalist pin output clock speed test for Teensy 3.6, overclocked at 240 MHz
//G. Kovacs, 7/6/19
//Use bit-banging to make a clock for an external ADC and port reads to capture output (later, need to unscramble the port pin mapping).

const int clockPin = 25;
uint32_t sample;

void setup() {
pinMode(clockPin,OUTPUT);
CORE_PIN25_CONFIG = PORT_PCR_MUX(1); // No slew-rate-limiting on pin 25
}

void loop() {
for(int i = 0; i<16384; i++)
{
digitalWriteFast(clockPin,HIGH);
sample = GPIOB_PDIR;
digitalWriteFast(clockPin,LOW);
sample = GPIOC_PDIR;
}
}

Note: LTO = Link Time Optimizations (looking through compiled code to remove unused or dropped elements).

Trying every single compiler option methodically, it was clear that there was a trade-off between duty cycle (important for many fast ADC's) and sample rate.

Here are two useful examples:

Compiler (default) "Faster," 240MHz overclock, Teensy 3.6.
Faster.jpeg

Compiler "Faster with LTO"
Faster with LTO.jpeg

The fastest with decent (close to 50%) duty cycle? Many options gave the same, at 17.2MSPS. These options are:
Faster
Fast
Fastest
Fastest + pure-code
Smallest Code

So... no difference.

Turning on LTO invariably leads to a duty cycle closer to 20% (not great), with the winner being "Fast with LTO," coming in at a nice, but not really usable (due to duty cycle) 26.9 MSPS...
Fast with LTO.jpeg


So it is quite noteworthy that in this particular case (raw bit-banging and port read speed), the compiler options do not literally translate to the stronger term for speed "fastest" actually being that.

Caveats, of course, are that this simplistic example is not putting the samples into an array, as one would in practice. Also, one can easily write code to generate "inline" versions of the acquisition code that literally hard-code the array index instead of incrementing in a loop. In my experience, this leads to insane (many minutes) compile times with flaky (variable, sometimes jittery timing) results. I have yet to see this approach really improve things.

Normally, one would use a toggled flip-flop to generate a clean, 50% duty-cycle clock in cases where the duty cycle was not ideal such as the fastest cases here. Of course, that divides the sample rate in half. Without using more complicated techniques to multiply the rate back up, one is pretty much stuck with the duty cycles that are within spec for the ADC chosen.

If you are trying to do something like this, consider playing with compiler optimization. Incidentally, FASTRUN does not always make things run faster. For example, in the winning example above (Fast with LTO), it makes no difference if void loop () is defined with FASTRUN or not, presumably because it is already optimal.



I hope this is useful to some of you out there.
 
Ok, one last thing (I can't help it, I'm a nerd). Here is an easy way to "pad" the timing in the logic HIGH phase to get a better duty cycle. As shown below, the code has two "nop" instructions inserted to get the sample rate to 20.1MSPS (not bad) and a duty cycle of about 40% (not great, but good enough for some ADC's).

//Minimalist pin output clock speed test for Teensy 3.6, overclocked at 240 MHz
//G. Kovacs, 7/6/19
//Use bit-banging to make a clock for an external ADC and port reads to capture output (later, need to unscramble the port pin mapping).

const int clockPin = 25;
uint32_t sample;

void setup() {
pinMode(clockPin,OUTPUT);
CORE_PIN25_CONFIG = PORT_PCR_MUX(1); // No slew-rate-limiting on pin 25
}

void loop() {
for(int i = 0; i<16384; i++)
{
digitalWriteFast(clockPin,HIGH);
sample = GPIOB_PDIR;
__asm__ __volatile__ ("nop\n\t");
__asm__ __volatile__ ("nop\n\t");
digitalWriteFast(clockPin,LOW);
sample = GPIOC_PDIR;
}
}


Fast with LTO nop Padded.jpeg


Not too bad at all.
 
Last edited:
I am happy to run any tests you should require. That's why I offered to help on the Teensy 4 (but was told no). I could also loan you an older generation Tek 1GHz digital scope with FET probes if needed.
 
I think most of that “off time” is doing the compare and loop increment, What if you count the loop down to zero? “compare to 16384” probably takes more cycles than “loop if non zero” then you might be able to increase the frequency and remove one of the NOPs

for clocking an ADC, you could unroll the loops for each byte or even the entire SPI block.
 
Last edited:
Status
Not open for further replies.
Back
Top