A set of scope-tested 100-500 ns delay macros

edsuom

New member
After some frustration with timing jitter from a short loop like that in delayMicroseconds, I put a set of hundred-nanosecond delay macros together today for bit-banging a second SPI port on my Teensy 3.2 and thought I'd share. Dedicated to the public domain.

The sketch below is what I used to empirically determine the combinations of NOP3, NOP4, and NOP6 to use for each macro in each case. I tried to get them all to conform as closely as possible to a 100-500 ns negative pulse width, from a high setting to a low setting and then back high again, erring on the side of a pulse too long rather than too short. I kept my eye on the positive pulse width, too, though that was intermittently longer (perhaps up to 20%) due to the beginning of the sketch loop.

You can stack these up to get longer delays, although I wouldn't be surprised if there was some non-linearity involved. And if your desired delays get long enough, you're just getting into the territory of delayMicroseconds.

Of course, the delay of digitalWriteFast is inherently included in the delay of each macro. That's probably what you'd want this delay to be used with anyhow.

I'm sure all the macro expansion bloats the compiled code somewhat, especially with longer delays, but I doubt if it adds up to much. Even with an F_CPU of 96MHz and a PAUSE of P5, I'm only seeing this sketch occupy 5% of program storage space.

Code:
// Empirically determined by Ed Suominen with an oscilloscope and a good deal of
// pressing Ctrl+U in the Arduino window. No guarantees expressed or implied. Dedicated
// to the public domain.

#define pinNum 13
void setup() {
  pinMode(pinNum, OUTPUT);
}

#define NOP3 "nop\n\t""nop\n\t""nop\n\t"
#define NOP4 "nop\n\t""nop\n\t""nop\n\t""nop\n\t"
#define NOP6 "nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t"

// P1-5 are 100-500 ns pauses, tested with an oscilloscope (2 second
// display persistence) and a Teensy 3.2 compiling with
// Teensyduino/Arduino 1.8.1, "faster" setting
#if F_CPU == 96000000
#define P1 __asm__(NOP4 NOP4)
#define P2 __asm__(NOP6 NOP6 NOP6)
#define P3 __asm__(NOP6 NOP6 NOP6 NOP6 NOP3)
#define P4 __asm__(NOP6 NOP6 NOP6 NOP6 NOP6 NOP4 NOP4)
#define P5 __asm__(NOP6 NOP6 NOP6 NOP6 NOP6 NOP6 NOP6 NOP4 NOP3)

#elif F_CPU == 72000000
#define P1 __asm__(NOP6)
#define P2 __asm__(NOP6 NOP6)
#define P3 __asm__(NOP6 NOP6 NOP6 NOP3)
#define P4 __asm__(NOP6 NOP6 NOP6 NOP6 NOP4)
#define P5 __asm__(NOP6 NOP6 NOP6 NOP6 NOP4 NOP4 NOP3)

#elif F_CPU == 48000000
#define P1 __asm__(NOP4)
#define P2 __asm__(NOP6 NOP3)
#define P3 __asm__(NOP6 NOP4 NOP3)
#define P4 __asm__(NOP6 NOP6 NOP6)
#define P5 __asm__(NOP6 NOP6 NOP4 NOP4 NOP3)

#endif

#define PAUSE P5

void loop() {
  noInterrupts();
  digitalWriteFast(pinNum, HIGH); // 1
  PAUSE;
  digitalWriteFast(pinNum, LOW);
  PAUSE;
  digitalWriteFast(pinNum, HIGH); // 2
  PAUSE;
  digitalWriteFast(pinNum, LOW);
  PAUSE;
  digitalWriteFast(pinNum, HIGH); // 3
  PAUSE;
  digitalWriteFast(pinNum, LOW);
  PAUSE;
  interrupts();
}
 
Last edited:
One should mention that interrupts should be disabled if a maximum delay time is desired.

Not sure what you intended to say there. Interrupts will increase the delay time, so disabling them will result in minimum delay, not maximum.
 
Ah, exact makes sense. As to couple of months old -- yeah, but posts on here continue to be useful for years!
 
Look for nop in this file :: ...\hardware\teensy\avr\cores\teensy\core_pins.h
like >> asm volatile("nop\n");
Hi defragster, I have no such file because I have no teensy board installed. But I will try to find the answer by iteration: I have a scope for doing the measurements.

New question: How does the void loop() function work? By that I mean how do these commands digitalWriteFast(pinNum, LOW); and digitalWriteFast(pinNum, HIGH); actually work to get the different #define's of P1 through P4 to be shown on the output pin?
 
Hello, newbie here. Find this forum a goldmine for Teensy code ideas. Just a quick note on the nano-delays. Attached are pictures of pin 13 w/o and w/ slew rate limiting ( PORTC_PCR5 &= ~(0x04); ) using a spring-tip probe (very little loop) and a Tek 100 MHz scope. Works well. First positive pulse is a bit wider.

tek00004.pngtek00005.png
 
How can these macros be adapted to a Teensy 3.6 running at 180MHZ ? I need accurate nanoseconds delays (10ns would be fine) for clock phases synchronization with an external CPU...

Any hint ?
 
You would need a good oscilloscope. Then just run the code, measure the pulse width, and adjust (add more NOPs) until it's correct.

Very unfortunately, for the moment, I don't own an oscilloscope (being said I'll follow your advice to keep enough money back and buy a good one like the Rigol 1054Z)
 
How can these macros be adapted to a Teensy 3.6 running at 180MHZ ? I need accurate nanoseconds delays (10ns would be fine) for clock phases synchronization with an external CPU...

Any hint ?

Best I could get was 20ns on a Teensy 3.6, abandoning loop() - and there is some drift. Teensy 4 is too fast for my scope and this test (also, I think I have to use a different port). Uncomment the sections below to run the various tests.

Code:
#define LED 13
#define on 32
#define off 0
//Teensy 3.6 240MHz, Fastest with LTO

void setup() {
  pinMode(LED, OUTPUT); 
  //DDRB |=B00100000; 
  noInterrupts();

  while (true)
  {
    //Period - pinMode mean V - DDRB mean V
    // 20ns - 100mv - 680mv
    //digitalWriteFast(LED, HIGH); 
    //digitalWriteFast(LED, LOW);

    //75ns - 1.36V - 176mv
    //digitalWrite(LED, HIGH); 
    //digitalWrite(LED, LOW);

    //482ns - 2.56V - 2.4V
    PORTB |= (on); 
    PORTB &= (off);

    //46ns - 520mv - 840mv
    //digitalWrite(LED, HIGH);
    //digitalWriteFast(LED, LOW); 
  }  

}

void loop() {
  //stay out of Malibu, Lebowski!
}
 
The Teensy 4.1 is so heckin' fast you will need much longer noops!

By experimentation with a scope, I have found that

Code:
void noop() {
  for (uint32_t i=0; i<59; i++) __asm__("nop\n\t");
}

is equal to about 250 nanoseconds.
 
I wonder if the optimizer turns the above code into 59 NOPs rather than emitting loop code wrapping a single NOP. I have found that when a loop does something that always necessarily comes out the same way, the optimizer will figure it out and emit the result rather than my code (leading to a benchmark saying that a 10,000,000 loop took zero nanoseconds.)

Is there a way to get the assembler output out of the Arduino toolchain?
 
Is there a way to get the assembler output out of the Arduino toolchain?

The toolchain generates *.lst files per default. They are copied to the build folder. Here https://github.com/TeensyUser/doc/wiki/GCC#analyzing-compiler-output a link to the user WIKI with some detailed information. Please note that the stock objdump generates, say, suboptimal output for the T4.x processors. The output of current versions of objdump is much better. More information about this can also be found in the linked pages.
 
Pilot, you are correct that compiler optimizations (loop unrolls, etc) can make it hard to idle for a certain amount of time.

On modern chips the CPU may also use pipelining and other optimizations that can be quite unpredictable.

That's why to be certain I just looked at an oscilloscope and adjusted the loop until I got a signal that worked :) That said, I'm sure there is a more sensible way to use timers and get predictable clock signals without just idling.
 
asm is the way to go really - updates to the compiler or selecting different optimization level wlll screw up
the hand-selected C code approach.

Having said that if the clock speed is alterable then a better approach is using a hardware timer and code
that understands the various processor clock settings so it can set the timer appropriately whatever the processor
clock rate.

Or you can time your delay loop at start-up using a known delay (if there is one!) and callibrate without having
to know anything about processor architectural details or clock speed.

Even so this may not work reliably for a processor with an instruction cache...
 
The point of using assembly instead of something else is to deal with situations where timing is critical. The WS281x driver for 8-bit Arduinos uses inline assembly, where the cycles consumed by each instruction are hand-counted so that they add up to something within the requirements. As I recall, they use NOPs as well.

For example, suppose you have to read a pin at certain intervals, and you know how much time your instructions inside a loop take to run, but what about the loop itself? It takes time to increment or decrement the counter, compare the values, and branch-if-not-whatever. If you put some timing-critical code in a function (which isn't inline), how long does it take to call and return from that? These are down to low-level compiler implementation details. Will it push the argument on the stack, or carry it in a register? Next year, when there is a compiler update, will it switch from one way to the other? If you specify that it's to use a register variable, what if the optimizer says "nah I don't feel like it" and pushes it on the stack instead? If they update the optimizer, will it shave 10 nanoseconds off the execution time, causing your code to land at exactly the wrong tick?

In the case of delayNanoseconds(), it's a while-loop, but you have to spend time executing the compare and branch instructions at the end of the loop, and these are not factored into the loop time. So you are delaying for 5,000 nanoseconds, but coming out 5,020 nanoseconds (or so) later.

On the other hand, if you use raw inline assembly, you don't have to worry about what the compiler will do. You still have to worry about the processor's internal optimizations, but that whole layer of question marks goes away. On this platform, since there is no OS with five hundred background tasks, it's fairly simple. If you're not running timers or anything else that deals with interrupts, the question "how long does this take to run" is always deterministic. If you run the same test a thousand times, it's always the same answer.
 
Back
Top