Forum Rule: Always post complete source code & details to reproduce any issue!
Results 1 to 22 of 22

Thread: A set of scope-tested 100-500 ns delay macros

  1. #1

    A set of scope-tested 100-500 ns delay macros

    After some frustration with timing jitter from a short loop like that in delayMicroseconds, I put a set of hundred-nanosecond delay macros together today for bit-banging a second SPI port on my Teensy 3.2 and thought I'd share. Dedicated to the public domain.

    The sketch below is what I used to empirically determine the combinations of NOP3, NOP4, and NOP6 to use for each macro in each case. I tried to get them all to conform as closely as possible to a 100-500 ns negative pulse width, from a high setting to a low setting and then back high again, erring on the side of a pulse too long rather than too short. I kept my eye on the positive pulse width, too, though that was intermittently longer (perhaps up to 20%) due to the beginning of the sketch loop.

    You can stack these up to get longer delays, although I wouldn't be surprised if there was some non-linearity involved. And if your desired delays get long enough, you're just getting into the territory of delayMicroseconds.

    Of course, the delay of digitalWriteFast is inherently included in the delay of each macro. That's probably what you'd want this delay to be used with anyhow.

    I'm sure all the macro expansion bloats the compiled code somewhat, especially with longer delays, but I doubt if it adds up to much. Even with an F_CPU of 96MHz and a PAUSE of P5, I'm only seeing this sketch occupy 5% of program storage space.

    Code:
    // Empirically determined by Ed Suominen with an oscilloscope and a good deal of
    // pressing Ctrl+U in the Arduino window. No guarantees expressed or implied. Dedicated
    // to the public domain.
    
    #define pinNum 13
    void setup() {
      pinMode(pinNum, OUTPUT);
    }
    
    #define NOP3 "nop\n\t""nop\n\t""nop\n\t"
    #define NOP4 "nop\n\t""nop\n\t""nop\n\t""nop\n\t"
    #define NOP6 "nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t"
    
    // P1-5 are 100-500 ns pauses, tested with an oscilloscope (2 second
    // display persistence) and a Teensy 3.2 compiling with
    // Teensyduino/Arduino 1.8.1, "faster" setting
    #if F_CPU == 96000000
    #define P1 __asm__(NOP4 NOP4)
    #define P2 __asm__(NOP6 NOP6 NOP6)
    #define P3 __asm__(NOP6 NOP6 NOP6 NOP6 NOP3)
    #define P4 __asm__(NOP6 NOP6 NOP6 NOP6 NOP6 NOP4 NOP4)
    #define P5 __asm__(NOP6 NOP6 NOP6 NOP6 NOP6 NOP6 NOP6 NOP4 NOP3)
    
    #elif F_CPU == 72000000
    #define P1 __asm__(NOP6)
    #define P2 __asm__(NOP6 NOP6)
    #define P3 __asm__(NOP6 NOP6 NOP6 NOP3)
    #define P4 __asm__(NOP6 NOP6 NOP6 NOP6 NOP4)
    #define P5 __asm__(NOP6 NOP6 NOP6 NOP6 NOP4 NOP4 NOP3)
    
    #elif F_CPU == 48000000
    #define P1 __asm__(NOP4)
    #define P2 __asm__(NOP6 NOP3)
    #define P3 __asm__(NOP6 NOP4 NOP3)
    #define P4 __asm__(NOP6 NOP6 NOP6)
    #define P5 __asm__(NOP6 NOP6 NOP4 NOP4 NOP3)
    
    #endif
    
    #define PAUSE P5
    
    void loop() {
      noInterrupts();
      digitalWriteFast(pinNum, HIGH); // 1
      PAUSE;
      digitalWriteFast(pinNum, LOW);
      PAUSE;
      digitalWriteFast(pinNum, HIGH); // 2
      PAUSE;
      digitalWriteFast(pinNum, LOW);
      PAUSE;
      digitalWriteFast(pinNum, HIGH); // 3
      PAUSE;
      digitalWriteFast(pinNum, LOW);
      PAUSE;
      interrupts();
    }
    Last edited by edsuom; 03-24-2017 at 01:09 AM.

  2. #2
    Senior Member PaulStoffregen's Avatar
    Join Date
    Nov 2012
    Posts
    22,767
    I've added a link to this thread, on this page.

    https://www.pjrc.com/teensy/td_timing_delay.html

    Hopefully it will help others to find these macros.

    Maybe in time others will be done for Teensy LC, 3.5, and 3.6 at their many clock speeds?

  3. #3
    Senior Member+ Frank B's Avatar
    Join Date
    Apr 2014
    Location
    Germany NRW
    Posts
    6,928
    One should mention that interrupts should be disabled if a maximum delay time is desired.

  4. #4
    Quote Originally Posted by Frank B View Post
    One should mention that interrupts should be disabled if a maximum delay time is desired.
    Not sure what you intended to say there. Interrupts will increase the delay time, so disabling them will result in minimum delay, not maximum.

  5. #5
    Senior Member+ Frank B's Avatar
    Join Date
    Apr 2014
    Location
    Germany NRW
    Posts
    6,928
    I think I intended to write "exact".. ;-) Anyway, this post is a couple of month old..

  6. #6
    Ah, exact makes sense. As to couple of months old -- yeah, but posts on here continue to be useful for years!

  7. #7
    Junior Member
    Join Date
    Sep 2017
    Posts
    3
    Hi, I'm using Teensy 3.6 with clock rate of 180 Mhz, what should be my setup for P1 to P5?

  8. #8
    Junior Member
    Join Date
    Dec 2018
    Posts
    5
    Hi, what would be the __asm__ or NOP instructions for an Arduino UNO (F_CPU == 16000000)?

  9. #9
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    12,412
    Quote Originally Posted by brice3010 View Post
    Hi, what would be the __asm__ or NOP instructions for an Arduino UNO (F_CPU == 16000000)?
    Look for nop in this file :: ...\hardware\teensy\avr\cores\teensy\core_pins.h
    like >> asm volatile("nop\n");

  10. #10
    Junior Member
    Join Date
    Dec 2018
    Posts
    5
    Quote Originally Posted by defragster View Post
    Look for nop in this file :: ...\hardware\teensy\avr\cores\teensy\core_pins.h
    like >> asm volatile("nop\n");
    Hi defragster, I have no such file because I have no teensy board installed. But I will try to find the answer by iteration: I have a scope for doing the measurements.

    New question: How does the void loop() function work? By that I mean how do these commands digitalWriteFast(pinNum, LOW); and digitalWriteFast(pinNum, HIGH); actually work to get the different #define's of P1 through P4 to be shown on the output pin?

  11. #11
    Junior Member
    Join Date
    Dec 2018
    Location
    Ohio
    Posts
    2
    Hello, newbie here. Find this forum a goldmine for Teensy code ideas. Just a quick note on the nano-delays. Attached are pictures of pin 13 w/o and w/ slew rate limiting ( PORTC_PCR5 &= ~(0x04); ) using a spring-tip probe (very little loop) and a Tek 100 MHz scope. Works well. First positive pulse is a bit wider.

    Click image for larger version. 

Name:	tek00004.png 
Views:	93 
Size:	15.4 KB 
ID:	15344Click image for larger version. 

Name:	tek00005.png 
Views:	71 
Size:	16.1 KB 
ID:	15345

  12. #12
    Member
    Join Date
    Mar 2019
    Location
    Bordeaux / France
    Posts
    69
    How can these macros be adapted to a Teensy 3.6 running at 180MHZ ? I need accurate nanoseconds delays (10ns would be fine) for clock phases synchronization with an external CPU...

    Any hint ?

  13. #13
    Senior Member PaulStoffregen's Avatar
    Join Date
    Nov 2012
    Posts
    22,767
    Quote Originally Posted by Tactif CIE View Post
    How can these macros be adapted to a Teensy 3.6 running at 180MHZ ?
    ....
    Any hint ?
    You would need a good oscilloscope. Then just run the code, measure the pulse width, and adjust (add more NOPs) until it's correct.

  14. #14
    Member
    Join Date
    Mar 2019
    Location
    Bordeaux / France
    Posts
    69
    Quote Originally Posted by PaulStoffregen View Post
    You would need a good oscilloscope. Then just run the code, measure the pulse width, and adjust (add more NOPs) until it's correct.
    Very unfortunately, for the moment, I don't own an oscilloscope (being said I'll follow your advice to keep enough money back and buy a good one like the Rigol 1054Z)

  15. #15
    Junior Member
    Join Date
    Sep 2019
    Posts
    1
    Quote Originally Posted by Tactif CIE View Post
    How can these macros be adapted to a Teensy 3.6 running at 180MHZ ? I need accurate nanoseconds delays (10ns would be fine) for clock phases synchronization with an external CPU...

    Any hint ?
    Best I could get was 20ns on a Teensy 3.6, abandoning loop() - and there is some drift. Teensy 4 is too fast for my scope and this test (also, I think I have to use a different port). Uncomment the sections below to run the various tests.

    Code:
    #define LED 13
    #define on 32
    #define off 0
    //Teensy 3.6 240MHz, Fastest with LTO
    
    void setup() {
      pinMode(LED, OUTPUT); 
      //DDRB |=B00100000; 
      noInterrupts();
    
      while (true)
      {
        //Period - pinMode mean V - DDRB mean V
        // 20ns - 100mv - 680mv
        //digitalWriteFast(LED, HIGH); 
        //digitalWriteFast(LED, LOW);
    
        //75ns - 1.36V - 176mv
        //digitalWrite(LED, HIGH); 
        //digitalWrite(LED, LOW);
    
        //482ns - 2.56V - 2.4V
        PORTB |= (on); 
        PORTB &= (off);
    
        //46ns - 520mv - 840mv
        //digitalWrite(LED, HIGH);
        //digitalWriteFast(LED, LOW); 
      }  
    
    }
    
    void loop() {
      //stay out of Malibu, Lebowski!
    }

  16. #16
    Junior Member
    Join Date
    Jan 2020
    Location
    New York City
    Posts
    15
    The Teensy 4.1 is so heckin' fast you will need much longer noops!

    By experimentation with a scope, I have found that

    Code:
    void noop() {
      for (uint32_t i=0; i<59; i++) __asm__("nop\n\t");
    }
    is equal to about 250 nanoseconds.

  17. #17
    Senior Member
    Join Date
    Jul 2020
    Posts
    174
    I wonder if the optimizer turns the above code into 59 NOPs rather than emitting loop code wrapping a single NOP. I have found that when a loop does something that always necessarily comes out the same way, the optimizer will figure it out and emit the result rather than my code (leading to a benchmark saying that a 10,000,000 loop took zero nanoseconds.)

    Is there a way to get the assembler output out of the Arduino toolchain?

  18. #18
    Senior Member
    Join Date
    Apr 2014
    Location
    Germany
    Posts
    1,167
    Is there a way to get the assembler output out of the Arduino toolchain?
    The toolchain generates *.lst files per default. They are copied to the build folder. Here https://github.com/TeensyUser/doc/wi...ompiler-output a link to the user WIKI with some detailed information. Please note that the stock objdump generates, say, suboptimal output for the T4.x processors. The output of current versions of objdump is much better. More information about this can also be found in the linked pages.

  19. #19
    Junior Member
    Join Date
    Jan 2020
    Location
    New York City
    Posts
    15
    Pilot, you are correct that compiler optimizations (loop unrolls, etc) can make it hard to idle for a certain amount of time.

    On modern chips the CPU may also use pipelining and other optimizations that can be quite unpredictable.

    That's why to be certain I just looked at an oscilloscope and adjusted the loop until I got a signal that worked That said, I'm sure there is a more sensible way to use timers and get predictable clock signals without just idling.

  20. #20
    Senior Member
    Join Date
    Jul 2020
    Posts
    406
    asm is the way to go really - updates to the compiler or selecting different optimization level wlll screw up
    the hand-selected C code approach.

    Having said that if the clock speed is alterable then a better approach is using a hardware timer and code
    that understands the various processor clock settings so it can set the timer appropriately whatever the processor
    clock rate.

    Or you can time your delay loop at start-up using a known delay (if there is one!) and callibrate without having
    to know anything about processor architectural details or clock speed.

    Even so this may not work reliably for a processor with an instruction cache...

  21. #21
    Senior Member
    Join Date
    Apr 2014
    Location
    Germany
    Posts
    1,167
    What's wrong with delayNanoseconds() ? https://github.com/PaulStoffregen/co...e_pins.h#L1804
    If you need it for T3.x it should be sufficient to replace F_CPU_ACTUAL by F_CPU in the linked code.

  22. #22
    Senior Member
    Join Date
    Jul 2020
    Posts
    174
    The point of using assembly instead of something else is to deal with situations where timing is critical. The WS281x driver for 8-bit Arduinos uses inline assembly, where the cycles consumed by each instruction are hand-counted so that they add up to something within the requirements. As I recall, they use NOPs as well.

    For example, suppose you have to read a pin at certain intervals, and you know how much time your instructions inside a loop take to run, but what about the loop itself? It takes time to increment or decrement the counter, compare the values, and branch-if-not-whatever. If you put some timing-critical code in a function (which isn't inline), how long does it take to call and return from that? These are down to low-level compiler implementation details. Will it push the argument on the stack, or carry it in a register? Next year, when there is a compiler update, will it switch from one way to the other? If you specify that it's to use a register variable, what if the optimizer says "nah I don't feel like it" and pushes it on the stack instead? If they update the optimizer, will it shave 10 nanoseconds off the execution time, causing your code to land at exactly the wrong tick?

    In the case of delayNanoseconds(), it's a while-loop, but you have to spend time executing the compare and branch instructions at the end of the loop, and these are not factored into the loop time. So you are delaying for 5,000 nanoseconds, but coming out 5,020 nanoseconds (or so) later.

    On the other hand, if you use raw inline assembly, you don't have to worry about what the compiler will do. You still have to worry about the processor's internal optimizations, but that whole layer of question marks goes away. On this platform, since there is no OS with five hundred background tasks, it's fairly simple. If you're not running timers or anything else that deals with interrupts, the question "how long does this take to run" is always deterministic. If you run the same test a thousand times, it's always the same answer.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •