Sub Micro Second Pulses

jonr

Well-known member
Sub micro second pulses

I had a need for sub micro second square wave pulses, so I wrote this. It looks fine on a scope. Comments are appreciated.

Code:
// Routine to delay for specified number of nano seconds
// NOTE:  minimum pulse width is ~700 nsec, accuracy is ~ -0/+40 ns
// NOTE:  you can't trust this code:
//        compiler or library changes will change timing overhead
//        CPU speed will effect timing

// Jon Zeeff  V1.1
// Public Domain
// Written for teensy 3.1

#define LED_PIN 13

void setup() {
  delay(1000);
  pinMode(LED_PIN, OUTPUT);
  Serial.println("hello");
}

void loop() {

  //Setup_Nano_Delay(4000000000);
  Setup_Nano_Delay(700);

  Serial.println("start");
  delay(10);  // allow start message to go out
  noInterrupts();
  digitalWriteFast(LED_PIN, 1);
  Nano_Delay();
  digitalWriteFast(LED_PIN, 0);
  interrupts();
  Serial.println("stop\n");

  delay(2000);
}

// delay for a given number of nano seconds
// less sensitive to interrupts and DMA
// max delay is 4 seconds

constexpr double   CLOCK_RATE = 96.00000E6;     // MCU clock rate - measure it for best accuracy
constexpr unsigned NANO_OVERHEAD = 470;         // overhead - adjust as needed
constexpr unsigned NANO_JITTER = 18;            // adjusts for jitter prevention - leave at 18

// prepare before, so less delay later
static uint32_t nano_ticks;

void Setup_Nano_Delay(uint32_t nanos)
{
  // set up cycle counter
  ARM_DEMCR |= ARM_DEMCR_TRCENA;
  ARM_DWT_CTRL |= ARM_DWT_CTRL_CYCCNTENA;

  // improve teensy 3.1 clock accuracy
  OSC0_CR = 0x2;

  // we can't do less than this
  if (nanos < NANO_OVERHEAD)
     nanos = NANO_OVERHEAD;
   
  // how many cycles to wait
  nano_ticks = ((nanos - NANO_OVERHEAD) / (1.0E9 / CLOCK_RATE)) + .5;
  
  if (nano_ticks < NANO_JITTER)
     nano_ticks = NANO_JITTER;
          
} // Setup_Nano_Delay()

// Do the delay specified above.
// You may want to disable interrupts before and after

FASTRUN void Nano_Delay(void)
{
  uint32_t start_time = ARM_DWT_CYCCNT;
  uint32_t loop_ticks = nano_ticks - NANO_JITTER;

  // loop until time is almost up
  while ((ARM_DWT_CYCCNT - start_time) < loop_ticks) {
     // could do other things here
  }

  if (NANO_JITTER) {   // compile time option

    register unsigned r;          // for debugging
    
    // delay for the remainder using single instructions
    switch (r = (nano_ticks - (ARM_DWT_CYCCNT - start_time))) {
      case 18: __asm__ volatile("nop" "\n\t");
      case 17: __asm__ volatile("nop" "\n\t");
      case 16: __asm__ volatile("nop" "\n\t");
      case 15: __asm__ volatile("nop" "\n\t");
      case 14: __asm__ volatile("nop" "\n\t");
      case 13: __asm__ volatile("nop" "\n\t");
      case 12: __asm__ volatile("nop" "\n\t");
      case 11: __asm__ volatile("nop" "\n\t");
      case 10: __asm__ volatile("nop" "\n\t");
      case 9: __asm__ volatile("nop" "\n\t");
      case 8: __asm__ volatile("nop" "\n\t");
      case 7: __asm__ volatile("nop" "\n\t");
      case 6: __asm__ volatile("nop" "\n\t");
      case 5: __asm__ volatile("nop" "\n\t");
      case 4: __asm__ volatile("nop" "\n\t");
      case 3: __asm__ volatile("nop" "\n\t");
      case 2: __asm__ volatile("nop" "\n\t");
      case 1: __asm__ volatile("nop" "\n\t");
      default:
           break;
    }  // switch()
  
  } // if
 
}  // Nano_Delay()
 
Last edited:
You could try to uses on of the timers and do it "in hardware" - would by more reliable and you can use interrupts or DMA without negative effects to the timing.
 
Since I had some questions via private message: the basic idea is to start a hardware timer, wait till runs out and then run NOPs for the remaining time. The timer is too granular to use it alone.
 
I had a need for sub micro second square wave pulses, so I wrote this. It looks fine on a scope. Comments are appreciated.

Code:
// Routine to delay for specified number of nano seconds
// NOTE:  minimum pulse width is ~700 nsec, accuracy is ~ -0/+40 ns
// NOTE:  you can't trust this code:
//        compiler or library changes will change timing overhead
//        CPU speed will effect timing

// Jon Zeeff  V1.1
// Public Domain
// Written for teensy 3.1

#define LED_PIN 13

void setup() {
  delay(1000);
  pinMode(LED_PIN, OUTPUT);
  Serial.println("hello");
}

void loop() {

  //Setup_Nano_Delay(4000000000);
  Setup_Nano_Delay(700);

  Serial.println("start");
  delay(10);  // allow start message to go out
  noInterrupts();
  digitalWriteFast(LED_PIN, 1);
  Nano_Delay();
  digitalWriteFast(LED_PIN, 0);
  interrupts();
  Serial.println("stop\n");

  delay(2000);
}

// delay for a given number of nano seconds
// less sensitive to interrupts and DMA
// max delay is 4 seconds

constexpr double   CLOCK_RATE = 96.00000E6;     // MCU clock rate - measure it for best accuracy
constexpr unsigned NANO_OVERHEAD = 470;         // overhead - adjust as needed
constexpr unsigned NANO_JITTER = 18;            // adjusts for jitter prevention - leave at 18

// prepare before, so less delay later
static uint32_t nano_ticks;

void Setup_Nano_Delay(uint32_t nanos)
{
  // set up cycle counter
  ARM_DEMCR |= ARM_DEMCR_TRCENA;
  ARM_DWT_CTRL |= ARM_DWT_CTRL_CYCCNTENA;

  // improve teensy 3.1 clock accuracy
  OSC0_CR = 0x2;

  // we can't do less than this
  if (nanos < NANO_OVERHEAD)
     nanos = NANO_OVERHEAD;
   
  // how many cycles to wait
  nano_ticks = ((nanos - NANO_OVERHEAD) / (1.0E9 / CLOCK_RATE)) + .5;
  
  if (nano_ticks < NANO_JITTER)
     nano_ticks = NANO_JITTER;
          
} // Setup_Nano_Delay()

// Do the delay specified above.
// You may want to disable interrupts before and after

FASTRUN void Nano_Delay(void)
{
  uint32_t start_time = ARM_DWT_CYCCNT;
  uint32_t loop_ticks = nano_ticks - NANO_JITTER;

  // loop until time is almost up
  while ((ARM_DWT_CYCCNT - start_time) < loop_ticks) {
     // could do other things here [URL="https://www.kissanime.vip/"][COLOR="#800000"]kissanime.vip[/COLOR][/URL]
  }

  if (NANO_JITTER) {   // compile time option

    register unsigned r;          // for debugging
    
    // delay for the remainder using single instructions
    switch (r = (nano_ticks - (ARM_DWT_CYCCNT - start_time))) {
      case 18: __asm__ volatile("nop" "\n\t");
      case 17: __asm__ volatile("nop" "\n\t");
      case 16: __asm__ volatile("nop" "\n\t");
      case 15: __asm__ volatile("nop" "\n\t");
      case 14: __asm__ volatile("nop" "\n\t");
      case 13: __asm__ volatile("nop" "\n\t");
      case 12: __asm__ volatile("nop" "\n\t");
      case 11: __asm__ volatile("nop" "\n\t");
      case 10: __asm__ volatile("nop" "\n\t");
      case 9: __asm__ volatile("nop" "\n\t");
      case 8: __asm__ volatile("nop" "\n\t");
      case 7: __asm__ volatile("nop" "\n\t");
      case 6: __asm__ volatile("nop" "\n\t");
      case 5: __asm__ volatile("nop" "\n\t");
      case 4: __asm__ volatile("nop" "\n\t");
      case 3: __asm__ volatile("nop" "\n\t");
      case 2: __asm__ volatile("nop" "\n\t");
      case 1: __asm__ volatile("nop" "\n\t");
      default:
           break;
    }  // switch()
  
  } // if
 
}  // Nano_Delay()

Exactly what I needed! You just saved me several hours. Thanks!
 
Thanks guys, my application was Dshot for RC motor control. But in the end I had to resort to inline code with null loops, nope for 16MHz devices.
 
When I wrote a driver that needed 400-nanosecond-wide pulses, I generated a waveform in RAM and used DMA to transfer it to the PWM generator. The 'scope demonstrated that it was quite accurate. It doesn't take much RAM to do this, and other than issuing instructions to the DMA controller, the CPU is free to do its own thing (can process interrupts, and so on.) The DMA controller can be programmed to transfer multiple chained blocks, and you can even do a ring buffer if you want.
 
That depends greatly on the microcontroller, bus, and signaling protocol. If there are data + clock lines, then you can have as many bits per second as states (low or high) per second. On the other hand, if there is no dedicated clock line, then the data line has to be self-clocking, and that usually means one bit takes more than one high/low state to transfer.

The DMA on any modern Teensy can support 400-nanosecond-wide pulses (as evidenced by OctoWS2811) and I would guess it can get all the way down to double-digit nanoseconds. For precise numbers, you would have to look at the particular model's CPU manual or datasheet, which can be found here: https://www.pjrc.com/teensy/datasheets.html

One good example of this is the OctoWS2811 library, which can drive anywhere from 1 to 8 NeoPixel arrays. Up to 8 PWM pins are driven by separate DMA channels. The total throughput is 20,000,000 states/second. Because the WS281x protocol requires three states to transmit one bit, that gives you 6.66 million bits/second, minus latch time. (At the end of every frame there's a continuous low signal, 50 microseconds I think, which means "latch." That causes the shift registers inside the LEDs to dump their contents to their PWM generators, resulting in the display of whatever color was sent.)

On the more extreme end, if the DMA and PWM controllers can support 50-nanosecond pulses, you would be getting 20 megabits/sec on a single pin, which would be just right for talking to USB-3 hosts. (This doesn't take into account framing and error correction, so you are going a little slower than that. NeoPixels have no error correction, so their protocol is more straightforward.)
 
what is the max freq it will support in Teensy 3.5?

To answer your question, I ran this code on a Teensy 3.5 and connected my oscilloscope.

Code:
void setup() {
  pinMode(13, OUTPUT);
  PORTC_PCR5 &= ~0x04; // disable slew rate limit
  noInterrupts();
  while (1) {
    digitalWriteFast(13, HIGH);
    digitalWriteFast(13, LOW);
    digitalWriteFast(13, HIGH);
    digitalWriteFast(13, LOW);
    digitalWriteFast(13, HIGH);
    digitalWriteFast(13, LOW);
  }
}

void loop() {
}

file.png
(click for full size)


As you can see in the scope's measurements, the burst of 3 pulses is at 60 MHz. But the loop overhead causes a substantial delay between bursts, giving a 40 MHz overall pulse rate for this particular example.

Hopefully the loop overhead can give you the concept that even though you can get very fast pulses using digitalWriteFast(), as a practical matter the surrounding code matters quite a lot for any real application. This example also completely disable interrupts, but in normal use cases they also come into play.

Or I guess I could have just answered: 60 MHz. That is the maximum!
 
When I wrote a driver that needed 400-nanosecond-wide pulses, I generated a waveform in RAM and used DMA to transfer it to the PWM generator. The 'scope demonstrated that it was quite accurate. It doesn't take much RAM to do this, and other than issuing instructions to the DMA controller, the CPU is free to do its own thing (can process interrupts, and so on.) The DMA controller can be programmed to transfer multiple chained blocks, and you can even do a ring buffer if you want.

Hi, I think I am trying to do something similar to what you describe here. I am using a Teensy 4.1. I would like to generate signals that are around 100ns in width for on/off keying at 10MHz. Is it possible to update the PWM module this quickly using DMA? Would you be able to share your driver code? Thanks!
 
Back
Top