Simple High Accuracy Timer advice.

Status
Not open for further replies.
Hi guys. I have a finished working project on an Arduino Nano of a high accuracy pulse delay unit (Nano? High accuracy? I hear you say). The remit was to accept a 'start' pulse and output a 'completed' pulse at the end of a predetermined time period, and to achieve as close to 20ns accuracy as possible. The first was ultra simple, the second was an impossible call of course. Nothing more! The functional side was a piece of cake of course, the accuracy side was unachievable given the Arduino's 16MHz clock speed though I get pretty damned near to it.

I set up a process to take in the preset time required using the IDE and USB and to calculate how that translates into clock ticks and then on to timer loops and the residual. I then adjusted those figures to account for a possible very small residual which would leave its count as the final loop very near the previous overflow. Knowing that the counter continued to run when other interrupt actions might be taking place, I worried that this could, on a few occasions, conceivably allow a small number of lost cycles as the interrupts were reset at the end of the final full loop. If the residual clock cycles was below a certain very low minimum I reduced the full loop count by 1, added half a full loop to the residual and added a loop of half the count at the start, shifting the final residual then to the middle of a loop count.

Of course, this was either going to be a very rare event or even was probably not necessary at all but it worked well with the figures in practice as all calculations were done before the count was triggered. Then the timer count was started when the pulse was received and the overflows captured to another variable while the active interrupts could be changed and set up for the different functions when it was required without worrying about lost counts during that process. As I said, that may not be necessary but I didn't know enough about interrupts to know if it was a genuine worry or not.

This gave excellent results when measured using an ultra-high accuracy dedicated card used in a professional seismic recording device for which the unit is meant to act in a test bed. With its 16MHz clock the Arduino has a claimed 62.5ns accuracy which I got close to. The drift is another problem of course but it proved to be not as bad as many Arduino users said they had experienced.

So basically the unit was close to the required spec. I am now twitchy to get things to another level of accuracy and I am going to set up the same functionality in a Teensy 4.0 with it's close to 1ns capability. I'm a teensy virgin! The programming is no problem but, through working with the Arduino, I learned that there are often "peculiarities" in how you should approach this for a particular board. I wondered if anyone could point out whether there is anything unique or odd about the Teensy 4.0's timer set up which might save me some time in discovering it?
 
The internal processor runs at 600MHz - not sure the IO uses that clock, more likely its using the 150MHz clock, that seems
to be the resolution I saw using GPIO in tight loops. Still thats 6.7ns resolution...

Note that the internals of a fast processor are typically clocking faster than the I/O circuitry - driving 600MHz edges to an
external pin is expensive in power and an EMI nightmare in practice unless its something like an LVDS pair.
 
I tried the simplest brute force approach just now.

Code:
void setup() {
  analogWriteFrequency(2, 100000);  // create test pulses
  analogWrite(2, 64);
  pinMode(4, INPUT);  // short pin 2 to pin 4
  pinMode(6, OUTPUT);
  digitalWriteFast(6, LOW);
  noInterrupts();
}

void loop() {
  while (digitalReadFast(4) == LOW) ;
  //asm("nop");
  digitalWriteFast(6, HIGH);
  delayNanoseconds(50);
  digitalWriteFast(6, LOW);
  while (digitalReadFast(4) == HIGH) ;
}

Sadly, it seems the time between detecting the rising edge detected and rising edge output is a little over 40ns.

file.png

There's probably some way with the timers, but there you'd be working with 150 MHz clock speed since they all run from the peripheral clock.
 
Silly but it never occurred to me to mention that the delay has to be able to be set from 1ms up to 1sec and it should be precisely to what is demanded. A few preset values is not going to cut the mustard. That rules out a lot of innovative approaches I think. I am actually more of a hardware engineer (electronics design) than a programmer so the idea of hard wiring something simple would have been great for me but most approaches are not accurate enough in terms of their reliance on component values and on an accurate continuous selection between the upper and lower limits. I love the twisted pair reflection method though! Highly creative thinking there. At least with the digital approach you get the ability to specify to the absolute limit of the system and it then becomes very maths friendly.

The 150MHz clock is the sort of info I need. I was totally unaware of that aspect. Yes, I would have found it eventually but knowing at the outset saves a bit of time. 6.7nsec accuracy would be fine in this context I think and a real step forward from where I was at. The propagation delays in the hardware are fine too. As long as they come out as a fixed number of clock cycles which they should they can be trimmed out as a constant in the software.

Thanks to all of you for the thoughts. That starts me off with a few bits of info I didn't have before. I'll keep this thread posted with progress. And if anyone else has any other ideas I would be still be interested in hearing them.
 
So 5 ns max error after 1 ms is 5 ppm error, and 5 ns after 1s is 5 ppb. The crystal on the Teensy has a 15ppm specification. If you measure the crystal speed with a high resolution time reference you can compensate for the errors down to, in my tests, around 50ppb as long as room temperature stays constant, time from calibration is short and the researchers stay still without bringing any hot coffee close. The processor oscillator makes fairly large jumps on shutdown and startup, so calibration should be preformed after the osc and crystal has time to settle.
 
Yes mlu, a procedure for derivation of a compensation factor was something I also wrote into the code. This was actually calculated just as you say by measuring a fixed reference time and inserting its actual value measured by the high res seismic unit. It stored that compensation factor value from then on until another calibration was performed and used it to pre-correct all future delay values which the unit was used to simulate.

The inaccuracies of all types are always going to be under the influence of temperature of course. I never got round to any attempts to stabilise the ambient temperature around the crystal. The unit had only this one simple task to perform so had no off and on heavy duty loads to drive for example which would have meant increased dissipation. It was built into a simple diecast aluminium box which acted as a heatsink to the outside for the internal environment. It was always used in pretty much the same location in a controlled workshop so temperature stayed fairly constant. It really did perform well overall for so simple a build. And it was ultra-cheap!
 
You'll probably need to use one of the GPT timers, since they're 32 bits. Most of the other timers are 16 bits, which isn't enough if you want 1 cycle of the peripheral clock (150 MHz) precision all the way out to 1 second.

The GPT timers are documented in chapter 52 of the reference manual, starting on page 2945.

Since the minimum delay is 1ms (your original message gave me the impression you wanted a fixed 20ns delay), you should have plenty of time for an interrupt. So I'd recommend configuring the timer to be free running counter with the full 32 bit range and use the input capture feature to detect the input pulse. Then in the interrupt, you'll read the captured timer count where the input change was detected, add your delay amount (modulo 32 bits, just like the counter rolls over when it reaches the max) and configure the output compare to change the output signal when the timer reaches that count. As long as the interrupt is serviced within 1ms, you'll set up the output compare before the timer reaches that count and the output will change automatically by the timer at the precise moment.

The input capture probably does delay a couple cycles to sync to the peripheral clock (no flip-flop metastability) so you might need to adjust the delay number by a couple cycles to compensate.
 
Just for fun and because I'm curious, I gave it a quick try the software-only way using a busy loop to poll the ARM cycle counter implementing the 1ms delay.

Here's what my oscilloscope sees for the output waveform, on a 20ns/div scale delayed exactly 1ms after the input signal (or as exactly as my oscilloscope's time base is). This was captured with the rendering set to 10 seconds persistence, so we can see where the outputs are all falling without needing to record a video.

file.png

So it looks like there's about a 60ns extra delay from the software (the right-ward shift from exactly the center of the screen), which you could compensate by just subtracting 36 from the delay number. Then all the outputs fall within about 18ns. That's with interrupts disabled. I don't know exactly what causes this variability, but it could be some combination the M7's 6 cycle pipeline and how the input circuitry captures and synchonizes the incoming signal to the chip's clocks.

That's not at tight as the GPT input capture & output compare might do, perhaps just 1 or 2 cycles of the 150 MHz peripheral clock. But it's close, using only very simple code rather than a deep dive into configuring the timer hardware.

Code:
void setup() {
  analogWriteFrequency(2, 4);
  analogWrite(2, 64);
  pinMode(4, INPUT);
  pinMode(6, OUTPUT);
  digitalWriteFast(6, LOW);
  noInterrupts();
}

void loop() {
  while (digitalReadFast(4) == LOW) ;
  uint32_t n = ARM_DWT_CYCCNT;
  while (1) {
    uint32_t elapsed = ARM_DWT_CYCCNT - n;
    if (elapsed >= 600000) break;
  }
  digitalWriteFast(6, HIGH);
  delayNanoseconds(500);
  digitalWriteFast(6, LOW);
  while (digitalReadFast(4) == HIGH) ;
}
 
Yes, 32 bit timers would be the way to go. In the Arduino setup there was no capability to deal with greater than 16 bit. I had to write in arithmetic functions to jig this up to cope with the range I needed and to take some real care with rounding errors in the arithmetic. It was interesting! It was pretty raw and not optimised in any way as it was only needed to work out my 3 loop counts, (single start count, number of full, single end count), based on the requested delay once at outset and when the delay was changed. In the scope of something like a "simulated shot" delay being performed once every 4 secs minimum it wasn't necessary.

As I said this process is up and working fully in the Arduino so I'm familiar with my general interrupt approach. It is only going to be necessary to first get to grips with how the T_4.0 handles these same steps and where to change that where it will be different. My guess is that it will remain pretty much the same process as there is no complication of any other tasks or interrupts clouding the issue. I was starting with no interrupts enabled, then only enabling the one I needed for the particular stage of the delay. I steered the action by logical variables set by each interrupt's procedure to reset the active interrupts as required and to move on the next step. As every delay time setting had the basic same three phases, any run should generate the same latency which as we have said could be factored out. I think I may even be able to simplify the code I had to handle an output pulse of specified length on completion. Your test setup there is a big help in seeing similarities in the new setup. It shows how good the results can be with only the simplest approach. Thanks for the code, it is a fantastic starting point for me.
 
Last edited:
I am certainly not an expert on these things, but I have been doing a deep (theoretical) dive on MCU based timing things before, and collected some (someone else's) ideas and points along the way. Some points out of my (poor) memory to check/consider:

  • Some MCUs/CPUs have small timing quirks depending on which order and when certain instructions are run. Even simpler architectures can have these. It may be useful to try different order of instructions or different instructions to reach a seemingly equal end result. (Even the address alignment of instructions might matter in some cases.)
  • In one case, apparently the only way to reach single-instruction/single-clock precision in pin-change timings was to have the last few moments done with a selection of different NOP-sequences. Timers and/or interrupts had too much other logic-level variability, or were too slow, etc. and an assembly loop couldn't be made to work in single-cycle resolution. Thus, having a trigger a little while before the correct time, read the counter (or otherwise know how many cycles exactly to still wait), use loop for coarse wait, then select correct final delay chain, and all of those chains end with the desired pin-flip.
  • Small counters are not a problem if their timings are predictable/tight, there are enough of them, and they can be chained. E.g. connecting the compare output (or final count etc.) to another timer's count input (or corresponding). I was planning myself a variant of this with AVRs where I was running out of timers, and one of the solution was to chain a 8-timer to 16-bit timer. On the (2nd?) last count of the 16-bit timer it would trigger a process that would disable interrupts, set the "remainder" value to the 8-bit counter's compare register, and wait for that last round of the 8-bit counter to finish (which makes the first part of the actual output pulse), listening for that state change on another input, and once it is seen, do some cleanup and prepare for the next loooong wait.
  • This next one is something I was thinking myself, no idea if it works well or not, haven't tried yet. I was planning on doing sub-clock level timing by using a fast low-jitter analog comparator, comparing against a MCU adjustable voltage level (e.g. heavily filtered PWM output, or I2C DAC, or MCU analog output, whatever), and slightly slowed down but more predictable rampup of the counter's output, using known/fixed tiny-valued RC filter or such. As the output slews with somewhat well-defined rate, the analog level "trip point" defines a sub-clock resolution delay.. Not perfect, but way better than single clock period. This turned into a rather desperate hunt for finding out how much there is timing jitter on the digital outputs of MCUs. If that is too high, this analog comparator trick will be useless. (In any case, the jitter is likely well below one clock period, but how much below..) This idea grew from the limitation of the first MCU/design choices running the timers at e.g. max 8MHz and the desired resolution would have needed like 100MHz. And it evolved later to have a separate sync of the MCU output with the (much better quality) master clock before feeding to the analog comparator.. etc. etc. heading down the rabbit hole..

My designs/ideas did have the benefit that the MCUs were to be driven by ultra-stable (OCXO-magic etc.) and precise (DDS-magic) clock, and the timers stuff was needed "only" for getting different (accurate) periods, and more accurate phases to adjust away delay differences in other circuitry, etc. Gotta hate them Christmas tree LEDs blinking out of sync...
 
Status
Not open for further replies.
Back
Top