Teensy 3.6: speed of DigitalReadFast?

Status
Not open for further replies.

XFer

Well-known member
Hello,

how fast is digitalReadFast on a Teensy 3.6? And, how is it related to CPU speed (like, is it 33% faster at cpu clock=240 MHz than at cpu clock=180 MHz?)

Thanks!

Fernando
 
how fast is digitalReadFast on a Teensy 3.6?

It depends on surrounding code. It compiles to a single ST instruction which uses 2 registers, for the address of the register and the data to write.

How long it takes to load values into those 2 registers can vary, because ARM doesn't have a single instruction that loads a full 32 bit immediate value into a register. So it might do a LD instruction (using relative to the program counter for the address), or it might use 1 or 2 other instructions and tricks to get the values into those registers.

If an extra LD instruction is used, it might cause a cache miss, adding 4 extra cycles access time. Likewise, if the code itself isn't yet in the flash controller's 256 byte cache or the memory controller's 8K cache, flash wait states come into play. But the flash is very wide (can't recall if it's 128 or 256 bits) so those don't apply to every fetch. If you really want maximum speed, use FASTRUN so your code runs from the RAM which is all single cycle access.

The compiler optimizer will attempt to pre-load those registers. But again, the results can vary quite a lot depending on other details of your code. If the compiler decides using the registers in other ways would make your code run faster overall, it will consider register holding constants to be expendable. It can always just put the constant back into the register when needed. Or it can do this outside of one loop, but inside another. The compiler fast a lot of very complex decisions about how to best utilize the registers.

Even the speed of ST (store to memory) instruction can vary. Normally it takes 2 cycles. But if you do 2 or more "related" load or store instructions in an unbroken sequence, the ARM hardware will apply a special pipeline optimization to do all but the first using only 1 cycle. So if you have 10 digitalWriteFast() lines in a row, there's a very strong possibility the compiler will pre-load the registers in other code and then they all execute in only 11 cycles.

If you have 10 of them with other stuff mixed between that puts a lot of "register pressure" on the compiler's optimizer and you're running from flash with cache misses, those 10 digitalWriteFast() could take 100+ cycles.


And, how is it related to CPU speed (like, is it 33% faster at cpu clock=240 MHz than at cpu clock=180 MHz?)

Yes, it's definitely related to the CPU speed. Overclocking makes it run faster.

However, something to consider is how fast the pin's voltage can actually change. By default, pinMode() configures the pin with slew rate limiting. Normally this is a good thing, because it greatly reduces radio emissions and other nasty high-speed signal quality problems if you use long unshielded wires without impedance matching (the normal for most projects). With slew rate limit, you can actually get the code to run faster than the actual voltage at the pin can change. To really use this speed, you need to write the pin config register to put it in fast mode... and be warned, such extremely fast signals do cause RF noise problems if not handled very well. Actually measuring such signals can be quite tricky. Even with a very good oscilloscope, careful attention to probes and ground wire lengths can be critical for a good measurement.

Here's a thread with some actual test results and optimization tips.

https://forum.pjrc.com/threads/4187...lator-example)?p=132363&viewfull=1#post132363
 
Here is a program that tests digitalWrite vs digitalWriteFast vs direct register access. You can edit it for testing digitalRead's and change your clock speeds and get your answers. I found digitalWriteFast to be the same speed as direct register access.

Code:
/* LED Blink, Teensyduino Tutorial #1
   http://www.pjrc.com/teensy/tutorial.html
 
   This example code is in the public domain.
*/

// Teensy 2.0 has the LED on pin 11
// Teensy++ 2.0 has the LED on pin 6
// Teensy 3.x / Teensy LC have the LED on pin 13
const int ledPin = 13;
elapsedMicros time1;
float time3, time4, time2;


// the setup() method runs once, when the sketch starts

void setup() {
  // initialize the digital pin as an output.
  pinMode(ledPin, OUTPUT);
  Serial.begin(9600);
  
}

// the loop() method runs over and over again,
// as long as the board has power

void loop() {
long i;

  time1 = 0;
  for( i = 0; i < 10000000L; ++i ){
    digitalWrite(ledPin, HIGH);   // set the LED on
    digitalWrite(ledPin, LOW);   // set the LED off
  }
  time3 = time1;

  time1 = 0;
  for( i = 0; i < 10000000L; ++i ){
  // use the internal register name for bit set, bit clear
     GPIOC_PSOR =  ( 1 << 5 );
     GPIOC_PCOR =  ( 1 << 5 );
  }
  time4 = time1;

  // just for fun toggle the LED
  // changed to test digitalWriteFast
  time1 = 0;
  for( i = 0; i < 10000000L; ++i ){
    //  GPIOC_PTOR =  ( 1 << 5 );
    digitalWriteFast(ledPin, HIGH);   // set the LED on
    digitalWriteFast(ledPin, LOW);   // set the LED off

  }
  time2 = time1;

  Serial.println( "Teensy 3.6 running at 120 mhz");
  Serial.print( "Digital Write Test ");  Serial.print( time3/1000000.0,3 ); Serial.println(" seconds");
  Serial.print( " Set Clear Test    ");  Serial.print( time4/1000000.0,3 ); Serial.println(" seconds");
  Serial.print( "Digital Write Fast ");  Serial.print( time2/1000000.0,3 ); Serial.println(" seconds");
  Serial.println();
  
  delay(1000);                  
  
}
 
Follow up: I wrote this (very simple) benchmark

Code:
/*
 * Teensy 3.6 - DigitalRead benchmark
 *
 * Start 20181120 FC
 * Last edit 20181120 FC
 * 
 * TODO:
*/

// System include files
#include <stdint.h>


/**********************************************/
/***                 Read Pins              ***/
/**********************************************/
// This way, data pins are all from Port D
#define D9  5 
#define D8  21
#define D7  20
#define D6  6 
#define D5  8 
#define D4  7 
#define D3  14
#define D2  2 

#define N_CYCLES_TEST1 200000000UL
#define N_CYCLES_TEST2 200000000UL
#define N_CYCLES_TEST3 2000UL
#define DATA_BUFFER_SIZE 204800UL

// Various
#define SERIAL_SPEED 115200

volatile uint8_t dataBuffer[DATA_BUFFER_SIZE];
volatile uint8_t portData;


void yield()
{	// We override yield() to disable serial events check, gaining a bit of CPU cycles

}

FASTRUN void test1()
{
	for (uint32_t iCycle = 0; iCycle < N_CYCLES_TEST1; iCycle++)
	{
		portData = digitalReadFast(D2);
	}
}

FASTRUN void test2()
{
	for (uint32_t iCycle = 0; iCycle < N_CYCLES_TEST2; iCycle++)
	{
		portData = (GPIOD_PDIR & 0xFF);	// Read all pins of Port D at a time
	}
}

FASTRUN void test3()
{
	for (uint32_t iCycle = 0; iCycle < N_CYCLES_TEST3; iCycle++)
	{
		for (uint32_t i = 0; i < DATA_BUFFER_SIZE; i++)
		{
			dataBuffer[i] = (GPIOD_PDIR & 0xFF);	// Read all pins of Port D at a time
		}
	}
}


/**********************************************************/
/***                    Program Setup                   ***/
/**********************************************************/

void setup() 
{
  uint32_t startms, elapsedms;
  float fNanos;

  pinMode(D2, INPUT);
  pinMode(D3, INPUT);
  pinMode(D4, INPUT);
  pinMode(D5, INPUT);
  pinMode(D6, INPUT);
  pinMode(D7, INPUT);
  pinMode(D8, INPUT);
  pinMode(D9, INPUT);

  memset ((void *)dataBuffer, 128, DATA_BUFFER_SIZE);
  Serial.begin(SERIAL_SPEED);
  while(!Serial);
  
  Serial.println("\nTeensy DigitalRead benchmark\n");

  Serial.println("Benchmarking digitalReadFast... please wait...\n");

  startms = millis();
  test1();
  elapsedms = millis() - startms;
  fNanos = 1000000.0f * elapsedms / N_CYCLES_TEST1;

  Serial.print("DigitalReadFast took "); Serial.print(fNanos, 1); Serial.println(" ns\n");

  Serial.println("Benchmarking 8-bit port read... please wait...\n");

  startms = millis();
  test2();
  elapsedms = millis() - startms;
  fNanos = 1000000.0f * elapsedms / N_CYCLES_TEST2;

  Serial.print("Whole 8-bit port read took "); Serial.print(fNanos, 1); Serial.println(" ns\n");

  Serial.println("Benchmarking 8-bit port read and store in buffer... please wait...\n");

  startms = millis();
  test3();
  elapsedms = millis() - startms;
  fNanos = 1000000.0f * elapsedms / (N_CYCLES_TEST3 * DATA_BUFFER_SIZE);

  Serial.print("Whole 8-bit port read and store in buffer took "); Serial.print(fNanos, 1); Serial.println(" ns\n");
}

/**********************************************************/
/***                    Program Loop                    ***/
/**********************************************************/

void loop() {	
  delay(10000);
}

Impressive results:

Code:
T3.6 @ 180 MHz Fastest + Pure Code , Arduino 1.8.7 + TD 1.44
============================================================
Teensy DigitalRead benchmark

Benchmarking digitalReadFast... please wait...

DigitalReadFast took 44.5 ns

Benchmarking 8-bit port read... please wait...

Whole 8-bit port read took 44.5 ns

Benchmarking 8-bit port read and store in buffer... please wait...

Whole 8-bit port read and store in buffer took 55.6 ns



T3.6 @ 240 MHz Fastest + Pure Code , Arduino 1.8.7 + TD 1.44
============================================================
Teensy DigitalRead benchmark

Benchmarking digitalReadFast... please wait...

DigitalReadFast took 33.4 ns

Benchmarking 8-bit port read... please wait...

Whole 8-bit port read took 33.4 ns

Benchmarking 8-bit port read and store in buffer... please wait...

Whole 8-bit port read and store in buffer took 41.7 ns

Teensy 3.6 is a speed demon. ;)

Thanks Paul!
 
You do nothing with the values - at least in test1 and test2. The actual use of the values would mean adding a multiple of the CPU time, so reading is negligible.
Also the surrounding loop (for....) takes a lot of time in this context (more than the read..)

You can trust us, when we say, digitalReadFast is fast, and the basic opeartions needs one cycle, if used with constants. This is all information needed(?)
I know, many people do this kind of benchmark. Nobody could explain me, why - There is no practical relevance if the read value is not used meaningfully.

A digitalWriteFast in a loop, produces a (uncontrollable) squarewave, at least. Also not more reasonable, but at least there is an output.
 
I do use the values in my full code (videocamera interfacing), what I cared for is how fast DigitalReadFast was per se.
For me, the most important test is #3 (videocamera buffering), and the speed I measured with this benchmark is perfectly coherent with the results I'm seeing in real world (storing pixels from camera up to PCLK = 6 MHz), which is impressive.
 
maybe, but the read takes only a very small fraction of the time. Even the loop code takes more time :)

If you want to optimize it, you can unroll the loop. Or do 4 fast reads into different variables, shift the values into a uint32_t and save 3 memory-writes this way. Maybe this is a bit faster, too. And if your array is uint32, it will be aligned automatically. Don't use uint8_t for this kind of loops.

Edit: And last but not least, you should disable interrupts - they add unpredictable "jitter". I bet there are usb-interrupts and systicks during your loop. Of course, millis()/micros() will not work anymore. Use the ARM-Cyclecounter, instead.
 
Last edited:
See Frank, I now you mean well, but believe me, my full code is already quite tuned. :)
I just needed to verify how fast those specific calls were. They are PLENTY fast, so I'm very happy. :)
 
Status
Not open for further replies.
Back
Top