Teensy 4.1 : why my own GPIO functions are 10 times faster? ITCM vs. external flash?

tjaekel

Well-known member
I tried to measure the max. speed I can toggle a GPIO pin (full speed, endless loop).
Strange is:
  • if I use the LIB functions - 10 times SLOWER
  • if I implement my own, similar function - more as 10 times FASTER

Here the code I use to test (just the loop: configure pin as output first):

Code:
/* helper function to set GPIO Output register before configuring mode */

void GPIO_setOutValue(uint8_t pin, uint8_t val)
{
	const struct digital_pin_bitband_and_config_table_struct *p;
	uint32_t mask;

	if (pin >= CORE_NUM_DIGITAL) return;
	p = digital_pin_to_info_PGM + pin;
	mask = p->mask;
	// pin is configured for output mode
	if (val) {
		*(p->reg + 0x21) = mask; // set register
	} else {
		*(p->reg + 0x22) = mask; // clear register
	}
}

void GPIO_testSpeed(void) {
#if 1
  /* this is 10x faster! assuming, this code runs on ITCM */
  while (1) {
    GPIO_setOutValue(32, arduino::HIGH);
    GPIO_setOutValue(32, arduino::LOW);
  }
#else
  /* this is 10x slower! assuming the function sits on external flash (and is not cached or running full speed) */
  while (1) {
    digitalWrite(32, arduino::HIGH);
    digitalWrite(32, arduino::LOW);
  }
#endif
}

The results I get, with nominal 600 MHz MCU core clock speed:
Code:
LIB code                           My code
---------------------------------------------------------
11.11 MHz                        149.7 MHz

BTW: with overclocking the MCU, e.g. 800 MHz - I get 201 MHz GPIO toggling frequency.

This is a huge difference! (my file with GPIO function as part of project is way faster).

My assumption
The only explanation I can come up with:
The function "digitalWrite()" used to toggle a GPIO pin comes from the LIB.
My function "GPIO_setOutValue()" as part of my project (a source code file in my sketch) is very similar, not really so different in terms of instructions to do.
But the difference could be:
  • all the LIB code functions - sit in external flash and are executed from there
  • my sketch code, code in my own code files, are copied from flash to ITCM and executed afterwards in ITCM (no latency)

No idea how to confirm (OK, have a look at the generated *.MAP or *.LST file).
I know, the MCU has just a very tiny internal flash, mainly used for the bootloader (and I assume, internal flash is never overwritten). All the code sits
in an external flash memory device: the bootloader executes from there or loads some pieces of code into internal ITCM (which is fastest speed).
I think I found a statement like: "the code of your sketch is loaded into ITCM", but it can mean as well: "the code called from sketch, in LIBs, is still located on external slow flash memory".

But which code is loaded into ITCM and which not, instead executed via very slow external interface (to external flash ROM)? - no idea.
How to control which code should be loaded first to ITCM and executed from there? (including code from LIB)
(I see different code addresses in generated files: so, part of code is internal (copied), another still external (fetched via slow interface).
Or - worst case: the ICache is not enabled, not configured (MPU) for code executed on external flash memory. (how to confirm ICache for external code location is enabled?)

Any idea how to control the use of internal ITCM vs. external flash memory (e.g. using __attribute__(()) )?
Which code coming from a LIB is still external?
Are the ICache and DCache enabled (configured = MPU) for external code/data locations?

Why this dramatic speed difference? (when toggling a GPIO pin with two similar functions)
 
When you call your function the compiler is unrolling and inlining because the parameters are known (and it's all part of the same compilation module).
When you call the library function it isn't unrolled and has to take branches/calculate addresses based on the parameters. This is why there are digitalRead/WriteFast functions defined in a header file, for use with constant pin numbers.
 
As far as code below shows post #1 code is faster than either Write with variable pin, or WriteFast with constant and with variable pin only takes 1.5 longer:
Code:
setOutValue 66  <<10M 
setOutValue 100 << Var Random Pin 31 10M
Write 700  <<10M 
WriteFast 67  <<10M

Code:
/* helper function to set GPIO Output register before configuring mode */
int jj=30;

void GPIO_setOutValue(uint8_t pin, uint8_t val) {
  const struct digital_pin_bitband_and_config_table_struct *p;
  uint32_t mask;
  if (pin >= CORE_NUM_DIGITAL) return;
  p = digital_pin_to_info_PGM + pin;
  mask = p->mask;
  // pin is configured for output mode
  if (val) {
    *(p->reg + 0x21) = mask;  // set register
  } else {
    *(p->reg + 0x22) = mask;  // clear register
  }
}

elapsedMillis aT;
void GPIO_testSpeed(void) {
  /* this is 10x faster! assuming, this code runs on ITCM */
  int ii;
  uint32_t iT;
  ii = 0;
  aT = 0;
  while (ii++ < 10000000) {
    GPIO_setOutValue(32, HIGH);
    GPIO_setOutValue(32, LOW);
  }
  iT = aT;
  Serial.printf("setOutValue %lu  <<10M \n", iT);
  delay(400);

  ii = 0;
  aT = 0;
  while (ii++ < 10000000) {
    GPIO_setOutValue(jj, HIGH);
    GPIO_setOutValue(jj, LOW);
  }
  iT = aT;
  Serial.printf("setOutValue %lu << Var Random Pin %d 10M\n", iT, jj);
  delay(400);

  /* this is 10x slower! assuming the function sits on external flash (and is not cached or running full speed) */
  ii = 0;
  aT = 0;
  while (ii++ < 10000000) {
    digitalWrite(jj, HIGH);
    digitalWrite(jj, LOW);
  }
  iT = aT;
  Serial.printf("Write %lu  <<10M \n", iT);
  delay(400);

  ii = 0;
  aT = 0;
  while (ii++ < 10000000) {
    digitalWriteFast(32, HIGH);
    digitalWriteFast(32, LOW);
  }
  iT = aT;
  Serial.printf("WriteFast %lu  <<10M \n", iT);
  delay(400);
}

void setup() {
  pinMode(32, OUTPUT);
  jj=30;
  while (jj == 30)
    jj = random(5) + 30;
  pinMode(jj, OUTPUT);
  pinMode(13, OUTPUT);
  digitalWriteFast(13, HIGH);
  Serial.begin(22);
  while (!Serial)
    ;
  Serial.println("\n" __FILE__ " " __DATE__ " " __TIME__);
  delay(400);
  digitalWriteFast(13, LOW);
  GPIO_testSpeed();
}

void loop() {
  // put your main code here, to run repeatedly:
}

<edit> Forgot to mention all user code by default runs from RAM1 ITCM after copy from flash on reset.

<edit> using above code with this as loop() will repeat the test and the pin does change with the same results:
Code:
void loop() {
  jj=30;
  while (jj == 30)
    jj = random(5) + 30;
  pinMode(jj, OUTPUT);
  digitalWriteFast(13, HIGH);
  delay(4000);
  digitalWriteFast(13, LOW);
  GPIO_testSpeed();
}
 
Last edited:
Playing with new machine {some loader anomoly} edited code above to include digitalWriteFast random PIN.
Uniform output function and timing in micros not millis.
Shows digitalWriteFast() as same or better solution with pin# const or Random:
Code:
	TIMING IN MICROS()
us=66670 << setOutValue const PIN  32 itr=10000000 
us=100005 << setOutValue random PIN  31 itr=10000000 
us=683367 << digitalWrite random PIN  31 itr=10000000 
us=66670 << digitalWriteFast const PIN  32 itr=10000000 
us=66670 << digitalWriteFast random PIN  31 itr=10000000 

	TIMING IN MICROS()
us=66670 << setOutValue const PIN  32 itr=10000000 
us=100005 << setOutValue random PIN  33 itr=10000000 
us=683367 << digitalWrite random PIN  33 itr=10000000 
us=66670 << digitalWriteFast const PIN  32 itr=10000000 
us=66670 << digitalWriteFast random PIN  33 itr=10000000

Code:
/* helper function to set GPIO Output register before configuring mode */
int jj = 30;

void GPIO_setOutValue(uint8_t pin, uint8_t val) {
  const struct digital_pin_bitband_and_config_table_struct *p;
  uint32_t mask;
  if (pin >= CORE_NUM_DIGITAL) return;
  p = digital_pin_to_info_PGM + pin;
  mask = p->mask;
  // pin is configured for output mode
  if (val) {
    *(p->reg + 0x21) = mask;  // set register
  } else {
    *(p->reg + 0x22) = mask;  // clear register
  }
}

void showTime( const char * ss, uint32_t time, int pin, uint32_t cnt ) {
  Serial.printf("us=%lu << %s %d itr=%d \n", time, ss, pin, cnt );
}

elapsedMicros aT;
#define LCNT 10000000
void GPIO_testSpeed(void) {
  Serial.printf("\tTIMING IN MICROS()\n");
  int ii;
  uint32_t iT;

  ii = 0;
  aT = 0;
  while (ii++ < LCNT) {
    GPIO_setOutValue(32, HIGH);
    GPIO_setOutValue(32, LOW);
  }
  iT = aT;
  showTime( "setOutValue const PIN ", iT, 32, LCNT);
  delay(400);

  ii = 0;
  aT = 0;
  while (ii++ < LCNT) {
    GPIO_setOutValue(jj, HIGH);
    GPIO_setOutValue(jj, LOW);
  }
  iT = aT;
  showTime( "setOutValue random PIN ", iT, jj, LCNT);
  delay(400);

  ii = 0;
  aT = 0;
  while (ii++ < LCNT) {
    digitalWrite(jj, HIGH);
    digitalWrite(jj, LOW);
  }
  iT = aT;
  showTime( "digitalWrite random PIN ", iT, jj, LCNT);
  delay(400);

  ii = 0;
  aT = 0;
  while (ii++ < LCNT) {
    digitalWriteFast(32, HIGH);
    digitalWriteFast(32, LOW);
  }
  iT = aT;
  showTime( "digitalWriteFast const PIN ", iT, 32, LCNT);
  delay(400);

  ii = 0;
  aT = 0;
  while (ii++ < LCNT) {
    digitalWriteFast(jj, HIGH);
    digitalWriteFast(jj, LOW);
  }
  iT = aT;
  showTime( "digitalWriteFast random PIN ", iT, jj, LCNT);
  delay(400);
  Serial.printf("\n");
}

void setup() {
  pinMode(32, OUTPUT);
  pinMode(13, OUTPUT);
  digitalWriteFast(13, HIGH);
  Serial.begin(22);
  while (!Serial)
    ;
  Serial.println("\n" __FILE__ " " __DATE__ " " __TIME__);
  digitalWriteFast(13, LOW);
}

void loop() {
  jj = 30;
  while (jj == 30)
    jj = random(5) + 30;
  pinMode(jj, OUTPUT);
  GPIO_testSpeed();
  digitalWriteFast(13, HIGH);
  delay(4000);
  digitalWriteFast(13, LOW);
}
 
An Luni mentioned, compare against digitalWriteFast().

Regarding "why" for digitalWrite(), the very best but most tedious way is to look at the assembly code the compiler generated. In Arduino IDE, turn on verbose output during compile. Then look in the huge pile of commands for the temporary folder pathname where Arduino IDE actually compiled your code. Inside that directory you should find a .lst file.

Teensyduino 1.57 and older don't create the .lst file, because for some large code cases (Defragster came up with an excellent test case) the old compiler would spend 1 minute or more to create the .lst file. Use version 1.58 or later to get the .lst file automatically.

Now for guesswork based on some experience, I would imagine 2 things contribute to most of the speed difference.

1: The library functions are in another file (or "compile unit") so the compiler can't apply many optimizations. Some of this may not apply if you compile with LTO, which you can find in the Tools > Optimization menu if using 1.59 beta (in Arduino 2.1.0 Boards Manager, choose 0.59.2 to get the beta).

2: The library functions have some extra work to emulate compatibility with code written for AVR, specifically use of digitalWrite() to control the pullup resistor while in input mode. For years David Mellis (then the Arduino lead developer) resisted all requests to have INPUT_PULLUP, so quite a lot of legacy code still exists to this day, and boards like Teensy which strive for excellent compatibility have to emulate that AVR hardware quirk.
 
An Luni mentioned, compare against digitalWriteFast().
...

And hopefully demonstrated in p#5 code.

Interesting it was (mis)understood that WriteFast only worked on constant pin#?

Code above suggests fixed or var pin # runs at same speed (w/us res and 10M iterations) - where the seeming equivalent version in p#1 slows down given a var pin#.
 
I can confirm,
using digitalWriteFast() works: it reaches the same speed (a tiny bit faster because the pin is written directly, no IF's and checks...).

Why the speed was so different - still no clue:
I've checked in *.LST file: both functions seem to be on ITCM, my function on address 0x0059C, the digitalWrite on address 0x26274 (so, also and still in ITCM).

Anyway, using digitalWriteFast() works fine (and fast as expected).
Thank you.
 
Another factor may be interrupt-safety - the standard library functions can be called from main code and an ISR for the same pin and all works as expected (on a range of different microcontroller architectures). To get this to work involves adding critical sections to the code for writing a pin.
 
Back
Top