Fast 8 bit parallel I/O for T4.0

markkimball

Active member
For various reasons I've been looking at various ways to get reasonably fast parallel I/O for 8 bits on a teensy 4.0. None of the T4.0's GPIOs that are used for digital I/O have 8 contiguous bits, but I did find that GPIO6 has two contiguous groups of 4 bits or more. Getting 8 bits still requires a small amount of bit twiddling and shifts but those operations are pretty fast. I wrote some code to test the idea out and it seems to work OK.

I wrote it so it wouldn't be too difficult to turn the core functions into a library. It is simple enough that I have included it here:

Code:
/* 8 bit parallel I/O functions.
 *  For Teensy4.0
 *  The read/write routines directly access the GPIO6 data register for fast
 *  reads and writes.
 *  
 *  GPIO6 is the best choice because:
 *  pin #'s 21, 20, 23, 22, 16 and 17 (in MSB to LSB order, GPIO6 bit#'s 27-22) are contiguous;
 *  and pin#'s 15, 14, 18 and 19 are contiguous in the same order, bit#'s 19-16.
 *  So one register read could fetch 2 separate nybbles that can be combined to make an 8-bit word.
 *  If we choose bit #'s 27-24 for the upper nybble, we need to right-shift them by 20 bits;
 *  and for the lower 4 bits (#'s 19-16) we need to right-shift them by 16 bits.
 *  the T4's processor only supports 32-bit Read/Write accesses so we're stuck with both shift operations.
 * 
 *  if raw_reg = GPIO6_DR, we can do something like this to read an 8-bit input:
 *  raw_reg &= 0x0f0f0000;
 *   value = ((raw_reg >> 20) | (raw_reg >> 16)) & 0xff;
 * 
 *  outputting an 8 bit value will require some masking to avoid changing the state of other pins associated with GPIO6.
 * 
 * 
 *  The bit ordering can be found in core_pins.h, in the section that defines CORE_PINX_BIT,
 *  where X = 0...39.
 *  If we use an external multiplexer to place either the uppper 8 or lower 8 bits from a 16-bit ADC we can
 *  then combine THEM to acquire a 16-bit value.
 *  OK this is not optimal compared to directly fetching 16 bits, but the Teensy4.0 board design doesn't permit that.
 * 
 *   NOTE:  we still will use pinMode() to configure the input/output pins.
 */

 #include <Arduino.h>
 #include <core_pins.h>

#define _pickbits 0x0f0f0000
#define ledpin 13

const int EightPins[8] = {21,20,23,22,15,14,18,19};

// I'm using a somewhat-Arduino-like naming scheme for these.

uint8_t digitalRead8Bits(void)
{
	uint32_t rawregvalue;
	
	rawregvalue = GPIO6_DR & _pickbits;
	return (uint8_t) ((rawregvalue >> 20) + (rawregvalue >> 16)) & 0xff;
}


// For a bit more efficiency I'm using the XOR function to change the output bits.
// This avoids the need to explicitly protect other register bits.
//  Since the bootup logic states are all LOWs this should work even if digitalWrite8Bits() is called before pinMode8Bits() is called.
volatile uint32_t _last = 0;

void digitalWrite8Bits(uint8_t value)
{
  uint32_t writebits,newbits;
  
  newbits =  (( (uint32_t) value & 0xf0) << 20 ) + (((uint32_t) value & 0x0f)<<16); // we want to update _last with this value later
  writebits = newbits ^ _last;
  GPIO6_DR ^= writebits;
  _last = newbits;
}

void pinMode8Bits(int mode)
{
 int i;
 for(i = 0; i < 8; i++)
  pinMode(EightPins[i], mode);
}

void setup() {
   pinMode8Bits(OUTPUT);
   digitalWrite8Bits(LOW);
 } // end setup()


// We test this with a simple "walking one" routine
void loop() {
  int i,j = 1;

  for(i = 0; i < 8; i++)
  {
   digitalWrite8Bits(j);
   j <<= 1;
   delay(500);
 }

}

I'm a get'er done kind of c/c++ programmer so any suggestions on how to improve the code are welcome.
 
Last edited:
Code:
	return (uint8_t) ((rawregvalue >> 20) + (rawregvalue >> 16)) & 0x0f;
Shouldn't that be & 0xff ?

Pete
 
I should add that el_supremo must have seen the un-edited version of my first post (I caught the error before he posted his comment). So latecomers won't see my error.....I think....

Maybe it's best to leave our misteaks <deliberate misspelling> alone and take our lumps :). At least that way everyone can be in on the conversation.
 
Maybe it's best to leave our misteaks <deliberate misspelling> alone and take our lumps :). At least that way everyone can be in on the conversation.
I think it is best to fix them in the original post, so when someone new comes along later and takes that code from the OP, they have a working version of it rather than potentially dealing with an error for a while themselves, before spending even more time trawling through a forum thread looking for and applying updates.
 
Benchmarking my routines gave me a read rate a bit over 70MBytes/second and about half that for writes. Quite a bit slower than I had expected for writing. Delving into the digitalWriteFast code I found that it uses a different set of GPIO registers, the DR_SET and DR_CLEAR registers. I modified my 8-bit write routine to use them instead and achieved almost 170MBytes/second.

The new digitalWrite8Bits routine looks like this:

Code:
// This faster version of digitalWrite8Bits() uses the GPIO SET and CLEAR registers, similar to how digitalWriteFast() works.
// Despite the fact that I'm writing to two registers instead of one, this approach benchmarks much faster.
void digitalWrite8Bits(uint8_t value)
{
  uint32_t writebits;

  writebits = (( (uint32_t) value & 0xf0) << 20 ) + (((uint32_t) value & 0x0f)<<16);
  GPIO6_DR_SET = writebits;
  GPIO6_DR_CLEAR = writebits ^ _pickbits;
}

BTW uint32_t _pickbits = 0x0f0f0000, although that can be seen in my OP.

The disadvantage of this approach is the 0-->1 transition occurs some nanoseconds before the 1-->0 transition does. If this is an issue, it would be necessary to use an external 8-wide latch clocked by a ninth Teensy pin. The transition precedence also can easily swapped by swapping the precedence of the two register writes. You would not use a transparent latch, it would have to be an edge triggered type - but anyone who cares about this kind of thing would know that anyway :).

I don't think there's much hope for speeding up the read rate, since it's necessary to read the GPIO data regisgter; and that is what slowed down my original write function. Still, 71MBytes/second isn't too bad.
Mark
 
Hi,

You could also remember the pins states and then use the GPIO6_DR_TOGGLE register instead to prevent a delay between the high to low and low to high transitions.

All the details for the available registers can be found in the IMXRT1060 manual available here, specifically chap 12.
 
Hi,

You could also remember the pins states and then use the GPIO6_DR_TOGGLE register instead to prevent a delay between the high to low and low to high transitions.

All the details for the available registers can be found in the IMXRT1060 manual available here, specifically chap 12.

RE: the toggle register -- good idea! that would take care of the precedence issue. It actually will be more like my original approach, but would use a different register. So it should be a trivial thing to do.

I downloaded the manual long ago. Big beast, it is.
 
Writing to the TOGGLE register benchmarks between the versions that modify DR and SET/CLEAR , somewhere around 80MBytes/second. Writing to SET/CLEAR is a bit more than twice as fast, according to my testing.

These raw numbers represent what the Teensy4.0 can do when it's running in a tight loop (no overclocking) with no additional data processing. So a real-world application that needs to run other code can't achieve these kinds of sample rates. That said, the 600MHz system clock is fast enough for the processor to do some concurrent data processing.
 
Writing to the TOGGLE register benchmarks between the versions that modify DR and SET/CLEAR , somewhere around 80MBytes/second. Writing to SET/CLEAR is a bit more than twice as fast, according to my testing.

These raw numbers represent what the Teensy4.0 can do when it's running in a tight loop (no overclocking) with no additional data processing. So a real-world application that needs to run other code can't achieve these kinds of sample rates. That said, the 600MHz system clock is fast enough for the processor to do some concurrent data processing.

Hi,

I find your your results quite surprising. I would think that the TOGGLE, SET and CLEAR registers have the same access time... In my (admittingly very dirty) test, I find that the write time is the same for all 3 registers: it takes two cycles when every values are hardcoded. So that is around 300MHz when the T4 is clocked at 600Mhz.But of course, this kind of testing is mostly useless in practice as many other factors will certainly play a more important role in practical applications. Note also that Teensyduino moves the digital pins to the 'fast' ports GPIO6-9 at boot.
 
I modified my code to benchmark all three approaches and it reports a different result when modifying the TOGGLE register. I'm not sure why -- I didn't change the code other than rename the various versions of digitalWrite8Bitsxx, where xx is either _DR, _TOGGLE or _SET_CLEAR.

Anyway, my most-recent benchmarking produced the following result:


WRITE rate to GPIO6_DR, Mbytes/sec: 50.00
WRITE rate to GPIO6_TOGGLE, Mbytes/sec: 250.00
WRITE rate to GPIO6_SET_CLEAR, Mbytes/sec: 166.67
data READ rate, Mbytes/sec: 71.43


I have included the code below. It occurs to me that digitalWrite8Bits_DR may run more slowly because I use a read-modify-write sequence in the form of:
GPIO6_DR ^= _lastbits_DR. It would be faster to just write the modified value in _lastbits_DR (of course, it would be in a different form since we won't perform the XOR on the register).
BUT this has the great disadvantage of smashing all the other bits that are used to set pin states. Given the fact that the other two forms preserve the other bits
and they run much faster, it makes no sense to use digitalWrite8Bits_DR().

Code:
/* Code to benchmark my 8 bit parallel I/O functions.
 *  For Teensy4.0
 *  The read/write routines directly access the GPIO6 data registers for fast
 *  reads and writes.
 *  
 *  GPIO6 is the best choice because:
 *  pin #'s 21, 20, 23, 22, 16 and 17 (in MSB to LSB order, GPIO6 bit#'s 27-22) are contiguous;
 *  and pin#'s 15, 14, 18 and 19 are contiguous in the same order, bit#'s 19-16.
 *  So one register read could fetch 2 separate nybbles that can be combined to make an 8-bit word.
 *  If we choose bit #'s 27-24 for the upper nybble, we need to right-shift them by 20 bits;
 *  and for the lower 4 bits (#'s 19-16) we need to right-shift them by 16 bits.
 *  the T4's processor only supports 32-bit Read/Write accesses so we're stuck with both shift operations.
 * 
 *  if raw_reg = GPIO6_DR & 0x0f0f0000, we can do something like this to read an 8-bit input:
 *  value = ((raw_reg >> 20) | (raw_reg >> 16)) & 0xff
  * outputting an 8 bit value will require some masking to avoid changing the state of other pins associated with GPIO6.
 * 
 * 
 *  The bit ordering can be found in core_pins.h, in the section that defines CORE_PINX_BIT,
 *  where X = 0...39.
 * 
 *   NOTE: pinMode()is used inside pinMode8Bits() to configure the input/output pins.  Calling any of the digitalWrite8Bits_..() routines before
 *   setting the 8-bit mode to OUT should work the same as in digitalWrite and pinMode.
 *   
 * 
 */

// #include <Arduino.h>
 #include <core_pins.h>

#define _pickbits 0x0f0f0000;
#define ledpin 13

const int EightPins[8] = {21,20,23,22,15,14,18,19};

// I'm using a somewhat-Arduino-like naming scheme for these.

uint8_t digitalRead8Bits(void)
{
	uint32_t rawregvalue;
	
	rawregvalue = GPIO6_DR & _pickbits;
  return (uint8_t) ((rawregvalue >> 20) + (rawregvalue >> 16));
}

void pinMode8Bits(int mode)
{
 int i;
 for(i = 0; i < 8; i++)
  pinMode(EightPins[i], mode);
}

uint32_t _last_DR = 0;

// This version writes to DR.  _last_DR is used to avoid the overhead of a read/write cycle.
void digitalWrite8Bits_DR(uint8_t value)
{
  uint32_t writebits,newbits;
  
  newbits =  (( (uint32_t) value & 0xf0) << 20 ) + (((uint32_t) value & 0x0f)<<16); // we want to update _last with this value later
  writebits = newbits ^ _last_DR;
  GPIO6_DR ^= writebits;
  _last_DR = newbits;
}

// A version that uses the toggle register.  Suggestion by vindar on the PJRC forum.
uint32_t _last_TOGGLE = 0;

void digitalWrite8Bits_TOGGLE(uint8_t value)
{
  uint32_t writebits,newbits;
  
  newbits =  (( (uint32_t) value & 0xf0) << 20 ) + (((uint32_t) value & 0x0f)<<16); // we want to update _last with this value later
  writebits = newbits ^ _last_TOGGLE;
  GPIO6_DR_TOGGLE = writebits;
  _last_TOGGLE = newbits;
}

// This faster version of digitalWrite8Bits() uses the GPIO SET and CLEAR registers, similar to how digitalWriteFast() works.
// Despite the fact that I'm writing to two registers instead of one, this approach benchmarks much faster.
void digitalWrite8Bits_SET_CLEAR(uint8_t value)
{
  uint32_t writebits;

  writebits = (( (uint32_t) value & 0xf0) << 20 ) + (((uint32_t) value & 0x0f)<<16); // same as above
  GPIO6_DR_SET = writebits;
  GPIO6_DR_CLEAR = writebits ^ _pickbits;
}

void setup() {
  float time0;
  int i;
  uint16_t temp,temp1,temp2,temp3,temp4,temp5,temp6,temp7,temp8,temp9;  // this is to keep the compiler from optimizing my benchmark code too much.  I hope...
  
  Serial.begin(115200);
  while(!Serial);
  
   pinMode8Bits(OUTPUT);

   // Benchmark for modifying GPIO6_DR
   time0 = (float) millis();
   
   for(i = 0; i < 100000; i++) // one hundred-thousand loops
   {
   digitalWrite8Bits_DR(0); // ten calls/loop = one million calls
   digitalWrite8Bits_DR(0xff);
   digitalWrite8Bits_DR(0);
   digitalWrite8Bits_DR(0xff);
   digitalWrite8Bits_DR(0);
   digitalWrite8Bits_DR(0xff);
   digitalWrite8Bits_DR(0);
   digitalWrite8Bits_DR(0xff);
   digitalWrite8Bits_DR(0);
   digitalWrite8Bits_DR(0xff);
}
  time0 = (float) millis() - time0; // This is how long it took to execute one million calls in milliseconds.
  digitalWrite8Bits_DR(0); // This is done so subsequent calls to digitalWrite8Bits_TOGGLE work right (probably doesn't matter in the context of benchmarking)
  Serial.print("WRITE rate to GPIO6_DR, Mbytes/sec: ");
  Serial.println(1000/time0);

   // Benchmark for modifying GPIO6_TOGGLE
   time0 = (float) millis();
   
   for(i = 0; i < 100000; i++) // one hundred-thousand loops
   {
   digitalWrite8Bits_TOGGLE(0); // ten calls/loop = one million calls
   digitalWrite8Bits_TOGGLE(0xff);
   digitalWrite8Bits_TOGGLE(0);
   digitalWrite8Bits_TOGGLE(0xff);
   digitalWrite8Bits_TOGGLE(0);
   digitalWrite8Bits_TOGGLE(0xff);
   digitalWrite8Bits_TOGGLE(0);
   digitalWrite8Bits_TOGGLE(0xff);
   digitalWrite8Bits_TOGGLE(0);
   digitalWrite8Bits_TOGGLE(0xff);
}
  time0 = (float) millis() - time0; // This is how long it took to execute one million calls in milliseconds.
  digitalWrite8Bits_TOGGLE(0);
  Serial.print("WRITE rate to GPIO6_TOGGLE, Mbytes/sec: ");
  Serial.println(1000/time0);

  // Benchmark for modifying GPIO6_SET and _CLEAR

   time0 = (float) millis();
   
   for(i = 0; i < 100000; i++) // one hundred-thousand loops
   {
   digitalWrite8Bits_SET_CLEAR(0); // ten calls/loop = one million calls
   digitalWrite8Bits_SET_CLEAR(0xff);
   digitalWrite8Bits_SET_CLEAR(0);
   digitalWrite8Bits_SET_CLEAR(0xff);
   digitalWrite8Bits_SET_CLEAR(0);
   digitalWrite8Bits_SET_CLEAR(0xff);
   digitalWrite8Bits_SET_CLEAR(0);
   digitalWrite8Bits_SET_CLEAR(0xff);
   digitalWrite8Bits_SET_CLEAR(0);
   digitalWrite8Bits_SET_CLEAR(0xff);
}
  time0 = (float) millis() - time0; // This is how long it took to execute one million calls in milliseconds.
  digitalWrite8Bits_SET_CLEAR(0);
  Serial.print("WRITE rate to GPIO6_SET_CLEAR, Mbytes/sec: ");
  Serial.println(1000/time0);

  // now see how long it takes to read 8 bits

  time0 = (float) millis();
  for(i =0; i < 100000; i++)
  {
    temp = digitalRead8Bits();
    temp1 = digitalRead8Bits();
    temp2 = digitalRead8Bits();
    temp3 = digitalRead8Bits();
    temp4 = digitalRead8Bits();
    temp5 = digitalRead8Bits();
    temp6 = digitalRead8Bits();
    temp7 = digitalRead8Bits();
    temp8 = digitalRead8Bits();
    temp9 = digitalRead8Bits();
  }
  time0 = (float) millis() - time0;
  Serial.print("data READ rate, Mbytes/sec: ");
  Serial.println(1000/time0);
 } // end setup()



void loop() {
  while(1);

/*
 //  A walking ones test to verify correct pin vs bit ordering.
  int i,j = 1;
  unsigned char readback;

  for(i = 0; i < 8; i++)
  {
   digitalWrite8Bits(j);
   readback = digitalRead8Bits();
   Serial.println(readback, HEX);
   j <<= 1;
   delay(500);
   }
 */
}
 
BTW, the version writing to SET and CLEAR runs slower is because it uses two writes to the GPIO register, compared to one that writes to TOGGLE. So the benchmark results are consistent with this.
 
I noticed some peculiar variations in the benchmark results that were resolved by using micros() instead of millis() to get the execution time (along with a few minor code changes to account for the changeover from milliseconds to microseconds). The improvement probably is due to the relatively poor time resooution of millis() in this situation. Now writing to the TOGGLE register benchmarks to a consistent 300MBytes/second. Using the TOGGLE register for fast writes appears to the best option at this point, particularly if you don't want the parallel output code to impose much of a processor load.

Independently setting or clearing bits in the set associated with the 8 bit Write function may cause problems with subsequent calls to digitalWrite8Bits_TOGGLE(), depending on what the state of that bit was.
 
@markkimball: using uint32_t time0;

and ARM_DWT_CYCCNT for timing should have even less error than micros():
Code:
   // Benchmark for modifying GPIO6_DR
[B]   time0 = ARM_DWT_CYCCNT;[/B]
   
   for(i = 0; i < 100000; i++) // one hundred-thousand loops
   {
   digitalWrite8Bits_DR(0); // ten calls/loop = one million calls
   digitalWrite8Bits_DR(0xff);
   digitalWrite8Bits_DR(0);
   digitalWrite8Bits_DR(0xff);
   digitalWrite8Bits_DR(0);
   digitalWrite8Bits_DR(0xff);
   digitalWrite8Bits_DR(0);
   digitalWrite8Bits_DR(0xff);
   digitalWrite8Bits_DR(0);
   digitalWrite8Bits_DR(0xff);
}
 [B] time0 = ARM_DWT_CYCCNT - time0;[/B] // This is how long it took to execute one million calls in milliseconds.
  digitalWrite8Bits_DR(0); // This is done so subsequent calls to digitalWrite8Bits_TOGGLE work right (probably doesn't matter in the context of benchmarking)
  Serial.print("WRITE rate to GPIO6_DR, Mbytes/sec: ");
  [B]Serial.println((float)F_CPU_ACTUAL/time0);[/B]

Code:
WRITE rate to GPIO6_DR, Mbytes/sec: 47.24
WRITE rate to GPIO6_TOGGLE, Mbytes/sec: 299.89
WRITE rate to GPIO6_SET_CLEAR, Mbytes/sec: 149.97
data READ rate, Mbytes/sec: 74.99
 
Last edited:
Thanks for the tidbit, @defragster. I wasn't aware of ARM_DWT_CYCCNT.

Glad to point it out. It is of course the MCU's cycle clock counter and at 600 MHz has 600 times more resolution than micros() - and on T_4.x it is used to resolve micros() between millis().

It is generally very useful even for the shortest set of instructions/elapsed time and takes about 3 clock cycles to read, versus the 35-40 cycles it takes to resolve micros() when used in a tight spot like _isr() timing.

It also works on the T_3.x family - but there the ARM_DWT_CYCCNT currently isn't set running on Reset like the T_4.x family, since it is used in resolving micros().

Though at 600 MHz it does wrap in (2^32)/600,000,000 seconds when using uint32_t's. @luni has made a lib with a 64 bit aware version.
 
Glad to point it out. It is of course the MCU's cycle clock counter and at 600 MHz has 600 times more resolution than micros() - and on T_4.x it is used to resolve micros() between millis().

It is generally very useful even for the shortest set of instructions/elapsed time and takes about 3 clock cycles to read, versus the 35-40 cycles it takes to resolve micros() when used in a tight spot like _isr() timing.

It also works on the T_3.x family - but there the ARM_DWT_CYCCNT currently isn't set running on Reset like the T_4.x family, since it is used in resolving micros().

Though at 600 MHz it does wrap in (2^32)/600,000,000 seconds when using uint32_t's. @luni has made a lib with a 64 bit aware version.

OK, based on the register in question I'd been wondering why it was updating just every microsecond,. But I see that isn't the case at all. Good to know when using it to convert to time.

Does it update even after noInterrupts() is called?
 
OK, based on the register in question I'd been wondering why it was updating just every microsecond,. But I see that isn't the case at all. Good to know when using it to convert to time.

Does it update even after noInterrupts() is called?

Once enabled the ARM_DWT_CYCCNT ticks on each clock cycle of the MCU without pause.

In doing the micros() based offset to millis() code - or other testing - I did something like.

uint32_t anArray[10]

then in some fashion - loop or unrolled using index 'ii' - anArray[ii]=ARM_DWT_CYCCNT

You'll see indications of about 3 clock ticks of 600 MHz when printing the diff of : anArray[ii+1] - anArray[ii]
 
Once enabled the ARM_DWT_CYCCNT ticks on each clock cycle of the MCU without pause.

In doing the micros() based offset to millis() code - or other testing - I did something like.

uint32_t anArray[10]

then in some fashion - loop or unrolled using index 'ii' - anArray[ii]=ARM_DWT_CYCCNT

You'll see indications of about 3 clock ticks of 600 MHz when printing the diff of : anArray[ii+1] - anArray[ii]

Alright! This will be useful for profiling code sections. I have some concerns with regard to execution speed with an isr inside something I'm working on, this looks like a way to check it out.

Many thanks for the information.
 
Back
Top