Reading multiple GPIO pins on the Teensy 4.0 "atomically"

Status
Not open for further replies.

rt54321

Member
I plan on using a 16 bit parallel ADC with the Teensy4.0. Currently, I only know how to read each of the 16 pins individually:

//ADC bit 0 (LSB) //bit 1 //bit 2 //bit 3
ADC_result = digitalRead(PIN_C2) + digitalRead(PIN_C3) <<1 + digitalRead(PIN_C4) <<2 + digitalRead(PIN_D2) <<3 ...etc.

Is there a way to "group" GPIO pins together, so that I can read all of the 16 inputs "at the same time"?
 
The short answer is no.

The longer answer is probably not. At least not in a contiguous 16 IO pins associated with one IO port.

If you look at the section of the Processor reference manual talking about GPIO (11.5)
There are eight 32-bit GPIO registers. All registers are accessible from the IP interface.
Only 32-bit access is supported.

Again I have no idea of what your pins like PIN_C2, PIN_C3 are. But assuming you have some #defines for them some where.

But if you look at the table of which Teensy pin is connected to which IMXRT pin. You can find that in several places including: https://forum.pjrc.com/threads/54711-Teensy-4-0-First-Beta-Test?p=193716&viewfull=1#post193716

You could try to see if any of the GPIO objects have 16 actual IO pins that are brought out on the Teensy. If so you could configure all of these to be inputs and then use the DR register to read them. But I am guessing that there may not be any one GPIO register with 16 pins brought out. But I have not counted them up...

But that would be where I would start looking.
 
There does happen to be one port with 16 IO pins available, pins 0-1 and pins 14-27 are all on GPIO6 so it could be possible. You will have to do some bit shuffling to get them in the right order for a 16 bit variable, but they are there if you want that to be an option.
 
I wonder if the bit shuffling will be faster at the end as reading the pins sequentially. However, you should use digitalReadFast instead of digitalRead and it might be faster to OR the results together instead of adding.

Code:
uint16_t value =  (digitalReadFast(P1) << 0) | (digitalReadFast(P2) << 2) | (digitalReadFast(P3) << 3) ....

As long as the pin numbers are compile time constants this should be very fast.
 
Last edited:
Just measured the speed of this snippet with an LA on pin 11 and get some 250ns on a T4@600MHz

Code:
digitalWriteFast(11, HIGH);
    
    volatile uint16_t value =
        digitalReadFast(2) << 0 |
        digitalReadFast(3) << 1 |
        digitalReadFast(5) << 2 |
        digitalReadFast(1) << 3 |
        digitalReadFast(7) << 4 |
        digitalReadFast(12) << 5 |
        digitalReadFast(15) << 6 |
        digitalReadFast(14) << 7 |
        digitalReadFast(4) << 8 |
        digitalReadFast(6) << 9 |
        digitalReadFast(9) << 10 |
        digitalReadFast(8) << 11 |
        digitalReadFast(7) << 12 |
        digitalReadFast(10) << 13 |
        digitalReadFast(16) << 14 |
        digitalReadFast(17) << 15;

digitalWriteFast(11, LOW);
 
Sorry, this one won't help: But on the IMXRT1062 there is an ability to read or write multiple pins at the same time using FlexIO subsystem.

In particular you can configure it for 1 bit, 4, 8, 16, 32 bits, but there is a few rubs!

They work on FlexIO pins and they have to be contiguous FlexIO pins on the same FlexIO controller. The problem is on the largest run of contiguous FlexIO pin on any of the three controllers is I believe 6 pins.

And so far I have not tried this part of the FlexIO subsystem. I have thought about trying to drive some simple display to see how well it worked, but I have not seen many recent ones that allow 4 bit writes.

The interesting thing will be to see which pins we bring out on the larger T4 (T4.1) And if there will be more contiguous pins. If so I will probably experiment then when the first betas come out.

Again sorry I know that this probably won't help you with your current stuff, but thought I would mention it.
 
16 bits at once is going to suck, no doubt about it. For two 8-bit buses on T4 I made it work by reading a 32 bit register (GPIO6_PSR and GPIO7_PSR) and then scrambling it with a macro I can drop into my code that reorders the pins I'm using into a normal byte. This would never be fast if you had 8 (or 16!!) bits scattered all over the place, but GPIO6 in particular does have something like 6 sequential bits brought out to pins on the T4, so you only need a couple bit shifts.

I might do this in my code to read a byte:
Code:
read_byte=MACRO_INVTRANSPOSE_DATAWORD(DATA_BUS_INPUTREG);  // This is the actual read from GPIO bus.
where DATA_BUS_INPUTREG has already been defined elsewhere (as GPIO6_PSR)

That macro looks like this:
Code:
#define MACRO_INVTRANSPOSE_DATAWORD(orig32b) (byte)(orig32b>>22)<<2 | (orig32b&0b11000000000000000000)>>18

I couldn't tell you if it's the MOST efficient way to do this, but it seems to do the trick.

I'd pay $50 a piece no doubt about it for T4.1s that bring out two separate GPIO buses each with 8 or more sequential bits brought out to pins.
 
Hi DukeBlue219,

I am with you on this one. I am attempting digital video acquisition from a analog camera's ITU-R BT.656 port, and need 11 bits (8 bit data, clk, hsync, and vsync). About

The clock runs at 28.5MHz so I get 17nsec max to read in the data. I really want to be able to do one read to get all 11 bits, then with minimal logic sort out the data and store to SD card.

Like you I'd love a Teensy that let me read in sequential bits ...

Regards, Tony Barry
 
I'd pay $50 a piece no doubt about it for T4.1s that bring out two separate GPIO buses each with 8 or more sequential bits brought out to pins.

This thread is the place to comment about the pinout for Teensy 4.1.

https://forum.pjrc.com/threads/58028-Pins-to-bring-out-on-a-hypothetical-larger-Teensy4

At this point, it's looking pretty much certain T4.1 will have at least 16 contiguous bits (16 to 31) from GPIO7 (AD_B1_xx), all on outside pins, not bottom side pads.
 
If you are doing intermittent data collection, you can read and store GPIO6 very quickly and then do the slower process of repacking the bits later.

Looks like it would take 5 mask/shift/OR operations to repack bits 2,3,12,13,16,17,18,19,22,23,24,25,26,27,30,31 into a 16 bit word. But maybe there is a faster way?

Edit: yes, looks like it could be done in 3 mask/shift/OR operations. Move bits 2,3 to 14,15 and bits 30,31 to 20,21. I haven't looked at instructions like "Bit Field Insert". Maybe just three BFIs?
 
Last edited:
Thank you for this info Paul, it is much appreciated.

Do you know where there is a collated tutorial on "speedy IO for the Teensy 4" ? Presently I am finding that all info is on the forum, widely disseminated. The responsivity of the forum is very gratifying, but repeated questions all asking the same thing will wear down the present generous responders - and I would prefer to not do that :).

I do appreciate that the platform is moving very quickly towards its stable configuration, and it's not appropriate to have tutorials until the platform is stable. This should not be taken as any kind of criticism; the Teensy 4 work is very exciting and has great opportunities ahead.

Regards, Tony Barry
 
I have come up with a way to do this using the longest IO runs I could find...

I used 4 different runs in GPIO6, GPIO7, and GPIO9 and shifted them together. For reference here were my notes.
Code:
#define CORE_PIN10_PORTREG	GPIO7_DR
#define CORE_PIN11_PORTREG	GPIO7_DR
#define CORE_PIN12_PORTREG	GPIO7_DR

#define CORE_PIN10_BIT		0
#define CORE_PIN12_BIT		1
#define CORE_PIN11_BIT		2

#define CORE_PIN2_PORTREG	GPIO9_DR
#define CORE_PIN3_PORTREG	GPIO9_DR
#define CORE_PIN4_PORTREG	GPIO9_DR

#define CORE_PIN2_BIT		4
#define CORE_PIN3_BIT		5
#define CORE_PIN4_BIT		6

#define CORE_PIN14_PORTREG	GPIO6_DR
#define CORE_PIN15_PORTREG	GPIO6_DR
#define CORE_PIN18_PORTREG	GPIO6_DR
#define CORE_PIN19_PORTREG	GPIO6_DR

#define CORE_PIN19_BIT		16
#define CORE_PIN18_BIT		17
#define CORE_PIN14_BIT		18
#define CORE_PIN15_BIT		19

#define CORE_PIN16_PORTREG	GPIO6_DR
#define CORE_PIN17_PORTREG	GPIO6_DR
#define CORE_PIN20_PORTREG	GPIO6_DR
#define CORE_PIN21_PORTREG	GPIO6_DR
#define CORE_PIN22_PORTREG	GPIO6_DR
#define CORE_PIN23_PORTREG	GPIO6_DR

#define CORE_PIN17_BIT		22
#define CORE_PIN16_BIT		23
#define CORE_PIN22_BIT		24
#define CORE_PIN23_BIT		25
#define CORE_PIN20_BIT		26
#define CORE_PIN21_BIT		27

As you can see I organized each pin by it's offset in the IOMUX region.

So you can use pins [10, 12, 11 | 2, 3, 4 | 19, 18, 14, 15 | 17, 16, 22, 23, 20, 21] (as they appear on the solder mask) to do a 16 bit parallel read.
Here is my sketch code:
Code:
#define IMXRT_GPIO6_DIRECT  (*(volatile uint32_t *)0x42000000)
#define IMXRT_GPIO9_DIRECT  (*(volatile uint32_t *)0x4200C000)
#define IMXRT_GPIO7_DIRECT  (*(volatile uint32_t *)0x42004000)

// this can be done with a mask and 2 less shifts
//#define IO_BLOCK_A ((IMXRT_GPIO7_DIRECT << 29) >> 29)
//#define IO_BLOCK_B (((IMXRT_GPIO9_DIRECT >> 4) << 29) >> 26)
//#define IO_BLOCK_C (((IMXRT_GPIO6_DIRECT >> 16) << 28) >> 22)
//#define IO_BLOCK_D (((IMXRT_GPIO6_DIRECT >> 22) << 26) >> 16)

#define IO_BLOCK_A (IMXRT_GPIO7_DIRECT & 0b00000000000000000000000000000111)
#define IO_BLOCK_B ((IMXRT_GPIO9_DIRECT & 0b00000000000000000000000001110000) >> 1)
#define IO_BLOCK_C ((IMXRT_GPIO6_DIRECT & 0b00000000000011110000000000000000) >> 10)
#define IO_BLOCK_D ((IMXRT_GPIO6_DIRECT & 0b00001111110000000000000000000000) >> 12)

void setup() {
  pinMode(2, INPUT);
  pinMode(3, INPUT);
  pinMode(4, INPUT);
  pinMode(10, INPUT);
  pinMode(11, INPUT);
  pinMode(12, INPUT);
  pinMode(17, INPUT);
  pinMode(16, INPUT);
  pinMode(22, INPUT);
  pinMode(23, INPUT);
  pinMode(20, INPUT);
  pinMode(21, INPUT);
  pinMode(19, INPUT);
  pinMode(18, INPUT);
  pinMode(14, INPUT);
  pinMode(15, INPUT);

  Serial.begin(115200);
}

void loop() {
  uint16_t data = IO_BLOCK_A | IO_BLOCK_B | IO_BLOCK_C | IO_BLOCK_D;
  Serial.println(data, HEX);
}

I was able to get something between 40-80ns by doing this. It took me a weekend to figure out a couple months back.
 
They work on FlexIO pins and they have to be contiguous FlexIO pins on the same FlexIO controller.
Do you think it would be possible to "fake" continuous pins by using an XBAR configuration that copies your desired pins onto an unused/unpopulated) set of GPIO?

I think it would be really useful to be able to use FlexIO for this kind of thing. I wonder if using XBAR would make it work.

EDIT: it looks like this library is doing something of the sort https://github.com/mjs513/Teensy-4.x-Quad-Encoder-Library
 
Last edited:
I checked and the exact timing on the OR combined method is ~50ns reads.
So I tried using BFI and for some reason it is 9ns slower than using my OR combined method.

Code:
  data = GPIO7_DR & 0b00000000000000000000000000000111;
  asm volatile("bfi %0, %1, 3, 3" : "+r"(data) : "r"(GPIO9_DR >> 4));
  asm volatile("bfi %0, %1, 6, 4" : "+r"(data) : "r"(GPIO6_DR >> 16));
  asm volatile("bfi %0, %1, 10, 6" : "+r"(data) : "r"(GPIO6_DR >> 22));

As you can see, doing BFI should reduce the number of operations by one-per span (we replace a MASK,SHIFT,OR with a SHIFT,BFI). This should save time but for some reason it causes almost 10ns more time to be taken.
Note that I am using "-O3" so GCC may be optimizing my bitwise math further than I can with well crafted C code.
 
I checked and the exact timing on the OR combined method is ~50ns reads.
So I tried using BFI and for some reason it is 9ns slower than using my OR combined method.
...
As you can see, doing BFI should reduce the number of operations by one-per span (we replace a MASK,SHIFT,OR with a SHIFT,BFI). This should save time but for some reason it causes almost 10ns more time to be taken.
Note that I am using "-O3" so GCC may be optimizing my bitwise math further than I can with well crafted C code.

Might be the prior "C" code allowed the CPU to run code in parallel as presented by the compiler.
 
It would be somewhat faster to use only a single read of GPIO6 (vs 7, 9 and 6 twice). It also looks to be much faster without "data" being volatile. About like this for pins 0-1 and 14-27 on GPIO6 . Only 8 instructions (vs 22).

Code:
uint32_t test2(){
  register uint32_t data  = IMXRT_GPIO6_DIRECT >> 2;
  asm volatile("bfi %0, %1, 12, 2" : "+r"(data) : "r"(data));
  asm volatile("bfi %0, %1, 18, 2" : "+r"(data) : "r"(data >> 28));
  return (data >> 10) & 0xffff;
}

I'm curious, what is the fastest parallel ADC reads anyone has achieved?
 
Last edited:
Interesting - modified to allow some parallel operation, I confirmed that this is 7% faster than the code in #18:
Code:
inline uint32_t test3()
{
  register uint32_t data  = IMXRT_GPIO6_DIRECT;
  register uint32_t data2  = data >> 30;
  register uint32_t data3  = data >> 2;
  asm volatile("bfi %0, %1, 20, 2" : "+r"(data) : "r"(data2));
  asm volatile("bfi %0, %1, 14, 2" : "+r"(data) : "r"(data3));
  return (data >> 12) & 0xffff;
}
 
Last edited:
I wrote some rough code and it looks like using the above code for getting the 10 bits from GPIO6 shaves 3ns off of the SHIFT,OR method.
That brings us to an avg of 50.0ns per 16bit read.

I will try to write some code using XBAR and test if that can outperform what we're doing with multiple GPIO banks.
 
As written (for all 16 bits from GPIO6), I measure < 25 ns with a for loop around it. At 600 mhz.

Caution: the code has never been tested for correctness.
 
Last edited:
Just a wrapper to get a time:
Code:
#define IMXRT_GPIO6_DIRECT  (*(volatile uint32_t *)0x42000000)

void setup()
{
  Serial.begin(115200);
  delay(2000);
}


inline uint32_t test3()
{
  register uint32_t data  = IMXRT_GPIO6_DIRECT;
  register uint32_t data2  = data >> 30;
  register uint32_t data3  = data >> 2;
  asm volatile("bfi %0, %1, 20, 2" : "+r"(data) : "r"(data2));
  asm volatile("bfi %0, %1, 14, 2" : "+r"(data) : "r"(data3));
  return (data >> 12) & 0xffff;
}

void loop()
{
  long unsigned stime;
  unsigned total;

  stime = micros();
  for (register int i = 0; i < 1000000; ++i )
    total += test3();
  Serial.printf("result = %lu %u\n", micros() - stime, total);

  delay(10000);
}
 
Status
Not open for further replies.
Back
Top