Simultaneously reading 8 GPIO pins

Status
Not open for further replies.

RandoRkt

Member
Hello everyone,

I am trying to read and store eight(8) GPIO pins simultaneously. The ultimate goal is to complete this operation in the fewest number of CPU clock cycles as possible. One clock cycle would be ideal, but most likely unfeasible. I can't exceed ~10 instruction clocks since I am only allotted 18 clock pulses total for the loop I am writing, and I need some wiggle room for the other operation which is two write a pin high then low again after the 8bits are stored. AVR microcontrollers were easier to do this with using the PINx command, but the ARM microcontroller doesn't seem to support this command. And since I am limited in timing, I can't use the built-in functions like digitalRead(), digitalWrite(), etc. Since each one consumes ~50 clock cycles according to post #3 on an arduino forum: https://forum.arduino.cc/t/how-many-clock-cycles-does-digitalread-write-take/467153/2

I have extensively searched through these forums and came across some information such as using IOMUX to force the pins to act as GPIO when ALT5 is used. And that GPIOx_GDIR = 0xFFFF will set the port to be input, and then GPIOx_DR will read the whole port. I am aware that a port read will read 32 bits, then I will need to truncate the top MSB bits and store only those bits.

I also need some way to write a pin high and low on command preferably in a single clock pulse of the 600MHz CPU clock if possible, so once again, the digitalWrite() and I suspect even the digitalWriteFast() are not fast enough.

A few of the relevant forums/sources I have found so far:
https://forum.pjrc.com/threads/64702-GPIO-ports-and-related-control-register
https://forum.pjrc.com/threads/54711-Teensy-4-0-First-Beta-Test
https://forum.pjrc.com/threads/6149...-other-board-with-2-cores-to-solve-my-problem
The IMXRT1060 datasheet

I am new to programming microcontrollers with C/Arduino-C/Teensyduino, I have only been trained to use assembly for microcontroller programming and I am very unsure how to program this loop to read and store the 8 parallel pins. I would use a standalone MCU, but adding on a memory peripheral and interfacing with it using assembly sounds to be more effort than it's worth. And so I have landed on using the Teensy 4.0.

Any feedback on how to program this beyond just being told to read the datasheet would be greatly appreciated since the datasheet is pretty convoluted in my opinion.

And thank you Paul, KurtE and TRNPep for your previous posts which have guided me up to this point.
~RRkt
 
Each Teensy 4.x port can read all of its pins with a single read of 32 bits - where the pins will be 'distributed'.

Looking at the PJRC posted schematic or other info will show what pins are on which port and in what order.

Then reading that port all pin values can be seen/recorded.

See this post and related :: Does-T4-0-have-PORTS-like-read-8-pins-at-once

That links to Teensy-4-1-Storing-the-value-of-18-pins-input-quickly

concept is the same - only the pin mapping changes as noted in the schematic or the linked @KurtE info
 
This T4.1 code reads 12 usable bits of input and synchronizes reads to an external clock signal:

Code:
    for (register unsigned i = 0; i < FFTSize; ++i) {
      register uint32_t data;

      do {    // wait for clock low (could also store this falling edge data)
      } while (clock_bit(GPIO6));

      do {    // wait for rising edge
        data = GPIO6;
      } while (!clock_bit(data));

      input[scan][i] = data >> 16;    // store the data

    } // for samples
 
Thank you both for the replies so far.
Here is the pseudocode I have thought up. Obviously, the code is not real, because I am still unsure how the majority of the built in functions/commands work such as the IOMUX ALT5 stuff to make the pins behave as GPIO.

Code:
MasterClock = PinX //External clock
SlaveClock = PinY //External clock
SerialClock = PinZ //Coded clock used to drive a peripheral

[Declare IOMUX somehow] //FIXME: Figure out how to use the IOMUX command to set port 'x' as GPIOx
GPIOx_DGIR = 0xFFFF; //Set port 'x' to be input

while(MasterClock = high){
     while(SlaveClock = high){
          SerialClock = high; //FIXME: How to write a single pin high without using digitalWrite(pin)
          SomeBuffer = GPIOx_DR; //FIXME: What type of array/matrix to use to store 8-bits
          SerialClock = low; //FIXME: How to write a single pin low without using digitalWrite(pin)
          break; //break out of the SlaveClock loop wait for the the SlaveClock to go high againw
     }
}

The problem I see that I might encounter is if I use GPIOx_DR to poll the whole port, it might disrupt the logic level of the pin being used to feed into a peripheral as its clock. Additionally, I don't know how I can set a single pin high at the beginning of the "SlaveClock" loop, then set it back to low at the end of the loop while leaving all other pins unaffected.
 
Paul has digitalWriteFast( Const pin#, high/low ) resolve out to the fastest raw write in inline code on Output pins.

Setting all desired Input pins with pinMode will assure they are right for reading.

The tight loop read should be reliable
 
Has Paul calculated/measured the instruction cycles needed to execute the digitalWriteFast() command? I am very limited with how many CPU clock cycles can execute in the main loop.

Thank you
 
Easy to do using ARM_DWT_CYCCNT which is already running on T_4.x.

Code:
	uint32_t start;
	uint32_t end;

	start = ARM_DWT_CYCCNT;
	for (unsigned i = 0; i < 100; i++)
	{
		// dummy++;
	}
	end = ARM_DWT_CYCCNT;

	// end - start; // Cycles used plus ~3 for reading the CYCCNT
 
Only one variable is needed:
Code:
  uint32_t count = 0 ;
  count -= ARM_DWT_CYCCNT;
  ... code to time ...
  count += ARM_DWT_CYCCNT;
Using the -= allows discontinuous time periods to be summed together easily if you want to time only certain parts of the code.

Code:
  uint32_t count = 0 ;
  count -= ARM_DWT_CYCCNT;
  ... code to time ...
  count += ARM_DWT_CYCCNT;

  .. code not to be timed ...

  count -= ARM_DWT_CYCCNT;
  ... code to time ...
  count += ARM_DWT_CYCCNT;
 
If you really only have about 10-18 clock cycles, assembly language is probably the only way. I wrote a logic analyzer that reads and stores 32 bits in 8 clock cycles, but it is in assembly. On the Teensy 4, a GPIO read takes 8 clock cycles and a write takes 1 clock cycle. Luckily with the parallel processing, if done optimally, it can read GPIO while simultaneously writing the previous value to memory. So reading, writing, then toggling an output high and low might be possible in 10-12 clocks. This is without a loop - I have 1024 reads in a row for the logic analyzer. Adding a loop would add at least 3 clocks.

Writing it in c may be almost as fast if you are very, very careful.
 
Thank you everyone for your help so far, and especially defragster and MarkT for your help on recording CPU cycles.

However, I am very lost right now trying to setup a port of GPIO pins, then read 8 of the pins, preferably simultaneously as to use the fewest number of CPU cycles as possible. I also need to then store those values into a buffer of some sort, perhaps an array.

If someone has some example code that does this type of port manipulation, I would be eternally grateful for your help.

Thank you,
~RRkt
 
If the pins to be used fit in a port then reading that returns a 32 bit value of ALL pins on that port.

The useful pins could be extracted and saved in a byte at the time- which you may not have - or the 32 bit word saved in an array to be parsed later where each of the desired pins has a fixed location in that value. Or if the pins are in the high or low 16 bits - then it might save space to just store those 16 bits.

Hopefully links above demonstrate port reading - if not there are other posts.

Spelling out what pins and port are of interest might allow better help.
 
I'm flexible on which pins/ports to use since there is no conflict with anything. I am just not knowledgeable at all with the T4.x microcontroller, and I have no idea how to read or store a whole port at once. I just need it spelled out for me at this point as to how to read the port, like the DDRx, PORTx, and PINx commands for the AVR microcontrollers, but in the manner that the ARM based T4.0 can understand.
 
Here's a short sketch that reads 1024 32-bit values from Port 6 in an efficient manner:

Code:
#define BUFFER_SIZE 1024

uint32_t buffer[BUFFER_SIZE];

void setup() {
  uint32_t *buffer_ptr = &buffer[0];
  uint32_t *end_of_buffer = &buffer[BUFFER_SIZE];

  while (buffer_ptr < end_of_buffer) {
    *(buffer_ptr) = GPIO6_PSR;
    ++buffer_ptr;
  }
}

void loop() {  
}
 
Here's a short sketch that reads 1024 32-bit values from Port 6 in an efficient manner:
Talking about efficiency
A for loop can be about 100 times faster than a while as shown next
Code:
#define BUFFER_SIZE 1024

uint32_t buffer[BUFFER_SIZE];

void setup() {
  uint32_t *buffer_ptr = &buffer[0];
  uint32_t *end_of_buffer = &buffer[BUFFER_SIZE];

while(!Serial);

uint32_t to=ARM_DWT_CYCCNT;
  while (buffer_ptr < end_of_buffer) {
    *(buffer_ptr) = GPIO6_PSR;
    ++buffer_ptr;
  }
Serial.print("while ");Serial.println(ARM_DWT_CYCCNT-to);

to=ARM_DWT_CYCCNT;
  for (; buffer_ptr < end_of_buffer; )  *buffer_ptr++ = GPIO6_PSR;
Serial.print("for "); Serial.println(ARM_DWT_CYCCNT-to);
}

void loop() {  
}

Serial output
Code:
while 9358
for 87
T4.1, 600MHz, faster
reason? I guess unwrapping in for loop by compiler
 
Time difference for serial printing 6 characters rather than 4? You might want to read CYCCNT immediately after the loops end...
 
Was wondering how 1024 port reads could complete in 87 cycles??? Didn't catch the dual print on one line. So was playing with the code.

That and the OP ref to 50 clocks per read is NOT at Teensy speed. DigitalReadFast ( with bit shifting and ||'ing ) is under 10 cycles per read with loop overhead. Even the looped read of pin numbers from an array and bitshift or'ing is under 28 cycles per read - with overhead of 2 loops.

Time difference for serial printing 6 characters rather than 4? You might want to read CYCCNT immediately after the loops end...

Swapping the prints around and adding alternate cases - it looks like the Compiler is optimizing away the "for" case?
Code:
T:\tCode\FORUM\GPIOreadPort\GPIOreadPort.ino Jul 29 2021 00:43:32
9232 >> while 
[B][COLOR="#FF0000"]1 >>for [/COLOR][/B]
[B]9255 >>for ii * 
9251 >>for [ii] 
[/B]222777 >>for pin Read 
76398 >>for pin ReadFast

Code:
#define BUFFER_SIZE 1024

// https://forum.pjrc.com/threads/67751-Simultaneously-reading-8-GPIO-pins?p=284737&viewfull=1#post284737

uint32_t buffer[BUFFER_SIZE];

void setup() {
  uint32_t *buffer_ptr = &buffer[0];
  uint32_t *end_of_buffer = &buffer[BUFFER_SIZE];

  while (!Serial);
  Serial.println("\n" __FILE__ " " __DATE__ " " __TIME__);
  int ii;

  if ( CrashReport ) Serial.print ( CrashReport );
  uint32_t to = ARM_DWT_CYCCNT;
  while (buffer_ptr < end_of_buffer) {
    *(buffer_ptr) = GPIO6_PSR;
    ++buffer_ptr;
  }
  Serial.print(ARM_DWT_CYCCNT - to); Serial.println(" >> while ");

[B][COLOR="#FF0000"]  to = ARM_DWT_CYCCNT;
  for (; buffer_ptr < end_of_buffer; )  *buffer_ptr++ = GPIO6_PSR;
  Serial.print(ARM_DWT_CYCCNT - to); Serial.println(" >>for ");[/COLOR][/B]

[B]  buffer_ptr = &buffer[0];
  to = ARM_DWT_CYCCNT;
  for (ii = 0; ii < BUFFER_SIZE; ii++ )  *buffer_ptr++ = GPIO6_PSR;
  Serial.print(ARM_DWT_CYCCNT - to); Serial.println(" >>for ii * ");[/B]

  to = ARM_DWT_CYCCNT;
  for (ii = 0; ii < BUFFER_SIZE; ii++ )  buffer[ii] = GPIO6_PSR;
  Serial.print(ARM_DWT_CYCCNT - to); Serial.println(" >>for [ii] ");

  int myPins[8] = {2, 4, 6, 8, 10, 12, 14, 16};
  for (ii = 0; ii < 8; ii++)  pinMode( myPins[ii], INPUT );
  byte myB = 0;
  to = ARM_DWT_CYCCNT;
  for (buffer_ptr = &buffer[0]; buffer_ptr < end_of_buffer; )  {
    for (ii = 0; ii < 8; ii++)  myB = (myB << 1) || digitalRead( myPins[ii] );
    *buffer_ptr++ = myB;
  }
  Serial.print(ARM_DWT_CYCCNT - to); Serial.println(" >>for pin Read ");

  to = ARM_DWT_CYCCNT;
  for (buffer_ptr = &buffer[0]; buffer_ptr < end_of_buffer; )  {
    myB = digitalReadFast( 2 ) << 7 || digitalReadFast( 4 ) << 6 || digitalReadFast( 6 ) << 5 || digitalReadFast( 8 ) << 4 || \
          digitalReadFast( 10 ) << 3 || digitalReadFast( 12 ) << 2 || digitalReadFast( 14 ) << 1 || digitalReadFast( 16 );
    *buffer_ptr++ = myB;
  }
  Serial.print(ARM_DWT_CYCCNT - to); Serial.println(" >>for pin ReadFast ");

}

void loop() {
}
 
Last edited:
Changing the p#17 RED code as follows gives similar clocks value as while()

Code:
T:\tCode\FORUM\GPIOreadPort\GPIOreadPort.ino Jul 29 2021 00:55:33
9232 >> while 
[B]9252 >>for[/B] 
9251 >>for ii * 
9253 >>for [ii] 
222778 >>for pin Read 
76394 >>for pin ReadFast

Add initialization :: buffer_ptr = &buffer[0]
Code:
  to = ARM_DWT_CYCCNT;
  for ([B]buffer_ptr = &buffer[0][/B]; buffer_ptr < end_of_buffer; )  *buffer_ptr++ = GPIO6_PSR;
  Serial.print(ARM_DWT_CYCCNT - to); Serial.println(" >>for ");
 
100 times faster

Careful ... your second loop does nothing, because buffer_ptr hasn't been reset.

Code:
#include <Arduino.h>

#define BUFFER_SIZE 1024

uint32_t buffer [BUFFER_SIZE];

void setup () {
    while(!Serial) {}

    auto buffer_ptr = buffer, end_of_buffer = buffer + BUFFER_SIZE;
    int t0, t1;

    t0 = ARM_DWT_CYCCNT;
    while (buffer_ptr < end_of_buffer)
        *buffer_ptr++ = GPIO6_PSR;
    t1 = ARM_DWT_CYCCNT;

    Serial.printf("while %d\n", t1-t0);
    //arm_dcache_flush_delete(buffer, sizeof buffer);
    buffer_ptr = buffer;

    t0 = ARM_DWT_CYCCNT;
    for (; buffer_ptr < end_of_buffer; )
        *buffer_ptr++ = GPIO6_PSR;
    t1 = ARM_DWT_CYCCNT;

    Serial.printf("for %d\n", t1-t0);
}

void loop () {}

while 9232
for 9234

PS. With C++11, you can also use this notation (same cycle count):

Code:
    t0 = ARM_DWT_CYCCNT;
    for (auto& e : buffer)
        e = GPIO6_PSR;
    t1 = ARM_DWT_CYCCNT;
 
Last edited:
Indeed @jcw that is the case.

Also note post #17 code adds CrashReport - that didn't catch anything.

But when one of those loops was first copied with changes [ and not resetting buffer_ptr :( ] it went storming off into memory causing an immediate RESET of Teensy :(

No fault, just: reset ... after reset ...
 
I was interested how efficient the compiler implements a range based for loop so I added one to the code above:
Code:
 for (auto& slot : buffer) slot = GPIO6_PSR;

Which also gives the same result:

Code:
while 9234
for 9235
range 9236

Looking at the generated assembly (only the loops shown)
Code:
While loop
aa:	6881      	ldr	r1, [r0, #8]
ac:	f843 1b04 	str.w	r1, [r3], #4
b0:	42a3      	cmp	r3, r4
b2:	d1fa      	bne.n	aa <setup+0x2e>

Range based
ea:	6881      	ldr	r1, [r0, #8]
ec:	f843 1b04 	str.w	r1, [r3], #4
f0:	42a3      	cmp	r3, r4
f2:	d1fa      	bne.n	ea <setup+0x6e>

The manually written while loop and the range based loop generate exactly the same assembly. IMHO, as so often, no need to do the compilers work, a simple

Code:
uint32_t buffer [1024];

void setup () {
      for (auto& slot : buffer) slot = GPIO6_PSR;
}

void loop () {}

seems to do the job quite nicely.

Edit: Sorry, cross post with jcw's answer
 
Last edited:
I would agree on bad (too quickly developed) test program (I should have known it better)
BUT
I had an issue with parallel ILI9341
that was too slow with while loop and that I could speed-up with for.
Now trying to reproduce, I realize that with while() I may had introduced a side-effect that slowed the display down by a factor of 100
(filling display with single colour needed 1.6 seconds!!)
 
Last edited:
Careful ... on first iteration the w-loop runs N times, but on the next iterations it won't, because w is now zero.

Yes, that was the problem. w is declared uint16_t so instead of 240 there were 32768 iterations (or a factor of 136)
 
At the risk of drawing out a long discussion further ...

Nope, that's not exactly what's going on. The first inner while runs w times, and then always 65535 (i.e. "(uint16_t) 0 - 1").
If w were declared as uint8_t, the nested while loops would still run 255 times, which is not the same as the for loops.
In short: you'll be better off by clearly stating intent. The compiler will optimise ... no point trying to outsmart it.

PS. My previous comment was wrong: w will be 65535 every time it comes out of its while loop, not zero.
PPS. Hrm, I was assuming unsigned. Probably incorrectly so. Oh well, enough yak-shaving, onwards :)
 
Last edited:
Status
Not open for further replies.
Back
Top