Teensy 4.0 PORT manipulation to read/write data to multiple pins

Status
Not open for further replies.

MakerBR

Member
I have noticed that lot of people been asking about port manipulation example code and I was also searching for the same. After reading some documentation I came up with these program to read/write multiple pins simultaneously at faster speed.
This is my first attempt at creating something like this. Let me know how it is.:)
Here is the GitHub link: https://github.com/ManojBR105/Teensy-4.0_digital_parallel_read-write
Please see readme file for more information.
I have used only pins 0-23 and optimized the code to take less time and operations to the best of my knowledge.
 

Attachments

  • Teensy_4.0_digital_parallel_read_write.ino
    5.9 KB · Views: 222
I encourage you to review this and how it is much faster.

Code:
#define IMXRT_GPIO6_DIRECT  (*(volatile uint32_t *)0x42000000)

// rearrange 16 GPIO6 pin inputs into 16 consecutive bits
// move two two bit fields from the lower word into gaps in the upper word

inline uint16_t test5()
{
  register uint32_t data  = IMXRT_GPIO6_DIRECT;
  register uint32_t data2  = data >> 2;
  register uint32_t data3  = data >> 12;
  asm volatile("bfi %0, %1, 20, 2" : "+r"(data) : "r"(data2));
  asm volatile("bfi %0, %1, 28, 2" : "+r"(data) : "r"(data3));
  return (data >> 16);
}
 
I encourage you to review this and how it is much faster.

Code:
#define IMXRT_GPIO6_DIRECT  (*(volatile uint32_t *)0x42000000)

// rearrange 16 GPIO6 pin inputs into 16 consecutive bits
// move two two bit fields from the lower word into gaps in the upper word

inline uint16_t test5()
{
  register uint32_t data  = IMXRT_GPIO6_DIRECT;
  register uint32_t data2  = data >> 2;
  register uint32_t data3  = data >> 12;
  asm volatile("bfi %0, %1, 20, 2" : "+r"(data) : "r"(data2));
  asm volatile("bfi %0, %1, 28, 2" : "+r"(data) : "r"(data3));
  return (data >> 16);
}


I noticed that you have used the GPIO6 for all the 16 bit, but some of those pins are on the backside of the board which are inconvinient to use.
Also I mentioned the same in the readme file that 16bit and 24bit operations speed can be improved by using pins on the backside of the board.

So I compared my approach to yours in the following way:
Code:
#define IMXRT_GPIO6_DIRECT  (*(volatile uint32_t *)0x42000000)

const uint16_t mask1 = 0xcfcf;// 1100 1111 1100 1111;
const uint16_t mask2 = 0x0030;// 0000 0000 0011 0000;
const uint16_t mask3 = 0x3000;// 0011 0000 0000 0000;

//using backside pins for faaster data read
inline uint16_t read_16fast() {
  uint32_t data = GPIO6_DR;
  return ((data >> 16) & mask1) | ((data << 2) & mask2) | (data & mask3);
}


// rearrange 16 GPIO6 pin inputs into 16 consecutive bits
// move two two bit fields from the lower word into gaps in the upper word
const int pins[16] = {27, 26, 25, 24, 21, 20, 23, 22, 16, 17, 0, 01, 15, 14, 18, 19};

inline uint16_t test5()
{
  register uint32_t data  = IMXRT_GPIO6_DIRECT;
  register uint32_t data2  = data >> 2;
  register uint32_t data3  = data >> 12;
  asm volatile("bfi %0, %1, 20, 2" : "+r"(data) : "r"(data2));
  asm volatile("bfi %0, %1, 28, 2" : "+r"(data) : "r"(data3));
  return (data >> 16);
}
void setup() {
  // put your setup code here, to run once:
  Serial.begin(38400);
  for (int i = 0; i < 16; i++)
    pinMode(pins[i], INPUT_PULLDOWN);
}

void loop() {
  uint16_t data, start, stop;
  start = micros();
  for (int i = 0; i < 1000; i++)
    data = test5();
  stop = micros();
  Serial.print("time taken by test5() for 1000 reads: ");
  Serial.print(stop - start);
  Serial.print("  || data : ");
  Serial.println(data);
  start = micros();
  for (int i = 0; i < 1000; i++)
    data = read_16fast();
  stop = micros();
  Serial.print("time taken by read_16fast() for 1000 reads: ");
  Serial.print(stop - start);
  Serial.print("  || data : ");
  Serial.println(data);
  delay(5000);
}

this approach might not be accurate ;). But it turns out mine takes 14ns and yours 21ns.
you can see attachment for logged data.(or test it on your own).

But I noticed that you are using assembly code. Which should be faster than mine.
But the time taken by the micro to get data from GPIO6_DR is very high compared to those operations. So the operations that done are insignificant with respect to time consumed.
 

Attachments

  • log_speed_data.txt
    3 KB · Views: 108
That code is hitting some magical dual execute sweetspot! In fact it is faster than t T_4.1 can just read 16 pins it has presented just read and right shifted 16?

Timing converted to Cycle Counts - two T_4.1 bits as cycled don't align with placement - but are cycled PULLUP/PULLDOWN in some fashion to show the two T_4.0 example results compare:
Code:
T:\tCode\TLAjr\BitIO_T4\BitIO_T4.ino Jul 14 2020 23:23:10

      time taken by test5() for 1000 reads: 13034  || 	data : 110101010101010
time taken by read_16fast() for 1000 reads: 8536  || 	data : 110101010101010
time taken by read_16_T41() for 1000 reads: 9036  || 	data : 110101010011010

      time taken by test5() for 1000 reads: 13020  || 	data : 100100010001000
time taken by read_16fast() for 1000 reads: 8536  || 	data : 100100010001000
time taken by read_16_T41() for 1000 reads: 9038  || 	data : 100100010011000
      time taken by test5() for 1000 reads: 13020  || 	data : 1000100010001
time taken by read_16fast() for 1000 reads: 8536  || 	data : 1000100010001
time taken by read_16_T41() for 1000 reads: 9038  || 	data : 1000100000001
      time taken by test5() for 1000 reads: 13020  || 	data : 10001000100010
time taken by read_16fast() for 1000 reads: 8534  || 	data : 10001000100010
time taken by read_16_T41() for 1000 reads: 9036  || 	data : 10001000000010
      time taken by test5() for 1000 reads: 13020  || 	data : 1000010001000100
time taken by read_16fast() for 1000 reads: 8533  || 	data : 1000010001000100
time taken by read_16_T41() for 1000 reads: 9038  || 	data : 1000010001100100

Code:
#define IMXRT_GPIO6_DIRECT  (*(volatile uint32_t *)0x42000000)

const uint16_t mask1 = 0xcfcf;// 1100 1111 1100 1111;
const uint16_t mask2 = 0x0030;// 0000 0000 0011 0000;
const uint16_t mask3 = 0x3000;// 0011 0000 0000 0000;

//using backside pins for faaster data read
inline uint16_t read_16fast() {
	uint32_t data = GPIO6_DR;
	return ((data >> 16) & mask1) | ((data << 2) & mask2) | (data & mask3);
}

inline uint16_t read_16_T41() {
	//uint32_t data = GPIO6_DR;
	return (GPIO6_DR >> 16);
}

// rearrange 16 GPIO6 pin inputs into 16 consecutive bits
// move two two bit fields from the lower word into gaps in the upper word
const int pins[22] = {27, 26, 25, 24, 21, 20, 23, 22, 16, 17, 0, 01, 15, 14, 18, 19, 40, 41, 39, 38, 26, 27};

inline uint16_t test5()
{
	register uint32_t data  = IMXRT_GPIO6_DIRECT;
	register uint32_t data2  = data >> 2;
	register uint32_t data3  = data >> 12;
	asm volatile("bfi %0, %1, 20, 2" : "+r"(data) : "r"(data2));
	asm volatile("bfi %0, %1, 28, 2" : "+r"(data) : "r"(data3));
	return (data >> 16);
}
void setup() {
	// put your setup code here, to run once:
	while (!Serial && millis() < 4000 );
	Serial.println("\n" __FILE__ " " __DATE__ " " __TIME__);
	for (int i = 0; i < 22; i++)
		if ( (i) % 2 )
			pinMode(pins[i], INPUT_PULLDOWN);
		else
			pinMode(pins[i], INPUT_PULLUP);
	Serial.println();
}
uint32_t cnt;
void loop() {
	uint16_t data;
	uint32_t start, stop;
	start = ARM_DWT_CYCCNT;
	for (int i = 0; i < 1000; i++)
		data = test5();
	stop = ARM_DWT_CYCCNT;
	Serial.print("      time taken by test5() for 1000 reads: ");
	Serial.print(stop - start);
	Serial.print("  || \tdata : ");
	Serial.println(data, BIN);
	start = ARM_DWT_CYCCNT;
	for (int i = 0; i < 1000; i++)
		data = read_16fast();
	stop = ARM_DWT_CYCCNT;
	Serial.print("time taken by read_16fast() for 1000 reads: ");
	Serial.print(stop - start);
	Serial.print("  || \tdata : ");
	Serial.println(data, BIN);
	start = ARM_DWT_CYCCNT;
	for (int i = 0; i < 1000; i++)
		data = read_16_T41();
	stop = ARM_DWT_CYCCNT;
	Serial.print("time taken by read_16_T41() for 1000 reads: ");
	Serial.print(stop - start);
	Serial.print("  || \tdata : ");
	Serial.println(data, BIN);

	for (int i = 0; i < 22; i++)
		if ( (cnt + i) % 4 )
			pinMode(pins[i], INPUT_PULLDOWN);
		else
			pinMode(pins[i], INPUT_PULLUP);
	delay(50);
	if ( !(cnt % 4) ) {
		delay(5000);
		Serial.println();
	}
	cnt++;
}
 
Woo, that looks somewhat tricky.

Here I ran another test, i didn't print data. For some reason returning GPIO6_DR is slower than creating a local variable and returning it.

It may make sense because local variables initialized dynamically will be created in ram. It might be executed faster that directly returning. I am not sure what happens when GPIO6_DR is directly returned it uses registers or somewhat. I may be totally wrong.

Although the difference is very less, it is consistent.
Code:
time taken by read_16_T41_MOD() for 1000 reads: 8520
----time taken by read_16_T41() for 1000 reads: 8533
time taken by read_16_T41_MOD() for 1000 reads: 8510
----time taken by read_16_T41() for 1000 reads: 8531
time taken by read_16_T41_MOD() for 1000 reads: 8510
----time taken by read_16_T41() for 1000 reads: 8532
time taken by read_16_T41_MOD() for 1000 reads: 8510
----time taken by read_16_T41() for 1000 reads: 8531
time taken by read_16_T41_MOD() for 1000 reads: 8510
----time taken by read_16_T41() for 1000 reads: 8530
time taken by read_16_T41_MOD() for 1000 reads: 8510
----time taken by read_16_T41() for 1000 reads: 8531
time taken by read_16_T41_MOD() for 1000 reads: 8510
----time taken by read_16_T41() for 1000 reads: 8532

Code:
inline uint16_t read_16_T41() {
  //uint32_t data = GPIO6_DR;
  return (GPIO6_DR >> 16);
}

inline uint16_t read_16_T41_MOD() {
  uint32_t data = GPIO6_DR;
  return (data >> 16);
}
const int pins[16] = {27, 26, 25, 24, 21, 20, 23, 22, 16, 17, 0, 01, 15, 14, 18, 19};


void setup() {
  // put your setup code here, to run once:
  Serial.begin(38400);
  for (int i = 0; i < 16; i++)
    pinMode(pins[i], INPUT_PULLDOWN);
}

void loop() {
  uint16_t data;
  uint32_t  start, stop;
  start = ARM_DWT_CYCCNT;
  for (int i = 0; i < 1000; i++)
    data = read_16_T41_MOD();
  stop = ARM_DWT_CYCCNT;
  Serial.print("time taken by read_16_T41_MOD() for 1000 reads: ");
  Serial.println(stop - start);
  start = ARM_DWT_CYCCNT;
  for (int i = 0; i < 1000; i++)
    data = read_16_T41();
  stop = ARM_DWT_CYCCNT;
  Serial.print("----time taken by read_16_T41() for 1000 reads: ");
  Serial.println(stop - start);
  delay(100);
}

May be someone who has more knowledge abut architecture of the ic can explain this.
 
That code is hitting some magical dual execute sweetspot! In fact it is faster than t T_4.1 can just read 16 pins it has presented just read and right shifted 16?

Timing converted to Cycle Counts - two T_4.1 bits as cycled don't align with placement - but are cycled PULLUP/PULLDOWN in some fashion to show the two T_4.0 example results compare:


Can this be explained how these two take up same amount of time. I am confused.......
Code:
const uint16_t mask1 = 0xcfcf;// 1100 1111 1100 1111;
const uint16_t mask2 = 0x0030;// 0000 0000 0011 0000;
const uint16_t mask3 = 0x3000;// 0011 0000 0000 0000;


inline uint32_t read() {
  uint32_t data = GPIO6_DR;
  return ((data >> 16) & mask1) | ((data << 2) & mask2) | (data & mask3);
}

const int pins[16] = {27, 26, 25, 24, 21, 20, 23, 22, 16, 17, 0, 01, 15, 14, 18, 19};

void setup() {
  // put your setup code here, to run once:
  Serial.begin(38400);
  for (int i = 0; i < 16; i++)
    pinMode(pins[i], INPUT_PULLDOWN);
}

void loop() {
  uint32_t data, start, stop, time;
  start = ARM_DWT_CYCCNT;
  data = GPIO6_DR;
  stop = ARM_DWT_CYCCNT;
  time = stop - start;
  start = ARM_DWT_CYCCNT;
  data = read();
  stop = ARM_DWT_CYCCNT;
  Serial.print("Time taken with operations    : ");
  Serial.println(time);
  Serial.print("Time taken without operations : ");
  Serial.println(stop - start);
  delay(100);
}

Code:
Time taken with operations    : 9
Time taken without operations : 9
Time taken with operations    : 9
Time taken without operations : 9
Time taken with operations    : 9
Time taken without operations : 9
Time taken with operations    : 9
Time taken without operations : 9
Time taken with operations    : 9
Time taken without operations : 9
Time taken with operations    : 9
Time taken without operations : 9
Time taken with operations    : 9
Time taken without operations : 9
Time taken with operations    : 9
Time taken without operations : 9
Time taken with operations    : 9
 
Can this be explained how these two take up same amount of time. I am confused.......
...

As noted seemed as if 'magic' must be involved ... for true understanding perhaps would have to look at the generated assembly to understand if there are chances for the 1062's dual execute to do all the manipulations of one in the same time it does the simple transfer of the other.

That's why the binary pin read 'data' was changed { Up & Down } and dumped - as it seemed like it wasn't possibly actually referencing the real pin data and doing those manipulations in that time.
 
You need to make some use of returned data so the compiler doesn't optimize it away. Then you will see that the use of assembly/bfi is faster.
 
Last edited:
As noted seemed as if 'magic' must be involved ... for true understanding perhaps would have to look at the generated assembly to understand if there are chances for the 1062's dual execute to do all the manipulations of one in the same time it does the simple transfer of the other.

That's why the binary pin read 'data' was changed { Up & Down } and dumped - as it seemed like it wasn't possibly actually referencing the real pin data and doing those manipulations in that time.

You need to make some use of returned data so the compiler doesn't optimize it away. Then you will see that the use of assembly/bfi is faster.

Based on those suggestion I tried this code. I changed the data before each cycle and used the return value.
But still it is executing at the same speed.
I am not familiar with ARM architecture is it capable of doing that operation so fast.
Code:
#define IMXRT_GPIO6_DIRECT  (*(volatile uint32_t *)0x42000000)

const uint16_t mask1 = 0xcfcf;// 1100 1111 1100 1111;
const uint16_t mask2 = 0x0030;// 0000 0000 0011 0000;
const uint16_t mask3 = 0x3000;// 0011 0000 0000 0000;
const int pins[16] = {19, 18, 14, 15, 1, 0, 17, 16, 22, 23, 20, 21, 24, 25, 26, 27}; //{27, 26, 25, 24, 21, 20, 23, 22, 16, 17, 0, 01, 15, 14, 18, 19};


int count = 0;
bool state = 0;

inline uint32_t read() {
  uint32_t data = GPIO6_DR;
  return ((data >> 16) & mask1) | ((data << 2) & mask2) | (data & mask3);
}

inline uint32_t test5()
{
  register uint32_t data  = IMXRT_GPIO6_DIRECT;
  register uint32_t data2  = data >> 2;
  register uint32_t data3  = data >> 12;
  asm volatile("bfi %0, %1, 20, 2" : "+r"(data) : "r"(data2));
  asm volatile("bfi %0, %1, 28, 2" : "+r"(data) : "r"(data3));
  return (data >> 16);
}

void setup() {
  // put your setup code here, to run once:
  Serial.begin(115200);
  for (int i = 0; i < 16; i++)
    pinMode(pins[i], INPUT_PULLDOWN);
  pinMode(pins[count], INPUT_PULLUP);
  delay(100);
}

void loop() {
  uint32_t data, start, stop;
  start = ARM_DWT_CYCCNT;
  data = read();  //9 clocks
  //data = test5();   //12 clocks
  stop = ARM_DWT_CYCCNT;
  Serial.print("Time : ");
  Serial.print(stop - start);
  Serial.print("  || data : ");
  data |= 0x10000;
  Serial.print(data , BIN);
  Serial.print("  || Count :");
  Serial.println(count++);
  if (count > 15) {
    count = 0;
    state = !state;
  }
    if (!state)
    pinMode(pins[count], INPUT_PULLUP);
  else
    pinMode(pins[count], INPUT_PULLDOWN);
  delay(100);
}

Code:
Time : 9  || data : 11111111111111110  || Count :0
Time : 9  || data : 11111111111111100  || Count :1
Time : 9  || data : 11111111111111000  || Count :2
Time : 9  || data : 11111111111110000  || Count :3
Time : 9  || data : 11111111111100000  || Count :4
Time : 9  || data : 11111111111000000  || Count :5
Time : 9  || data : 11111111110000000  || Count :6
Time : 9  || data : 11111111100000000  || Count :7
Time : 9  || data : 11111111000000000  || Count :8
Time : 9  || data : 11111110000000000  || Count :9
Time : 9  || data : 11111100000000000  || Count :10
Time : 9  || data : 11111000000000000  || Count :11
Time : 9  || data : 11110000000000000  || Count :12
Time : 9  || data : 11100000000000000  || Count :13
Time : 9  || data : 11000000000000000  || Count :14
Time : 9  || data : 10000000000000000  || Count :15
Time : 9  || data : 10000000000000001  || Count :0
Time : 9  || data : 10000000000000011  || Count :1
Time : 9  || data : 10000000000000111  || Count :2
Time : 9  || data : 10000000000001111  || Count :3
Time : 9  || data : 10000000000011111  || Count :4
Time : 9  || data : 10000000000111111  || Count :5
Time : 9  || data : 10000000001111111  || Count :6
Time : 9  || data : 10000000011111111  || Count :7
Time : 9  || data : 10000000111111111  || Count :8
Time : 9  || data : 10000001111111111  || Count :9
Time : 9  || data : 10000011111111111  || Count :10
Time : 9  || data : 10000111111111111  || Count :11
Time : 9  || data : 10001111111111111  || Count :12
Time : 9  || data : 10011111111111111  || Count :13
Time : 9  || data : 10111111111111111  || Count :14
Time : 9  || data : 11111111111111111  || Count :15
 
Last edited:
@fragster I am having a hard time visualizing what is going on in the simpler code by @jonr. On the pin diagram card that comes with the Teensy 4.0 there is no pin listed as GPI06. What is the name of these on the card?
Also, what is meant by "// move two two bit fields from the lower word into gaps in the upper word". Can you draw a picture showing what is happening here? Why are there GAPS in the upper word?
Why 2 bits at a time?
 
@MakerBR Could you please help out a newbie and sprinkle some comments in your code above (From July 15th) to explain what is going on?
Thanks. There is a lot of unfamilar terms in there.
 
Status
Not open for further replies.
Back
Top