Forum Rule: Always post complete source code & details to reproduce any issue!
Results 1 to 12 of 12

Thread: Teensy 4.0 PORT manipulation to read/write data to multiple pins

  1. #1
    Junior Member
    Join Date
    Jul 2020
    Posts
    7

    Teensy 4.0 PORT manipulation to read/write data to multiple pins

    I have noticed that lot of people been asking about port manipulation example code and I was also searching for the same. After reading some documentation I came up with these program to read/write multiple pins simultaneously at faster speed.
    This is my first attempt at creating something like this. Let me know how it is.
    Here is the GitHub link: https://github.com/ManojBR105/Teensy...lel_read-write
    Please see readme file for more information.
    I have used only pins 0-23 and optimized the code to take less time and operations to the best of my knowledge.
    Attached Files Attached Files

  2. #2
    Senior Member
    Join Date
    May 2015
    Location
    USA
    Posts
    846
    I encourage you to review this and how it is much faster.

    Code:
    #define IMXRT_GPIO6_DIRECT  (*(volatile uint32_t *)0x42000000)
    
    // rearrange 16 GPIO6 pin inputs into 16 consecutive bits
    // move two two bit fields from the lower word into gaps in the upper word
    
    inline uint16_t test5()
    {
      register uint32_t data  = IMXRT_GPIO6_DIRECT;
      register uint32_t data2  = data >> 2;
      register uint32_t data3  = data >> 12;
      asm volatile("bfi %0, %1, 20, 2" : "+r"(data) : "r"(data2));
      asm volatile("bfi %0, %1, 28, 2" : "+r"(data) : "r"(data3));
      return (data >> 16);
    }

  3. #3
    Junior Member
    Join Date
    Jul 2020
    Posts
    7
    Quote Originally Posted by jonr View Post
    I encourage you to review this and how it is much faster.

    Code:
    #define IMXRT_GPIO6_DIRECT  (*(volatile uint32_t *)0x42000000)
    
    // rearrange 16 GPIO6 pin inputs into 16 consecutive bits
    // move two two bit fields from the lower word into gaps in the upper word
    
    inline uint16_t test5()
    {
      register uint32_t data  = IMXRT_GPIO6_DIRECT;
      register uint32_t data2  = data >> 2;
      register uint32_t data3  = data >> 12;
      asm volatile("bfi %0, %1, 20, 2" : "+r"(data) : "r"(data2));
      asm volatile("bfi %0, %1, 28, 2" : "+r"(data) : "r"(data3));
      return (data >> 16);
    }

    I noticed that you have used the GPIO6 for all the 16 bit, but some of those pins are on the backside of the board which are inconvinient to use.
    Also I mentioned the same in the readme file that 16bit and 24bit operations speed can be improved by using pins on the backside of the board.

    So I compared my approach to yours in the following way:
    Code:
    #define IMXRT_GPIO6_DIRECT  (*(volatile uint32_t *)0x42000000)
    
    const uint16_t mask1 = 0xcfcf;// 1100 1111 1100 1111;
    const uint16_t mask2 = 0x0030;// 0000 0000 0011 0000;
    const uint16_t mask3 = 0x3000;// 0011 0000 0000 0000;
    
    //using backside pins for faaster data read
    inline uint16_t read_16fast() {
      uint32_t data = GPIO6_DR;
      return ((data >> 16) & mask1) | ((data << 2) & mask2) | (data & mask3);
    }
    
    
    // rearrange 16 GPIO6 pin inputs into 16 consecutive bits
    // move two two bit fields from the lower word into gaps in the upper word
    const int pins[16] = {27, 26, 25, 24, 21, 20, 23, 22, 16, 17, 0, 01, 15, 14, 18, 19};
    
    inline uint16_t test5()
    {
      register uint32_t data  = IMXRT_GPIO6_DIRECT;
      register uint32_t data2  = data >> 2;
      register uint32_t data3  = data >> 12;
      asm volatile("bfi %0, %1, 20, 2" : "+r"(data) : "r"(data2));
      asm volatile("bfi %0, %1, 28, 2" : "+r"(data) : "r"(data3));
      return (data >> 16);
    }
    void setup() {
      // put your setup code here, to run once:
      Serial.begin(38400);
      for (int i = 0; i < 16; i++)
        pinMode(pins[i], INPUT_PULLDOWN);
    }
    
    void loop() {
      uint16_t data, start, stop;
      start = micros();
      for (int i = 0; i < 1000; i++)
        data = test5();
      stop = micros();
      Serial.print("time taken by test5() for 1000 reads: ");
      Serial.print(stop - start);
      Serial.print("  || data : ");
      Serial.println(data);
      start = micros();
      for (int i = 0; i < 1000; i++)
        data = read_16fast();
      stop = micros();
      Serial.print("time taken by read_16fast() for 1000 reads: ");
      Serial.print(stop - start);
      Serial.print("  || data : ");
      Serial.println(data);
      delay(5000);
    }
    this approach might not be accurate . But it turns out mine takes 14ns and yours 21ns.
    you can see attachment for logged data.(or test it on your own).

    But I noticed that you are using assembly code. Which should be faster than mine.
    But the time taken by the micro to get data from GPIO6_DR is very high compared to those operations. So the operations that done are insignificant with respect to time consumed.
    Attached Files Attached Files

  4. #4
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    13,763
    That code is hitting some magical dual execute sweetspot! In fact it is faster than t T_4.1 can just read 16 pins it has presented just read and right shifted 16?

    Timing converted to Cycle Counts - two T_4.1 bits as cycled don't align with placement - but are cycled PULLUP/PULLDOWN in some fashion to show the two T_4.0 example results compare:
    Code:
    T:\tCode\TLAjr\BitIO_T4\BitIO_T4.ino Jul 14 2020 23:23:10
    
          time taken by test5() for 1000 reads: 13034  || 	data : 110101010101010
    time taken by read_16fast() for 1000 reads: 8536  || 	data : 110101010101010
    time taken by read_16_T41() for 1000 reads: 9036  || 	data : 110101010011010
    
          time taken by test5() for 1000 reads: 13020  || 	data : 100100010001000
    time taken by read_16fast() for 1000 reads: 8536  || 	data : 100100010001000
    time taken by read_16_T41() for 1000 reads: 9038  || 	data : 100100010011000
          time taken by test5() for 1000 reads: 13020  || 	data : 1000100010001
    time taken by read_16fast() for 1000 reads: 8536  || 	data : 1000100010001
    time taken by read_16_T41() for 1000 reads: 9038  || 	data : 1000100000001
          time taken by test5() for 1000 reads: 13020  || 	data : 10001000100010
    time taken by read_16fast() for 1000 reads: 8534  || 	data : 10001000100010
    time taken by read_16_T41() for 1000 reads: 9036  || 	data : 10001000000010
          time taken by test5() for 1000 reads: 13020  || 	data : 1000010001000100
    time taken by read_16fast() for 1000 reads: 8533  || 	data : 1000010001000100
    time taken by read_16_T41() for 1000 reads: 9038  || 	data : 1000010001100100
    Code:
    #define IMXRT_GPIO6_DIRECT  (*(volatile uint32_t *)0x42000000)
    
    const uint16_t mask1 = 0xcfcf;// 1100 1111 1100 1111;
    const uint16_t mask2 = 0x0030;// 0000 0000 0011 0000;
    const uint16_t mask3 = 0x3000;// 0011 0000 0000 0000;
    
    //using backside pins for faaster data read
    inline uint16_t read_16fast() {
    	uint32_t data = GPIO6_DR;
    	return ((data >> 16) & mask1) | ((data << 2) & mask2) | (data & mask3);
    }
    
    inline uint16_t read_16_T41() {
    	//uint32_t data = GPIO6_DR;
    	return (GPIO6_DR >> 16);
    }
    
    // rearrange 16 GPIO6 pin inputs into 16 consecutive bits
    // move two two bit fields from the lower word into gaps in the upper word
    const int pins[22] = {27, 26, 25, 24, 21, 20, 23, 22, 16, 17, 0, 01, 15, 14, 18, 19, 40, 41, 39, 38, 26, 27};
    
    inline uint16_t test5()
    {
    	register uint32_t data  = IMXRT_GPIO6_DIRECT;
    	register uint32_t data2  = data >> 2;
    	register uint32_t data3  = data >> 12;
    	asm volatile("bfi %0, %1, 20, 2" : "+r"(data) : "r"(data2));
    	asm volatile("bfi %0, %1, 28, 2" : "+r"(data) : "r"(data3));
    	return (data >> 16);
    }
    void setup() {
    	// put your setup code here, to run once:
    	while (!Serial && millis() < 4000 );
    	Serial.println("\n" __FILE__ " " __DATE__ " " __TIME__);
    	for (int i = 0; i < 22; i++)
    		if ( (i) % 2 )
    			pinMode(pins[i], INPUT_PULLDOWN);
    		else
    			pinMode(pins[i], INPUT_PULLUP);
    	Serial.println();
    }
    uint32_t cnt;
    void loop() {
    	uint16_t data;
    	uint32_t start, stop;
    	start = ARM_DWT_CYCCNT;
    	for (int i = 0; i < 1000; i++)
    		data = test5();
    	stop = ARM_DWT_CYCCNT;
    	Serial.print("      time taken by test5() for 1000 reads: ");
    	Serial.print(stop - start);
    	Serial.print("  || \tdata : ");
    	Serial.println(data, BIN);
    	start = ARM_DWT_CYCCNT;
    	for (int i = 0; i < 1000; i++)
    		data = read_16fast();
    	stop = ARM_DWT_CYCCNT;
    	Serial.print("time taken by read_16fast() for 1000 reads: ");
    	Serial.print(stop - start);
    	Serial.print("  || \tdata : ");
    	Serial.println(data, BIN);
    	start = ARM_DWT_CYCCNT;
    	for (int i = 0; i < 1000; i++)
    		data = read_16_T41();
    	stop = ARM_DWT_CYCCNT;
    	Serial.print("time taken by read_16_T41() for 1000 reads: ");
    	Serial.print(stop - start);
    	Serial.print("  || \tdata : ");
    	Serial.println(data, BIN);
    
    	for (int i = 0; i < 22; i++)
    		if ( (cnt + i) % 4 )
    			pinMode(pins[i], INPUT_PULLDOWN);
    		else
    			pinMode(pins[i], INPUT_PULLUP);
    	delay(50);
    	if ( !(cnt % 4) ) {
    		delay(5000);
    		Serial.println();
    	}
    	cnt++;
    }

  5. #5
    Junior Member
    Join Date
    Jul 2020
    Posts
    7
    Woo, that looks somewhat tricky.

    Here I ran another test, i didn't print data. For some reason returning GPIO6_DR is slower than creating a local variable and returning it.

    It may make sense because local variables initialized dynamically will be created in ram. It might be executed faster that directly returning. I am not sure what happens when GPIO6_DR is directly returned it uses registers or somewhat. I may be totally wrong.

    Although the difference is very less, it is consistent.
    Code:
    time taken by read_16_T41_MOD() for 1000 reads: 8520
    ----time taken by read_16_T41() for 1000 reads: 8533
    time taken by read_16_T41_MOD() for 1000 reads: 8510
    ----time taken by read_16_T41() for 1000 reads: 8531
    time taken by read_16_T41_MOD() for 1000 reads: 8510
    ----time taken by read_16_T41() for 1000 reads: 8532
    time taken by read_16_T41_MOD() for 1000 reads: 8510
    ----time taken by read_16_T41() for 1000 reads: 8531
    time taken by read_16_T41_MOD() for 1000 reads: 8510
    ----time taken by read_16_T41() for 1000 reads: 8530
    time taken by read_16_T41_MOD() for 1000 reads: 8510
    ----time taken by read_16_T41() for 1000 reads: 8531
    time taken by read_16_T41_MOD() for 1000 reads: 8510
    ----time taken by read_16_T41() for 1000 reads: 8532
    Code:
    inline uint16_t read_16_T41() {
      //uint32_t data = GPIO6_DR;
      return (GPIO6_DR >> 16);
    }
    
    inline uint16_t read_16_T41_MOD() {
      uint32_t data = GPIO6_DR;
      return (data >> 16);
    }
    const int pins[16] = {27, 26, 25, 24, 21, 20, 23, 22, 16, 17, 0, 01, 15, 14, 18, 19};
    
    
    void setup() {
      // put your setup code here, to run once:
      Serial.begin(38400);
      for (int i = 0; i < 16; i++)
        pinMode(pins[i], INPUT_PULLDOWN);
    }
    
    void loop() {
      uint16_t data;
      uint32_t  start, stop;
      start = ARM_DWT_CYCCNT;
      for (int i = 0; i < 1000; i++)
        data = read_16_T41_MOD();
      stop = ARM_DWT_CYCCNT;
      Serial.print("time taken by read_16_T41_MOD() for 1000 reads: ");
      Serial.println(stop - start);
      start = ARM_DWT_CYCCNT;
      for (int i = 0; i < 1000; i++)
        data = read_16_T41();
      stop = ARM_DWT_CYCCNT;
      Serial.print("----time taken by read_16_T41() for 1000 reads: ");
      Serial.println(stop - start);
      delay(100);
    }
    May be someone who has more knowledge abut architecture of the ic can explain this.

  6. #6
    Junior Member
    Join Date
    Jul 2020
    Posts
    7
    Quote Originally Posted by defragster View Post
    That code is hitting some magical dual execute sweetspot! In fact it is faster than t T_4.1 can just read 16 pins it has presented just read and right shifted 16?

    Timing converted to Cycle Counts - two T_4.1 bits as cycled don't align with placement - but are cycled PULLUP/PULLDOWN in some fashion to show the two T_4.0 example results compare:

    Can this be explained how these two take up same amount of time. I am confused.......
    Code:
    const uint16_t mask1 = 0xcfcf;// 1100 1111 1100 1111;
    const uint16_t mask2 = 0x0030;// 0000 0000 0011 0000;
    const uint16_t mask3 = 0x3000;// 0011 0000 0000 0000;
    
    
    inline uint32_t read() {
      uint32_t data = GPIO6_DR;
      return ((data >> 16) & mask1) | ((data << 2) & mask2) | (data & mask3);
    }
    
    const int pins[16] = {27, 26, 25, 24, 21, 20, 23, 22, 16, 17, 0, 01, 15, 14, 18, 19};
    
    void setup() {
      // put your setup code here, to run once:
      Serial.begin(38400);
      for (int i = 0; i < 16; i++)
        pinMode(pins[i], INPUT_PULLDOWN);
    }
    
    void loop() {
      uint32_t data, start, stop, time;
      start = ARM_DWT_CYCCNT;
      data = GPIO6_DR;
      stop = ARM_DWT_CYCCNT;
      time = stop - start;
      start = ARM_DWT_CYCCNT;
      data = read();
      stop = ARM_DWT_CYCCNT;
      Serial.print("Time taken with operations    : ");
      Serial.println(time);
      Serial.print("Time taken without operations : ");
      Serial.println(stop - start);
      delay(100);
    }
    Code:
    Time taken with operations    : 9
    Time taken without operations : 9
    Time taken with operations    : 9
    Time taken without operations : 9
    Time taken with operations    : 9
    Time taken without operations : 9
    Time taken with operations    : 9
    Time taken without operations : 9
    Time taken with operations    : 9
    Time taken without operations : 9
    Time taken with operations    : 9
    Time taken without operations : 9
    Time taken with operations    : 9
    Time taken without operations : 9
    Time taken with operations    : 9
    Time taken without operations : 9
    Time taken with operations    : 9

  7. #7
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    13,763
    Quote Originally Posted by MakerBR View Post
    Can this be explained how these two take up same amount of time. I am confused.......
    ...
    As noted seemed as if 'magic' must be involved ... for true understanding perhaps would have to look at the generated assembly to understand if there are chances for the 1062's dual execute to do all the manipulations of one in the same time it does the simple transfer of the other.

    That's why the binary pin read 'data' was changed { Up & Down } and dumped - as it seemed like it wasn't possibly actually referencing the real pin data and doing those manipulations in that time.

  8. #8
    Senior Member
    Join Date
    May 2015
    Location
    USA
    Posts
    846
    You need to make some use of returned data so the compiler doesn't optimize it away. Then you will see that the use of assembly/bfi is faster.
    Last edited by jonr; 07-15-2020 at 08:24 PM.

  9. #9
    Junior Member
    Join Date
    Jul 2020
    Posts
    7
    Quote Originally Posted by defragster View Post
    As noted seemed as if 'magic' must be involved ... for true understanding perhaps would have to look at the generated assembly to understand if there are chances for the 1062's dual execute to do all the manipulations of one in the same time it does the simple transfer of the other.

    That's why the binary pin read 'data' was changed { Up & Down } and dumped - as it seemed like it wasn't possibly actually referencing the real pin data and doing those manipulations in that time.
    Quote Originally Posted by jonr View Post
    You need to make some use of returned data so the compiler doesn't optimize it away. Then you will see that the use of assembly/bfi is faster.
    Based on those suggestion I tried this code. I changed the data before each cycle and used the return value.
    But still it is executing at the same speed.
    I am not familiar with ARM architecture is it capable of doing that operation so fast.
    Code:
    #define IMXRT_GPIO6_DIRECT  (*(volatile uint32_t *)0x42000000)
    
    const uint16_t mask1 = 0xcfcf;// 1100 1111 1100 1111;
    const uint16_t mask2 = 0x0030;// 0000 0000 0011 0000;
    const uint16_t mask3 = 0x3000;// 0011 0000 0000 0000;
    const int pins[16] = {19, 18, 14, 15, 1, 0, 17, 16, 22, 23, 20, 21, 24, 25, 26, 27}; //{27, 26, 25, 24, 21, 20, 23, 22, 16, 17, 0, 01, 15, 14, 18, 19};
    
    
    int count = 0;
    bool state = 0;
    
    inline uint32_t read() {
      uint32_t data = GPIO6_DR;
      return ((data >> 16) & mask1) | ((data << 2) & mask2) | (data & mask3);
    }
    
    inline uint32_t test5()
    {
      register uint32_t data  = IMXRT_GPIO6_DIRECT;
      register uint32_t data2  = data >> 2;
      register uint32_t data3  = data >> 12;
      asm volatile("bfi %0, %1, 20, 2" : "+r"(data) : "r"(data2));
      asm volatile("bfi %0, %1, 28, 2" : "+r"(data) : "r"(data3));
      return (data >> 16);
    }
    
    void setup() {
      // put your setup code here, to run once:
      Serial.begin(115200);
      for (int i = 0; i < 16; i++)
        pinMode(pins[i], INPUT_PULLDOWN);
      pinMode(pins[count], INPUT_PULLUP);
      delay(100);
    }
    
    void loop() {
      uint32_t data, start, stop;
      start = ARM_DWT_CYCCNT;
      data = read();  //9 clocks
      //data = test5();   //12 clocks
      stop = ARM_DWT_CYCCNT;
      Serial.print("Time : ");
      Serial.print(stop - start);
      Serial.print("  || data : ");
      data |= 0x10000;
      Serial.print(data , BIN);
      Serial.print("  || Count :");
      Serial.println(count++);
      if (count > 15) {
        count = 0;
        state = !state;
      }
        if (!state)
        pinMode(pins[count], INPUT_PULLUP);
      else
        pinMode(pins[count], INPUT_PULLDOWN);
      delay(100);
    }
    Code:
    Time : 9  || data : 11111111111111110  || Count :0
    Time : 9  || data : 11111111111111100  || Count :1
    Time : 9  || data : 11111111111111000  || Count :2
    Time : 9  || data : 11111111111110000  || Count :3
    Time : 9  || data : 11111111111100000  || Count :4
    Time : 9  || data : 11111111111000000  || Count :5
    Time : 9  || data : 11111111110000000  || Count :6
    Time : 9  || data : 11111111100000000  || Count :7
    Time : 9  || data : 11111111000000000  || Count :8
    Time : 9  || data : 11111110000000000  || Count :9
    Time : 9  || data : 11111100000000000  || Count :10
    Time : 9  || data : 11111000000000000  || Count :11
    Time : 9  || data : 11110000000000000  || Count :12
    Time : 9  || data : 11100000000000000  || Count :13
    Time : 9  || data : 11000000000000000  || Count :14
    Time : 9  || data : 10000000000000000  || Count :15
    Time : 9  || data : 10000000000000001  || Count :0
    Time : 9  || data : 10000000000000011  || Count :1
    Time : 9  || data : 10000000000000111  || Count :2
    Time : 9  || data : 10000000000001111  || Count :3
    Time : 9  || data : 10000000000011111  || Count :4
    Time : 9  || data : 10000000000111111  || Count :5
    Time : 9  || data : 10000000001111111  || Count :6
    Time : 9  || data : 10000000011111111  || Count :7
    Time : 9  || data : 10000000111111111  || Count :8
    Time : 9  || data : 10000001111111111  || Count :9
    Time : 9  || data : 10000011111111111  || Count :10
    Time : 9  || data : 10000111111111111  || Count :11
    Time : 9  || data : 10001111111111111  || Count :12
    Time : 9  || data : 10011111111111111  || Count :13
    Time : 9  || data : 10111111111111111  || Count :14
    Time : 9  || data : 11111111111111111  || Count :15
    Last edited by MakerBR; 07-15-2020 at 10:19 PM. Reason: few changes in code and aligned logged data

  10. #10
    @fragster I am having a hard time visualizing what is going on in the simpler code by @jonr. On the pin diagram card that comes with the Teensy 4.0 there is no pin listed as GPI06. What is the name of these on the card?
    Also, what is meant by "// move two two bit fields from the lower word into gaps in the upper word". Can you draw a picture showing what is happening here? Why are there GAPS in the upper word?
    Why 2 bits at a time?

  11. #11
    @MakerBR Could you please help out a newbie and sprinkle some comments in your code above (From July 15th) to explain what is going on?
    Thanks. There is a lot of unfamilar terms in there.

  12. #12
    Senior Member
    Join Date
    Dec 2013
    Location
    East Stroudsburg PA.
    Posts
    302
    Quote Originally Posted by Deane View Post
    On the pin diagram card that comes with the Teensy 4.0 there is no pin listed as GPI06. What is the name of these on the card?
    You really need to look into the Teensy 4.x MCU datasheets, check this thread Post #4 by PaulStoffregen it has a lot of information: Port and toggle question T4

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •