Can I accelerate GPIO updates in a simple loop?

Status
Not open for further replies.

was-ja

Well-known member
Hello,

I am pushing consequently several (14) integer words into GPIO6 in a loop. Right now one GPIO6 update takes two clocks, and I would like to do it in one clock. Data in these integer words organized such a way that each port bit updates only twice during 14 storage's, so, each GPIO port run at 30-70MHz but I would like to play with the phase as accurate as possible. AFAIK, I am retrieving data to the register that takes one clock tick, and one GPIO6 update the next clock tick. Please, advice me is there any possibility to do this twice faster, and preferably with 28 integer words?

The complete code is below.

Sure I played already with optimization, but it does not help a lot.

Thank you

Code:
// #pragma GCC push_options
// #pragma GCC optimize ("Ofast")


#include <Arduino.h>


void RunRealExcitationOne()
{ unsigned int  A0=0x22a20002;
  unsigned int  A1=0x22860002;
  unsigned int  A2=0x2a840002;
  unsigned int  A3=0x0a940002;
  unsigned int  A4=0x0a150002;
  unsigned int  A5=0x18150002;
  unsigned int  A6=0x18510002;
  unsigned int  A7=0x11510002;
  unsigned int  A8=0x11490002;
  unsigned int  A9=0x15480002;
  unsigned int A10=0x05680002;
  unsigned int A11=0x052a0002;
  unsigned int A12=0x242a0002;
  unsigned int A13=0x24a20002;
  unsigned char i=255;
  do
  { GPIO6_DR=A0;
    GPIO6_DR=A1;
    GPIO6_DR=A2;
    GPIO6_DR=A3;
    GPIO6_DR=A4;
    GPIO6_DR=A5;
    GPIO6_DR=A6;
    GPIO6_DR=A7;
    GPIO6_DR=A8;
    GPIO6_DR=A9;
    GPIO6_DR=A10;
    GPIO6_DR=A11;
    GPIO6_DR=A12;
    GPIO6_DR=A13;
    i--;
  } while(i);
}


void setup()
{
  while(!Serial); // wait for serial port to connect
  if (CrashReport) {
    Serial.print(CrashReport);
    delay(5000);
  }
  Serial.println("We are starting...");

  pinMode(23, OUTPUT);
  pinMode(22, OUTPUT);
  pinMode(21, OUTPUT);
  pinMode(20, OUTPUT);
  pinMode(19, OUTPUT);
  pinMode(18, OUTPUT);
  pinMode(17, OUTPUT);
  pinMode(16, OUTPUT);
  pinMode(15, OUTPUT);
  pinMode(14, OUTPUT);
  pinMode(41, OUTPUT);
  pinMode(40, OUTPUT);
  pinMode(39, OUTPUT);
  pinMode(38, OUTPUT);
}


void loop()
{ unsigned long t1, t2;
  t1=ARM_DWT_CYCCNT;
  for(unsigned char i=0; i<128; i++)
     RunRealExcitationOne();
  t2=ARM_DWT_CYCCNT;
  Serial.printf("Times: %ld\n", (t2-t1)/255/128);
  delay(1000);
}

// #pragma GCC pop_options
 
I understand that I need to place A0...A13 into registers, in this case the main loop will take one tact to one data transaction.

Actually, I was unable to do this using "register" before "unsigned int".

I also tried to rearrange the code and use only 4, or 7 variables of A0,...A13, considering that compiler do not have enough registers for all 14 variables.

My further questions:

1. can teensy 4x processor load data into one redister and store data from other register into GPIO6_DR simultaniously in one tact?
2. how to force compiler to place A0,...A13 into registers?
3. can I get assembler code from arduino environment?
4. is there any simple examples and tutorials, how to write small assembler parts and incorporate them into *.cpp for this processor?

Thank you!
 
What teensy version do you use?, teensy 4.1 have the possibility to output 16bits on the MSB (bit16-31) part of GPIO6 in one go.
 
Thank you very much, manicksan, for the replay!

Yes, I am using teensy 4.1, and want to update only 16-31 bits. Please, advice my how to make it by one tact, I tried to unsigned short, but it seems be slower.
 

yes, I see it before, but it is 7 times slower than 32-bits write, if we discussing about:

Code:
a = (GPIO6_DR) & 0x0000ffff; GPIO6_DR = a | (((unsigned int) A0)<<16);

and direct operation like you proposed:

GPIO6_DR = (GPIO6_DR & 0x00FF) | (data << 16);
should be modified to
GPIO6_DR = (GPIO6_DR & 0x0000ffff) | (data << 16);
but cannot be compiled - the compiler give an error regarding to use GPIO6_DR in bit operations.
 
should be modified to: GPIO6_DR = (GPIO6_DR & 0x0000ffff) | (data << 16);
Yes off course, I must have been tired when writing that. (8*4bits=32)


but cannot be compiled - the compiler give an error regarding to use GPIO6_DR in bit operations.

I don't get an compile errors while doing it like this:
Code:
void setup() {
  GPIO6_GDIR |= 0xFFFF0000; // set bits 16-31 to outputs
}

void loop() {
  uint16_t data = 0x55;
  GPIO6_DR = (GPIO6_DR & 0x0000ffff) | (data << 16);
  data = 0xAA;
  GPIO6_DR = (GPIO6_DR & 0x0000ffff) | (data << 16);
}

probe it with my scope and I get 30nS time between the first and last in the loop that would give you around 33MHz data transfer if you would do it in one go
(but I have read somewhere that the speed can be even more)
then 40nS from the end of loop to the beginning

the assembly output is (which looks efficient)
Code:
GPIO6_DR = (GPIO6_DR & 0x0000ffff) | (data << 16);
      90:	f04f 4284 	mov.w	r2, #1107296256	; 0x42000000
      94:	b29b6813 	.word	0xb29b6813
      98:	f443 03aa 	orr.w	r3, r3, #5570560	; 0x550000
      9c:	68136013 	.word	0x68136013
data = 0xAA;
GPIO6_DR = (GPIO6_DR & 0x0000ffff) | (data << 16);
      a0:	b29b      	uxth	r3, r3
      a2:	f443 032a orr.w	r3, r3, #11141120	; 0xaa0000
      a6:	6013      	.short	0x6013
      a8:	4770      	bx	lr
 
In the datasheet page 958
It say that the maximum speed for GPIO6 is 200MHz
But that need additional configuration to be possible.
 
I really cannot understand the reason why my previous version with "unsigned short" does not compile. When I used uint16_t as you suggested - everything compile, but it is still slow.

Below I compared mine version with 32-bits write and yours with 16-bits write, and mine take 2 tacts per write, and 16-bits write takes 7 times more, i.e. 14 tacts.

You can compare it, I attached both versions below:

Code:
#pragma GCC push_options
#pragma GCC optimize ("Ofast")


#include <Arduino.h>

void RunRealExcitationX()
{ register unsigned short  A0=0x22a2;
  register unsigned short  A1=0x2286;
  register unsigned short  A2=0x2a84;
  register unsigned short  A3=0x0a94;
  register unsigned short  A4=0x0a15;
  register unsigned short  A5=0x1815;
  register unsigned short  A6=0x1851;
  register unsigned short  A7=0x1151;
  register unsigned short  A8=0x1149;
  register unsigned short  A9=0x1548;
  register unsigned short A10=0x0568;
  register unsigned short A11=0x052a;
  register unsigned short A12=0x242a;
  register unsigned short A13=0x24a2;
  unsigned char i=0;
  do
  {
#if 0
    unsigned int a;
    a = (GPIO6_DR) & 0x0000ffff; GPIO6_DR = a | (((unsigned int) A0)<<16);
    a = (GPIO6_DR) & 0x0000ffff; GPIO6_DR = a | (((unsigned int) A0)<<16);
    a = (GPIO6_DR) & 0x0000ffff; GPIO6_DR = a | (((unsigned int) A1)<<16);
    a = (GPIO6_DR) & 0x0000ffff; GPIO6_DR = a | (((unsigned int) A2)<<16);
    a = (GPIO6_DR) & 0x0000ffff; GPIO6_DR = a | (((unsigned int) A3)<<16);
    a = (GPIO6_DR) & 0x0000ffff; GPIO6_DR = a | (((unsigned int) A4)<<16);
    a = (GPIO6_DR) & 0x0000ffff; GPIO6_DR = a | (((unsigned int) A5)<<16);
    a = (GPIO6_DR) & 0x0000ffff; GPIO6_DR = a | (((unsigned int) A6)<<16);
    a = (GPIO6_DR) & 0x0000ffff; GPIO6_DR = a | (((unsigned int) A7)<<16);
    a = (GPIO6_DR) & 0x0000ffff; GPIO6_DR = a | (((unsigned int) A8)<<16);
    a = (GPIO6_DR) & 0x0000ffff; GPIO6_DR = a | (((unsigned int) A9)<<16);
    a = (GPIO6_DR) & 0x0000ffff; GPIO6_DR = a | (((unsigned int)A10)<<16);
    a = (GPIO6_DR) & 0x0000ffff; GPIO6_DR = a | (((unsigned int)A11)<<16);
    a = (GPIO6_DR) & 0x0000ffff; GPIO6_DR = a | (((unsigned int)A12)<<16);
    a = (GPIO6_DR) & 0x0000ffff; GPIO6_DR = a | (((unsigned int)A13)<<16);
#else
    GPIO6_DR = (GPIO6_DR & 0x0000ffff) | (((unsigned int) A0) << 16);
    GPIO6_DR = (GPIO6_DR & 0x0000ffff) | (((unsigned int) A1) << 16);
    GPIO6_DR = (GPIO6_DR & 0x0000ffff) | (((unsigned int) A2) << 16);
    GPIO6_DR = (GPIO6_DR & 0x0000ffff) | (((unsigned int) A3) << 16);
    GPIO6_DR = (GPIO6_DR & 0x0000ffff) | (((unsigned int) A4) << 16);
    GPIO6_DR = (GPIO6_DR & 0x0000ffff) | (((unsigned int) A5) << 16);
    GPIO6_DR = (GPIO6_DR & 0x0000ffff) | (((unsigned int) A6) << 16);
    GPIO6_DR = (GPIO6_DR & 0x0000ffff) | (((unsigned int) A7) << 16);
    GPIO6_DR = (GPIO6_DR & 0x0000ffff) | (((unsigned int) A8) << 16);
    GPIO6_DR = (GPIO6_DR & 0x0000ffff) | (((unsigned int) A9) << 16);
    GPIO6_DR = (GPIO6_DR & 0x0000ffff) | (((unsigned int)A10) << 16);
    GPIO6_DR = (GPIO6_DR & 0x0000ffff) | (((unsigned int)A11) << 16);
    GPIO6_DR = (GPIO6_DR & 0x0000ffff) | (((unsigned int)A12) << 16);
    GPIO6_DR = (GPIO6_DR & 0x0000ffff) | (((unsigned int)A13) << 16);
#endif
    i--;
  } while(i);
}


void RunRealExcitationY()
{ register unsigned int  A0=0x22a20002;
  register unsigned int  A1=0x22860002;
  register unsigned int  A2=0x2a840002;
  register unsigned int  A3=0x0a940002;
  register unsigned int  A4=0x0a150002;
  register unsigned int  A5=0x18150002;
  register unsigned int  A6=0x18510002;
  register unsigned int  A7=0x11510002;
  register unsigned int  A8=0x11490002;
  register unsigned int  A9=0x15480002;
  register unsigned int A10=0x05680002;
  register unsigned int A11=0x052a0002;
  register unsigned int A12=0x242a0002;
  register unsigned int A13=0x24a20002;
  unsigned char i=0;
  do
  { GPIO6_DR=A0;
    GPIO6_DR=A1;
    GPIO6_DR=A2;
    GPIO6_DR=A3;
    GPIO6_DR=A4;
    GPIO6_DR=A5;
    GPIO6_DR=A6;
    GPIO6_DR=A7;
    GPIO6_DR=A8;
    GPIO6_DR=A9;
    GPIO6_DR=A10;
    GPIO6_DR=A11;
    GPIO6_DR=A12;
    GPIO6_DR=A13;
    i--;
  } while(i);
}


void setup()
{
  while(!Serial); // wait for serial port to connect
  if (CrashReport) {
    Serial.print(CrashReport);
    delay(5000);
  }
  Serial.println("We are starting...");

  pinMode(23, OUTPUT);
  pinMode(22, OUTPUT);
  pinMode(21, OUTPUT);
  pinMode(20, OUTPUT);
  pinMode(19, OUTPUT);
  pinMode(18, OUTPUT);
  pinMode(17, OUTPUT);
  pinMode(16, OUTPUT);
  pinMode(15, OUTPUT);
  pinMode(14, OUTPUT);
  pinMode(41, OUTPUT);
  pinMode(40, OUTPUT);
  pinMode(39, OUTPUT);
  pinMode(38, OUTPUT);
}


void loop()
{ unsigned long t1, t2;
  t1=ARM_DWT_CYCCNT;
  for(unsigned char i=0; i<128; i++)
    RunRealExcitationX();
  t2=ARM_DWT_CYCCNT;
  Serial.printf("16 bits write: %ld %ld\n", t2-t1, (t2-t1)/255/128);
  t1=ARM_DWT_CYCCNT;
  for(unsigned char i=0; i<128; i++)
    RunRealExcitationY();
  t2=ARM_DWT_CYCCNT;
  Serial.printf("32 bits write: %ld %ld\n", t2-t1, (t2-t1)/255/128);
  delay(1000);
}

#pragma GCC pop_options
 
In the datasheet page 958
It say that the maximum speed for GPIO6 is 200MHz
But that need additional configuration to be possible.

Thank you very much for pointing out. Yes, sure. I know that it is not possible to play with GPIO very fast, but, I have such a sequence of data that generate ca. 30MHz on/off sequence on each port, but all ports have some slight differences in phases. I would like to achieve as maximum as possible accuracy on this phase for each channel. I clearly understand that this taks usually solved by FPGA, and I did it by FPGA before, but on teensy 4.1 hardware I already achieved 300MHz granularity for the phase, and I am searching for the solution of 600MHz accuracy.
 
Last edited:
Out of curiosity what FPGA did you use?

mainly Altera, Cyclone, Stratix of different size. For example, on small cyclones it is doable to make similar with some additional delays on latches at 0.3ns, so, 10 times better than directly on teensy 4.1, but it is not very straightforward to rote it, implement design in FPGA and feed appropriate configuration. Here it is more simple for sure.
 
Status
Not open for further replies.
Back
Top