Parallel GPIO on Teensy 3.0

nateimig · Jul 15, 2013

Hello, I was wondering if there was a faster alternative towards writing to 16 parallel GPIO pins like such:

Code:

((cl & 0x01)) ? digitalWriteFast(PIN0, HIGH) : digitalWriteFast(PIN0, LOW);
((cl & 0x02)) ? digitalWriteFast(PIN1, HIGH) : digitalWriteFast(PIN1, LOW);
((cl & 0x04)) ? digitalWriteFast(PIN2, HIGH) : digitalWriteFast(PIN2, LOW);
((cl & 0x08)) ? digitalWriteFast(PIN3, HIGH) : digitalWriteFast(PIN3, LOW);
((cl & 0x10)) ? digitalWriteFast(PIN4, HIGH) : digitalWriteFast(PIN4, LOW);
((cl & 0x20)) ? digitalWriteFast(PIN5, HIGH) : digitalWriteFast(PIN5, LOW);
((cl & 0x40)) ? digitalWriteFast(PIN6, HIGH) : digitalWriteFast(PIN6, LOW);
((cl & 0x80)) ? digitalWriteFast(PIN7, HIGH) : digitalWriteFast(PIN7, LOW);
((ch & 0x01)) ? digitalWriteFast(PIN8, HIGH) : digitalWriteFast(PIN8, LOW);
((ch & 0x02)) ? digitalWriteFast(PIN9, HIGH) : digitalWriteFast(PIN9, LOW);
((ch & 0x04)) ? digitalWriteFast(PIN10, HIGH) : digitalWriteFast(PIN10, LOW);
((ch & 0x08)) ? digitalWriteFast(PIN11, HIGH) : digitalWriteFast(PIN11, LOW);
((ch & 0x10)) ? digitalWriteFast(PIN12, HIGH) : digitalWriteFast(PIN12, LOW);
((ch & 0x20)) ? digitalWriteFast(PIN13, HIGH) : digitalWriteFast(PIN13, LOW);
((ch & 0x40)) ? digitalWriteFast(PIN14, HIGH) : digitalWriteFast(PIN14, LOW);
((ch & 0x80)) ? digitalWriteFast(PIN15, HIGH) : digitalWriteFast(PIN15, LOW);

This piece of code is communicating to the 16-bit parallel GPIO's on a TFT LCD Display & lately I've been running into slow screen refresh rates that have caused me to scale back some small projects that could be really cool, such as this:

Real-Time Audio Spectrogram

I think what I need to do is to look more into DMA from the datasheet & do more research, but this is probably way over my head.

Please help!

stevech · Jul 15, 2013

It is possible to set or clear selected bits in one line of code from a hardware viewpoint. I don't see that implemented in the library code though.

Constantin · Jul 15, 2013

nateimig said:
Hello, I was wondering if there was a faster alternative towards writing to 16 parallel GPIO pins like such:

I wonder if the K20 manual would yield any information on how to address this issue. IIRC, the AVR has a deep-seated ability to allow all PortA, PortB, etc. pins to be set at once using a byte being written to a particular register. I wonder if the same thing can't be done on the Teensy.

stevech · Jul 15, 2013

K20 yes.
And the library pins.c in teensy3, digitalWrite()

any microprocessor has a way to alter all the bits in its GPIO registers in one operation. They differ a bit on how this is done. And how "atomic" access is assured so that as you do a read/modify/write, an interrupt doesn't interfere. On AVRs, you just disable interrupts briefly. On Cortex, there are special ways to interrupts don't need to be turned off. But can be, for simplicity.

Same thing for reading selected bits from a GPIO port, in one operation.

With the Arduino legacy, I guess this just wasn't done often.

PaulStoffregen · Jul 16, 2013

The 8 pins used by OctoWS2811 are a native port. You can write all 8 in a single bus cycle.

nateimig · Jul 16, 2013

Thank you for your responses, I have a few questions for Paul, but first correct me if I'm wrong: In your library OctoWS2811 you have a pair of pwm signals triggering the series of DMA events, which include setting the 8-bits, correcting the 8-bit SPI frame, & clearing all 8-bits; each SPI frame is pre-configured in a buffer from all the led colors of each video frame.

Q1: Can I trigger the DMA events from software?
Q2: Can I use around 2 (8-bit) frames controlling the LCD in 8-bit mode instead of 16-bit mode?
Q3: Where did you learn this?

Thanks so much!

PaulStoffregen · Jul 17, 2013

nateimig said:
Q1: Can I trigger the DMA events from software?

Yes. Well, maybe, depending on what you mean by "event".

Certainly you can set up a DMA transfer by writing 32 bytes to the 8 TCD registers. The last one has a bit for manual start. Otherwise, the major loop executes only when an event comes from the DMAMUX, which gets the event from one of the peripherals you've configured to send a DMA event (usually instead of an interrupt).

Somewhere on this forum is an old thread about memcpy speed. I posted a DMA-based memcpy example. I'm sure you can find it with a little searching.

Q2: Can I use around 2 (8-bit) frames controlling the LCD in 8-bit mode instead of 16-bit mode?

Probably, again depending on the details. There are 5 native 32 bit ports, called A through F, but none has all 32 or even 16 bits on this chip. I believe one of them has 12 bits, so that's probably the most pins that can be written in a single bus cycle.

Q3: Where did you learn this?

It's all in the (huge) reference manual. I must confess, the DMA is the part I learned last. I still haven't used many of its amazing capabilities. Also, this chip has some published errata, where a couple of those features don't actually work, so if you really get into experimenting, check Freescale's errata for this chip.

The DMA engine is quite complex. Setting up DMA involves writing at least 8 registers. Starting a DMA transfer involves many cycles to copy the TCD from registers into the DMA engine, so the DMA does have some overhead per transfer. Once the transfer has begun, the minor loop speed is pretty incredible, especially if you've configured for 32 bit bus operation.

For a LCD, the DMA engine probably doesn't make a lot of sense. It's complex to use and difficult to troubleshoot. For just doing 16 bit operations, the overhead of setting up the transfer is much greater than simply using code to move the bytes around.

nateimig · Aug 7, 2013

Sorry for the late post, I accidentally tore off the USB connector & had to perform open heart surgery on the bottom of expansion board to solder a USB cable to the bottom of Teensy, using Pauls instructions from his post http://forum.pjrc.com/threads/19336-How-to-repair-a-broken-off-Teensy-3-0-USB-connector. Anyway since DMA transfer doesn't seem that practical for my application of just drawing real-time graphics, I decided to write the parallel data to two ports to see if that would speed up the results of UTFT Demo sketch.

These are the pin numbers on teensy 3.0 for each of the port numbers that are available accordingly.

PortA[4:5, 12:13] = {33, 24, 3, 4}
PortB[0:3, 16:19] = {16, 17, 19, 18, 0, 1, 32, 25}
PortC[0:11] = {15, 22, 23, 9, 10, 13, 11, 12, 28, 27, 29, 30}
PortD[0:7] = {2, 14, 7, 8, 6, 20, 21, 5}
PortE[0:1] = {31, 26}

or the port numbers for each pin

PIN0 = B16
PIN1 = B17
PIN2 = D0
PIN3 = A12
PIN4 = A13
PIN5 = D7
PIN6 = D4
PIN7 = D2
PIN8 = D3
PIN9 = C3
PIN10 = C4
PIN11 = C6
PIN12 = C7
PIN13 = C5
PIN14 = D1
PIN15 = C0
PIN16 = B0
PIN17 = B1
PIN18 = B3
PIN19 = B2
PIN20 = D5
PIN21 = D6
PIN22 = C1
PIN23 = C2
PIN24 = A5
PIN25 = B19
PIN26 = E1
PIN27 = C9
PIN28 = C8
PIN29 = C10
PIN30 = C11
PIN31 = E0
PIN32 = B18
PIN33 = A4

I decided to make the following connections to the TFT Display

DB0 = C0
DB1 = C1
DB2 = C2
DB3 = C3
DB4 = C4
DB5 = C5
DB6 = C6
DB7 = C7
DB8 = C8
DB9 = C9
DB10 = C10
DB11 = C11
DB12 = D2
DB13 = D3
DB14 = D4
DB15 = D5

RS = D6

Then this is the short version of the code to control the data write to the IO pins:

Code:

void UTFT::LCD_Writ_Bus(char VH, char VL, byte mode){
	GPIOC_PDOR = ((uint16_t)(VH & 0x0F) << 8) |((uint16_t) VL);
	GPIOD_PDOR = (((uint16_t)(VH >> 2)) & (0b00111100))|(RS << 6);
	pulse_low(pinWR);
}

Now I know this isnt really a control test but before when I was using the digitalWriteFast() on each I/O pin the UTFT demo took ~29,500 ms to finish, although in the demo video it takes 35s. I found later there was a loose debug statement uncommented printing to the terminal so this comparison is somewhat compromised.
Old Demo Video

With the improved PORT writing to I/Os the same UTFT demo took 27,367 ms to finish, here is the video:

This actually finishes the UTFT demo faster than the Due does in these videos:
Due UTFT Video 1 UTFT Demo Time = 29,505 ms
Due UTFT Video 2 UTFT Demo Time = ~32,000 ms

nateimig · Aug 7, 2013

Here is an updated version of Real-Time Spectrogram & the code.

Code:

#define ARM_MATH_CM4
#include <UTFT.h>
#include "arm_math.h"

#define N 128    // 128
#define M 8    // 24
int16_t data[2*N];
q15_t   fftbuf[2*N];
q31_t   *fftmag = (q31_t*)fftbuf;
uint8_t Plot[M][N];
extern uint8_t SmallFont[];

/* Audio Sample Interrupt Instance */
int AudioPin = A3, n = 0;
IntervalTimer AudioSample;

void AudioSampleISR(void){ 
  data[2*n] = analogRead(AudioPin) - 150;      // Real Part
  data[2*n+1] = 0;                             // Imag Part  
  n++;
  if(n > N-1){ 
    calcFFT();
    for(int j = M - 1; j > 0; j--)                                  // Shift data 1 time series
      for(int i = 0; i < 128; i++) Plot[j][i] = Plot[j-1][i];
    displayFFT();
    n = 0;
  } 
}

/* TFT Display Instance */
UTFT myGLCD(ITDB32WD, 21, 25, 3, 4);

int START = 390;
void displayFFT(){
  uint8_t r,g,b;

  myGLCD.setColor(VGA_WHITE);
  myGLCD.fillRect(0, START, 230 - 3*M, 60);
  //myGLCD.fillScr(VGA_WHITE);
  //myGLCD.setColor(VGA_BLACK);
  for(int i = 0; i < N/2; i++){            // Format Current Short Time Spectrum
    Plot[0][i] = constrain(abs(data[2*i]/1),0, 240); //round(constrain(abs(data[i]/5),0, 240));
    gradient(Plot[0][i], r, g, b);
    myGLCD.setColor(r,g,b);
    // myGLCD.fillRect(230, START - 3*i, 232, START - 3*i + 2);
    myGLCD.fillRect(230 - 3*(M -1), START - 3*i, 230 - 3*(M -1) + 2, START - 3*i + 2);
    //myGLCD.drawLine(239 - Plot[0][i], 360 - 2*i, 240, 360 - 2*i);
    
    myGLCD.setColor(0,0,0);
    myGLCD.fillRect(230 - 3*M - constrain(Plot[0][i]/2, 0, 150), START - 3*i, 230 - 3*M, START - 3*i + 2);
  }
  
  for(int j = 1; j < M; j++){                // Plot the STFT Series [1 : M]
    for(int i = 0; i < N/2; i++){            
      gradient(Plot[j][i],r,g,b);
      myGLCD.setColor(r, g, b);
      myGLCD.fillRect(230 - 3*(M -1) + 3*j, START - (3*i),230 - 3*(M -1) + 3*j + 2, START - 3*i+2);
    }
  }
}

void gradient(uint8_t val, uint8_t& r,uint8_t& g,uint8_t& b){
  float d = 150.0;
  float C = 15.97/d;
  r = 0; g = (C*(val - d))*(C*(val - d)); b = 0xFF;
  r = 0;      g = 0xFF - (val - 0);   b = 0xFF;
}
arm_cfft_radix2_instance_q15 fft_inst;  /* CFFT Structure instance */
uint16_t ifftFlag = 0; 
uint16_t doBitReverse = 1;

void calcFFT(){
  arm_cfft_radix2_q15(&fft_inst, data);      // Process Data & create spectrum data  
  
  
  // Convert each value to magnitude
  q15_t * pSrc = fftbuf;
  q31_t * pDst = fftmag;
  for( n=0; n < N; n++ ){
    q15_t real, imag;
    q31_t acc0, acc1;
    real = *pSrc++;
    imag = *pSrc++;
    acc0 = __SMUAD(real, real);
    acc1 = __SMUAD(imag, imag);
    arm_sqrt_q31((q31_t) (((q63_t) acc0 + acc1) >> 17), pDst++);
  }
  
} 

void setup() {
  pinMode(AudioPin, INPUT);
  Serial.begin(9600);
  
  for(int j = 0; j < M; j++)
    for(int i = 0; i < N; i++)
      Plot[j][i] = 0;
  analogReadRes(12);

  myGLCD.InitLCD(PORTRAIT);
  myGLCD.setFont(SmallFont);
  myGLCD.fillScr(VGA_WHITE);
  
  for(int i = 0; i < 120; i++){
    uint8_t r,g,b;
    gradient(i,r,g,b);
    myGLCD.setColor(r, g, b);
    myGLCD.fillRect(238 - 2*i, 20, 239 - 2*i+1,25);
  }
  
  arm_cfft_radix2_init_q15(&fft_inst, N, ifftFlag, doBitReverse); // Initilize CFFT Instance  
  AudioSample.begin(AudioSampleISR, 100);    // Set Period to 100 us or 10 kHz
}

void loop() { }

nateimig · Aug 7, 2013

Then here is TFT drawing the Mandelbrot Set with the following code

Code:

#include <UTFT.h>

#define REAL_CONSTANT2  -0.7
#define IMG_CONSTANT2   0.27015
#define ITERATION 50

float Zm = 0.5; 
extern uint8_t SmallFont[];

// UTFT(Model, RS, WR, CS,RST[, ALE]) 
UTFT myGLCD(ITDB32WD, 21, 25, 3, 4);

void setup(){
  Serial.begin(9600);
  randomSeed(analogRead(0));
  
  // Setup the LCD
  myGLCD.InitLCD();
  myGLCD.setFont(SmallFont);
}

void loop(){
  myGLCD.fillScr(VGA_BLACK);
  Mandelbrot(200, 200, 100, 100, Zm);
  Zm += 2.0F;
  //Zoom = 1.5*Zoom;
  //Zoom += (10 - Zoom)*(Zoom > 1000);
  for(int t = 0; t < 2; delay(1000), t++); 
}

void Julia2(uint16_t size_x, uint16_t size_y, uint16_t offset_x, uint16_t offset_y, uint16_t zoom){
  float tmp1, tmp2;
  float num_real, num_img;
  float radius;
  uint8_t i;
  uint16_t x,y;
  for (y = 0; y < size_y; y++){
    for (x = 0; x < size_x; x++){
      num_real = y - offset_y;
      num_real = num_real / zoom;
      num_img = x - offset_x;
      num_img = num_img / zoom;
      i=0;
      radius = 0;
      while ((i < ITERATION - 1) && (radius < 4)){
          tmp1 = num_real * num_real;
          tmp2 = num_img * num_img;
          num_img = 2*num_real*num_img + IMG_CONSTANT2;
          num_real = tmp1 - tmp2 + REAL_CONSTANT2;
          radius = tmp1 + tmp2;
          i++;
        }
      myGLCD.setColor(i*12, i*38, i*18);
      myGLCD.drawPixel(x,y);
    }
  }
}

void Mandelbrot(uint16_t size_x, uint16_t size_y, uint16_t offset_x, uint16_t offset_y, float zoom){
  float tmp1, tmp2;
  float num_real, num_img, Pr, Pi;
  float moveX = -1.108, moveY = 0.230;
  float radius;
  uint8_t i;
  uint16_t x,y;
  for(y = 0; y < size_y; y++){
    for(x = 0; x < size_x; x++){
      Pr = (x - offset_x);
      Pr = Pr/(0.5*zoom*size_x) + moveX;
      Pi = (y - offset_y);
      Pi = Pi/(0.5*zoom*size_y) + moveY;
      num_real = 0;
      num_img = 0;
      i = 0;
      radius = 0;
      while ((i < ITERATION - 1) && (radius < 4)){
          tmp1 = num_real * num_real;
          tmp2 = num_img * num_img;
          num_img = 2*num_real*num_img + Pi;
          num_real = tmp1 - tmp2 + Pr;
          radius = tmp1 + tmp2;
          i++;
        }
      myGLCD.setColor(i*10, i*10, i*10);
      myGLCD.drawPixel(x, y);
    }
  }
}

PaulStoffregen · Aug 7, 2013

If you use the first 8 bits on port C and D

PortC[0:7] = {15, 22, 23, 9, 10, 13, 11, 12} // other 4 pins not used
PortD[0:7] = {2, 14, 7, 8, 6, 20, 21, 5}

Then you can cast the registers to 8 bits. The syntax is ugly, but it cost nothing in CPU time. This only causes the compiler to use different instructions to write the data. The ARM supports 8 bit I/O to those 32 bit registers, where the other 24 bits are preserved, so this will prevent the other pins from changing.

The code might look something like this.

Code:

void UTFT::LCD_Writ_Bus(char VH, char VL, byte mode){
        *(volatile uint8_t *)(&GPIOC_PDOR) = VL;
        *(volatile uint8_t *)(&GPIOD_PDOR) = VH;
        // use digitalWriteFast for RS....
	pulse_low(pinWR);
}

Headroom · Aug 7, 2013

It would be interesting to see what speed can be achieved for the Mandelbrot calculation when using integer math intead of floating point. Fractint comes to mind.

slomobile · Feb 13, 2016

PaulStoffregen said:
If you use the first 8 bits on port C and D

PortC[0:7] = {15, 22, 23, 9, 10, 13, 11, 12} // other 4 pins not used
PortD[0:7] = {2, 14, 7, 8, 6, 20, 21, 5}

Then you can cast the registers to 8 bits. The syntax is ugly, but it cost nothing in CPU time. This only causes the compiler to use different instructions to write the data. The ARM supports 8 bit I/O to those 32 bit registers, where the other 24 bits are preserved, so this will prevent the other pins from changing.

The code might look something like this.

Code:

void UTFT::LCD_Writ_Bus(char VH, char VL, byte mode){ *(volatile uint8_t *)(&GPIOC_PDOR) = VL; *(volatile uint8_t *)(&GPIOD_PDOR) = VH; // use digitalWriteFast for RS.... pulse_low(pinWR); }

I know this is an old thread, but I was hoping to use similar code interfacing with an old motor control chip with combined parallel Address/Data bus. But I also need to read from the bus and I'm not sure its ok to just reverse that syntax like.

Code:

void UTFT::LCD_Read_Bus(char VH, char VL, byte mode){
        VL = *(volatile uint8_t *)(&GPIOC_PDOR) ;
        VH = *(volatile uint8_t *)(&GPIOD_PDOR) ;        
}

I also wasn't sure what to do about the LED on pin 13. Do I need to remove it and its resistor? Is that sufficient? In case I borrow some code that wants to blink the LED, how can I keep it from interfering with my bus transfers?

Parallel GPIO on Teensy 3.0

nateimig

Member

stevech

Well-known member

Constantin

Well-known member

stevech

Well-known member

PaulStoffregen

Well-known member

nateimig

Member

PaulStoffregen

Well-known member

nateimig

Member

nateimig

Member

nateimig

Member

PaulStoffregen

Well-known member

Headroom

Well-known member

slomobile

Well-known member