3.2, what is the fastest possible digital pin read method?

Status
Not open for further replies.
For the moment the main purpose of this code is to act as a simple logic analyser, demo, the goal is to read in gameboy pixel data from the LCD connection. The clock is around 4mhz. When I view my dump of the data it just seems 'wrong'. I think it's a problem with sample rate.

So, my question is:

How fast can we go when reading in the digital pins?

Code:
  int  pins[] = {9,10,11,12,15};
  int cloop = 0;
  byte buffer[51200];

  void setup()
  {
    Serial.begin(9600);
    for(int i=0;i<5;i++)
      pinMode(pins[i],INPUT);
  }
  
  void loop()
  {
    if(cloop == 51200) {
      cloop = 0;
      for(int i=0;i<(51200);i++){
        Serial.print(buffer[i],DEC);
        Serial.print(",");
      }
    }
    buffer[cloop] = GPIOC_PDIR & 0xFF;
    cloop++;
  }
 
For the moment the main purpose of this code is to act as a simple logic analyser, demo, the goal is to read in gameboy pixel data from the LCD connection. The clock is around 4mhz. When I view my dump of the data it just seems 'wrong'. I think it's a problem with sample rate.

So, my question is:

How fast can we go when reading in the digital pins?

Code:
  int  pins[] = {9,10,11,12,15};
  int cloop = 0;
  byte buffer[51200];

  void setup()
  {
    Serial.begin(9600);
    for(int i=0;i<5;i++)
      pinMode(pins[i],INPUT);
  }
  
  void loop()
  {
    if(cloop == 51200) {
      cloop = 0;
      for(int i=0;i<(51200);i++){
        Serial.print(buffer[i],DEC);
        Serial.print(",");
      }
    }
    buffer[cloop] = GPIOC_PDIR & 0xFF;
    cloop++;
  }

could you provide more information?
what do you mean by 'wrong'?, mind to give us some 10-20 samples (hex format)?
what is your T3.2 clock speed?
Teensy may simply be too fast, some additional delays (dummy operations) in the loop may slow down access to port.
 
Teensy may simply be too fast
Quite the opposite that code is extremely slow, in my brief test it can do 0.25MHz.

Reading the port register is the fasted way. However, using loop() has a huge overhead, since it implicitly calls yield() in each iteration. You may also want to disable interrupts while you are capturing data, or you will have holes in your capture data when interrupts occur (e.g. timer interrupt every millisecond, USB interrupts).

Using the cloop global variable results in an optimization barrier causing the compiler to generate very bad code. This version can capture at >11Mhz in my quick test.

Code:
const size_t capture_count = 51200;

FASTRUN void capture() {
    for(size_t i = 0; i < capture_count; i++) {
        buffer[i] = GPIOC_PDIR & 0xFF;
    }
}

void loop() {
    capture();
    for(size_t i = 0; i < capture_count; i++){
        Serial.print(buffer[i], DEC);
        Serial.print(",");
    }
}

FASTRUN places the function in RAM, which is considerably faster than running from flash.
 
Interesting ... is FASTUN a C/C++ ? or perhaps teensy specific?

it is a instruction to compiler to put code into ram (better fastrun section, defined in linker ld file)
Code:
#define FASTRUN __attribute__ ((section(".fastrun")))
 
FASTRUN places the function in RAM, which is considerably faster than running from flash.
but only if highly overclocked. At 96 MHz there seems little speed increase (but may depend on predictability of code)
 
Quite the opposite that code is extremely slow, in my brief test it can do 0.25MHz.

Reading the port register is the fasted way. However, using loop() has a huge overhead, since it implicitly calls yield() in each iteration. You may also want to disable interrupts while you are capturing data, or you will have holes in your capture data when interrupts occur (e.g. timer interrupt every millisecond, USB interrupts).

Using the cloop global variable results in an optimization barrier causing the compiler to generate very bad code. This version can capture at >11Mhz in my quick test.

Code:
const size_t capture_count = 51200;

FASTRUN void capture() {
    for(size_t i = 0; i < capture_count; i++) {
        buffer[i] = GPIOC_PDIR & 0xFF;
    }
}

void loop() {
    capture();
    for(size_t i = 0; i < capture_count; i++){
        Serial.print(buffer[i], DEC);
        Serial.print(",");
    }
}

FASTRUN places the function in RAM, which is considerably faster than running from flash.

You, Seriously, Rock!

Thank you!

and thank you to everyone for taking a look at my question.


The data looks pretty decent now. I haven't confirmed if there are any holes in the data yet but for now I am very happy. I updated the demo link accordingly.

Thank you again :D
 
It might be possible to use DMA to transfer GPIO register to memory, or circular DMA (triggered by timer).

T3.2@120mhz 1024 32-bit words
FASTRUN capture() 69 us (60 us without FASTRUN)
DMA 44 us
 
Last edited:
It might be possible to use DMA to transfer GPIO register to memory, or circular DMA (triggered by timer).

T3.2@120mhz 1024 32-bit words
FASTRUN capture() 69 us (60 us without FASTRUN)
DMA 44 us

ooo very cool :D

would you mind showing how you went about doing this.
 


Thanks, I will have a read through.

http://www.electronoob.com/gbasm/lcd/ - I updated my demo, clicking on the green rectangle on right hand side of screen will render next captured frame... It's ugly. :(

It will be interesting to see how using DMA will improve it.

Thanks again!
 
If you need a higher capture rate, you must run DMA in continuous transfer mode. DMA can theoretically be triggered via timers or pin changes, but that has pretty high overhead / latency and won't really get you a higher capture rate than the port read loop.

Here is some code for DMA. It uses single byte transfers from the port register, so effectively does the same as using "GPIOC_PDIR & 0xFF".

Code:
#include <DMAChannel.h>

const size_t capture_count = 10000;

uint8_t buffer[capture_count] __attribute__ ((aligned (16)));
DMAChannel dma;

void setup() {
    Serial.begin(9600);
    delay(2000);
}

void setupDMA() {
    dma.TCD->SADDR = &GPIOC_PDIR ;
    dma.TCD->SOFF = 0;
    dma.TCD->ATTR_SRC = 0; // 1 byte/transfer source
    dma.TCD->SLAST = 0;
    dma.TCD->DADDR = buffer;
    dma.TCD->DOFF = 1;     // 1 byte destination increment
    dma.TCD->ATTR_DST = 0; // 1 bytes/transfer dest
    dma.TCD->NBYTES = capture_count;
    dma.TCD->DLASTSGA = 0;
    dma.TCD->BITER = 1;
    dma.TCD->CITER = 1;

    dma.disableOnCompletion();
}

void loop() {
    setupDMA();

    dma.enable();
    dma.triggerManual();

    // wait for 'capture_count' samples to be captured
    while(!dma.complete()) ;

    // process buffer
}
 
If you need a higher capture rate, you must run DMA in continuous transfer mode. DMA can theoretically be triggered via timers or pin changes, but that has pretty high overhead / latency and won't really get you a higher capture rate than the port read loop.

Here is some code for DMA. It uses single byte transfers from the port register, so effectively does the same as using "GPIOC_PDIR & 0xFF".



Code:
#include <DMAChannel.h>

const size_t capture_count = 60000;
uint8_t buffer[capture_count] __attribute__ ((aligned (16)));
DMAChannel dma;

void setup() {
    Serial.begin(9600);
    int  pins[] = {9,10,11,12,15};
    for(int i=0;i<5;i++)
      pinMode(pins[i],INPUT);
}


void loop() {
  while(1) {
    dma.TCD->SADDR = &GPIOC_PDIR ;
    dma.TCD->SOFF = 0;
    dma.TCD->ATTR_SRC = 0; // 1 byte/transfer source
    dma.TCD->SLAST = 0;
    dma.TCD->DADDR = buffer;
    dma.TCD->DOFF = 1;     // 1 byte destination increment
    dma.TCD->ATTR_DST = 0; // 1 bytes/transfer dest
    dma.TCD->NBYTES = capture_count;
    dma.TCD->DLASTSGA = 0;
    dma.TCD->BITER = 1;
    dma.TCD->CITER = 1;
    dma.disableOnCompletion();
    dma.enable();
    dma.triggerManual();

    // wait for 'capture_count' samples to be captured
    while(!dma.complete()) ;
    // process buffer
    Serial.write(buffer,capture_count);
  }
}

Thanks a lot for the DMA example code that really helped me out as I was quite lost!

I've since tried it out and I think the problem is actually the time spent writing to the Serial port, causing chunks of screen data being missed; the portion of lcd data that is captured seems to have decent integrity otherwise.

I'm not sure what to do at this point.

Certainly open to more suggestions if there are any.
 
I have been watching this thread and learning a bit about DMA which looks interesting.

But before I think I could give any suggestions, I think I would need to understand the problem better. That is you say that the data comes at about 4mhz. But what I am not sure if I heard, is how much data? Is it continuous or does it come in bursts... If the data is continuous at at 4mhz, I think the USB issue is far more of an issue than how fast you can read the IO port.

Your call to Serial.write() will surely block in that call, as there is no way you have 6000 bytes available in the output buffer, so it will hard loop waiting for space to be available in the output queue...
All during this time you will not be reading the IO port.

If it were me here are some of the things I would investigate and experiment with.

a) If I am only interested in one IO port, I would probably pack the data going out over USB. That is if I pack 8 samples per byte, instead of sending 4MB of data per second, I would only send .5mb per second.

b) I would probably look at other USB output methods. For example I believe the Saleae logic analyzer uses a form of USB output, called bulk transfer? (http://support.saleae.com/hc/en-us/...-interfering-with-other-attached-USB-devices-) Not sure if that maps to anything we can do with Teensy 3.2? But I would probably experiment with RAW Hid (http://www.pjrc.com/teensy/rawhid.html), which sends 64 bytes at a time. I would probably also see if I could configure that to a larger size if that would help or not

c) Assuming I stayed with DMA reading of the IO port, I would look into setting up multiple DMA buffers. I would size the DMA transfers to maybe be the size of my RAW HID transfer size. Maybe times 8 if I did a), so when One DMA is done, have the system keep sampling with the 2nd DMA buffer, while I read the data out of the first, pack it into buffer to call the RAW hid send to initiate the output over USB. Repeat when 2nd DMA buffer has been received (telling the first to start reading again)....

d) maybe look again at how the DMA reads are working. That is if it is reading something like 23M samples per second, but your data is only 4MBS, maybe you can either somehow slow down DMA reads? Or maybe compress more data toward the output side? ...

e) if c) can not keep up, but the data is of some reasonable size, I would try to add more buffering and compression of the data and have it continue to fill up as much memory as I could, and hope I reach the end of the sampling before I run out of memory, and have the USB output continue to catch up...

I am not sure if any of this helps, but that is what I would look into.

Good Luck
Kurt
 
I have been watching this thread and learning a bit about DMA which looks interesting.

But before I think I could give any suggestions, I think I would need to understand the problem better. That is you say that the data comes at about 4mhz. But what I am not sure if I heard, is how much data? Is it continuous or does it come in bursts... If the data is continuous at at 4mhz, I think the USB issue is far more of an issue than how fast you can read the IO port.

Your call to Serial.write() will surely block in that call, as there is no way you have 6000 bytes available in the output buffer, so it will hard loop waiting for space to be available in the output queue...
All during this time you will not be reading the IO port.

If it were me here are some of the things I would investigate and experiment with.

a) If I am only interested in one IO port, I would probably pack the data going out over USB. That is if I pack 8 samples per byte, instead of sending 4MB of data per second, I would only send .5mb per second.

b) I would probably look at other USB output methods. For example I believe the Saleae logic analyzer uses a form of USB output, called bulk transfer? (http://support.saleae.com/hc/en-us/...-interfering-with-other-attached-USB-devices-) Not sure if that maps to anything we can do with Teensy 3.2? But I would probably experiment with RAW Hid (http://www.pjrc.com/teensy/rawhid.html), which sends 64 bytes at a time. I would probably also see if I could configure that to a larger size if that would help or not

c) Assuming I stayed with DMA reading of the IO port, I would look into setting up multiple DMA buffers. I would size the DMA transfers to maybe be the size of my RAW HID transfer size. Maybe times 8 if I did a), so when One DMA is done, have the system keep sampling with the 2nd DMA buffer, while I read the data out of the first, pack it into buffer to call the RAW hid send to initiate the output over USB. Repeat when 2nd DMA buffer has been received (telling the first to start reading again)....

d) maybe look again at how the DMA reads are working. That is if it is reading something like 23M samples per second, but your data is only 4MBS, maybe you can either somehow slow down DMA reads? Or maybe compress more data toward the output side? ...

e) if c) can not keep up, but the data is of some reasonable size, I would try to add more buffering and compression of the data and have it continue to fill up as much memory as I could, and hope I reach the end of the sampling before I run out of memory, and have the USB output continue to catch up...

I am not sure if any of this helps, but that is what I would look into.

Good Luck
Kurt



The following pins are required: 9,10,11,12,15. and the time to shift bits around / masking / whatever is too expensive :( The goal is to read off the LCD data from a gameboy, continuously, and to stream it out.

I had tried to do some compression on the data while capturing without DMA and it took too many cycles up and made it impossible to capture anything at all.. In fact even just discarding duplicate bytes was enough to be a bottleneck.


I tried your idea of reading while writing to serial, but I was unable to make any improvements to the captured data, code below:

Code:
#include <DMAChannel.h>

const size_t capture_count = 25000;

uint8_t buffer1[capture_count] __attribute__ ((aligned (16))) = {};
uint8_t buffer0[capture_count] __attribute__ ((aligned (16))) = {};

DMAChannel dma;

void setup() {
    Serial.begin(9600);
    int  pins[] = {9,10,11,12,15};
    for(int i=0;i<5;i++)
      pinMode(pins[i],INPUT);
}


void loop() {
  bool which = 0;
  delay(2000);
  
  while(1) {
    which = !which;
    dma.TCD->SADDR = &GPIOC_PDIR ;
    dma.TCD->SOFF = 0;
    dma.TCD->ATTR_SRC = 0; // 1 byte/transfer source
    dma.TCD->SLAST = 0;
    
    if (which)
     dma.TCD->DADDR = buffer1;
    else
     dma.TCD->DADDR = buffer0;

    dma.TCD->DOFF = 1;     // 1 byte destination increment
    dma.TCD->ATTR_DST = 0; // 1 bytes/transfer dest
    dma.TCD->NBYTES = capture_count;
    dma.TCD->DLASTSGA = 0;
    dma.TCD->BITER = 1;
    dma.TCD->CITER = 1;
    dma.disableOnCompletion();
    dma.enable();
    dma.triggerManual();

    if (!which)
     Serial.write(buffer1,capture_count);
    else
     Serial.write(buffer0,capture_count);
    while(!dma.complete());
  }
}

Thanks for your thoughts, maybe someone can offer a better way to handle 2 buffers for the DMA transfer.
 
Don't bother with trying different USB things. USB HID is limited to 64kBytes/s. USB serial can max out the connection and can get close to 1MB/s.

At 4MHz, you are effectively capturing 4MB/s. Unless there is some massive redundancy / sparseness in the captured data that is trivial to filter out, there is no way to do a continuous real-time capture.

What you could do, is use a Teensy 3.5 or 3.6 and dump the data to an SD card. It has a real hardware SD controller that would be fast enough.
 
Sorry to dredge up an older thread - but I have a very similar question. I am trying to do some high speed (10MHz) GPIO to receive data from a parallel ADC on a T3.6 (thread here: https://forum.pjrc.com/threads/4156...l-transfers-from-a-10MSPS-ADC-on-a-Teensy-3-6 ) but in order to make sure my samples are valid I would like to read the port on the rising edge of an external clock. It seems like the DMA code by tni would be a great way to achieve this, but I would need to modify it to perform a DMA transfer from GPIOC_PDIR to the buffer on the rising edge of an interrupt. I haven't found good info on how to set up the DMA channels to do this - any pointers?

Thanks!
 
Sorry to dredge up an older thread - but I have a very similar question. I am trying to do some high speed (10MHz) GPIO to receive data from a parallel ADC on a T3.6 (thread here: https://forum.pjrc.com/threads/4156...l-transfers-from-a-10MSPS-ADC-on-a-Teensy-3-6 ) but in order to make sure my samples are valid I would like to read the port on the rising edge of an external clock. It seems like the DMA code by tni would be a great way to achieve this, but I would need to modify it to perform a DMA transfer from GPIOC_PDIR to the buffer on the rising edge of an interrupt. I haven't found good info on how to set up the DMA channels to do this - any pointers?

Thanks!

Did you solve this one? I have similar problem...
 
Status
Not open for further replies.
Back
Top