Fast Digital IO on Teensy 3.6: Ports or Bits?

Status
Not open for further replies.

StanfordEE

Well-known member
I apologize for the length of this posting. I'm kind of doing it for the record, so of nobody reads it, fair enough. I'm still a happy nerd.

The point here is to make a case for fast I/O. Those of you who have written essentially that "pursuing the fastest X" is pointless, or that "parallel I/O is useless, use serial," I beg you not to quickly slam this post, as it is neither naïve nor filled with hubris. This is an effort to make the case that the Teensy 3.6, which is an amazing bit of compute power, is a classic example of the von Neumann bottleneck (throughput limited by I/O bandwidth) for applications like mine, educational test instruments like spectrum analyzers, oscilloscopes, etc. This limitation is almost entirely due to the very long-in-the tooth Arduino "pinout" which was not designed for the kind of global success the IDE and hardware approachability has created. Hats off to Massimo Banzi, and in particular to the PJRC team. Now we need to see how far we can do and keep the easy/quick/familiar/approachable and crank the performance. My point is that performance is not just about how fast it clocks, and that seems to be the direction future Teensy boards are heading.

So for my specific project, which is an educational FFT analyzer based on the Teensy 3.6 running as fast as possible (goal is high resolution ADC at 1 Ms/s, for a maximum (Nyquist) signal frequency of 500 kHz, but the basic version uses the built-in ADC's of the Teensy 3.6).

Starting with a floating-point FFT and (always) overclocking at 240MHz, it was clear that the Teensy 3.6's computing performance is more than capable for this application. Using the on-board ADC, as far as I can tell regardless of using an outside ADC library (e.g., Pedvide's excellent one), the maximum frequency of an interrupt-driven sampling routine using the IntervalTimer approach is 100 ks/s. I can push it a bit further, but not much, so for keeping the math simple, let's leave it there. The current alpha version uses a custom pcb with a Gameduino 3 display, a quiet analog mid-rail supply (1.65V) and an optional WiFi module. This is the minimalist version, but delivers a solid 60 - 70 dB dynamic range (10 - 12 bit effective number of bits, or ENOB), has a one-channel DDS synthesized sinewave tracking generator, and has color display and touch menus for things like the number of FFT bins and the ADC bits used (to allow learning about quantization noise).

For those who care, with 64 frames of FFT result to allow time averaging (very handy), it becomes memory bound, but without this feature 8192-point transforms are quick. With all of the bells and whistles, (windowing, averaging, simultaneous display of linear and logarithmic voltage traces), I get a respectable 15 frames per second displayed, climbing to 32 with only linear y-axis. Not bad for a COGS well under $100. My final code, I hope, will be self-explanatory through abundant comments and links, and a tutorial at the bottom of the code itself (I am not trying to be some new-age, CS-correct programmer here, so please don't judge me).

Then I looked into using an external ADC, which is where all of my troubles started. The first thing I did, still using the IntervalTimer method, was to just use digitalReadFast to input one pin, shift it into the MSB position (value is 127 or zero) and flow the input into the FFT, then run. I proved that I could run the code at 1 us interrupt period, meaning 1 Ms/s, which is awesome!

I should note here that I have not yet looked into ways of generating precisely timed interrupts faster than that, but surely that is possible.

Images:
[FFT100kHz]
FFT100kHz.jpg

[FFT300kHz]
FFT300kHz.jpg

[FFT490kHz]
FFT490kHz.jpg
Anyway, the first three photos show an input signal at 100 kHz, 300 kHz and 490 kHz (here the Nyquist frequency is 500 kHz). Again, the input is a squarewave, so the Fourier series that "it's made of" has harmonics at 3X, 5X, and all odd numbers upward. The 300 kHz harmonic of the 100 kHz input signal is clearly visible at 1/3 the amplitude of the fundamental (expected) at 300 kHz (expected). A close look will show the 500 kHz harmonic at the right of the screen at 1/5 the amplitude. Great!

You might note that the Fpk value on the display is the computed peak frequency, which matches what is put in. Also good proof that it's working ok.

There is no anti-aliasing filter here, so the higher harmonics "fold" around zero frequency. So the 300 kHz harmonic at 900 kHz should show up at the sample rate minus the harmonic frequency or 100 kHz. It does. Cool.

The last image shows 490 kHz being input, and a complex batch of small peaks due to folding, jitter, etc. In normal use, there would be no frequencies put into the FFT above the Nyquist rate to prevent this, but here (again), I am using a digital input, which is a squarewave, and it is coming from an external signal generator, so it is wandering relative to my sampling interrupt routine.

So 1 Ms/s is possible. Now how to get the data in?

Well, there are many ADC's out there that are a good combination of fast enough, cheap enough and provide a reasonable ENOB so we don't compromise too much there. For those who care, the best dynamic range for an ideal ADC is 6.02*N + 1.74, where N = the number of ADC bits.

So, how fast can it go in port versus serial mode? First, I measured how fast the port could be read, regardless of pin scrambling, with this code (noting that it did not change the results if the variables were int or long):

//Teensy 3.6 port reads speed test, overclocked 240 MHz
//G. Kovacs, 1/13/19

long startTime, elapsedTime;
long temp;

void setup() {
Serial.begin(115200);
}

void loop() {
startTime=micros();
for (int i = 0; i<1000000; i++)
{
temp = GPIOB_PDIR;
}
elapsedTime = micros()-startTime;
Serial.print ("Time for 1000000 port reads in microseconds: ");
Serial.println(elapsedTime);
}

The results are this (snippet), jittering between 29188 and 29189 microseconds for 1 million reads, or 29.2 ns/port read. Nice! So making a quick comparison of SPI (assuming N-bits per N-clocks) over realistic speeds (not clear from what's on line how vast the hardware SPI on the Teensy 3.6 can actually go, but I'd bet less than 48 MHz), it is clear that parallel wins after - wait for it - one bit ADC's!

Image:
[von_Neumann_Bottleneck]
von_Neumann_Bottleneck.png

Further, this shows that even at 48 MHz SPI speed, the maximum sample rate for a 16-bit ADC (readily available) is 1 Ms/s. So there, folks, is a basic case for why parallel still has a role in the world. This is just the basic case, and maybe bit-bang SPI could somewho go fast? Maybe some DMA tricks? This is just to make a simple point, so no need to invoke complexity.

So if we can, in theory read so fast, can we also make a start pulse for an ADC? Let's make pin 14 (not part of port B) the start pulse output and look with a fast oscilloscope (350 MHz)… Here is some simple code to test the idea, first with the shortest output pulse possible:

//Teensy 3.6 port reads speed test, overclocked 240 MHz
//Add in a start pulse for an external ADC.
//G. Kovacs, 1/13/19

long startTime, elapsedTime;
long temp;
int startPulsePin = 33;

void setup() {
Serial.begin(115200);
pinMode(startPulsePin,OUTPUT); //Our test start pulse for a hypothetical ADC
digitalWriteFast(startPulsePin,HIGH);
}

void loop() {
startTime=micros();
noInterrupts();
for (int i = 0; i<1000000; i++)
{
digitalWriteFast(startPulsePin,LOW);
digitalWriteFast(startPulsePin,HIGH);
temp = GPIOB_PDIR;
}
interrupts();
elapsedTime = micros()-startTime;
Serial.print ("Time for 1000000 port reads in microseconds: ");
Serial.println(elapsedTime);
}


So, what do I see on the oscilloscope? A nice-looking waveform with a bit of as-yet unexplained jitter as seen in the background - I have infinite persistence on in the scope).

Image:
[Simple_PortIO_Jitter_NoInterrupts]
Simple_PortIO_Jitter_NoInterrupts.png

So what is the timing? The reported elapsed time flips each cycle between 1503 and 503 us, but there is a good reason (please see below). The sample rate (repetition rate, since we are not yet sampling anything) is 11.43 MHz (measured on oscilloscope), which corresponds to an 87.5 us period. Duh! Interrupts are needed for the elapsed time calculation! Turn them back on, and you get jitter, but a decent time estimate of 83.39 ns, or 12 MHz. Great! Here's the jitter, FYI:

Image:
[Simple_PortIO_Jitter_Interrupts_On]
Simple_PortIO_Jitter_Interrupts_On.png

Just for grins, note that the jitter is about 4 ns and there is also some "missing pulse" jitter, which is the background straight line at 3.3V.

Image:
[Simple_PortIO_Jitter]
Simple_PortIO_Jitter.png

Parallel ADCs (again, parallel-haters, please bite your tongues) are the fastest around, and there are sub-$5 versions that hit 8 bits at speed (e.g., 10 Ms/s). Typically ADC's need a conversion start command and there is a wait period for the data to settle, even if it is only 20 ns for some of the faster ones. Here, if parallel I/O works well, you get your data into the CPU with maybe a couple of cycles to bit-bang the start command, and one or two more to read the port. That's really fast if it works. Unfortunately, on the Arduino and the Teensy 3.6, it's not that simple. First, almost every possible port has pins that are used by common libraries, shields, etc., so with most combos of hardware (or the need for the main serial port), most are unusable as whole bytes. Second, in Paul Stoffregen's kind-hearted effort to painstakingly map the ARM's ports onto wiring-possible pins, the 32-bit internal registers map onto the external physical pins in a scrambled, messy way.

In other words, to read the only port I could use (B, due to hardware/library issues), I had to work around on pin (25) not responding normally to the port commands but working fine with digitalReadFast(25), the work-around to get to speed looks like this (my basic ISR). Note that I have to read the port into a temporary variable (temp1) to avoid reading it multiple times in the line below - ensuring time synchrony.

FASTRUN void ISR(void) {

noInterrupts();
if (SampleNow = true && SampleIndex < FFTbinsM)
{
temp1 = GPIOB_PDIR; //Bits are out of order compared to published mapping. Tested by grounding each, serial port output.
vReal[SampleIndex] = (((temp1 & 0x00030000)>>12)+((temp1 & 0x00000800)>>5)+(temp1 & 0x0000000F))+(digitalReadFast(25)<<7);
SampleIndex++;
}
interrupts();
}

Well, maybe I'm missing something, but this does, in fact, work at 1Ms/s, although it's very kludgy working around the port issues, and will not extend beyond 8-bits due to the other ports being tied up.

So let me test this kludge in the simple non-interrupt-driven timing test shown further above, and let's pretend we are putting the read value into a double precision float, as done for the actual FFT (although the FPU is only single precision).

//Teensy 3.6 port reads speed test, overclocked 240 MHz
//Add in a start pulse for an external ADC.
//G. Kovacs, 1/13/19

long startTime, elapsedTime;
long temp;
int startPulsePin = 33;
double vReal;

void setup() {
Serial.begin(115200);
pinMode(startPulsePin,OUTPUT); //Our test start pulse for a hypothetical ADC
digitalWriteFast(startPulsePin,HIGH);
}

void loop() {
startTime=micros();
noInterrupts();
for (int i = 0; i<1000000; i++)
{
digitalWriteFast(startPulsePin,LOW);
digitalWriteFast(startPulsePin,HIGH);
temp = GPIOB_PDIR; //Bits are out of order compared to published mapping. Tested by grounding each, serial port output.
vReal = (((temp & 0x00030000)>>12)+((temp & 0x00000800)>>5)+(temp & 0x0000000F))+(digitalReadFast(25)<<7);
}
interrupts();
elapsedTime = micros()-startTime;
Serial.print ("Time for 1000000 port reads in microseconds: ");
Serial.println(elapsedTime);
}


We get a nice waveform on the scope at a 5.996 MHz repetition rate (not bad), with our 32 ns start pulse intact. The timing reported serially, again with interrupts disabled for this purpose, is 16.66.777 ns or 6 Ms/s (close enough!).

Image:
[Sope_Kludgy_PortIO_Full_Tilt]
Scope_Kludgy_PortIO_Full_Tilt.png

Now change the code a bit so that it is filling up a 1024-point array of double…

//Teensy 3.6 port reads speed test, overclocked 240 MHz
//Add in a start pulse for an external ADC.
//G. Kovacs, 1/13/19

long startTime, elapsedTime;
long temp;
int startPulsePin = 33;
double vReal[1024];

void setup() {
Serial.begin(115200);
pinMode(startPulsePin,OUTPUT); //Our test start pulse for a hypothetical ADC
digitalWriteFast(startPulsePin,HIGH);
}

void loop() {
startTime=micros();
noInterrupts();
for (int i = 0; i<1024; i++)
{
digitalWriteFast(startPulsePin,LOW);
digitalWriteFast(startPulsePin,HIGH);
temp = GPIOB_PDIR; //Bits are out of order compared to published mapping. Tested by grounding each, serial port output.
vReal = (((temp & 0x00030000)>>12)+((temp & 0x00000800)>>5)+(temp & 0x0000000F))+(digitalReadFast(25)<<7);
}
interrupts();
elapsedTime = micros()-startTime;
Serial.print ("Time for 1000000 port reads in microseconds: ");
Serial.println(elapsedTime);
}

That slows down the sample rate to 5.345 Ms/s. Still awesome. Could the timing be adjusted finely? Not unless there is a "delayNanoseconds()" option out there… :)

So, with my big, clunky and not ready to share FFT program, I put this in the ISR like this, and tested it at 1 us interrupt period (1 Ms/s):

FASTRUN void ISR(void) {
noInterrupts();
if (SampleNow = true && SampleIndex < FFTbinsM)
{
digitalWriteFast(startPulsePin,LOW);
digitalWriteFast(startPulsePin,HIGH);
temp = GPIOB_PDIR; //Bits are out of order compared to published mapping. Tested by grounding each, serial port output.
vReal[SampleIndex] = (((temp & 0x00030000)>>12)+((temp & 0x00000800)>>5)+(temp & 0x0000000F))+(digitalReadFast(25)<<7);
SampleIndex++;
}
interrupts();
}

I get a nice set of 1024 samples taken at 1 us apart, and a working FFT on the Gameduino 3 screen for a 200 kHz input squarewave to the most significant port pin (the others work too, as expected, at their respective amplitudes - I checked them all). Nice.

Image: [Scope_Using_Kludgy_PortIO]
Scope_Using_Kludgy_PortIO.png

Image: [FFT_Using_Kludgy_PortIO]
FFT_Using_Kludgy_PortIO.jpg

So what about fast pin-mode input by brute force? That way, any available pins could be used. Why not give it a try?

Here it is for 8-bits, using pins 25..32 as inputs:

//Teensy 3.6 port reads speed test, overclocked 240 MHz
//Add in a start pulse for an external ADC.
//G. Kovacs, 1/13/19

long startTime, elapsedTime;
long temp;
int startPulsePin = 33;
double vReal[1024];

void setup() {
Serial.begin(115200);
pinMode(startPulsePin,OUTPUT); //Our test start pulse for a hypothetical ADC
digitalWriteFast(startPulsePin,HIGH);
for (int i = 0; i<8; i++) //use pins 25..31 in this test
{
pinMode(i+25,INPUT_PULLUP); //inputs for ADC using pinmode, LSB = 25
}
}

void loop() {
startTime=micros();
noInterrupts();
for (int i = 0; i<1024; i++)
{
digitalWriteFast(startPulsePin,LOW);
digitalWriteFast(startPulsePin,HIGH);
//Brute force reading of 8 pins...
vReal = (digitalReadFast(25)<<7)+(digitalReadFast(26)<<6)+(digitalReadFast(27)<<5)+(digitalReadFast(28)<<4)+(digitalReadFast(29)<<3)+(digitalReadFast(30)<<2)+(digitalReadFast(31)<<1)+digitalReadFast(32);
}
interrupts();
elapsedTime = micros()-startTime;
Serial.print ("Time for 1000000 reads in microseconds: ");
Serial.println(elapsedTime);
}

Not bad at all. Get reads into the array at a rate of 2.531 Ms/s. Try a 12-bit, brute-force version using pins 0..11 and get 1.943 Ms/s (still very nice). Here it is with a CRUDE approach to trimming the loop timing, in this case to 1.0129 Ms/s:

//Teensy 3.6 port reads speed test, overclocked 240 MHz
//Add in a start pulse for an external ADC.
//G. Kovacs, 1/13/19

long startTime, elapsedTime;
long temp;
int startPulsePin = 33;
double vReal[1024];

void setup() {
Serial.begin(115200);
pinMode(startPulsePin,OUTPUT); //Our test start pulse for a hypothetical ADC
digitalWriteFast(startPulsePin,HIGH);
for (int i = 0; i<12; i++) //use pins 0..11 in this test
{
pinMode(i,INPUT_PULLUP); //inputs for ADC using pinmode, LSB = 0
}
}

void loop() {
startTime=micros();
noInterrupts();
for (int i = 0; i<1024; i++)
{
digitalWriteFast(startPulsePin,LOW);
digitalWriteFast(startPulsePin,HIGH);
//Brute force reading of 8 pins...
vReal = (digitalReadFast(0)<<11)+(digitalReadFast(1)<<10)+(digitalReadFast(2)<<9)+(digitalReadFast(3)<<8)+(digitalReadFast(4)<<7)+(digitalReadFast(5)<<6)+(digitalReadFast(6)<<5)+(digitalReadFast(7)<<4)+(digitalReadFast(8)<<3)+(digitalReadFast(9)<<2)+(digitalReadFast(10)<<1)+digitalReadFast(11);
for (int twiddle = 0; twiddle<18; twiddle++)
{
temp = (temp*temp)/temp;
}
}
interrupts();
elapsedTime = micros()-startTime;
Serial.print ("Time for 1000000 reads in microseconds: ");
Serial.println(elapsedTime);
}

In my case the Gameduino display needs pins 8, 9, 11, 12, 13 and 2, I use pins 31 and 32 for a WiFi module (I'll move these to keep the grouping of pins neat), and it's good to keep pins 0 and 1 (main serial port) free. Other than that, any pins should be fair game for these purposes. So to fully build it out to 16 bits, I go with the following pins, as well as pin 10 as the start pulse:

24 MSbit 15
25 bit 14
26 bit 13
27 bit 12
28 bit 11
29 bit 10
30 bit 9
31 bit 8
32 bit 7
33 bit 6
34 bit 5
35 bit 4
36 bit 3
37 bit 2
38 bit 1
39 LSbit 1

Here is the code:

//Teensy 3.6 port reads speed test, overclocked 240 MHz
//Add in a start pulse for an external ADC.
//Brute force bit reads for 16-bit ADC, but scalable with same pins downward in number of bits.
//G. Kovacs, 1/13/19

long startTime, elapsedTime;
long temp;
int startPulsePin = 10;
double vReal[1024];

void setup() {
Serial.begin(115200);
pinMode(startPulsePin,OUTPUT); //Our test start pulse for a hypothetical ADC
digitalWriteFast(startPulsePin,HIGH);
for (int i = 0; i<16; i++) //use pins 24..39 in this test
{
pinMode(i+24,INPUT_PULLUP); //inputs for ADC using pinmode, LSB = 0
}
}

void loop() {
startTime=micros();
noInterrupts();
for (int i = 0; i<1024; i++)
{
digitalWriteFast(startPulsePin,LOW);
digitalWriteFast(startPulsePin,HIGH);
//Brute force reading of 8 pins...
vReal = (digitalReadFast(24)<<15)+(digitalReadFast(25)<<14)+(digitalReadFast(26)<<13)+(digitalReadFast(27)<<12)+(digitalReadFast(28)<<11)+(digitalReadFast(29)<<10)+(digitalReadFast(30)<<9)+(digitalReadFast(31)<<8)+(digitalReadFast(32)<<7)+(digitalReadFast(33)<<6)+(digitalReadFast(34)<<5)+(digitalReadFast(35)<<4)+(digitalReadFast(36)<<3)+(digitalReadFast(37)<<2)+(digitalReadFast(38)<<1)+digitalReadFast(39);
for (int twiddle = 0; twiddle<14; twiddle++)
{
temp = (temp*temp)/temp;
}
}
interrupts();
elapsedTime = micros()-startTime;
Serial.print ("Time for 1000000 reads in microseconds: ");
Serial.println(elapsedTime);
}

With the "twiddle delay" loop commented out, I get an update rate of 1.657 Ms/s (outstanding!), and by adjusting "twiddle" (CRUDE, I know) to 14, I can get a sample rate of 1.0044 Ms/s.

Finally, putting this brute-force method into my FFT analyzer code's ISR like this, it gets the job done, with the input pins behaving as expected.

noInterrupts();
if (SampleNow = true && SampleIndex < FFTbinsM)
{
digitalWriteFast(startPulsePin,LOW);
digitalWriteFast(startPulsePin,HIGH);
vReal[SampleIndex] = (digitalReadFast(24)<<15)+(digitalReadFast(25)<<14)+(digitalReadFast(26)<<13)+(digitalReadFast(27)<<12)+(digitalReadFast(28)<<11)+(digitalReadFast(29)<<10)+(digitalReadFast(30)<<9)+(digitalReadFast(31)<<8)+(digitalReadFast(32)<<7)+(digitalReadFast(33)<<6)+(digitalReadFast(34)<<5)+(digitalReadFast(35)<<4)+(digitalReadFast(36)<<3)+(digitalReadFast(37)<<2)+(digitalReadFast(38)<<1)+digitalReadFast(39);
SampleIndex++;
}
interrupts();

This is good enough to use something like the AD7677 (https://www.analog.com/en/products/ad7677.html) or the LTC2393-16 (https://www.analog.com/en/products/ltc2393-16.html). These chips are in the $40 range in 1000-lot, so within reach. Such an FFT analyzer, with appropriate front-end circuits, be a serious instrument, yet could be open-source and highly educational.

So the moral of this story? Up to 1Ms/s, with an interrupt-driven approach, is very feasible with digitalWriteFast() and brute force as demonstrated here. Above that, it depends on the number of bits, but for 16 (my goal), we are limited to about 1.657 Ms/s. For 8 bits, I got 2.531 Ms/s.

Parallel would be good for going faster, and even with kludgy code to "compensate" for Arduino compatibility, it does better, and for 8 bits, all else the same, I got 5.345 Ms/s.

Of course it is the *relative* speed that matters, as you may not be filling an array of floats, for example.

It looks like port reads could hit 12 Ms/s tops, but I did not test this in a direct comparison when bogged down with reading into a float array, index incrementing, making an ADC start pulse, etc., nor with DMA or anything else. My guess is that parallel would give a factor of 5 - 10X improvement. For someone really wanting to push, that would allow maybe using a higher-end 16-bit ADC (e.g. AD7621 at 3 Ms/s or even faster ones , that can exceed 1Gs/s).

Now anyone interested in making a "Teensy 3.6 Pro" version with some uncontaminated ports? If we had that, we could push it to the limit. I was really thinking that would be essential, but I showed that hitting 1 Ms/s, with safety margin, should be feasible with brute-force pin reading. If anyone wants to make the "Pro," I'll put up funds at a reasonable hourly rate, payable on delivery of working boards with assured supply (e.g., PJRC agrees to supply and to provide code support for the ports). Hey, an guy can dream, right?

Thanks for reading. I hope this is useful to at least somebody out there.
 
Neat project! Regarding the Teensy 3.6 "Pro", see this:
https://forum.pjrc.com/threads/53225-Teensy-3-6-quot-Pro-quot-Feedback

I've got some initial boards and am in the process of debugging. Not sure how long it will take - I have a bunch of other irons in the fire too. It also won't be called the Teensy 3.6 "Pro" because I'd like to avoid any mis-interpretation with PJRC, but will be a fully pinned out MK66FX1M0 with the PJRC bootloader.
 
@brtaylor, will there be support added for the extra pins? I take it most PJRC libraries would have issues with the extra pins....
 
That sounds great. Do you know the rough price-point?

Even with the PJRC bootloader, the question would be if and how the "clean" ports would be supported. Is there a plan for that?

Any chance you could estimate the port update speed? In principle it could be pretty fast...
 
@brtaylor, will there be support added for the extra pins? I take it most PJRC libraries would have issues with the extra pins....

I think the approach is to extend the pin numbering (https://forum.pjrc.com/threads/54114-Extended-Pin-Numbering), with a forked Teensy core, and then let the community add support for various ports and libraries from there. I see the pin numbering as a necessary first step so that we're all writing code against a common reference.
 
That sounds great. Do you know the rough price-point?

Even with the PJRC bootloader, the question would be if and how the "clean" ports would be supported. Is there a plan for that?

Any chance you could estimate the port update speed? In principle it could be pretty fast...

Roughly in the $40's or $50's.
 
Well, we at the university worked with TI (Robert Wessels) and brought our own DSP shield to life with a fork of the Arduino iDE (Energia: https://energia.nu/). The problem is that forks require a ton of maintenance or they seem to drift off after a few updates of the parent. If PJRC saw fit to incorporate this new pin numbering, with options, into the Teensy core, I think it would be even more awesome. I get the feeling PJRC wants to keep moving with new processors and higher clock speeds though, versus such things.
 
Status
Not open for further replies.
Back
Top