Teensy 3.5 Ultra Fast Digital Input

Not open for further replies.


Probably a question for Paul, but others may be able to help me to I am hoping!
I am trying to use the Teensy3.5 as an in circuit TTL tester using upto 20 digital IO pins. I have grouped the pins to use PortC0-11 and PortD0-7. I had thought that at 120Mhz it would be quick enough for my needs but so far it works fine with TTL in slower circuits but faster circuits don't work.
So I dug out my logic analyser and started doing some timing tests using an extra digital pin to indicate start/finish points.

static uint16_t PrtCDat[100] __attribute__ ((aligned (16)));
static uint16_t PrtDDat[100] __attribute__ ((aligned (16)));
void Capture1()
digitalWriteFast(33, HIGH);
for (int i=0; i<100; i++)
/* PrtDDat = GPIOD_PDIR;*/
digitalWriteFast(33, LOW);

This code takes 5.06us but if I takes out the comments it only takes an extra 0.82us, why? Surely there is not that much overhead for the loop.... I had been expecting the time to nearly double.

I have also tried using DMA, and if only sampling a single port I can get the time down to 4.44us but however I try to trigger 2 dma channels the time almost doubles to 8.62us. I also found that it was quicker to do the transfer in 1 trigger with NBYTES being set to the buffer size (200) rather than (2) and setting BITER/CITER to (100).

Am I missing something obvious here? Does anybody know of a quicker way?

In an ideal world it would have been great to have all 20 bits available on a single port, it seems the most I can get is 15 on Port D if I use the surface pads on the bottom.
I am not sure what you are trying to accomplish but you could unroll your loop. Write
PrtCDat[i++] = GPIOC_PDIR; 100 times.
I didn't think to try that! Just did and the time it takes to read both ports 100 times is down to 1.69us or 17ns per time :)
In the actual program I need to read them 40,000 times so not sure how the compiler will handle it but a curious thing is that I am using the 'Fastest' optimization option and the test case of 100 compared to a single read is only 448 bytes larger code.
I have tried 'for', 'while' & 'do while' loops but all are significantly slower than unrolled code, this would point to poor code generation by the compiler and the lack of a cache on the Teensy 3.5. Right, I am going to spend the next 15 minutes pasting in the editor to get my 40,000 pairs of reads!
You may want to try the pointer version
*p++ = GPIOC_PDIR;
to see if there is any difference. These new compilers are very good, so there may not be a difference.
You are assigning a 32bit register to a 16 bit variable? Is this by purpose?
Yes, as I am only interested in the lower 16 bits of each port.
I did try using a pointer version, it basically gains me around 0.5ns when reading both ports.
The big trouble I have is that the compiler appears to be crashing with anything other than 'debug' optimizations and when using 'debug' the code uses 98% of flash memory! I can get a simple 40,000 test compiling using 'fastest' and it is about half the size, but as soon as I add any proper processing the compiler fails!
My biggest question after playing with all of this is, Why Are Loops So Slow When Compiled ???????
I think luni may have been suggesting that using native 32 bit access read and writes may be faster than 16 bit. The data sheet I have downloaded is very vague about pipelines in the architecture but I would guess in addition to the loop need to test the completion condition and then branch, that any pipelines would then be flushed and you loose any speed benefit they provide.
I did understand what he was getting at, but I had already tried that and found no difference with the speed. I do have a Teensy 3.6 here which is clocked faster and I think I read somewhere that it has a cache where the 3.5 doesn't. I may give it a try, although it's not 5v tolerant and as such will not be any good for this project!
I quickly checked the assembly output of the following code:

static unsigned PrtCDat[100];
static unsigned PrtDDat[100];

void Capture1()
    for (int i = 0; i < 100; i++)
        PrtCDat[i] = GPIOC_PDIR;
     //   PrtDDat[i] = GPIOD_PDIR;

The loop generates the following output (-O2) R1 holds the address of GPIOC_PIDR, R3 is the loop counter
Looks pretty optimal.

    4a6:	ldr	r2, [r1, #0]
    4a8:	str	r2, [r0, r3]
    4aa:	adds	r3, #8
    4ac:	cmp.w	r3, #400	
    4b0:	bne.n	4a6 

Here the output if I uncomment the GPIOD line

     4aa:	ldr	r1, [r4, #0]
     4ac:	ldr	r2, [r0, #0]
     4ae:	str	r2, [r5, r3]
     4b0:	str	r1, [r6, r3]
     4b2:	adds	r3, #8
     4b4:	cmp.w	r3, #400	
     4b8:	bne.n	4aa

Unrolling the loop will increase the speed, but it doesn't make sense to unroll all 40000. If you unroll say 20 cycles you will have 80 ldr/str operations plus a loop overhead of 3 operations.
PS: I'm not sure if your 16bit aligned array is legal, I think you are supposed to access memory at 32bit boundaries. tni will definitely know that....
Your "benchmark" doesn't measure anything useful. It get compiled to:

   e:	3b01      	subs	r3, #1
  10:	6811      	ldr	r1, [r2, #0]
  12:	d1fc      	bne.n	e <Capture1()+0xe>

with both port reads:
  10:	3b01      	subs	r3, #1
  12:	6808      	ldr	r0, [r1, #0]
  14:	6810      	ldr	r0, [r2, #0]
  16:	d1fb      	bne.n	10 <Capture1()+0x10>

So only the load from the port register is performed, the store to the array is completely optimized away. GCC is very conservative with loop unrolling, you need to add something like "-funroll-loops --param max-unroll-times=20" to the compiler options.

If you need completely stable sample times, you need to use assembly code.

In the ideal case, back-to-back str / ldr instructions execute in a single clock cycle for the additional str / ldr, so the cycle count count from 6 to 7 per iteration for the 1-port vs. 2-port version.


DMA is slower than direct reads with the CPU. Major loops have more overhead (the DMA controller updates TCD values and performs arbitration checks for each iteration) than minor loops.


Port B has 16 pins available. They are not contiguous, you would need to read the whole 32-bit port and filter out the unnecessary stuff later.
PS: I'm not sure if your 16bit aligned array is legal, I think you are supposed to access memory at 32bit boundaries. tni will definitely know that....
Those arrays are 16-byte aligned, not 16-bit. Cortex M4 supports unaligned access for the most part (except for the boundary between the 2 SRAM regions; there are 2 memory controllers).
Thanks, there is always something to learn here :)

BTW: In case somebody is confused: In my code in #9 there is a stray i++ in the loop which leads to the "adds r3, #8" instead of the correct "adds r3,#4".
Thanks for all the useful info/suggestions. Funny thing is now I am having trouble getting the same response times that I had earlier when first trying unrolling the test 100 iterations loop.
I did realise Port B had 16 but this includes pins 0 & 1 which I believe are used for the Arduino serial monitor that is handy for generating debug msgs, or have I got this wrong!
I believe Port D has 15 pins which would be enough for 14 & 16 pin TTL chips, but I would like to have done 20 pin chips.
I have looked at the Arduino temp folder but it doesn't seem to leave a copy of the asm file, only the object & hex files, so I presume I would have to run the GCC compiler manually to check the assembler.
I have even used a loop variable that decrements to 0 as in theory this would save the 'cmp' instruction but I guess this depends on the optomiser.

I did have the code working quite well on some 74LS86's & 74LS32's but when I went to check some 74LS86's that were in the sprite generator circuit of the arcade pcb I was testing, it became obvious I was not reading the pins fast enough. From the datasheet I believe the maximum settle time for the 74LS86 is 30ns, so in reality I should really be aiming for 10-15ns sample time. Perhaps I will have to look at using a different board, but very few are available that have good barebones support and enough digital IO to also deal with LCD's & button controls.
If you just want to quickly check the generated code for a construct you can use the Compliler-Explorer https://godbolt.org/
Select ARM GCC 5.4 for the compiler and "-mthumb -mcpu=cortex-m4 -O3" for the compiler switches.

It compiles and shows the assembler code while you type. It is quite interesting to watch how the code changes if you play for example with the optimizer setting...

Thanks, thats a very useful link. This morning though my teensy 3.5 decided to die on me, only used it for 2 days! It won't ID as a windows device and it gets very hot near the USB connector, is this common? Never really had any issues with the 3.2's.
Anyway, I've swapped to the 3.6 and had to make a new board with level shifters. I've decided to only use 15 pins initially, so these are all PortD. I've decided to just use a single read in a loop as this should give a consistent sample time.
Re pins 0 and 1, they are used for serial1, a hardware serial port which you can probably live without if full port width is critical. classic 'Serial' is USB and has it's own pins on the Teensy that don't have IO assignments. The need to keep clear of pins 0 and 1 is for Uno and other atmega328 ICs using a FTDI USB converter.
Not open for further replies.