Accessing I/O ports using C commands in Teensy 3.2, 4.0

Status
Not open for further replies.
Hi,

sorry for noob questions but I couldn't find the answer. Can the I/O pins in Teensy 3.2 and Teensy 4.0 be accessed using C commands? Are they grouped in ports?
 
Last edited:
Code:
  digitalWrite(pin_number,pin_state);

E.g.
Code:
  digitalWrite(10,HIGH);

They can be grouped in ports and accessed directly but unless you really need simultaneous access, it is easier to just use digitalWrite.

[EDIT] And some pins can be analog so you would use analogRead(pin_number) to read an ADC or analogWrite(pin_number,value) to use PWM on a PWM-capable pin.

Pete
 
Thanks for your answer, but it's not what I wanted to know. I'll try to be more specific.

digitalWrite() is an Arduino function to set state of a specified pin. In Arduino, which is based on atmega, you can do the same using C:
PORTx &= ~_BV(Pxy); to set low
PORTx |= _BV(Pxy); to set high
Where x is a letter of port and y is a number of pin.

Off course it's easier to use digitalWrite() function but it's also much slower. In Arduino setting pin state using digitalWrite() takes about 10 times more time than doing the same using C. I assume it's the same in Teensy (correct me if I'm wrong).
I need simultaneous access to group of pins to save even more time. If all pins in a port are low i don't have to waste time for checking them separately.

In the project I'm currently working on I need to check states of 264 input pins of shift registers and it has to be done in less than 10 µs. When I connected all shift registers in a row and programmed an Arduino to check all pins separately and was using only digitalRead() and digitalWrite() functions, it took about 3.5 ms (3500 µs) to do it which is waaaay too long. After few modifications, changing all digitalWrite() and digitalRead() functions to C commands and reading ports instead of single pins and reading single pins only when needed, I was able to go down with time to approximately 40-50 µs, but it's still too much which is why I decided to use Teensy instead of Arduino. Unfortunately I'm totally unfamiliar with Teensy and I hope someone will help me with it.
 
Well you might want to first try out:
Code:
digitalWriteFast()
digitalReadFast()

They are special macrocommands integrated in the Teensy that do as you describe: Skip the Arduino nonsense like turning of ISR for a moment and so on and just reads/writes straight to/from the GPIO Register!
It does its job within a dozen nanoseconds most to my experience.

But if you want to get into the nitty gritty reading of multiple IO pins or exploiting some of the Teensy's more advanced hidden features. The T_4's https://www.pjrc.com/teensy/IMXRT1060RM_rev2.pdf Reference shows all the registers you can access giving a great deal (if not overwhelming) control. Especially once you start to understand the IOMUX . But for using multiple GPIO simultaneous for efficiency on the Teensy4 it is mostly a case of the following 32-bit registers:
GPIOx_GDIR (Your input/output register), GPIOx_PSR (input) and GPIOx_DR (Output).


In my project for example i use how the teensy4.1 has the last 16 consecutive pins of a single group (GPIO6) exposed to read/write 16-bits in a single hit. So my code looks a bit like this:
Code:
//For Reading:
GPIO6_GDIR &= ~0xFFFF0000;   //Set the last 16 pins of GPIO6 as input (note the invert ~)
uint16_t DATA_IN = GPIO6_PSR >> 16;  //sample and shift the bits to a 16bit value

//For writing
GPIO6_GDIR |= 0xFFFF0000;  //Set as outputs
GPIO6_DR &= ~0xFFFF0000;  //Clear to 0
GPIO6_DR |= DATA_OUT << 16  //Write new value
//Alternatively there is also GPIO6_DR_SET, GPIO6_DR_CLEAR and GPIO6_DR_Toggle for configuring output which you can guess what they do



Just takes a bit of effort to know which Pin belongs to which GPIO group ^^;
Oh and to remember that Teensy by default uses the alternative registers (e.g. instead of GPIO1 one uses its alt: GPIO6)


(PS: Yeah the use of more closer to the metal stuff like the GPIO registers or other very nifty features like the MANY different interrupts these beasts of chips can handle are not well documented on the main site. Kind of wish there where more in-depth resources than have to scour the forums)
 
Last edited:
Thank you all for your answers. It helped me a lot.

I'll try digitalWriteFast() and digitalReadFast()

On Teensy 3.2, using digitalWriteFast(x) with x= constant takes !one! cycle.

That sounds really promising. As far as I know digitaRead() and digitalWrite() takes about 50 cycles. That's a huge difference.

(PS: Yeah the use of more closer to the metal stuff like the GPIO registers or other very nifty features like the MANY different interrupts these beasts of chips can handle are not well documented on the main site. Kind of wish there where more in-depth resources than have to scour the forums)

Sooo true.
 
Note this is not for Teensy4.
But T3.x digitalWriteFast/digitalReadFast with pin number=constant is unbeatable :)
 
Note this is not for Teensy4.

Ohhh... But can I still use digitalReadFast() and digitalWriteFast() on Teensy 4? Since Teensy 4 is 600Mhz and Teensy 3 is only 72Mhz, I'm definitely gonna use Teensy 4.

Is it a built in function or do I need a library?
 
I don't believe that the I/O runs at 600MHz.
Bye the way the H in Hz should be uppercase.
All units derived from a human name ie Hertz, Newton, Ampere, always have their first letter capitalised.
mm, m for instance are not derived from an actual person's name so are in lower case.
 
I don't believe that the I/O runs at 600MHz.
I don't know how fast is direct access to I/O ports in Tennsy 4 but digitalWriteFast() runs at 150MHz.
See here. It's two times more than 72MHz.
Bye the way the H in Hz should be uppercase.
All units derived from a human name ie Hertz, Newton, Ampere, always have their first letter capitalised.
mm, m for instance are not derived from an actual person's name so are in lower case.
I know that. I studied physics. I know how to write units but everyone can make an error. Right?
Speaking of errors. Do you know, that 'by' in 'By the way' is spelled BY not BYE?
Off course you do. Just like I said. Everyone can make an error. :D
 
All units derived from a human name ie Hertz, Newton, Ampere, always have their first letter capitalised.
But only in the abbreviated form... Hz stands for hertz, N for newton, A for ampere. This distinquishes the unit from
the person (as in "the newton (N) is named after Newton").

There is one sort-of exception, "degrees Celsius" is written with capital C. And while on the subject the unit
of absolute temperature is the kelvin, never "degrees Kelvin" as its a fundamental unit.

The kelvin is also oddball in that it is commonally used as its own plural, presumably by analogy to the other
temperature units (we say "30 centigrate" or "47 Fahrenheit" all the time as short for "30 degrees centigrate", etc).
I'm sure you are supposed to say "17.5 kelvins" but I rarely hear that.
 
Ohhh... But can I still use digitalReadFast() and digitalWriteFast() on Teensy 4? Since Teensy 4 is 600Mhz and Teensy 3 is only 72Mhz, I'm definitely gonna use Teensy 4.

Is it a built in function or do I need a library?

The Fast functions work on both the Teensy 3 and Teensy 4 and is integrated within the Teensyduino core. So you don't need a library. Moment you say you want to use a teensy 3/4 board it'll compile.


Should be noted that the Teensy 4's GPIO works a little slower than the actual 600Mhz. it has a seperate slower clock for the GPIO section (max 200Mhz). While the Teensy 3 GPIO clocks at the same speed as the core (72Mhz). Some things also may take a couple cycles on the Teensy4 (e.g. sampling the pins takes 2 cycles) and i'd wager both aren't that far apart from each other each other. So the Teensy4 should be faster, but not by a lot. Is WAY faster at handling processing in between actions though!
 
Note this is not for Teensy4.
But T3.x digitalWriteFast/digitalReadFast with pin number=constant is unbeatable :)

Sorry, not sure why the distinction? That is for example if you do something like: digitalWriteFast(10, HIGH);

It should compile down to: CORE_PIN10_PORTSET = CORE_PIN10_BITMASK;
Which for Teensy4 is: GPIO7_DR_SET = (1<<(CORE_PIN10_BIT)) where CORE_PIN10_BIT = -0

So again setting one register to a constant value...

Again not sure not sure of many optimizations on how to improve on that for a single register set...

Side Note: there is also digitalToggleFast which likewise for constant pin numbers again boil down to single instruction.


As to where to find out more information:
Every version of Teensyduino installs the source code for all of this, so browsing through the cores (and libraries) installed is a good place to look.

Also each of the teensy boards have a page up on pjrc... Like the T4: https://www.pjrc.com/store/teensy40.html#tech
On that page will be links where you can download the reference manual for processor board, and there are sections that describe the registers and the like.

Hope that helps
 
Code:
 pinMode(0,OUTPUT);
 for(;;) {
  digitalWriteFast(0,1);
  digitalWriteFast(0,0);
 }
You're right. I see 150MHZ on the scope.
Code:
      70:    f8c2 3084     str.w    r3, [r2, #132]    ; 0x84
                CORE_PIN0_PORTCLEAR = CORE_PIN0_BITMASK;
      74:    f8c2 3088     str.w    r3, [r2, #136]    ; 0x88
      78:    e7fa          b.n    70 <setup+0x10>
So, yes one cycle.
 
Last edited:
However, the branch prediction seems not to work here? 600 / 3 should be 200MHz. Hm....
And, if you take the prerequisites into account, it's a little more than one cycle :) - ok, on both.
Code:
      6a:    f04f 4284     mov.w    r2, #1107296256    ; 0x42000000
      6e:    2308          movs    r3, #8

Eh, no, all is good I think.
The first store needs 2 cycles, the second one If I remember correctly.

BUT I don't remember who stated it's one cycle? It's not the case for the 3.x, too, then... so that was a wrong Info. It's only true in a sepcial sequence. If the instruction before is a "str", too. So, who said this? It's wrong in most cases. Almost all. Must be *many* years ago..


MoNoImPoSm , sorry. SOmeone must have not written the whole truth. I all the years thought it was true, "one cycle". But.. it's true only for a *very* special case. Did not think about it, all the years.
Over all, it's just wrong.

In almost all "normal" cases (=no extremly tight loop, no consecutive digitalWriteFast) a single digitalWriteFast takes 3 or 4 cycles.
 
Last edited:
The short & simple answer is digitalWriteFast() and digitalReadFast() are indeed the most optimal way for a single pin, if the pin number is a constant.

The name "digitalWrite" has a sort of bad reputation due to a lot of overhead in Arduino's AVR core library. Don't let that scare you. These "Fast" versions truly do result in the most optimized code possible (for single pin use). In fact, rapid use of digitalWriteFast() on Teensy 4.0 & 4.1 is too fast for the voltage to even fully change, unless you do special configuration to turn off the slew rate limit feature of the pin (which results in massive increase of high frequency noise and other bad effects if you use ordinary wires without careful RF design - slew rate limit is a really good thing for most applications).

The long and complicated answer is a lot of factors go into how many CPU cycles are actually used. As a general rule, processor designs which allow for faster overall code execution tend to result in less deterministic speed for any individual piece of code. Cortex M7 is a very complex processor, with a 6 stage pipeline, 2 integer execution units, branch prediction, instruction and data caches, etc. Usually a store instruction which actually writes to GPIO6 to GPIO9 (which are on a special low-latency I/O bus) takes 2 cycles. But the surrounding code can have an effect. It's very difficult to predict. Often M7 can even execute 2 instruction in the same clock cycle, but sometimes things can stall the pipeline and you get multiple cycles.

Another complexity is the store instruction depends on 2 registers, for the address of the GPIO register and the data to write. Very often the compiler will optimize your code to push these outside of loops. But it all depends on the surrounding code. If the compiler needs to allocate registers to other uses, it could have to add 2 or more instructions to initialize the registers just for the store instruction. The details are complicated and the surrounding code matters.

But the simple answer is digitalWriteFast() and digitalReadFast() do result in the most optimal single-pin access code, when the pin number is a constant known to the compiler, even if the actual number of cycles it will take depends on a lot of complicated factors.
 
That's true. It's the fastest way.
But thank you for mentioning the pipelines. It has 6 entries. So, worst case there will be an additional 6 cycle lag. Means, the actual write can happen (worst case) 6 cyles *after* the store.
Have I got this right?
So, for really really fast things, you have to take this into account.


Edit: No, that's not the worst case. Have not thought about other things that can stall this... bus accesses...
and, *absolutely worst*, an interrupt (<- not pipeline, but if it happens after the register loads, and before the store)

The "dual issue" (= executute 2 instuctions in one cycle) does not apply in the case of store. And I think for literal register-loads too.
 
Last edited:
The "dual issue" (= executute 2 instuctions in one cycle) Does not apply in the case of store.

This is a deep rabbit hole of "it depends".

The bus to DTCM is dual 32 bit, capable of performing two 32 memory access simultaneously. But even there, some limits may apply, perhaps even / odd word alignment?

AXI bus is 64 bits. I believe M7 may be capable of combining two store instructions to consecutive 32 bit locations, at least for "normal" memory. The rules are probably different for "strongly ordered" memory.

But going over the bridges to peripheral buses, probably 32 bits at a time.

I know almost nothing about the special low latency I/O bus (address range starting at 4200,0000 for GPIO6-9 & FlexIO3), but I'd be pretty surprised if it's more than 32 bits.

I usually try not to obsess so much about these very low level details.
 
A little Task for you.
Read in 16 values using DigitalReadFast(1..16) shifting as necessary to represent the bit in the word and OR with the variable to hold the 16bits.
Time it over 1000 attempts and see how long it takes.
 
Lets the compiler optimize that.
You can just measure what is faster - use of the GPIO registers or digitalReadFast. I tend to think reading the registers is faster... and it has the plus, that you read several bits at the same time - less cycles between.
But I know you can trust the compiler. Don't try to optimize too much by hand. Let GCC do its job. It's pretty good at this.
 
Status
Not open for further replies.
Back
Top