Hi everybody, apologies for the delay in responding. We were out of town and occupied.
Per the request for a benchmarks for timing, see:
Teensy 4.0/UNO R4 timing (github)
This is code that I threw together to check timing. It outputs average and maximum timings using the cycle counter, but in practice I use it with an oscilloscope. See the subdirectory "results". With a large number of iterations, one might hope that the minimum is close to the average, but admittedly that is an assumption and perhaps I should output that number as well.
For the Teensy: Digital read is solid at 26 cycles (43nsecs). Digital write is 12 to 29 cycles (20 to 48nsecs). Interrupt latency for pins (or specifically the pin used in the test), is an average 119 cycles to max 145 (198 to 242nsecs, 25% variation). But skipping the api and connecting to the interrupt directly brings it down to 66 to 71 (110 to 118nsecs, 10% variation).
(The latency may seem slow, but that may be due to the larger number of registers that need to be saved. I recall in DSPs where we have direct control of the context switch, we would save the minimum set of registers required for a given isr. That number 66 cycles, for the ARM saving everything might be close to as expected.)
For a comparison, the UNO R4: Digital read is 72 to 204 cycles (1.5 to 4.3usec), write is 21 to 273 cycles (0.44 to 3.6usec), latency for a pin interrupt is 182 to 312 cycles (3.8 to 6.5usecs) and SPI transfer16 takes 327 to 464 cycles (6.8 to 10usecs).
And for a practical application: The following is an example of a project that uses timing on the above scale for the Teensy. It implements a state machine to operate the hamamatsu ccd,
S11639-01 with Teensy 4 (github)