how fast is digitalReadFast on a Teensy 3.6?
It depends on surrounding code. It compiles to a single ST instruction which uses 2 registers, for the address of the register and the data to write.
How long it takes to load values into those 2 registers can vary, because ARM doesn't have a single instruction that loads a full 32 bit immediate value into a register. So it might do a LD instruction (using relative to the program counter for the address), or it might use 1 or 2 other instructions and tricks to get the values into those registers.
If an extra LD instruction is used, it might cause a cache miss, adding 4 extra cycles access time. Likewise, if the code itself isn't yet in the flash controller's 256 byte cache or the memory controller's 8K cache, flash wait states come into play. But the flash is very wide (can't recall if it's 128 or 256 bits) so those don't apply to every fetch. If you really want maximum speed, use FASTRUN so your code runs from the RAM which is all single cycle access.
The compiler optimizer will attempt to pre-load those registers. But again, the results can vary quite a lot depending on other details of your code. If the compiler decides using the registers in other ways would make your code run faster overall, it will consider register holding constants to be expendable. It can always just put the constant back into the register when needed. Or it can do this outside of one loop, but inside another. The compiler fast a lot of very complex decisions about how to best utilize the registers.
Even the speed of ST (store to memory) instruction can vary. Normally it takes 2 cycles. But if you do 2 or more "related" load or store instructions in an unbroken sequence, the ARM hardware will apply a special pipeline optimization to do all but the first using only 1 cycle. So if you have 10 digitalWriteFast() lines in a row, there's a very strong possibility the compiler will pre-load the registers in other code and then they all execute in only 11 cycles.
If you have 10 of them with other stuff mixed between that puts a lot of "register pressure" on the compiler's optimizer and you're running from flash with cache misses, those 10 digitalWriteFast() could take 100+ cycles.
And, how is it related to CPU speed (like, is it 33% faster at cpu clock=240 MHz than at cpu clock=180 MHz?)
Yes, it's definitely related to the CPU speed. Overclocking makes it run faster.
However, something to consider is how fast the pin's voltage can actually change. By default, pinMode() configures the pin with slew rate limiting. Normally this is a good thing, because it greatly reduces radio emissions and other nasty high-speed signal quality problems if you use long unshielded wires without impedance matching (the normal for most projects). With slew rate limit, you can actually get the code to run faster than the actual voltage at the pin can change. To really use this speed, you need to write the pin config register to put it in fast mode... and be warned, such extremely fast signals do cause RF noise problems if not handled very well. Actually measuring such signals can be quite tricky. Even with a very good oscilloscope, careful attention to probes and ground wire lengths can be critical for a good measurement.
Here's a thread with some actual test results and optimization tips.
https://forum.pjrc.com/threads/4187...lator-example)?p=132363&viewfull=1#post132363