This digitalReadFast test is pretty bogus. Since the result doesn't get used, half of it gets optimized away. So it results in this:
This is missing the bit masking for the pin. This will never ever happen in real code where the result is not thrown away without looking at it. DigitalReadFast reads the entire port register and must thus mask out the other pins.
If I assign the result to a volatile bool in the test, it looks like this:
Code:
ldr r1, [r6, #0]
ubfx r1, r1, #5, #1
strb.w r1, [sp, #9]
The port register is read, bit 5 is extracted, the result is put into a bool on the stack. This takes 7x longer than the test above (this 7x is a bit unfair, since the stack store probably takes half the time).
I have experimented with using bitbanding, where you can extract the pin value in a single instruction (which does happen), but GCC generates code that is worse than digitalReadFast if you don't know exactly what you are doing.