So this is what I came up with, and I don't know if it works or not.
I haven't got my Teensy 4.0 yet, I am waiting for T4.1, which seems to have more GPIOs but still with the same clock speed and overclock capabilities (?)

digitalWriteFast seems to be fast, using 2 clock cycles I guess? But setting the whole port will still take some time. The bad thing is the pins are distributed quite poorly, no "8 bits in a row" thing.