Taking a closer look through SPI and thinking of ili9341_t3 like code...
I (or we) may need to play around some more. Hopefully tomorrow morning I will have the updated Transfer(buf, retbuf, cnt) version implemented, plus a transfer16, that does it in one transfer...
Also a closer look at the SPI registers and FIFO queue. I believe both the TX and RX have 16 32 bit queues, which should be nice, for the above transfer.
I think I may have been somewhat wrong about not having some T3ish like support for hardware CS support.
Whereas the T3.x boards allowed you to encode the state of these registers as part of a PUSH operation, the new T4 processor does allow you some support, that it will be interesting to try out.
To control this you use the Transmit Command Register (TCR), which controls several things including the transfer speed, The word width (here is where I will experiment changing from 8 to 16).
But in addition to this you can control a PeripheralChipSelect, And also if you wish for it to be Continuous... So not sure if we can do some hacks to control turning it on and off on demand.
Note: in the ILI9341 case, would not try to use this for CS pin, but hopefully DC pin. My ili9341_t3n library works with only one CS pin for DC and I found that using standard IO pin for CS did not impact performance much.
The way this TCR is used is the FIFO Transmit queue can is 16 units deep and each unit can be either a command (TCR) or data (Transmit Data register TDR). Again looks some stuff that might be fun to experiment with.