Regarding the new hardware, there's a consistent theme of "it's complicated"....
... for the newer T_4 hardware cache and pipelining? If I found the right info it seems the 600 MHz processor more than doubles the FLASH {off chip} access rate - and probably not 32 bits wide at that?
Flash in T4: 4 Bit parallel access, I think(?) - Like the ESP.
There is a cache.
Yes, the flash memory is 4 bit QPI DDR clocked at 60 MHz. So the raw data rate, minus command overhead, is 60 MByte/sec. When accessing the flash directly as addressable memory, there are caches, 32K for data and 32K for instructions... so access is pretty slow for a cache miss, but then you get single cycle speed of the cache. The flash isn't accessed as bytes or words, but bursts that fill cache rows.
But that's not the normal way you use the chip. The main usage model involves copying your code into RAM at startup. On the new chip, basically everything will default to running like using "FASTRUN" is now. But there too, the "it's complicated" theme applies.
The RAM has a total size of 512K, which is organized in 32K blocks. Each block can be configured to connect in 1 of 3 possible ways, ITCM, DTCM or AXI. All 3 of these buses are 64 bits wide (but technically DTCM is a pair of 32 bit paths). The TCM buses run at the full clock speed. AXI runs at 1/4 of the CPU speed, but the bus provides amazing features. The general rule is you want all your code on the ITCM bus, your stack and all your normal variables accessed via the DTCM bus, and you want buffers accessed by DMA from peripherals on the AXI bus.
The two 32K caches only apply to memory on the AXI bus. Caching is never used on the ITCM and DTCM buses, since those are single cycle access to all of the memory assigned.