mborgerson
Well-known member
In my current WIP project, implementing USB Test and Measurement Class (USBTMC) drivers for Teensy 4.1 hosts and devices, I have encountered an issue that may concern other developers who depend on EXTMEM for large buffers. In my case, I use 4MB of extmem as a circular buffer to handle incoming 8KB USBTMC data packets. The circular buffer is needed to buffer the incoming data stream at those times when SD Card writes take an unusually long time to complete (100+ milliseconds in some cases). The need to buffer incoming data during slow SD writes is well-known and solutions have been developed to manage the problem.
During USBTMC logging sessions, the USB host may send up to 9MB/second to the EXTMEM Buffer. At the same time, the SD driver is reading from the buffer to write data to the SD Card. EXTMEM seems to handle this aggregate 18MB/second of DMA transfers MOST OF THE TIME. One characteristic of the transfers helps the EXTMEM driver: The write and read accesses occur at monotonically increasing addresses. This minimizes the command overhead involved in sending the data to the PSRAM chip.
A second factor is that, in normal operation, the SD Card driver is reading the data block just behind the last block written by the USB host. I suspect that some system cache magic may simplify things for the EXTMEM driver, but I'm not sure how that cache interacts with DMA-based USB writes and SD Reads.
In my case, things seem to work well EXCEPT when the circular buffer reaches the end and wraps back to the beginning at the same time the SD card is taking extra time to write the last block of data. This seems to happen only about once every few seconds. The circular buffer wraps back about every 400 msec at the max data rate. I'm not sure exactly how often SD card write times are extended with the 128GB SanDisk card, but I intend to add some pin toggles to show that on my oscilloscope. The error appears to be that a block from some other location is written to EXTMEM instead of the current data from the USB host.
When the long write occurs near the end of the circular buffer the read pointer is stalled there until the next write. At the same time, the USB write pointer continues past the buffer end and restarts at the buffer beginning. When this happens the write pointer may be a megabyte or two ahead of the SD read pointer until the SD reads can catch up. Furthermore, the read and write pointers can be be at opposite ends of the 4MB circular buffer. If the 32KB system cache is in play, it will suddenly have to cope with widely separated data segments. In addition, the PSRAM driver will be coping with the command overhead needed to switch between the 8KB read and write segments.
The occasional glitches might never be noticed when capturing video frames. In my USTMC application a missing or misplaced data block shows up because the USBTMC header has an incrementing tag byte in its header which rolls over every 255 bytes (zero values are not allowed, so the tag rolls from 255 to 1). In addition my simulated 32-byte samples have an incrementing sample number as their first long word. The glitches were immediately obvious when I started plotting the sample numbers.
I have an interim solution: I have the USB send all data packets to a single 8KB buffer in DTCM. In the end-of transfer USB callback function, I memcpy() the data from the DTCM buffer into the appropriate place in the circular buffer. I suppose that this works because the memcpy() does not use DMA to put the data into the circular buffer and can take advantage of the system cache. The downside is that the 8KB memcpy() takes about 270uSec--greatly increasing the time spent in the USB callback function.
I can move the memcpy() out of the callback function by setting a flag there and having the loop() function check the flag often and do the memcpy() when required.
I am also considering having USB incoming data go into an intermediate circular buffer in DTCM of perhaps 4 8KB blocks. I could then transfer from that buffer to the larger EXTMEM buffer, perhaps using DMA to handle that transfer. However, I am afraid that might result in the same problems as the original transfer directly from USB to EXTMEM. If the DMA from loop() is at a lower rate than the USB DMA, it might work.
My original intention was to avoid all use of memcpy() as it can use up almost 27% of the CPU cycles. When collecting 8MB/second, there is a new 8KB packet once per millisecond. If it takes 270uSec to move that packet, there goes 27% of the CPU cycles! With some careful (and more complex) programming, I can live with the decreased CPU availability--especially if the memcpy() is interruptible so that the host can do some timer-based data collection of its own.
I'm open other solutions. I also plan to investigate how the USB MTP driver handles the interaction of SD Card reads and writes and USB transfers to and from the PC. I suspect that the USBTMC driver is at a disadvantage as I want to maintain continuous high-speed transfers from connected devices, while the PC and Teensy MTP driver can occasionally pause to catch their breath.
During USBTMC logging sessions, the USB host may send up to 9MB/second to the EXTMEM Buffer. At the same time, the SD driver is reading from the buffer to write data to the SD Card. EXTMEM seems to handle this aggregate 18MB/second of DMA transfers MOST OF THE TIME. One characteristic of the transfers helps the EXTMEM driver: The write and read accesses occur at monotonically increasing addresses. This minimizes the command overhead involved in sending the data to the PSRAM chip.
A second factor is that, in normal operation, the SD Card driver is reading the data block just behind the last block written by the USB host. I suspect that some system cache magic may simplify things for the EXTMEM driver, but I'm not sure how that cache interacts with DMA-based USB writes and SD Reads.
In my case, things seem to work well EXCEPT when the circular buffer reaches the end and wraps back to the beginning at the same time the SD card is taking extra time to write the last block of data. This seems to happen only about once every few seconds. The circular buffer wraps back about every 400 msec at the max data rate. I'm not sure exactly how often SD card write times are extended with the 128GB SanDisk card, but I intend to add some pin toggles to show that on my oscilloscope. The error appears to be that a block from some other location is written to EXTMEM instead of the current data from the USB host.
When the long write occurs near the end of the circular buffer the read pointer is stalled there until the next write. At the same time, the USB write pointer continues past the buffer end and restarts at the buffer beginning. When this happens the write pointer may be a megabyte or two ahead of the SD read pointer until the SD reads can catch up. Furthermore, the read and write pointers can be be at opposite ends of the 4MB circular buffer. If the 32KB system cache is in play, it will suddenly have to cope with widely separated data segments. In addition, the PSRAM driver will be coping with the command overhead needed to switch between the 8KB read and write segments.
The occasional glitches might never be noticed when capturing video frames. In my USTMC application a missing or misplaced data block shows up because the USBTMC header has an incrementing tag byte in its header which rolls over every 255 bytes (zero values are not allowed, so the tag rolls from 255 to 1). In addition my simulated 32-byte samples have an incrementing sample number as their first long word. The glitches were immediately obvious when I started plotting the sample numbers.
I have an interim solution: I have the USB send all data packets to a single 8KB buffer in DTCM. In the end-of transfer USB callback function, I memcpy() the data from the DTCM buffer into the appropriate place in the circular buffer. I suppose that this works because the memcpy() does not use DMA to put the data into the circular buffer and can take advantage of the system cache. The downside is that the 8KB memcpy() takes about 270uSec--greatly increasing the time spent in the USB callback function.
I can move the memcpy() out of the callback function by setting a flag there and having the loop() function check the flag often and do the memcpy() when required.
I am also considering having USB incoming data go into an intermediate circular buffer in DTCM of perhaps 4 8KB blocks. I could then transfer from that buffer to the larger EXTMEM buffer, perhaps using DMA to handle that transfer. However, I am afraid that might result in the same problems as the original transfer directly from USB to EXTMEM. If the DMA from loop() is at a lower rate than the USB DMA, it might work.
My original intention was to avoid all use of memcpy() as it can use up almost 27% of the CPU cycles. When collecting 8MB/second, there is a new 8KB packet once per millisecond. If it takes 270uSec to move that packet, there goes 27% of the CPU cycles! With some careful (and more complex) programming, I can live with the decreased CPU availability--especially if the memcpy() is interruptible so that the host can do some timer-based data collection of its own.
I'm open other solutions. I also plan to investigate how the USB MTP driver handles the interaction of SD Card reads and writes and USB transfers to and from the PC. I suspect that the USBTMC driver is at a disadvantage as I want to maintain continuous high-speed transfers from connected devices, while the PC and Teensy MTP driver can occasionally pause to catch their breath.