Glitches when using EXTMEM circular buffer for USB Host writes and reads by SD Driver

mborgerson

Well-known member
In my current WIP project, implementing USB Test and Measurement Class (USBTMC) drivers for Teensy 4.1 hosts and devices, I have encountered an issue that may concern other developers who depend on EXTMEM for large buffers. In my case, I use 4MB of extmem as a circular buffer to handle incoming 8KB USBTMC data packets. The circular buffer is needed to buffer the incoming data stream at those times when SD Card writes take an unusually long time to complete (100+ milliseconds in some cases). The need to buffer incoming data during slow SD writes is well-known and solutions have been developed to manage the problem.

During USBTMC logging sessions, the USB host may send up to 9MB/second to the EXTMEM Buffer. At the same time, the SD driver is reading from the buffer to write data to the SD Card. EXTMEM seems to handle this aggregate 18MB/second of DMA transfers MOST OF THE TIME. One characteristic of the transfers helps the EXTMEM driver: The write and read accesses occur at monotonically increasing addresses. This minimizes the command overhead involved in sending the data to the PSRAM chip.

A second factor is that, in normal operation, the SD Card driver is reading the data block just behind the last block written by the USB host. I suspect that some system cache magic may simplify things for the EXTMEM driver, but I'm not sure how that cache interacts with DMA-based USB writes and SD Reads.

In my case, things seem to work well EXCEPT when the circular buffer reaches the end and wraps back to the beginning at the same time the SD card is taking extra time to write the last block of data. This seems to happen only about once every few seconds. The circular buffer wraps back about every 400 msec at the max data rate. I'm not sure exactly how often SD card write times are extended with the 128GB SanDisk card, but I intend to add some pin toggles to show that on my oscilloscope. The error appears to be that a block from some other location is written to EXTMEM instead of the current data from the USB host.

When the long write occurs near the end of the circular buffer the read pointer is stalled there until the next write. At the same time, the USB write pointer continues past the buffer end and restarts at the buffer beginning. When this happens the write pointer may be a megabyte or two ahead of the SD read pointer until the SD reads can catch up. Furthermore, the read and write pointers can be be at opposite ends of the 4MB circular buffer. If the 32KB system cache is in play, it will suddenly have to cope with widely separated data segments. In addition, the PSRAM driver will be coping with the command overhead needed to switch between the 8KB read and write segments.

The occasional glitches might never be noticed when capturing video frames. In my USTMC application a missing or misplaced data block shows up because the USBTMC header has an incrementing tag byte in its header which rolls over every 255 bytes (zero values are not allowed, so the tag rolls from 255 to 1). In addition my simulated 32-byte samples have an incrementing sample number as their first long word. The glitches were immediately obvious when I started plotting the sample numbers.

I have an interim solution: I have the USB send all data packets to a single 8KB buffer in DTCM. In the end-of transfer USB callback function, I memcpy() the data from the DTCM buffer into the appropriate place in the circular buffer. I suppose that this works because the memcpy() does not use DMA to put the data into the circular buffer and can take advantage of the system cache. The downside is that the 8KB memcpy() takes about 270uSec--greatly increasing the time spent in the USB callback function.
I can move the memcpy() out of the callback function by setting a flag there and having the loop() function check the flag often and do the memcpy() when required.

I am also considering having USB incoming data go into an intermediate circular buffer in DTCM of perhaps 4 8KB blocks. I could then transfer from that buffer to the larger EXTMEM buffer, perhaps using DMA to handle that transfer. However, I am afraid that might result in the same problems as the original transfer directly from USB to EXTMEM. If the DMA from loop() is at a lower rate than the USB DMA, it might work.

My original intention was to avoid all use of memcpy() as it can use up almost 27% of the CPU cycles. When collecting 8MB/second, there is a new 8KB packet once per millisecond. If it takes 270uSec to move that packet, there goes 27% of the CPU cycles! With some careful (and more complex) programming, I can live with the decreased CPU availability--especially if the memcpy() is interruptible so that the host can do some timer-based data collection of its own.

I'm open other solutions. I also plan to investigate how the USB MTP driver handles the interaction of SD Card reads and writes and USB transfers to and from the PC. I suspect that the USBTMC driver is at a disadvantage as I want to maintain continuous high-speed transfers from connected devices, while the PC and Teensy MTP driver can occasionally pause to catch their breath.
 
EXTMEM seems to handle this aggregate 18MB/second of DMA transfers MOST OF THE TIME. One characteristic of the transfers helps the EXTMEM driver: The write and read accesses occur at monotonically increasing addresses. This minimizes the command overhead involved in sending the data to the PSRAM chip.

A second factor is that, in normal operation, the SD Card driver is reading the data block just behind the last block written by the USB host. I suspect that some system cache magic may simplify things for the EXTMEM driver, but I'm not sure how that cache interacts with DMA-based USB writes and SD Reads.
I suspect this sustained speed may simply be too fast for the PSRAM chip to constantly maintain.
At the default clock speed of 88MHz, 30MB/s is the rough upper sustained limit for mixed reads and writes.
There are a couple of simple options for increasing this:
- raise the FlexSPI2 (PSRAM) clock speed, up to a maximum of 133MHz. There's a few posts on the forum about how to do this, sometimes the max supported speed is less due to the specific PSRAM chip being used.
- Enable FlexSPI2 prefetching may speed up the reads depending on how the SD card driver is doing DMA - if it's already using 64-byte transfers (8 beats of 8 bytes) then it probably won't help.

In either case, if the data is never being read/written by the CPU and all the moving is done via DMA (by the EHCI host and SD controller) the 32KB CPU cache is not in play at all, since it only caches data going into/out of the CPU.
 
I suspect this sustained speed may simply be too fast for the PSRAM chip to constantly maintain.
At the default clock speed of 88MHz, 30MB/s is the rough upper sustained limit for mixed reads and writes.
There are a couple of simple options for increasing this:
- raise the FlexSPI2 (PSRAM) clock speed, up to a maximum of 133MHz. There's a few posts on the forum about how to do this, sometimes the max supported speed is less due to the specific PSRAM chip being used.
- Enable FlexSPI2 prefetching may speed up the reads depending on how the SD card driver is doing DMA - if it's already using 64-byte transfers (8 beats of 8 bytes) then it probably won't help.

In either case, if the data is never being read/written by the CPU and all the moving is done via DMA (by the EHCI host and SD controller) the 32KB CPU cache is not in play at all, since it only caches data going into/out of the CPU.
If the EXTMEM driver can handle 30MB/sec long term, that would explain why the EXTMEM handles the 18MB/sec 99.99% of the time, but sometimes gets address bits mixed up during the circular buffer wraparound. One problem I face is that, once the host has requested a response packet, it has no real control over when the bytes will really arrive from the devices. As a result, I can't figure out how to tell the host to quiet down if I detect that the next response packets will be stored near the buffer wraparound.

Since the memcpy() option halves the DMA load and shifts it to the CPU where the cache IS effective, I think I'll try to figure out how to make the memcpy() calls have minimum impact on the rest of the program.

I did a quick test with the PSRAM clock at 133MHz. It didn't appear to solve the problem.
 
If the EXTMEM driver can handle 30MB/sec long term, that would explain why the EXTMEM handles the 18MB/sec 99.99% of the time, but sometimes gets address bits mixed up during the circular buffer wraparound.
Incorrect data being written to PSRAM shouldn't be happening under any circumstances.
The USB Host controller has its own internal buffers and won't initiate read transactions from devices if they're full (due to not being able to flush to RAM fast enough), so I don't see how it could end up placing the wrong data in memory.
 
I did some tests with a timer ISR writing to a ring buffer in PSRAM at various rates, then reading from the ring buffer and writing to SD in loop(). I found it did not quite run reliably at 8 MB/s with the default 88 MHz FlexSPI2 clock rate. Increasing the clock to 133 MHz allowed it to work reliably at 8 MB/s, but not quite at 9 MB/s. I did not use DMA, and my longest run was 128 seconds @ 8MB/s, so the file was a little over 1GB. I didn't see any "anomalies" in terms of the time to read/write PSRAM.

At 8 MB/s, the max ring buffer usage was about 600KB, and the total CPU for writing PSRAM, reading PSRAM, and writing to SD was ~58%, with 27% for writing to PSRAM (same as OP), about the same for read from PSRAM, and a few percent for writing to SD.

I was also able to get it to run at 8 MB/s with the largest ring buffer I could fit in RAM2 (~500K), and no use of PSRAM. The largest ring buffer I could fit into DTCM (RAM1) was ~400K, and that wasn't quite enough. With the buffer in RAM2, for the same run of 128 sec @ 8MB/sec, total CPU usage was less than 5%. The max busy time of my SanDisk Ultra 32GB card was just under 50 ms. I don't know how often those long busy times occurred, but if the max busy time were any greater, the max buffer in RAM2 would not be enough.

EDIT: I was able to get it working for 128 sec @ 12 MB/s with 1 MB buffer in PSRAM and FlexSPI2 clock 133 MHz. The file is ~1.5 GB and total CPU usage was ~80%. For 128 sec @ 8 MB/s, total CPU time for PSRAM and SD is ~47% with max RingBuf usage only 400K.

Apr 5 2024 14:42:20
Teensy 4.1
Teensyduino version 159
SdFat version 2.1.2
Type any character to begin
Log for 128 seconds at 8.00 MB/s in 8192 byte chunks
Pre-allocated file 1073741824 bytes
RingBuf 4194304 bytes
Start dataTimer (period = 976 us)
................................................................
...............................................................
Stop dataTimer
2096864 writes to SD in 127.927 s (27.552 s writing to file = 21.537 %)
131072 writes to RB in 127.927 s (32.448 s writing to RB = 25.365 %)
rbMaxUsed = 402432
max busy us = 47903
 
Last edited:
DMA operations bypass the cache on WRITE and also READ. The cache needs proper invalidation.

A: > 'NEW' data values in the CACHE will not be READ/seen by DMA, but the old values on the physical PSRAM will be used.
--> for valid DMA READ the cache must be FLUSHED over the desired region after application WRITE

B: > DMA written data on the physical PSRAM will not be returned if the CACHE is 'thought' to hold those memory values.
--> for valid application READ after DMA WRITE the cache must be DELETED before application READ.

This seems it may be a case of "A" where DMA following behind is pulling stale physical data where fresh data is in the cache.

Code:
// ...\hardware\teensy\avr\cores\teensy4\imxrt.h

// Flush data from cache to memory
//
// Normally arm_dcache_flush() is used when metadata written to memory
// will be used by a DMA or a bus-master peripheral.  Any data in the
// cache is written to memory.  A copy remains in the cache, so this is
// typically used with special fields you will want to quickly access
// in the future.  For data transmission, use arm_dcache_flush_delete().
__attribute__((always_inline, unused))
static inline void arm_dcache_flush(void *addr, uint32_t size)
// ...

// Delete data from the cache, without touching memory
//
// WARNING: This function is DANGEROUS!!  The address must be
// 32 byte aligned and the size must be a multiple of 32 bytes.
//
// DO NOT USE this function with arbitrarily aligned data,
// especially pointers from malloc() or C++ new.  The ARM cache
// can only delete with granularity of 32 byte cache rows.  If
// you attempt to delete improperly aligned data, any other
// cached variables shared within the same 32 byte cache row(s)
// will become collateral damage!
//
// If you wish to assure some variable or array or other data
// is not cached, use arm_dcache_flush_delete().  This
// arm_dcache_delete() should only be used for very special
// cases like DMA buffers or hardware testing & benchmarks.
//
// See this forum thread for more detail:
// https://forum.pjrc.com/threads/68100-BUG-in-arm_dcache_delete
//
// Normally arm_dcache_delete() is used before receiving data via
// DMA or from bus-master peripherals which write to memory.  You
// want to delete anything the cache may have stored, so your next
// read is certain to access the physical memory.
__attribute__((always_inline, unused))
static inline void arm_dcache_delete(void *addr, uint32_t size)
// ...

// Flush data from cache to memory, and delete it from the cache
//
// Normally arm_dcache_flush_delete() is used when transmitting data
// via DMA or bus-master peripherals which read from memory.  You want
// any cached data written to memory, and then removed from the cache,
// because you no longer need to access the data after transmission.
__attribute__((always_inline, unused))
static inline void arm_dcache_flush_delete(void *addr, uint32_t size)
// ...
 
re p#6: Rather than memcpy() after data is written to PSRAM perform an arm_dcache_flush() on that region, or pehaps arm_dcache_flush_delete()

This may cause a stall of some duration as the PSRAM data will be flushed to the physical media.

It may work to perform the flush just prior to a DMA based write operation from the PSRAM as the cache may find time to write some of the cache data in the background - and if not cached would not take any significant time.
 
The ARM cache isn't the issue. Both the USB Host controller and the SD controller use DMA, the data never goes through the CPU.
 
The ARM cache isn't the issue. Both the USB Host controller and the SD controller use DMA, the data never goes through the CPU.
The SD Controller does not necessarily use DMA. It can use either DMA or code which directly copies bytes:

Code:
// From the SDIOteensy.cpp source  code:

bool SdioCard::writeData(const uint8_t* src) {
  DBG_IRQSTAT();
  if (!waitTransferComplete()) {
    return false;
  }
  const uint32_t* p32 = reinterpret_cast<const uint32_t*>(src);
  if (!(SDHC_PRSSTAT & SDHC_PRSSTAT_WTA)) {
    SDHC_PROCTL &= ~SDHC_PROCTL_SABGREQ;
    SDHC_PROCTL |= SDHC_PROCTL_CREQ;
  }
  SDHC_PROCTL |= SDHC_PROCTL_SABGREQ;
  if (waitTimeout(isBusyFifoWrite)) {
    return sdError(SD_CARD_ERROR_WRITE_FIFO);
  }
  for (uint32_t iw = 0 ; iw < 512/(4*FIFO_WML); iw++) {
    while (0 == (SDHC_PRSSTAT & SDHC_PRSSTAT_BWEN)) {
    }
    for (uint32_t i = 0; i < FIFO_WML; i++) {
      SDHC_DATPORT = p32[i];  // <-------------Writing long words to IO port
    }
    p32 += FIFO_WML;
  }
  m_transferActive = true;
  return true;
 }


Whether DMA is used or not seems to depend on the initialization. The same word-by-word copy FROM the hardware is used in sdiocard.ReadData().

I have also observed that a memcpy from DTCM to EXTMEM can take 30% longer when the SD card is writing from EXTMEM. Perhaps that is because both memcpy() and the SD card are competing for cache resources.
 
My understanding is that it will use DMA by default as long as the source/destination buffers are suitably aligned, which I assume you've taken care of (since otherwise the speed would be reduced/cpu utilization would be increased). I think SdFat may actually take care of this automatically with its use of a cache.
 
Last edited:
Post #7 would be an easy change having the FLUSH done before performing the SD WRITE.

If it still fails, then that is known. If there are no more failures in extended use it fixed/hid the issue. Perhaps it will just add some (measurable) slowdown for the flush hiding the issue as the MCU gets its ducks aligned, or perhaps the cache was 'in the way'.
 
Back
Top