USBHost_t36 slow reception of incoming data

mborgerson

Well-known member
I am working on a driver for the FLIR Lepton camera. As part of the testing, I send frames of 38400 bytes to a PC host for display. The driver is running on a custom T3.2 connected to the Lepton. It receives Lepton data over SPI using DMA, and transmits to the PC using the T3.2 USB Serial port. When sending to the PC, the T3.2 reports that the transmission of the frame takes about 49mSec. That's about 764KB/second--which is pretty good considering it is a full-speed link moving data at 12Mbits/second.

I am now trying to make a more portable system where the Lepton/T3.2 will send to a T4.1 for analysis and recording on SD card. (A thermal-imaging trail cam of sorts). The first major issue I have found is that the same firmware on the T3.2 that can send a frame in 49mSec takes about 330mSec when sending to a T4.1 running USBHost_t36. The transmission rate is more than 6 times slower! It seems that the transmitting T3.2 is spending a lot of time waiting for the T4.1 to suck up the incoming bytes. It seems that there must be a bottleneck in the CDC/ACM driver that keeps it from receiving at the maximum potential speed of the T3.2 transmission.

C++:
uint8_t framebuffer[38400];
uint16_t Checkuserial1(void) {
  uint16_t rd, n;
  uint32_t numread;
  uint8_t *buffptr;

  numread = 0;
  buffptr = framebuffer;
  do {
    rd = userial1.available();
    if (rd > 0) {
      LEDON
     // put data from userial1 into frame buffer--checking for overflow
      if((numread + rd) > sizeof(framebuffer)) buffptr = framebuffer;
      n = userial1.readBytes(buffptr, rd);
      buffptr+= n;
      numread+= n;
     // delayMicroseconds(1400);
    }
  } while (rd > 0);
 // if(numread)Serial.printf("Read %lu bytes from userial1\n",numread);
  LEDOFF
  return numread;
}

One potential clue is that when the T3.1 is reading data, it reports that usbserial1 has 128 bytes available (two times the 64-byte USB packet size).
Another possible problem is that I have the T3.2 set up for dual-serial output. Serial is used for command and status comms. USBSerial1 is used only for the transmission of binary data frames.

If anyone has some insight into the reason for the slow read speed of the T4.1 I'd appreciate your input. My experiments with the T4.1 as a receiver of USBTMC (Test and Measurement Class) data show that the T4.1 can reliably read more than 9MB/second (3+ MB/Second from three devices with 3 High Speed links), so I don't think the slow read performance with USB Serial full-speed links is an intrinsic hardware problem
 
Have you tried using the bigbuffer version of the Serial objects? And see if that helps? For higher speeds, we might need to update the code to
have more than two USB buffers at a time. For example, the RAWHID code was updated for 4 RX and TX.to handle higher speeds.
 
Have you tried using the bigbuffer version of the Serial objects?
I tried using the BigBuffer version. It didn't help.

After a bit more research, I've come to two conclusions:

1. The availability of just two buffers for transactions may be a limiting factor. Making the buffers larger with BigBuffer doesn't help.

2. The internal code for block transfers is much less efficient in Host Serial than is the case for Serial or USBSerial1.

Here is the code for block transfers for HostSerial;
Code:
// read characters from stream into buffer
// terminates if length characters have been read, or timeout (see setTimeout)
// returns the number of characters placed in the buffer
// the buffer is NOT null terminated.
//

size_t Stream::readBytes(char *buffer, size_t length)
{
        if (buffer == nullptr) return 0;
        size_t count = 0;
        while (count < length) {
               int c = timedRead();
                if (c < 0) {
                        setReadError();
                        break;
                }
                *buffer++ = (char)c;
                count++;
        }
        return count;
}
Since the Host Serial class doesn't directly implement a block read, it uses the inherited readBytes method from the Stream class. This function transfers just one byte at time.

In contrast, the Serial class has a block read function:

Code:
// read a block of bytes to a buffer
int usb_serial_read(void *buffer, uint32_t size)
{
        uint8_t *p = (uint8_t *)buffer;
        uint32_t count=0;
        NVIC_DISABLE_IRQ(IRQ_USB1);    //  don’t let USB receive interrupt change avail, etc,  while we transfer
        //if (++maxtimes > 15) while (1) ;
        uint32_t tail = rx_tail;
        //printf("usb_serial_read, size=%d, tail=%d, head=%d\n", size, tail, rx_head);
        while (count < size && tail != rx_head) {
                if (++tail > RX_NUM) tail = 0;
                uint32_t i = rx_list[tail];
                uint32_t len = size - count;
                uint32_t avail = rx_count[i] - rx_index[i];
                 //printf("usb_serial_read, count=%d, size=%d, i=%d, index=%d, len=%d, avail=%d, c=%c\n",
                  //count, size, i, rx_index[i], len, avail, rx_buffer[i * CDC_RX_SIZE_480]);
                if (avail > len) {
                        // partially consume this packet
                        memcpy(p, rx_buffer + i * CDC_RX_SIZE_480 + rx_index[i], len);
                        rx_available -= len;
                        rx_index[i] += len;
                        count += len;
                } else {
                        // fully consume this packet
                        memcpy(p, rx_buffer + i * CDC_RX_SIZE_480 + rx_index[i], avail);
                        p += avail;
                        rx_available -= avail;
                        count += avail;
                        rx_tail = tail;
                       rx_queue_transfer(i);
                }
        }
        NVIC_ENABLE_IRQ(IRQ_USB1);
        return count;
}

This function transfers a packet of data with a single memcpy() call. Further investigation also shows that the USB Serial class attempts to transfer 512-byte packets and relies on the EHCI hardware or a hub to break up the 512-byte packets into 64-byte packets when connected to a full-speed host.

I still haven't figured out exactly why the Host Serial driver on a T3.2 can send data blocks faster than the T4.1 can accept them. It seems that the extra CPU speed of the T4.1 should make up for the byte-by-byte Host Serial read.

In the end, I may have to implement a block transfer driver using custom device, as I have done for the USBTMC driver and the Boson camera USB driver. Squeezing all that into a T3.2 will be interesting--but should be possible as most of the descriptor stuff can be set up as 'constant' and reside in flash memory, along with the extra code to allow it to enumerate.

I may also try adding a native readBlock() function (not inherited from Stream) to the HostSerial class. That might also benefit from more transfer buffers than the two now allocated.
 
Sorry, yes the read code was implemented with KISS to get it up and running. I meant to at some point potentially go back and do the memcpy version. At the time we were chasing other bugs and so did not add the complexity of dealing with the boundary conditions. where the read needs to wrap back around to the start of the buffer... which in that case required two memcpy operations...

Would be great if you implemented it and did a PR back to Paul...
 
Sorry, yes the read code was implemented with KISS to get it up and running. I meant to at some point potentially go back and do the memcpy version. At the time we were chasing other bugs and so did not add the complexity of dealing with the boundary conditions. where the read needs to wrap back around to the start of the buffer... which in that case required two memcpy operations...

Would be great if you implemented it and did a PR back to Paul...
I'm having other issues with the HostSerial receive that are puzzling. The T3.2 device sends good frames to my PC, but when I try to read with HostSerial, I get lots of garbage data or no data at all in the destination buffer. I think I may switch over to a simpler block-mode driver. There is also the advantage that the block-mode driver can be set up to be free of memcpy() calls by programming the final destination address into the USB receive queue headers. (Having the hardware writing directly into user memory space probably violates lots of security protocols. A more sophisticated OS might require that all hardware data transfers use dedicated driver (or kernel) memory, then be moved to user memory using a secure messaging system.)

There's also the advantage that you can use the EHCI hardware to break up a 16KB transfer and jam the 64-byte packets into frames and microframes for you. So far, I've only done that between T4.x high-speed devices. It will be interesting to see if the USB interface on the T3.2 has the same capabilities as the EHCI on the T4.x boards.

It seems amazing that the ability of the EHCI to do a single transfer of a block of up to at least 16KBytes directly to a destination address has not been used in any of the current core software---at least as far as I have found. The fact that the EHCI can queue a single large transfer rather than dozens of 64 or 512-byte transfers seems not to have been exploited in the current core routines. It may not be needed for small and intermittent stream-type data, and was probably not possible with earlier hardware.

Another thing I have found very useful is that receiving a short transfer (fewer bytes than requested) stops the transfers and you can set a flag in the handler. In my particular case, I could queue up three 16KB transfers. The camera module would send two 16KB transfers followed by one of 10752 bytes. The short transfer would signal to my host program that the complete frame has been received.
 
I made an interesting discovery this morning: If I put a powered USB 3.0 hub between the Lepton T3.2 and the Host Serial port of the T4.1, the data upload speed increases dramatically! Without the hub, the upload of a 38400-byte frame takes about 300mSec. With the hub in between, the upload takes about 50mSec--about the same as uploading directly to the PC.

My current hypothesis is that the transaction translator inside the hub is grouping the 64-byte full-speed packets into 512-byte transactions and sending them to the T4.1 at 480Mb/Second. I'm not sure I fully understand what is going on inside the hub, but the increase in data upload speed is consistent. With a frame upload time of 50mSec, I should be able to keep up with the Lepton ~9FPS frame rate for video capture. (The frame rate is limited to 9FPS due to ITAR export restrictions).
 
Sorry, with the EHCI implementation stuff, was done by @PaulStoffregen , I only know enough to be dangerous.

With the USB Serial, I believe it is setup that it does 2K writes, but receives 512... As for the USB Host, I don't remember if he built in that support for larger packets or not...

Hubs: Not sure if matters if it is a USB 3 hub or just one that supports USB2 High speed? I remember earlier there were some fixes that Paul needed to do, to support what I believe is called MULTI-TT. Which if I remember correctly the hubs will sort of convert the LS/FS stuff into HS type packets. Don't know if they convert to 512 bytes or not, or if it uses MicroFrames to speed it up. I believe without Muli-tt support it is limited to just one port at a time? but with multi, it can do more... I am pretty foggy on the details.

Sounds like you are making some progress.
 
I did a bit of research on hubs. It seems that USB 3.0 hubs maintain two separate data paths: one for USB 2.0 with 480Mb/Sec capability and one with USB 3.0 5Gb/Sec capability. I verified that a USB 2.0 hub gives the same improvement in SerialHost read speeds. With an old, unpowered USB 2.0 hub, a Lepton frame is received in 50mSec. Without the hub, the frame reception takes 300mSec.

Some simple math seems to support the hypothesis that the TT in the hub groups transactions:

Without transaction grouping, the HostSerial, with its two-transaction queue and buffer limit, can only receive 128 bytes per 1mSec USB frame.

38400 bytes / 128 bytes/mSec = 300mSec.

With transaction grouping, HostSerial can receive two 512-byte packets per millisecond.

38400bytes / 1024bytes/mSec = 37.5mSec.

I attribute part of the difference between the observed 50mSec and the theoretical 37.5mSec to the fact that I add some delays between packets to keep my PC from choking on continuous data input that doesn't allow the PC USB driver time to move data from it's own buffers to the destination buffer. There are probably other subtle timing issues involving when the first packets arrive in relation to the start of a USB frame.

For now I'm content to live with an incomplete understanding of the USBHost timing as long as I can keep up with the Lepton 9FPS output.

I verified that yesterday with a T4.1 sketch that collected frames from the Lepton T3.2 and wrote them to a file on the SD Card. The SD File, with its series of Lepton Frames, was transferred to my PC for processing with MatLab. The MatLab script, reads the frames, scales them up to 640x480 pixels with bicubic interpolation to fill in the new pixels, translates the linear temperature values to colors, and writes the frames to a .MP4 file. The movie can be seen at this YouTube location:
There are a few glitches in the movie---possibly due to SD Card long write times or interruption of the data stream by the Lepton Flat-Field-Correction shutter activation.
 
Back
Top