USB Serial Receive Speed Improvement

PaulStoffregen

Well-known member
I'm planning to start working on speed improvements to Teensy USB Serial receive functions, specifically Serial.readBytes() and Serial.readBytesUntil(). These functions have never been well optimized, partly because Arduino's API doesn't include an optimized read() function. This thread on Adafruit's forum is just one example where better speed is needed.

As a first step, I've written a simple benchmark. The code is attached. I'd like to encourage anyone interested in the receive speed to run this benckmark and post their results, and of course subscribe to notifications on this thread.

Running on my linux desktop, I measured 408 kbytes/sec to Teensy 3.0 and 139 kbyes/sec to Teensy 2.0. Both have 0 volts (no free CPU time) on pin 3.

Much better speed is possible, and I intend to make it happen....
 

Attachments

  • receive_test.zip
    11 KB · Views: 452
Here is my first attempt at speeding up Teensy 3.0. These files go into hardware/teensy/cores/teensy3.

This optimizes Serial.readBytes(), previously at 408 kbytes/sec using 100% CPU to 1020 kbytes/sec using 20% CPU.
 

Attachments

  • Stream.h
    1.9 KB · Views: 402
  • usb_serial.c
    5.4 KB · Views: 1,040
  • usb_serial.h
    2.8 KB · Views: 392
Initial test was about 408KB/sec, alternating between 405KB/sec & 410KB/sec. Like below. This only used 1% of CPU.
Bytes per second = 405510
Bytes per second = 410756
Bytes per second = 405394
Bytes per second = 410908
Bytes per second = 405373

BTW, the first read is higher, such as:
Bytes per second = 1265449

I just saved those files, but I'm not sure what I need to do for arduino. Can I just replace the files and run arduino as usual?
Do I need to recompile arduino? I think I just need to replace the files and upload as usual, but I wanted to check because I hadn't thought it through before of how things work.

-----

nvm, just replaced files and apparently that's it, cause the benchmark changed.

Bytes per second = 2165440
Bytes per second = 790326
Bytes per second = 789411
Bytes per second = 789328
Bytes per second = 789411
Bytes per second = 789432

CPU usage is at 2% now. But I do have quite a few things running during these test. Just shy of 100% increase is pretty good. Almost as fast as it sends to PC.
 
Last edited:
While the test is running, pin 3 indicates how much CPU time is free on the Teensy. The pin is high while Serial.available() indicates there's no data to read, which is time that would be available for a program to do something else... like transmit that data to RGB LED strips!

You can measure pin 3 with a DC voltmeter while running the test. On Teensy 3.0, the range is 0 to 3.3 volts. If you measure 1.65 volts, that means readBytes() is consuming approx 50% of the CPU time and other other half is free. My goal is not merely to achieve fast data rates, but to do so with as much CPU time available as possible on the Teensy, so you can actually do something useful with all the rapidly incoming data.....

On your Linux box, this test should not use much CPU time. PCs have very efficient USB host controller chips.

The first line probably measures how much data the kernel accepted into buffers, so it'll show a big number because so much data was "sent" from the program. It's the sustained rate that matters.
 
Last edited:
2.842 volts, which should be 14% CPU usage, 86% free. My 3.3 volt reading is 3.265, so it looks like it's more like:

13% usage/87% free

I'll have to re-do the initial test, but I need to replace those files again.
 
This is very nice :) Will test these out later when I get a moment as I use the USB Serial a fair bit :)
 
Here are files to add the readBytes() optimization to Teensy 2.0.

usb_api.c and usb_api.h go into hardware/teensy/cores/usb_serial and Stream.h goes into hardware/teensy/cores/teensy.

I measured Teensy 2.0 at approx 1 Mbyte/sec. However, there's almost no free CPU time (as indicated by pin 3). The 8 bit CPU at only 16 MHz is just barely able to manage the incoming packets and copy the data at that speed.
 

Attachments

  • usb_api.cpp
    11.6 KB · Views: 596
  • usb_api.h
    943 bytes · Views: 350
  • Stream.h
    1.9 KB · Views: 346
I'd like to encourage anyone interested in the receive speed to run this benckmark and post their results, and of course subscribe to notifications on this thread.
Using a Teensy 3, on linux, with the files from your second post.

Using an USB 2 port:
Code:
Bytes per second = 999101
Bytes per second = 1000033
Bytes per second = 1004184
Bytes per second = 999900
Bytes per second = 1004184
...
Mean: 997610,1
StDev: 56003,3
Measured a duty cycle of ~81.5% on pin three.

Using an USB 3 port:
Code:
Bytes per second = 1151587
Bytes per second = 1148765
Bytes per second = 1151631
Bytes per second = 1150704
Bytes per second = 1148765
...
Mean: 1152191,7
StDev: 57978,9
Measured a duty cycle of ~79.5% on pin three. This seems to exceed the actual speed of the serial port? :O That's pretty amazing.

Every 19th high pulse seems to be longer than the others. I'm not sure what caused this though, this periodic behaviour was not seen in my measurement results. (See attachment)
 

Attachments

  • periodic.jpg
    periodic.jpg
    30.7 KB · Views: 460
Bytes per second = 1148765
Bytes per second = 1151631
Bytes per second = 1150704
Bytes per second = 1148765

Wow, that's the closest to the theoretical maximum (1216000) I've ever seen!

Every 19th high pulse seems to be longer than the others. I'm not sure what caused this though, this periodic behaviour was not seen in my measurement results. (See attachment)

Those are the USB start-of-frame events. Near the end of each frame, the host controller allows the USB line to become idle so it can send the SOF packet at precisely 1 ms intervals. During that time, Teensy 3.0 is just waiting more more data to arrive, so pin 3 is high longer.
 
I'm working on another improvement that dramatically speeds up Serial.available(). Initial testing shows the speed of the commonly used Serial.available() and Serial.read() is about 680 kbytes/sec. Previously it was only 290 kbytes/sec.

Will post the code soon...
 
A megabyte per second is very good, you can send reasonable quality (MPEG2 or MPEG4 compressed) video at that rate...
 
Paul,

Thank you for posting these benchmarks. I have a pseudo-hijack question/follow-up to this that I figured I'd post here quickly, rather than make a new thread. When I bought the teensy 2.0 I sort of ignored the usb serial libraries (non-arduino only) because I was going on the (possibly false) assumption that the rawhid library was going to be the best way to get throughput over a usb cable. Likely this was just my ignorance of what is possible with usb serial as I was equating "usb serial" with "usb uart" and thinking that meant slow by usb standards since it would be limited to rs-232 serial baud rates (i now believe that is not true). Now that I see these benchmarks, and seeing that rawhid is going at 64k bytes a second by default I'm wondering what is the best method for fast communication on the teensy 2.0 (again, not using arduino). I am going for fast-as-possible receiving of analog sensor data from teensy to a host (both win7 and linux) that will be receiving usb communication on a hub so it still has to share bandwidth with other devices.

Thanks again!
 
The advantage of USB HID is, you are getting guaranteed bandwidth.

USB serial is using a different USB transfer mode (bulk transfers), which has a lower priority, but can use the entire USB bandwidth. But, if you have other devices hogging the bus temporarily, you may need to throw away data on the Teensy (when your buffers overflow).

You can theoretically have multiple HID interfaces on the same device to get around the HID bandwidth limit, but I don't know how easy that would be with the Teensy.
 
thanks tni - good to know. I just skipped over all the usb serial stuff as soon as I saw the word "baud" :( I did consider trying to open up the hid bandwidth, but gave up on that when i saw the rawhid code. I guess I should be thankful for the learning lesson I have gotten from being able to learn both hid and usb serial :p
 
The readByte() optimization, which allows for fast PC-to-Teensy transfer, is only in the newer Arduino code. The older C-only code has not been updated, and probably never will be.

Someday I'm going make a newer C-only download which is built directly from the Arduino code (only without the rest of Arduino). You can get that now by just installing Teensyduino and then delete the Arduino IDE.

In USB serial, the baud rate is never used for actual communication. It's just a 32 bit number that's passed from the PC side to Teensy. You can read it and use the number to set the speed of an actual UART on Teensy. But it's never actually used for the USB communication. The data always moves as quickly as USB bulk protocol can transfer it.

USB interrupt protocol does reserve bandwidth, but it's not like a carpool lane on a highway where capacity goes unused when others are waiting in the other slow lanes. When something like HID doesn't have data to transfer, its unused bandwidth automatically becomes available for bulk and control protocols.
 
Paul, thanks so much for answering my question, and sorry to psuedo-hijack the thread. As of last night I've updated my code to use usb_serial (of course I won't go into how that didn't help my project much because I didn't realize that the analog clock on the mega only samples in kHz range, so it made no difference lol). I did see the usb_serial speed difference though. Thanks!
 
The advantage of USB HID is, you are getting guaranteed bandwidth.

USB serial is using a different USB transfer mode (bulk transfers), which has a lower priority, but can use the entire USB bandwidth. But, if you have other devices hogging the bus temporarily, you may need to throw away data on the Teensy (when your buffers overflow).

Can you explain me something please? In the benchmark of Paul, Windows is the fastest OS for bulk transfers. I can't understand why Windows can send up to 4008 bytes per second! Linux and Mac doesn't send byte for byte, but they buffer the data, so it is not real time. Microsoft choose to be slow but send each byte (1 byte per transfer) as fast as possible if the system get the "send" command. Is it right?
What I doesn't understand is:
- if with a USB-HID device, you can do up to 1000 transfers each second, how is it possible to make 4008 transfers each second with USB serial?
 
I'm not an expert on this, but my understanding is that HID mode is throttling the data rate down to 64 bytes per packet, 1000 packets a second. That is well below the bandwidth that full speed USB is capable of - which is why HID can guarantee you will get the full data rate (i.e. if you plug in more HID devices they will all go that same rate, guaranteed up to a certain number of devices). HID devices are also a higher priority than bulk devices so they also will get their bandwidth first. Bulk devices can eat up as much bandwidth as they want, but USB doesn't guarantee it, so one second they might have the entire bus, and the next second, a bunch of HID devices could be waiting to send data and the bulk transfer gets bumped down to a small data rate. So usb serial is using bulk, so it can use the full data bandwidth of USB (in this case full speed USB) if no other higher priority HID devices are there.

Please correct if I'm wrong, this is just conversational knowledge --
 
Can you explain me something please? In the benchmark of Paul, Windows is the fastest OS for bulk transfers.

On this benchmark, Windows is slightly faster for very large write sizes, but much slower for small writes.

I can't understand why Windows can send up to 4008 bytes per second! Linux and Mac doesn't send byte for byte, but they buffer the data, so it is not real time. Microsoft choose to be slow but send each byte (1 byte per transfer) as fast as possible if the system get the "send" command. Is it right?
What I doesn't understand is:
- if with a USB-HID device, you can do up to 1000 transfers each second, how is it possible to make 4008 transfers each second with USB serial?

USB Serial is not HID protocol. It uses USB "bulk" transfer, not "interrupt" transfer as HID does. The bulk transfer type is able to allocate all the USB bandwidth which isn't used by other devices.

So in theory, if there aren't other USB devices using substantial bandwidth (as was the case in these tests), speeds of approximately 1.0 to 1.2 Mbyte/sec should be possible. In practice, all 3 operating systems are able to achieve close to this speed when given large blocks of data to transmit. When given the same data in smaller blocks, each operating system is dramatically different in its ability to use USB efficiently.
 
Please correct if I'm wrong, this is just conversational knowledge --

You're actually pretty close.

The host controller chip allocates bandwidth on a 1ms frame basis. It has queues for all 4 types of transactions, so when the PC wants to move data, it puts a transaction in one of the queues. A transaction can be anywhere from 0 to 65535 bytes. The host controller automatically uses multiple packets for transfers bigger than the device's maximum packet size.

At the beginning of each frame, the host controller cycles through pending control and bulk transfers. At a configurable point in the frame, it begins doing isync and interrupt transfers. When all of those are done, it goes back to control and bulk for the remainder of the frame. Shortly before the end of the frame, it allows some bus idle time, so it can transmit the start-of-frame token at precisely 1ms intervals.

There's one more important detail. The isync and interrupt transfers can have a polling interval, which is the number of frames between each time the host controller tries a transaction. So a device can use an interval of 64 to only get checked once every 64 cycles. That saves a little bandwidth for slow devices that don't need transfer much data, because it isn't checked 63 of every 64 frames. But the IN/NAK tokens are small, so even if it's polled every frame, relatively little bandwidth is used when no data is moving. The rest of the reserved isync+interrupt transfer time gets used for control+bulk, until the end of the frame.

The PC "reserves" bandwidth for isync and interrupt transfers by configuring how soon in the frame the controller switches to doing the isync and interrupt transfers. If none are pending for that particular frame, then the entire frame is used for control+bulk. Even if a HID devices like a mouse has reserved bandwidth, if it answers the IN token with a NAK token, the host controller quickly moves on to whatever else is pending.

So when other USB devices aren't using much bandwidth, bulk can transfer quite a lot of data!
 
Thank you Paul.
Yes I was meaning on large data, Windows is the faster.
For each byte, I suppose, Windows wait until the byte/packet was sent, before send another one. So it is normal that it is not so fast like Mac or Linux.
But if I understood the concept:
- On Windows, serial USB can be up to 4 times faster than USB HID if you send only a few bytes on each packet, but the speed is not guaranteed.
-> But this mean that if you have enough bandwidth, you can send up to 1 byte and also 24 bytes(-packet) each 250ms. Is it right?

I ask, because I was always thinking that USB RAW HID was the fastest communication protocol.
 
Yes I was meaning on large data, Windows is the faster.

On the large write size test, from the computer to Teensy3, Windows was slightly faster.

But it's important to keep a sense of perspective. Windows measured 983 kbytes/sec. Mac measured 960. So Windows was only 2.4% faster, and only on this one case where unusually large block sizes were used.

Windows was dramatically slower for small write sizes.


For each byte, I suppose, Windows wait until the byte/packet was sent, before send another one. So it is normal that it is not so fast like Mac or Linux.

It's difficult to say with certainty exactly why each system performs the way it does. One thing is pretty obvious, from some additional work I've done watching the USB packets with a protocol analyzer. Macintosh is able to combine the small writes into 64 byte packets. Linux and Windows do not. Each write appears to be sent to the USB host controller as-is.


But if I understood the concept:
- On Windows, serial USB can be up to 4 times faster than USB HID if you send only a few bytes on each packet, but the speed is not guaranteed.
-> But this mean that if you have enough bandwidth, you can send up to 1 byte and also 24 bytes(-packet) each 250ms. Is it right?

I ask, because I was always thinking that USB RAW HID was the fastest communication protocol.

I'm having a difficult time making any sense from statements and questions?

First of all, this benchmark only tested actual performance with USB serial. The complete source code is published, and the PC side includes a pre-compiled copy for Windows (since getting a working compiler on Windows is much harder than it is on Linux and Mac). You can run the benchmark on your own computer. Perhaps you ought to give it a try?

In theory, USB serial (using USB's "bulk" transfer type) ought to be able to use all of the available USB bandwidth (any bandwidth not consumed by other USB devices) to transfer data as rapidly as possible. That's how USB is supposed to work, in theory. But in practice, only Macintosh has USB serial drivers good enough to come close to achieving this over a wide range of conditions.
 
Back
Top