PDA

View Full Version : USB Serial Receive Speed Improvement



PaulStoffregen
05-04-2013, 09:24 PM
I'm planning to start working on speed improvements to Teensy USB Serial receive functions, specifically Serial.readBytes() and Serial.readBytesUntil(). These functions have never been well optimized, partly because Arduino's API doesn't include an optimized read() function. This thread on Adafruit's forum (http://forums.adafruit.com/viewtopic.php?f=24&t=39368&start=15) is just one example where better speed is needed.

As a first step, I've written a simple benchmark. The code is attached. I'd like to encourage anyone interested in the receive speed to run this benckmark and post their results, and of course subscribe to notifications on this thread.

Running on my linux desktop, I measured 408 kbytes/sec to Teensy 3.0 and 139 kbyes/sec to Teensy 2.0. Both have 0 volts (no free CPU time) on pin 3.

Much better speed is possible, and I intend to make it happen....

PaulStoffregen
05-05-2013, 12:11 AM
Here is my first attempt at speeding up Teensy 3.0. These files go into hardware/teensy/cores/teensy3.

This optimizes Serial.readBytes(), previously at 408 kbytes/sec using 100% CPU to 1020 kbytes/sec using 20% CPU.

linuxgeek
05-05-2013, 12:24 AM
Will try it out later when I can.

linuxgeek
05-05-2013, 05:38 AM
Initial test was about 408KB/sec, alternating between 405KB/sec & 410KB/sec. Like below. This only used 1% of CPU.
Bytes per second = 405510
Bytes per second = 410756
Bytes per second = 405394
Bytes per second = 410908
Bytes per second = 405373

BTW, the first read is higher, such as:
Bytes per second = 1265449

I just saved those files, but I'm not sure what I need to do for arduino. Can I just replace the files and run arduino as usual?
Do I need to recompile arduino? I think I just need to replace the files and upload as usual, but I wanted to check because I hadn't thought it through before of how things work.

-----

nvm, just replaced files and apparently that's it, cause the benchmark changed.

Bytes per second = 2165440
Bytes per second = 790326
Bytes per second = 789411
Bytes per second = 789328
Bytes per second = 789411
Bytes per second = 789432

CPU usage is at 2% now. But I do have quite a few things running during these test. Just shy of 100% increase is pretty good. Almost as fast as it sends to PC.

PaulStoffregen
05-05-2013, 03:11 PM
While the test is running, pin 3 indicates how much CPU time is free on the Teensy. The pin is high while Serial.available() indicates there's no data to read, which is time that would be available for a program to do something else... like transmit that data to RGB LED strips!

You can measure pin 3 with a DC voltmeter while running the test. On Teensy 3.0, the range is 0 to 3.3 volts. If you measure 1.65 volts, that means readBytes() is consuming approx 50% of the CPU time and other other half is free. My goal is not merely to achieve fast data rates, but to do so with as much CPU time available as possible on the Teensy, so you can actually do something useful with all the rapidly incoming data.....

On your Linux box, this test should not use much CPU time. PCs have very efficient USB host controller chips.

The first line probably measures how much data the kernel accepted into buffers, so it'll show a big number because so much data was "sent" from the program. It's the sustained rate that matters.

linuxgeek
05-06-2013, 06:10 AM
2.842 volts, which should be 14% CPU usage, 86% free. My 3.3 volt reading is 3.265, so it looks like it's more like:

13% usage/87% free

I'll have to re-do the initial test, but I need to replace those files again.

Madox
05-06-2013, 11:56 AM
This is very nice :) Will test these out later when I get a moment as I use the USB Serial a fair bit :)

PaulStoffregen
05-06-2013, 07:10 PM
Here are files to add the readBytes() optimization to Teensy 2.0.

usb_api.c and usb_api.h go into hardware/teensy/cores/usb_serial and Stream.h goes into hardware/teensy/cores/teensy.

I measured Teensy 2.0 at approx 1 Mbyte/sec. However, there's almost no free CPU time (as indicated by pin 3). The 8 bit CPU at only 16 MHz is just barely able to manage the incoming packets and copy the data at that speed.

iwanders
05-07-2013, 04:42 PM
I'd like to encourage anyone interested in the receive speed to run this benckmark and post their results, and of course subscribe to notifications on this thread.
Using a Teensy 3, on linux, with the files from your second post.

Using an USB 2 port:

Bytes per second = 999101
Bytes per second = 1000033
Bytes per second = 1004184
Bytes per second = 999900
Bytes per second = 1004184
...
Mean: 997610,1
StDev: 56003,3
Measured a duty cycle of ~81.5% on pin three.

Using an USB 3 port:

Bytes per second = 1151587
Bytes per second = 1148765
Bytes per second = 1151631
Bytes per second = 1150704
Bytes per second = 1148765
...
Mean: 1152191,7
StDev: 57978,9
Measured a duty cycle of ~79.5% on pin three. This seems to exceed the actual speed of the serial port? :O That's pretty amazing.

Every 19th high pulse seems to be longer than the others. I'm not sure what caused this though, this periodic behaviour was not seen in my measurement results. (See attachment)

PaulStoffregen
05-07-2013, 06:59 PM
Bytes per second = 1148765
Bytes per second = 1151631
Bytes per second = 1150704
Bytes per second = 1148765


Wow, that's the closest to the theoretical maximum (1216000) I've ever seen!



Every 19th high pulse seems to be longer than the others. I'm not sure what caused this though, this periodic behaviour was not seen in my measurement results. (See attachment)

Those are the USB start-of-frame events. Near the end of each frame, the host controller allows the USB line to become idle so it can send the SOF packet at precisely 1 ms intervals. During that time, Teensy 3.0 is just waiting more more data to arrive, so pin 3 is high longer.

PaulStoffregen
05-11-2013, 11:21 PM
I'm working on another improvement that dramatically speeds up Serial.available(). Initial testing shows the speed of the commonly used Serial.available() and Serial.read() is about 680 kbytes/sec. Previously it was only 290 kbytes/sec.

Will post the code soon...

JBeale
05-12-2013, 08:49 AM
A megabyte per second is very good, you can send reasonable quality (MPEG2 or MPEG4 compressed) video at that rate...

PaulStoffregen
06-03-2013, 12:27 AM
I published all this benchmark stuff on the website today.

http://www.pjrc.com/teensy/benchmark_usb_serial_receive.html

andersod2
08-11-2013, 09:29 AM
Paul,

Thank you for posting these benchmarks. I have a pseudo-hijack question/follow-up to this that I figured I'd post here quickly, rather than make a new thread. When I bought the teensy 2.0 I sort of ignored the usb serial libraries (non-arduino only) because I was going on the (possibly false) assumption that the rawhid library was going to be the best way to get throughput over a usb cable. Likely this was just my ignorance of what is possible with usb serial as I was equating "usb serial" with "usb uart" and thinking that meant slow by usb standards since it would be limited to rs-232 serial baud rates (i now believe that is not true). Now that I see these benchmarks, and seeing that rawhid is going at 64k bytes a second by default I'm wondering what is the best method for fast communication on the teensy 2.0 (again, not using arduino). I am going for fast-as-possible receiving of analog sensor data from teensy to a host (both win7 and linux) that will be receiving usb communication on a hub so it still has to share bandwidth with other devices.

Thanks again!

tni
08-11-2013, 10:21 AM
The advantage of USB HID is, you are getting guaranteed bandwidth.

USB serial is using a different USB transfer mode (bulk transfers), which has a lower priority, but can use the entire USB bandwidth. But, if you have other devices hogging the bus temporarily, you may need to throw away data on the Teensy (when your buffers overflow).

You can theoretically have multiple HID interfaces on the same device to get around the HID bandwidth limit, but I don't know how easy that would be with the Teensy.

andersod2
08-11-2013, 11:10 AM
thanks tni - good to know. I just skipped over all the usb serial stuff as soon as I saw the word "baud" :( I did consider trying to open up the hid bandwidth, but gave up on that when i saw the rawhid code. I guess I should be thankful for the learning lesson I have gotten from being able to learn both hid and usb serial :P

PaulStoffregen
08-11-2013, 04:25 PM
The readByte() optimization, which allows for fast PC-to-Teensy transfer, is only in the newer Arduino code. The older C-only code has not been updated, and probably never will be.

Someday I'm going make a newer C-only download which is built directly from the Arduino code (only without the rest of Arduino). You can get that now by just installing Teensyduino and then delete the Arduino IDE.

In USB serial, the baud rate is never used for actual communication. It's just a 32 bit number that's passed from the PC side to Teensy. You can read it and use the number to set the speed of an actual UART on Teensy. But it's never actually used for the USB communication. The data always moves as quickly as USB bulk protocol can transfer it.

USB interrupt protocol does reserve bandwidth, but it's not like a carpool lane on a highway where capacity goes unused when others are waiting in the other slow lanes. When something like HID doesn't have data to transfer, its unused bandwidth automatically becomes available for bulk and control protocols.

Xeon
08-11-2013, 05:17 PM
Hmm handy stuff u got brewing here.
Thanks Paul.

andersod2
08-11-2013, 11:29 PM
Paul, thanks so much for answering my question, and sorry to psuedo-hijack the thread. As of last night I've updated my code to use usb_serial (of course I won't go into how that didn't help my project much because I didn't realize that the analog clock on the mega only samples in kHz range, so it made no difference lol). I did see the usb_serial speed difference though. Thanks!

Adriano
08-14-2013, 05:40 PM
The advantage of USB HID is, you are getting guaranteed bandwidth.

USB serial is using a different USB transfer mode (bulk transfers), which has a lower priority, but can use the entire USB bandwidth. But, if you have other devices hogging the bus temporarily, you may need to throw away data on the Teensy (when your buffers overflow).


Can you explain me something please? In the benchmark of Paul, Windows is the fastest OS for bulk transfers. I can't understand why Windows can send up to 4008 bytes per second! Linux and Mac doesn't send byte for byte, but they buffer the data, so it is not real time. Microsoft choose to be slow but send each byte (1 byte per transfer) as fast as possible if the system get the "send" command. Is it right?
What I doesn't understand is:
- if with a USB-HID device, you can do up to 1000 transfers each second, how is it possible to make 4008 transfers each second with USB serial?

andersod2
08-14-2013, 07:18 PM
I'm not an expert on this, but my understanding is that HID mode is throttling the data rate down to 64 bytes per packet, 1000 packets a second. That is well below the bandwidth that full speed USB is capable of - which is why HID can guarantee you will get the full data rate (i.e. if you plug in more HID devices they will all go that same rate, guaranteed up to a certain number of devices). HID devices are also a higher priority than bulk devices so they also will get their bandwidth first. Bulk devices can eat up as much bandwidth as they want, but USB doesn't guarantee it, so one second they might have the entire bus, and the next second, a bunch of HID devices could be waiting to send data and the bulk transfer gets bumped down to a small data rate. So usb serial is using bulk, so it can use the full data bandwidth of USB (in this case full speed USB) if no other higher priority HID devices are there.

Please correct if I'm wrong, this is just conversational knowledge --

PaulStoffregen
08-14-2013, 11:34 PM
Can you explain me something please? In the benchmark of Paul, Windows is the fastest OS for bulk transfers.


On this benchmark (http://www.pjrc.com/teensy/benchmark_usb_serial_receive.html), Windows is slightly faster for very large write sizes, but much slower for small writes.



I can't understand why Windows can send up to 4008 bytes per second! Linux and Mac doesn't send byte for byte, but they buffer the data, so it is not real time. Microsoft choose to be slow but send each byte (1 byte per transfer) as fast as possible if the system get the "send" command. Is it right?
What I doesn't understand is:
- if with a USB-HID device, you can do up to 1000 transfers each second, how is it possible to make 4008 transfers each second with USB serial?

USB Serial is not HID protocol. It uses USB "bulk" transfer, not "interrupt" transfer as HID does. The bulk transfer type is able to allocate all the USB bandwidth which isn't used by other devices.

So in theory, if there aren't other USB devices using substantial bandwidth (as was the case in these tests), speeds of approximately 1.0 to 1.2 Mbyte/sec should be possible. In practice, all 3 operating systems are able to achieve close to this speed when given large blocks of data to transmit. When given the same data in smaller blocks, each operating system is dramatically different in its ability to use USB efficiently.

PaulStoffregen
08-15-2013, 12:04 AM
Please correct if I'm wrong, this is just conversational knowledge --

You're actually pretty close.

The host controller chip allocates bandwidth on a 1ms frame basis. It has queues for all 4 types of transactions, so when the PC wants to move data, it puts a transaction in one of the queues. A transaction can be anywhere from 0 to 65535 bytes. The host controller automatically uses multiple packets for transfers bigger than the device's maximum packet size.

At the beginning of each frame, the host controller cycles through pending control and bulk transfers. At a configurable point in the frame, it begins doing isync and interrupt transfers. When all of those are done, it goes back to control and bulk for the remainder of the frame. Shortly before the end of the frame, it allows some bus idle time, so it can transmit the start-of-frame token at precisely 1ms intervals.

There's one more important detail. The isync and interrupt transfers can have a polling interval, which is the number of frames between each time the host controller tries a transaction. So a device can use an interval of 64 to only get checked once every 64 cycles. That saves a little bandwidth for slow devices that don't need transfer much data, because it isn't checked 63 of every 64 frames. But the IN/NAK tokens are small, so even if it's polled every frame, relatively little bandwidth is used when no data is moving. The rest of the reserved isync+interrupt transfer time gets used for control+bulk, until the end of the frame.

The PC "reserves" bandwidth for isync and interrupt transfers by configuring how soon in the frame the controller switches to doing the isync and interrupt transfers. If none are pending for that particular frame, then the entire frame is used for control+bulk. Even if a HID devices like a mouse has reserved bandwidth, if it answers the IN token with a NAK token, the host controller quickly moves on to whatever else is pending.

So when other USB devices aren't using much bandwidth, bulk can transfer quite a lot of data!

Adriano
08-15-2013, 09:09 AM
Thank you Paul.
Yes I was meaning on large data, Windows is the faster.
For each byte, I suppose, Windows wait until the byte/packet was sent, before send another one. So it is normal that it is not so fast like Mac or Linux.
But if I understood the concept:
- On Windows, serial USB can be up to 4 times faster than USB HID if you send only a few bytes on each packet, but the speed is not guaranteed.
-> But this mean that if you have enough bandwidth, you can send up to 1 byte and also 24 bytes(-packet) each 250ms. Is it right?

I ask, because I was always thinking that USB RAW HID was the fastest communication protocol.

PaulStoffregen
08-15-2013, 04:18 PM
Yes I was meaning on large data, Windows is the faster.


On the large write size test, from the computer to Teensy3, Windows was slightly faster.

But it's important to keep a sense of perspective. Windows measured 983 kbytes/sec. Mac measured 960. So Windows was only 2.4% faster, and only on this one case where unusually large block sizes were used.

Windows was dramatically slower for small write sizes.




For each byte, I suppose, Windows wait until the byte/packet was sent, before send another one. So it is normal that it is not so fast like Mac or Linux.


It's difficult to say with certainty exactly why each system performs the way it does. One thing is pretty obvious, from some additional work I've done watching the USB packets with a protocol analyzer. Macintosh is able to combine the small writes into 64 byte packets. Linux and Windows do not. Each write appears to be sent to the USB host controller as-is.




But if I understood the concept:
- On Windows, serial USB can be up to 4 times faster than USB HID if you send only a few bytes on each packet, but the speed is not guaranteed.
-> But this mean that if you have enough bandwidth, you can send up to 1 byte and also 24 bytes(-packet) each 250ms. Is it right?

I ask, because I was always thinking that USB RAW HID was the fastest communication protocol.

I'm having a difficult time making any sense from statements and questions?

First of all, this benchmark only tested actual performance with USB serial. The complete source code is published, and the PC side includes a pre-compiled copy for Windows (since getting a working compiler on Windows is much harder than it is on Linux and Mac). You can run the benchmark on your own computer. Perhaps you ought to give it a try?

In theory, USB serial (using USB's "bulk" transfer type) ought to be able to use all of the available USB bandwidth (any bandwidth not consumed by other USB devices) to transfer data as rapidly as possible. That's how USB is supposed to work, in theory. But in practice, only Macintosh has USB serial drivers good enough to come close to achieving this over a wide range of conditions.

Headroom
08-15-2013, 05:15 PM
My 2 cents.
Given the abilities of modern COmputers and Microcontrollers It would not make sense for anything related to HID be "the fastest". HID stands for Human Interface Device. Humans are not really fast by any means.

The reason for these guaranteed transfer rates in HID modes is likely that there is a guaranteed response time to human interactions. If you push the mouse, you want for the cursor to move instantly (as perceived by a slow human ;-) ) and not experience lag until the the 4k video has been loaded from the USB Harddrive.

teensyfan1
10-26-2013, 11:50 AM
I published all this benchmark stuff on the website today.

http://www.pjrc.com/teensy/benchmark_usb_serial_receive.html

I was looking at the website because teensy is mentioned often in #arduino & #43oh (msp430) chatrooms and I noticed in the bottom of the page you tried the benchmarks with the Launchpad LM4F120 (Stellaris) and it didn't work out well. If you haven't been made aware, the msp430 launchpad now has a usb HID/MSC/CDC model (http://www.ti.com/ww/en/launchpad/msp430_head_usb.html) and Energia seems to have support now for some type of usb serial with msp430s https://github.com/energia/Energia/commit/92cc703ee3d4c441699e68bbf0ccfda2eb9ce60a , which might be a better one to compare with as it is more popular, also the Tiva C Launchpad(the rebranded and very slightly improved version of the stellaris) has come out with USB On-The-Go so maybe better usb support will be coming in future releases of Energia. Perhaps posting a ticket about the needed feature from usb serial @ https://github.com/energia/Energia/issues might help ensure it is on the checklist in case it was overlooked.

PaulStoffregen
10-26-2013, 03:10 PM
Looks like Energia is improving.

Has anyone tried to actually run the benchmark? Since their code is a library, the benchmark would need to be modified slightly to #include the header and create the USBSerial object, but other than that minor detail, the benchmark ought to be able to run.

I'm curious to hear how Energia/MSP430 performs. Right now, probably not curious enough to actually buy that new board and go to the trouble of running it myself. But maybe when I have more time.....

Edit: I did look at their source code just now, but only briefly. They've definitely got some work in there for performance, like DMA-based memcpy. They have interrupt based events, but it seems none are currently used for data movement. I didn't see any optimization on the Arduino API layer. Stream readBytes() just calls available() and read() for each byte. Achieving good performance on this benchmark requires careful design throughout the entire stack, from the hardware all the way up to the Arduino API. I hope someone will run the benchmark and report the results....

manitou
12-24-2013, 11:43 AM
Here are some recent runs of your benchmark. Host is Ubuntu 12.04.



teensy 3.0 read: Average bytes per second = 685260
readbytes: Average bytes per second = 1150279

teensy 3.1 read: Average bytes per second = 1150390
readbytes: Average bytes per second = 1150244

DUE read: Average bytes per second = 133103
readbytes: Average bytes per second = 123476

maple read: Average bytes per second = 214973


On 3.1 standard read is just as fast as ReadBytes().

teensy 1.0.5/r18, DUE 1.5.4, maple v0.12

Charly86
06-17-2014, 05:26 PM
Hi Guys,
I've read all this post, tested serial with the source code provided. I'm on Teensy 3.1 and with code provided by TeensuDuino environement (not changed with the file provided by Paul at the begining of the post) and the exe file from host. Like that I'm approx 1000 kbytes/sec which is fine.
I'm not a specialist, but does this mean that if I need to transfert bulk data (says about 64Kbytes) at best possible speed I must use USB Serial instead USB HID ?
At first time I thought the best perf should be done by USB HID but it seems I was wrong since USB HID perf is about 1000 * 64 bytes (per second) so 64Kbytes/sec, far less than 900 kbytes/sec with USB serial.
Did I correctly understood ?
Thanks for your help.

andersod2
06-17-2014, 05:56 PM
Charly86 - check out my question earlier in this thread and the subsequent posts. I think it will go a long way to answering your question.

Charly86
06-17-2014, 06:15 PM
anderstod2
Thank's for your quick reply.
For sure I did this, but time passed, and was dreaming someone had a short answer :p
I also throw a eye in code of usb_api.cpp just in case I wanted to allow more than 64 bytes buffer (but seems a very bad idea ;-)

// write bytes from the FIFO
#if (RAWHID_TX_SIZE >= 64)
UEDATX = *buffer++;
#endif
#if (RAWHID_TX_SIZE >= 63)
UEDATX = *buffer++;
#endif
#if (RAWHID_TX_SIZE >= 62)
UEDATX = *buffer++;
#endif
#if (RAWHID_TX_SIZE >= 61)
UEDATX = *buffer++;
#endif
#if (RAWHID_TX_SIZE >= 60)
UEDATX = *buffer++;
#endif
#if (RAWHID_TX_SIZE >= 59)
UEDATX = *buffer++;
#endif

... oup'ssss no way for me to have 8kbytes buffer, too much lines to add ;-)

So as My teensy will use USB only for transfer from host, and will be the only one transfering, I'm gona use USB Serial, it's fast enought for me, no need to check 64 bytes chunk and reassembly, So I will keep things simple (at least for this part)

I also wanted to thank's Paul for his brilliant job on teensy products.

PaulStoffregen
06-17-2014, 08:42 PM
At first time I thought the best perf should be done by USB HID but it seems I was wrong since USB HID perf is about 1000 * 64 bytes (per second) so 64Kbytes/sec, far less than 900 kbytes/sec with USB serial.


That's correct. HID is limited in bandwidth.



anderstod2
I also throw a eye in code of usb_api.cpp


That code is for Teensy 2.0. It's also the code that transmits from Teensy to the PC. The benchmarks mentioned above are in the other direction, from PC to the Teensy.



just in case I wanted to allow more than 64 bytes buffer (but seems a very bad idea ;-)


Since the USB packets are limited to 64 bytes, larger buffers will have little benefit. But I wouldn't call bigger buffers a "very bad idea", just an idea without extra benefit beyond a pretty small size.


If you're using Teensy 3.1, the CPU is fast and the USB stack is very efficient. Even very simple 1-byte-at-a-time code works pretty well. If you use even small buffers with Serial.write() and Serial.readBytes(), you'll have lots of extra CPU time.

WaywardGeek
09-25-2014, 09:22 PM
I published all this benchmark stuff on the website today.

http://www.pjrc.com/teensy/benchmark_usb_serial_receive.html

I copied your benchmarks and hacked them (quite poorly, I'm afraid) to send instead of recieve. I get 1214583 bytes/sec sent to my USB3 port, even better than the receive test!

Bill

PaulStoffregen
09-25-2014, 09:34 PM
Wow, I've never seen anyone get so very close to the theoretical limit, which is 1216 kbyte/sec.