Using Teensy Libraries without Arduino/Maximum USB performance

Status
Not open for further replies.

pramilo

Well-known member
Hi all,

I am starting work on a project where I will be using a Teensy LC to act as a USB to Serial converter with a twist: the Teensy LC will need to intercept and tweak communication for the specific application; so it won't be a simple get-send case, but get-parse-send (hence why and FTDI won't do)
The other issue is that we need the lowest possible latency (where FTDI doesn't excel either)

Therefore , for this project to be successful, I need to get the absolute best performance on the USB Serial implementation.

I've used Teensyduino for quite a while now, and on occasion, when I need to tweak a function, I dig into the Arduino libraries and the Processor Datasheet and tweak what I need. (for example, enabling 1 wire communication (i.e. merging RX and TX) on the UART using the LOOPS processor capability)

While doing so, it has become clear that most of the Arduino-style functions (such as Serial.read(), Serial.write(), and many others) are actually encapsulated in separate C files (such as serial.c), which are properly organized in functions.

This brings me to the following questions:

1) Wouldn't it be faster (and less bloated) if I just call the Teensy functions directly, bypassing the Arduino abstraction layer?

1a) In doing so, is there any toolchain or sample where I could start from (including Makefiles)?
1b) I realize that initializing the processor is not a trivial task; It is done by the Teensyduino library but I'm unclear if I can reuse barebones libraries for this; are there any advice pointers for this?

2) There are some published USB benchmark tests at the end of this page https://www.pjrc.com/teensy/usb_serial.html
2a) How do these compare to the current implementation of the stack on the Teensy LC / 3.2?
2b) Were they achieved using a full Teensyduino/Arduino sketch or a lighterweight code as I mention in 1)

3) Finally, any advice on reducing latency?
The biggest enemy we've had with FTDI is really the high latency; we're doing high speed USB-serial (as native Serial ports have practically vanished) and FTDI has several issues with unpredictable timing. Some seem to be on the OS side but others seem to be on the chip's own firmware (cahching, etc).


To be clear: I'm not looking to dig into 100s of pages of the processor or ARM manual; my question is abouot using the existing C libraries and bypassing as many Arduino layers as possible.

Thank you very much
Pedro
 
USB has some latency which is built into the protocol. Then, it depends on the number of devices on the bus and if they are transferring Data.

You did not mention how much latency is allowed? So..can you tell us?
It is not even clear to me what is more important: latency or transfer speed? It would help to know more about your project, too.
 
There is no need to do any *ino, *cpp coding with Arduino/Teensyduino. you can programs completely in plain C and still using Arduino/Teensyduino for compiling and downloading.
Simply keep the sketch.ino file empty and write your own main.c code. There is a makefile in the Teensyduino cores directory that you can use as starting point, then you do not need a fake ino file.

(I have done that for years)
 
Latency: ..you did not test if the latency is OK as it is... have you?( I guess you'd mention it if you tested it)
 
Well.... I'll try to answer the best I can bc the issue is not just plain latency. So more details:

We are running a master-slave setup at 1Mbps baud, where the master (in this case the PC, via Teensy LC), queries the slaves in any given sequence.

I can see, using the Osciloscope that, as soon as the slaves get the command, they respond very quickly (much less than 1ms). However that response takes some time to get transferred to the computer by the FTDI chip. (can be up to 5ms typically)
This is where we have a major bottleneck.

What we're aiming for is about a 20% improvement over FTDI and possibly more consistent timing.

I've read somewhere (can't find the link now) that FTDI caches the bytes and sends them to the PC in one go, as a USB transfer has some important overhead.
Therefore, transferring as soon as you get 1 byte would not make sense.

Furthermore, I presume that, in cases where the USB frame doesn't get filled, the FTDI times out and still send the bytes to the PC
There's also the famous "16ms latency timer setting", that you can tweak on the Windows driver down to 1ms; on Linux it's usually set to 1ms by default but you can't go any lower than this.

The interest in using Teensy is also about the possibility of tuning these parameters.
I've read (again, can't find the link) that you'd be able to tweak the Teensy USB code and adjust a parameter setting how much data you want to cache before sending it to the PC. (and maybe also the timeout)

From peeking around the Teensy USB libraries in the the hardware\teensy\cores\... I can see there are some constants that may be relevant, but I'm not confident enough to tweak them:

Code:
in usb_serial.c: #define TRANSMIT_FLUSH_TIMEOUT	5   /* in milliseconds */

EDIT:


I have located the additional lines inside usb_desc.h. Can anyone shed some light on what the effect of tweaking them?

Code:
#define CDC_ACM_SIZE          16
  #define CDC_RX_SIZE           64
  #define CDC_TX_SIZE           64


/EDIT

I'd also like to know how much data teensy caches before transmitting it to the PC and confirm if I can tweak the default timeout to transmit data to the PC when the cache is hasn't been filled.

... or any other approach you might find more effective to increase throughput.

Since this is a very specific scenario, where the USB to serial conversion will be working on a setup that is well known, it is acceptable for us, that we trigger more USB transfers with smaller chunks of data VS less transfers with higher amounts of data (which would be more efficient on the PC side, but terrible for us in terms fo latency).

Thank you
 
Last edited:
FTDI? Teensy has native high speed USB hardware there is no FTDI chip involved if that was the understanding.

Arduino imposes no penalty in general on Teensy implementation - especially USB Serial. It will run at full 12 Mbit speed with only the USB spec limits/constraints on timing and packet size/delivery. It will typically delivery 1MBytes/sec net throughput and overwhelm most PC's doing text display or processing unless the code is very efficient. It will hold partial packets under 64 bytes for some short time as sending one fuller packet is more efficient, there are ways to force the send - or just sending full 64 byte packets will stop that.

Here is what Arduino means to Teensy:
Code:
	// Arduino's main() function just calls setup() and loop()....
	setup();
	while (1) {
		loop();
		yield();
	}

The only compromise is that yield() call that looks to activate the Arduino defined serialEvent() processing code.

PJRC makes this a weak function so putting this in sketch code will stop that:: void yield(){}

Or just don't return from loop(). Any call to delay() also results in a call to yield(), though delayMicroseconds() does not.

If that adds anything - go ahead and give it a try and if latency or problems can be documented please provide a sample as Paul would like to fix it I'd suppose.

Paul has spent years on the hundreds - rather thousands of pages of the multiple Teensy microcontroller manuals and refined this in conjunction with extensive analysis and reference to USB spec and years of use and history.

For Teensy, Arduino just means a supported platform and way to use uniform libraries for multiple devices/bus types - often Teensy optimized but presented in a common familiar way.
 
FTDI? Teensy has native high speed USB hardware there is no FTDI chip involved if that was the understanding.

I understand that. That's why we're looking to use a Teensy as a USB to Serial device in our setup (to replace the FTDI).
Hence the questions about improving code efficiency.

It will hold partial packets under 64 bytes for some short time as sending one fuller packet is more efficient, there are ways to force the send - or just sending full 64 byte packets will stop that.

Is there a way/place where I can tweak this 64byte value for caching?

Also, what do you mean by "there are ways to force the send"? This would be very interesting to us. Is there a special function to be called?

Thanks
 
Thanks for pointing that out.

I've looked at the actual implementation of Serial.flush() - function usb_serial_flush_output(void) in file usb_serial.c and it seems to rely on usb_tx()

Glancing at usb_tx() - in the same file - seems to indicate the call to flush() won't necessarily guarantee the data is sent; it seems to depend on the state of the endpoint.

Anyway, I presume this would be as good as it gets as there's no Real Time guarantee when relying on USB.

It'd be good if there was some more documentation on the CDC implementation.

As far as I can tell, I may tweak the buffer sizes in
#define CDC_RX_SIZE 64
#define CDC_TX_SIZE 64

probably test with the defaults first and then bringing them to 16 (which what LUFA seems to use)

Still not sure about the TRANSMIT_FLUSH_TIMEOUT though but will also give it a try.

Thank you all for the information.
It'd be great if someone could shed some definitive light on these constants though.

Best,
 
Have Fun. The USB spec and details are enormous - and Paul has poured years into refinement and operation to optimize what the spec allows. There may be some special case usage with a suitable host that this optimized interface might not be optimal - but working to find that point would be the best first step. Make it work as it then look to tweak if you see opportunities for optimization.

There are other interface methods and the T_3.6 has a second USB port that may be higher speed? FrankB asked some good questions that could be answered regarding needed throughput for data and timing. Getting a solution in hand and finding the actual issues to demonstrate would perhaps allow getting to best solution within then confines of what USB can offer.

Not sure what is hosting this? Typically as noted the Teensy is capable overwhelming the host so that can be the weak spot.
 
Before answering, please understand the goals of maximum bandwidth and lowest latency are often in conflict with one another.

Now, to try to specifically answer your questions...


1) Wouldn't it be faster (and less bloated) if I just call the Teensy functions directly, bypassing the Arduino abstraction layer?

Yes, or maybe, if you do things very carefully. But relative to the USB speed, this will matter very little. Unless you have limitless time to spend on this project, you should prioritize your efforts. This ought to be at the very bottom of those priorities.


2) There are some published USB benchmark tests at the end of this page https://www.pjrc.com/teensy/usb_serial.html
2a) How do these compare to the current implementation of the stack on the Teensy LC / 3.2?

Near the top of your priority list should be measuring actual performance. Without real measurements, you're flying blind. I believe that's the essence of what Frank is saying. You need to make measurements. Put that at the top of your list, far more important than saving just a few CPU cycles in places that (most likely) will make no difference.


2b) Were they achieved using a full Teensyduino/Arduino sketch or a lighterweight code as I mention in 1)

The code is published on that page. You should run it and do the measurements yourself. When you do, you'll see it uses the normal Arduino functions.



3) Finally, any advice on reducing latency?

Yes. Use Serial.send_now(). But only do so when you're at the end of a distinct message, because it causes partial USB packets to transmit. Remember, the goal of low latency is often at odds with maximizing bandwidth.


We are running a master-slave setup at 1Mbps baud, where the master (in this case the PC, via Teensy LC), queries the slaves in any given sequence.

Query-and-wait-for-response style communication always performs poorly over USB. This has come up many times on this forum where others wanted to optimize performance, so I'm going to keep this short. Maybe search for those older conversations.

Again, you need to prioritize your effort. Avoiding a query-and-wait approach should be at the top of your list. This is where your work will really pay dividends. Don't squander your time fiddling with avoiding Arduino APIs if you're communicating this way. Put your effort where it matters!

Even Teensy LC is very fast compared to the slow pace of 12 Mbit/sec USB. Don't be shy about adding more code and sending even more bytes if it avoiding a query-wait approach on the PC side. One common approach is to serial number your messages, so the PC can send several messages ahead before receiving every reply, and the Teensy side can be working on the next operation before the PC has completed processing of the prior one. There are unavoidable latencies with USB. More than any other factor, a smart way to structure your communication on both the PC side and Teensy side, so you can always work ahead and avoid waiting will give you the best results.


I can see, using the Osciloscope that, as soon as the slaves get the command, they respond very quickly (much less than 1ms). However that response takes some time to get transferred to the computer by the FTDI chip. (can be up to 5ms typically)
This is where we have a major bottleneck.

Serial.send_now() will completely eliminate the 5ms timeout.


I've read somewhere (can't find the link now) that FTDI caches the bytes and sends them to the PC in one go, as a USB transfer has some important overhead. Therefore, transferring as soon as you get 1 byte would not make sense.

Transmitting whatever data you have as a block is always more efficient. On the Teensy side, the USB code is pretty smart. It will efficiently combine writes together into USB packets.

On the PC side, Linux and especially Windows are quite dumb (but Macintosh turns out to be pretty good). That's what those benchmarks show. If you transmit 1 byte at a time on the PC side, the drivers on Linux and Windows will put 1 byte into each USB packet. Windows has other limitations too, which make things really slow in this mode.


The interest in using Teensy is also about the possibility of tuning these parameters.
I've read (again, can't find the link) that you'd be able to tweak the Teensy USB code and adjust a parameter setting how much data you want to cache before sending it to the PC. (and maybe also the timeout)

From peeking around the Teensy USB libraries in the the hardware\teensy\cores\... I can see there are some constants that may be relevant, but I'm not confident enough to tweak them:

Code:
in usb_serial.c: #define TRANSMIT_FLUSH_TIMEOUT	5   /* in milliseconds */

EDIT:

Only mess with this if you can't use Serial.send_now().

If you know when you're at the end of a message, using Serial.send_now() at those times, and *NOT* using it any other times will give the best performance.


I have located the additional lines inside usb_desc.h. Can anyone shed some light on what the effect of tweaking them?

Code:
#define CDC_ACM_SIZE          16
  #define CDC_RX_SIZE           64
  #define CDC_TX_SIZE           64

Don't mess with this stuff.

At the very beginning you said "I'm not looking to dig into 100s of pages of the processor or ARM manual".

This is a very deep rabbit hole, with pages of required reading measured in many thousands of pages, not merely hundreds.

Seriously, if you're reading this code, you're spending your effort in all the wrong places. Use the Arduino functions, especially Serial.write(buffer, size) and Serial.send_now(), and focus your programming effort on your communication technique. Put your time to productive use by avoiding waiting. Transmit a few message ahead. If you require acknowledgement of prior messages, never stop working while you wait. Be smart and allow those confirmations to come later. That is the sort of work which will pay off with huge increases in performance.


I'd also like to know how much data teensy caches before transmitting it to the PC and confirm if I can tweak the default timeout to transmit data to the PC when the cache is hasn't been filled.

... or any other approach you might find more effective to increase throughput.

If you're sending large messages, this is one place where tweaking settings might help. There's a define for the number of USB buffers in usb_desc.h. That controls how many USB packets Teensy can buffer. If you fill up all the buffers on the Teensy side with a large Serial.write(), then it will have to wait for more buffers to become available.

If you think this may be a problem, do something like using elapsedMicros to measure the time spent in Serial.write(). If it spends a long time, more than 10-20 us, do something like turn on a LED or pulse a pin you watch with an oscilloscope or logic analyzer.

Again, the key to success is focusing your effort in the place that matter. If Serial.write() is always returning quickly, then more buffers will not help.


I understand that. That's why we're looking to use a Teensy as a USB to Serial device in our setup (to replace the FTDI).
Hence the questions about improving code efficiency.
....
Is there a way/place where I can tweak this 64byte value for caching?

You really should avoid messing with the low level USB code. You'll only make things much worse. Very deep knowledge of USB is required to play there. Even deeper knowledge is needed to have any hope of doing something that actually makes any improvement.


Also, what do you mean by "there are ways to force the send"?

Serial.send_now()

But remember, certain latencies with USB are unavoidable, due to the design of the host controller chip in your PC (that's a *really* deep rabbit hole...) and other issues imposed by the drivers on each operating system are effectively outside your control.

Design your communication in a smart way which avoids query-and-wait-for-reply on the PC side. That, more than any other factor, will give you the best results.

Seriously, don't waste your time trying to avoid the Arduino APIs. They are terrible on many other boards, but on Teensy this stuff is quite efficient. Using it in a smart way will give you the best performance possible on 12 Mbit/sec USB.
 
Thank you all for the advice.

I will try to gather data in a presentable way and hopefully post back with some performance measurements using the Teensy LC, for future reference.
 
Hi all

Once again than you all for your advice.

I have built a benchmark suite and have actual values and findings to share.

The setup: as I explained earlier, this is a master-slave setup, where the PC queries the slaves and they reply.
There are two benchmarks: benchmark 1 addresses each slave individually and waits for the reply, while Benchmark 2, sends a global query to the bus and the slaves reply in order.

The benchmark code is written in C#, compiled in VS 2015, Windows 10 64bit, AMD Ryzen 2400.

The point to be made is not about each exact millisecond but instead to get an idea of how they compare t each other.

Reference benchmarks:
FT232R based device:
IMPORTANT: The Device was configured in Windows Device Manager (Tab Port Settings->Advanced->Latency Timer) to a Latency of 1ms (instead of the default 16ms). This has a real impact on performance.
Code:
#1 Individual READ commands:
==========================================

Statistics:                                                                           
        Cycles ran         :   400
        Cycles failed      :     0
        Fastest Cycle time :  7536 uSec
        Slowest Cycle time : 15727 uSec
        Average Cycle time : 9150,925 uSec
        ----------------------
        Fastest Individual Device: ID 14, reply time   186 uSec
        Slowest Individual Device: ID 18, reply time  1658 uSec

#2 Read multiple devices using one BULK_READ command:
==========================================

Statistics:                                                                              
        Cycles ran         :   400
        Cycles failed      :     0
        Fastest Cycle time :  2277 uSec
        Slowest Cycle time :  3790 uSec
        Average Cycle time : 3043,8275 uSec
        ----------------------
        Fastest Individual Device: ID 17, reply time     3 uSec
        Slowest Individual Device: ID 13, reply time  1910 uSec

Reference benchmark:
ATMEGA32u2 based Usb to Serial Device (CDC class) - Actual device: USB2AX (fw 0.5RC1) http://www.xevelabs.com/doku.php?id=product:usb2ax:usb2ax
Code:
#1 Individual READ commands:
==========================================

Statistics:                                                                           
        Cycles ran         :   400
        Cycles failed      :     0
        Fastest Cycle time :  7173 uSec
        Slowest Cycle time : 11597 uSec
        Standard Deviation : 8659,5053031351
        ----------------------
        Fastest Individual Device: ID 11, reply time   208 uSec
        Slowest Individual Device: ID 17, reply time  1081 uSec

#2 Read multiple devices using one BULK_READ command:
==========================================

Statistics:                                                                              ailed runs: 0)
        Cycles ran         :   400
        Cycles failed      :     0
        Fastest Cycle time :  3486 uSec
        Slowest Cycle time :  5802 uSec
        Average Cycle time : 4361,7075 uSec
        ----------------------
        Fastest Individual Device: ID 14, reply time     5 uSec
        Slowest Individual Device: ID 11, reply time  3821 uSec


Now for the Teensy results.

Used Teensy 3.2, on Arduino 1.0.6 and Teensyduino 1.25 (I realize this is an ancient toolchain, but it's the one we've been using and works for all our code):

Teensy Sketch:
Code:
/* NO OPTIMIZATIONS; simple pass from one side to the other */
void setup()
{
	Serial.begin(10000);
	Serial2.begin(1000000);
}


void loop()
{
	if (Serial.available()) {
		Serial2.write(Serial.read());
	}

	if (Serial2.available()) {
		Serial.write(Serial2.read());
	}

}

Results, compiling the sketch with unmodified libraries:
Code:
#1 Individual READ commands:
==========================================

Statistics:                                                                           ecs. Failed runs: 0)
        Cycles ran         :   400
        Cycles failed      :     0
        Fastest Cycle time : 39685 uSec
        Slowest Cycle time : 45540 uSec
        Average Cycle time : 41101,475 uSec
        ----------------------
        Fastest Individual Device: ID 11, reply time  4297 uSec
        Slowest Individual Device: ID 12, reply time  5868 uSec

#2 Read multiple devices using one BULK_READ command:
==========================================

Statistics:                                                                              ailed runs: 0)
        Cycles ran         :   400
        Cycles failed      :     0
        Fastest Cycle time :  6678 uSec
        Slowest Cycle time :  8981 uSec
        Average Cycle time : 7718,0725 uSec
        ----------------------
        Fastest Individual Device: ID 13, reply time     5 uSec
        Slowest Individual Device: ID 11, reply time  7284 uSec

With unmodified libraries, this can be up to 4.5 times slower than the Reference benchmark using an FT232R based device.

Next up, editing usb_serial.c and reducing #define TRANSMIT_FLUSH_TIMEOUT 1 /* in milliseconds */ to 1 (default is 5), makes a significant difference in benchmark results as shown below:

Code:
#1 Individual READ commands:
==========================================

Statistics:                                                                           
        Cycles ran         :   400
        Cycles failed      :     0
        Fastest Cycle time :  7653 uSec
        Slowest Cycle time : 13721 uSec
        Average Cycle time : 8940,105 uSec
        ----------------------
        Fastest Individual Device: ID 11, reply time   204 uSec
        Slowest Individual Device: ID 12, reply time  1636 uSec

#2 Read multiple devices using one BULK_READ command:
==========================================

Statistics:                                                                              
        Cycles ran         :   400
        Cycles failed      :     0
        Fastest Cycle time :  2738 uSec
        Slowest Cycle time :  4725 uSec
        Average Cycle time : 3290,9425 uSec
        ----------------------
        Fastest Individual Device: ID 17, reply time     4 uSec
        Slowest Individual Device: ID 11, reply time  2579 uSec


Were looking at up to 5x faster communication by tweaking this library setting and coming close to matching the performance of an FT232R on Benchmark #2 and actual surpasses performance of the FT232R on benchmark #1.

The next steps are to test using <strike>newer Teensyduino libraries and Arduino versions</strike>, porting this to a Teensy LC (which would be the intended device) and implementing optimizations on code.
As far as optimizations go, it doesn't seem relevant to buffer communication on the main sketch as the USB Serial code already has its own buffer.

EDIT:
I've repeated the benchmarks using Arduino 1.8.8 and Teensyduino 1.45 and the benchmark results are the same as using the older 1.0.6/TD1.25
an earlier post, I had laid out the hypotheses of editing the CDC_ACM_SIZE/CDC_TX_../CDC_RX_.. constants in the library; I tried this but it didn't seem to have any relevant impact on performance.
Only the TRANSMIT_FLUSH_TIMEOUT seems to have an impact.

I hope you find the data useful .

I will post further data once I have the code running on the Teensy LC.

Best Regards
Pedro
 
Last edited:
Next up, editing usb_serial.c and reducing #define TRANSMIT_FLUSH_TIMEOUT 1 /* in milliseconds */ to 1 (default is 5), makes a significant difference in benchmark results as shown below:

Sounds like you didn't embrace the advice to use Serial.send_now(), did you?

When you call Serial.send_now(), you're effectively making that timeout zero, which is even better than edit the code to make it 1 ms.
 
Sounds like you didn't embrace the advice to use Serial.send_now(), did you?

When you call Serial.send_now(), you're effectively making that timeout zero, which is even better than edit the code to make it 1 ms.

I did try to use that with a timer. Code below:
Code:
#define USB_FLUSH_TIMEOUT_MICROS	500 // tested with 1000 as well

uint32_t last_usb_flush;
void setup()
{
	Serial.begin(10000);
	Serial2.begin(1000000);
}


void loop()
{
     	// force flush every X microsecs
	if ((micros() - last_usb_flush) > USB_FLUSH_TIMEOUT_MICROS) {
		Serial.send_now();
		last_usb_flush = micros();
	}

	if (Serial.available()) {
		Serial2.write(Serial.read());
	}

	if (Serial2.available()) {
		Serial.write(Serial2.read());
	}
}

The issue is that calling send_now() on a regular interval will lose bytes on occasion (whether called every 1ms or 0.5ms/500micros).
If I tweak the TRANSMIT_FLUSH_TIMEOUT to 1 I do not lose bytes and communication works flawlessly.
FYI: the ATMEL based device I listed (USB2AX) seems to work using a low USB transfer timeout as well; their code is open source, and can be found by following the link I posted.


BTW when calling it every 500microsecs indeed it can be faster in certain scenarios as shown below, but I lose bytes.
Code:
#1 Individual READ commands:
==========================================

Statistics:                                                                           cs. Failed runs: 11)
        Cycles ran         :   400
        Cycles failed      :    11 (check if the IDs are all connected and responding)
        Fastest Cycle time :  7626 uSec
        Slowest Cycle time : 12690 uSec
        Average Cycle time : 9587,42416452442 uSec
        Standard Deviation : 9572,9244387353
        ----------------------
        Fastest Individual Device: ID 11, reply time   178 uSec
        Slowest Individual Device: ID 13, reply time  1211 uSec

#2 Read multiple devices using one BULK_READ command:
==========================================

Statistics:                                                                              ailed runs: 3)
        Cycles ran         :   400
        Cycles failed      :     3 (check if the IDs are all connected and responding)
        Fastest Cycle time :  2140 uSec
        Slowest Cycle time :  3747 uSec
        Average Cycle time : 2837,94710327456 uSec
        ----------------------
        Fastest Individual Device: ID 16, reply time     5 uSec
        Slowest Individual Device: ID 11, reply time  1566 uSec

I'm not sure why I'm losing bytes but I realize that send_now() is not meant to be used like this so I won't dig any further.

One of the planned optimizations is to spy on the slave replies and once it sees it's finished (because we know the reply format), we call send_now(). I am confident this should give a performance boost.

However, so far, I'm only running generic benchmark testing.


PS: I've repeated the benchmarks in my previous post, using Arduino 1.8.8 and Teensyduino 1.45 and the benchmark results are the same as using the older 1.0.6/TD1.25
 
I did try to use that with a timer.

*sigh* - Seem you've completely misunderstood how to use this feature, quite dramatically so.

Looks like you utterly disregarded my prior advice:

If you know when you're at the end of a message, using Serial.send_now() at those times, and *NOT* using it any other times will give the best performance.

I could try to explain this all over again, but really, do you think that would do any good here?
 
defragster
can this be shared :"The benchmark code is written in C#, compiled in VS 2015, Windows 10 64bit"

The source code for the Benchmark is attached.
FYI: It's written in C#. I tested compiling with VS2015; in theory it should also compile with Mono under Linux but I haven't gotten around to testing that yet.

The benchmark is performed by communicating with 8x Robotis Dynamixel MX-28T devices. These are the slaves.

The benchmark code implements the communication protocol and subclasses the Stopwatch timer for Microsecond resolution. (there are plenty of articles online about doing this).

I'm not sure if the code is useful without Dynamixel Slave devices (either MX-28/MX-64/MX-106)

FILE DOWNLOAD: View attachment SeedDynamixelBenchmarking.zip


*sigh* - Seem you've completely misunderstood how to use this feature, quite dramatically so.

Paul, if you read through my post, you'll see I acknowledge that using a timer with send_now() is not the correct use for it and even explain how we intend to use it. (that's why I dismissed the results using that technique and _didn't even include that in the first post with benchmarks_)

I really appreciate your earlier advice in narrowing down what I should look for and how to approach the task of best performance.

However, the fact of the matter is that these experiments that I shared, compare a much simpler approach: a simple direct benchmark of sending and receiving, using the reference FTDI chips vs Teensy acting as a USB to Serial.

I applaud your great work on making USB so accessible and versatile to newcomers, in the Teensyduino. I really do.

However, from a performance standpoint, if someone can't or don't want to optimize using send_now(), Teensy performs a lot worse that FTDI; I took the time to look under the hood and share the findings, something I believe has value to others.

I recon my limited knowledge of USB and that I don't fully understand the implications of tweaking TRANSMIT_FLUSH_TIMEOUT. However I've seen others use short flush timeouts in LUFA implementations (not exactly the same - I Know, I know!!) and this gave me confidence to try it out.

The results are positive and I think worth sharing, if anything else, as a token of appreciation for all your advice.

Thank you,
Pedro
 
Last edited:
As has been mentioned by several people already in this thread, there is often a conflict in performance between throughput and latency...

What I find somewhat interesting, is you talk about FTDI as the reference. On Windows I believe the default latency setting for FTDI is 16MS. Yes the user can update this, if they know to go to the Device Manager, find the comm port, go to port Settings page and then click on Advanced button...

It is also interesting that on different Arduino boards, they timeouts are all over the board, as I have seen this week when trying out USB stuff on T4 beta...
Teensy is at 5MS, I believe Robotis OpenCM9.04 board is 3ms, the Robotis OpenCR1 board is 1ms...

What is right? Hard to say...

Note: I have played around with Teensy 3.x and soon 4 with using them as a Servo controller for Robotis Dynamixel servos, typically running on Hardware Serial port at 1MBS, but will probably run at least 2MBS...
The communication works using packets sent over USB to board, which gets sent out over the hardware Serial port (Half duplex) and any responses from the servos sent over the Hardware Serial port, are typically sent back over USB back to the host program... Hopefully with as little latency as possible...

However with this I also typically have the Teensy monitor the messages and if the message is sent to their logical ID, it intercepts the message and generates the response...

Personally the idea of trying to write a new version of the USB code was not something I considered... Instead I would spend more of the time just organizing when things get called...
Note: I use Serial.flush(), instead of Serial.send_now() as they are the same code and flush() is now the Arduino standard...

Things I would do include: maybe avoid returning from loop() as that avoids all of the calls to yield() which checks all of the Serial ports for stuff...

Use the avail and availForWrite calls to know how much I can read/write without blocking...

Either understand by context when the end of message has been received and call Serial.flush().
Or what I have done also in past is do my own timeout... That is in my main loop code, where SerialX is my hardware Serial port:
Code:
    if (SerialX.available()) {
        .... <Read it in - maybe limit to how much space available with Serial.availableForWrite() 
        last_serial = micros();
    } else if (last_serial && (micros()-last_Serial) >  timeout_value)) {
        Serial.flush();
        last_serial = 0;
    }
Where I maybe setup timeout_value to the time it would take to receive maybe 2-3 characters... Assuming maybe a gap then...

And again there are lots of other things to try out before believing that you need to rewrite the underlying USB system.

Good luck
 
Hi Kurt,

Thanks for all the advice.

Can you clarify 2 points for me:
Or what I have done also in past is do my own timeout...

What I've seen, from further experiments is that if you call send_now() (or flush()) and host happens to start transmitting something simultaneously, on occasion bytes are missed. --> Have you had this happen to you?

On the other hand, if I tweak the TIMEOUT in the library, and let the library handle it, I have never seen the behavior of lost bytes.


The other question:
which gets sent out over the hardware Serial port (Half duplex)

How are doing Half Duplex for the Robotis servos?
Are you using RS485 with TX_ENABLE pin or a tri-state buffer?

The T3.2 processor has a nice UART_LOOPS mode feature, which actually makes the whole UART run on 1 wire in Half Duplex. However, I'm dependent on the 3.3V output signal level (passing through a 100Ohm resistor for protection on the Data line) which sometimes is not enough on longer chains bc Dynamixels runs at 5V (this is when operating with TTL versions of the servos obviously)

(fyi, for other readers: the Dynamixels are supposed to be TTL 5V, so a CMOS at 3.3V has borderline compatible signal levels)

Thank you,
 
Hi Pramilo,

As for issues of losing data, the answer is typically no. That is anything missed is typically because my program on one side or the other was not working properly...

That is: The host side knows when it is sending out messages if it expects to get a response from the other side or not. If not, it is then free to send another message, else it waits for a response from the other side...

Now if the Host sends a message which may generate a response, and then does not wait for the response and then sends out a new packet, which the code on the Teensy then switches to TX mode on the Hardware UART, then you are likely to end up with corrupted data...

Obviously you can also on the Teensy, know what state you are in (waiting for data from DXL buss) and hold on to USB data coming in and then only output when coast is clear...

But again depends on how you wish for your code to work.

I have played around with DXL support on several different boards (note I did not write the code in all of these different cases)...

Things like: Trossen Robotics Arbotix boards, Xevel Labs USB2AX, Robotis boards OpenCM9.04 and OpenCR board. Note with some of these boards if you are doing the communications between a Host and servos, some of the coding may be different in trying to reduce latency... That is for example with Arbotix board, that you need something like an FTDI cable to connect the board to the PC, You not only have the USB latency, you also have the time it takes for the FTDI board to transmit the data to the AVR/ARM processor over a UART... So the question is when do you start forwarding data that is received from the host to the DXL buss.
So for example if you are using protocol 2, each of the packets start off with : 0xff 0xff 0xfd 0x00 <ID> ...
So it is only on the time after you received the 5th byte, can you decide if this packet is for you or to be forwarded. If the USB is built in to the processor like it is with Teensy, not a problem, you can look at all of these bytes and then choose to eat this packet locally. But if it is being processed by FTDI through UART and the Uart is running at about same speed as DXL Uart, you just wasted 5 byte output cycles on the off chance that you might eat it. So instead on the FTDI, like case you might always forward even though you are going to process it....

As for how to hook up servos... I have gone several different ways...

a) Using the built in loops like stuff... Worked fine for me on T3.2... But at times may have had issues that I did not know why things happened. Like AX servos resetting their ID to #1... (Probably power issue)...
b) I have also done the Loops stuff with one bidirectional TTL level shifter...
c) I have used level shifter on both RX/TX, closer to the spec that Robotis shows.

Note: There are several threads up here talking about driving DXL servos, so you can do query to get lots more information, including from one user who was very certain that there is no problem driving them with 3.3v...

Note: At one point I update the Trossen Robotics library: bioloid to work on with Teensy boards and the like using both ways. It is up at: https://github.com/KurtE/BioloidSerial

I have also made a version of the Robotis library: Dynamixel SDK that worked with Teensy boards (again I believe both ways). Was trying to get Robotis to take in support for 3rd party boards, but so far they have not.
But I will be trying it out again soon with T4...

hope that helps
 
Hi all,

I have now built the code and have been benchmark'ing it:

--> I left all the constants and the whole Teensy library in its original form. I use send_now() when I wish to send immediately.
From experimenting, as Paul suggested, if you know when the data is ready to send to the PC, calling send_now(9 is the most efficient.

However, I am now facing an issue which I am having trouble overcoming:
The first transmission consistently shows extremely high latency which throws the control loops off. I believe a picture explains this better:

DSYNC_MEU_TXLEN8.PNG
(Each point in the chart is plotted as (Discrete CYCLE Time - AVG Cycle time). Therefore it shows the deviations from the average time.)


- You can see the high latency on the first transmission. This is a consistent behavior across several runs, disconnects, re connects....
- After the first transmission it appears to normalize as you can see, with smaller variations that would be expectable.

I went peeking around the USB code and, in usb_Serial.c found:
Code:
// Maximum number of transmit packets to queue so we don't starve other endpoints for memory
#define TX_PACKET_LIMIT 8

I played around with reducing this constant to 1 but it's unclear if this helps; the overall behavior is still quite visible, so I don't think this is the solution.
I can confirm I am using send_now() at the end of all transmissions, so all those data points are taken after teensy issues a send_now().


Any pointers?

Thank you
 
Last edited:
Status
Not open for further replies.
Back
Top