TeensyLC USB Stack Questions on Latency

AtomicRobot · Mar 25, 2021

Hello all,

Hoping someone can shed some light on these questions. I'm creating a USB device with a TeensyLC, and have successfully altered the usb_desc & usb_dev files as well as adding my own USB device specific files. I can compile my device with Teensyduino. I'm currently trying to reduce latency from information gathered from an IMU (ISM330DHCX) over I2C to when the USB packet is collected by the host/computer.

I've read through this thread: https://forum.pjrc.com/threads/58663-Teensy-USB-delay and checked out that persons github repository https://github.com/maziac/fastestjoystick/blob/master/FastestJoystick.ino and from what I can understand, they tried to push data in to the USB hardware buffer every time the number of packets buffered was 0. However, it looks like in between the data being sent to the USB hardware, it is sent to a buffer in memory via the usb_malloc() function in the usb_mem files. That function is called if the TX_PACKET_LIMIT is greater than the amount of packets at the endpoint (I think). They still encountered a delay of USB poll interval, which leads me to believe that because the standard TX_PACKET_LIMIT is set to 3 (at least for Joysticks), the data that needs to be sent over USB will be sent to that buffer first, before being given over to the actual USB hardware in the chip.

If my device only has one endpoint for transmitting, would setting the TX_PACKET_LIMIT to 0 bypass the buffer created by usb_malloc(), and allow for lower latency (if I have my device set to only call my usb send function every poll interval? Would this be a bad idea? From comments in this code, the TX_PACKET_LIMIT is there to prevent starving endpoints for memory.

related code:

Code:

usb_packet_t * usb_malloc(void)
{
	unsigned int n, avail;
	uint8_t *p;

	__disable_irq();
	avail = usb_buffer_available;
	n = __builtin_clz(avail); // clz = count leading zeros
	if (n >= NUM_USB_BUFFERS) {
		__enable_irq();
		return NULL;
	}
	//serial_print("malloc:");
	//serial_phex(n);
	//serial_print("\n");

	usb_buffer_available = avail & ~(0x80000000 >> n);
	__enable_irq();
	p = usb_buffer_memory + (n * sizeof(usb_packet_t));
	//serial_print("malloc:");
	//serial_phex32((int)p);
	//serial_print("\n");
	*(uint32_t *)p = 0;
	*(uint32_t *)(p + 4) = 0;
	return (usb_packet_t *)p;
}

Code:

int usb_imu_ao_send(void)
{
    uint32_t wait_count=0;
    usb_packet_t *tx_packet;

    while (1) {
        if (!usb_configuration) {
            return -1;
        }
        if (usb_tx_packet_count(IMU_AO_ENDPOINT) < TX_PACKET_LIMIT) {
            tx_packet = usb_malloc(); 
            if (tx_packet) break; 
        }        
        if (++wait_count > TX_TIMEOUT || transmit_previous_timeout) {
            transmit_previous_timeout = 1;
            return -1;
        }
        yield();
    }

    transmit_previous_timeout = 0;
    memcpy(tx_packet->buf, usb_imu_ao_data, IMU_AO_SIZE);
    tx_packet->len = IMU_AO_SIZE;
    usb_tx(IMU_AO_ENDPOINT, tx_packet);

    return 0;
}

Frank B · Mar 25, 2021

As I2C is orders of magnitudes slower than I2C, attempting to reduce latency by optimizing USB is pretty questionable.

Then, if it would make sense to speed up IMUs, don't you think the manufacturers would use a faster interface than I2C?

Physics isn't that fast. I built a balancing robot 10yrs ago - it worked best around 20Hz... a slow AVR chip was more than enough.

PaulStoffregen · Mar 25, 2021

AtomicRobot said:
However, it looks like in between the data being sent to the USB hardware, it is sent to a buffer in memory via the usb_malloc() function in the usb_mem files. That function is called if the TX_PACKET_LIMIT is greater than the amount of packets at the endpoint (I think). They still encountered a delay of USB poll interval, which leads me to believe that because the standard TX_PACKET_LIMIT is set to 3 (at least for Joysticks), the data that needs to be sent over USB will be sent to that buffer first, before being given over to the actual USB hardware in the chip.

Yes, this is fundamentally how USB device mode works. Well, the way buffers are allocated can of course vary. But the concept that packets must be put into a buffer and get actually transmitted later is a fundamental aspect of all USB devices. This isn't something unique to Teensy. All USB works this way.

The reason USB became so widely used is because it has this design where a single USB host manages all the communication, which allows for inexpensive hardware in the devices. USB devices are built with much simpler hardware which only responds to tokens the USB host transmits. So when Teensy runs in USB device mode, like all USB devices, it can not transmit a packet to the USB host whenever it wants. Packets can only be transmitted by devices when the host sends an IN token, giving that particular device permission to transmit at that moment.

The host controller chip inside your PC implements 2 schedules for when it allows the many connected USB devices to communicate. USB manages bandwidth in frames, either 1ms or 125us. First a start-of-frame token is transmitted. Then the periodic schedule is checked and every endpoint in every device which is supposed to get guaranteed latency (all HID devices) are sent and IN or OUT token to give them their opportunity to communicate. Then after all those are completed, the asynchronous schedule is used to allow all the endpoints without guaranteed latency to share whatever bandwidth remains in the frame.

If my device only has one endpoint for transmitting, would setting the TX_PACKET_LIMIT to 0 bypass the buffer created by usb_malloc(),

No. The code doesn't recognize 0 as meaning anything special. It wouldn't be able to work at all with that setting!

and allow for lower latency

Hypothetically, if there were no limit or if you raise the limit to a higher number, you can expect worse latency, not better! More packets waiting to transmit means more latency.

From comments in this code, the TX_PACKET_LIMIT is there to prevent starving endpoints for memory.

Yes. On Teensy 3.x & LC, a pool of buffers is shared by all endpoints.

When running as a joystick or any other HID device, you a guaranteed opportunity to communicate according to the polling interval. But you can't get any more. If the polling interval is 1ms (the shortest possible with 12 Mbit/sec speed) then you simply can not transmit a message more than once every millisecond.

You really should design your program so it never generates messages faster than this rate. On Teensy, the key to achieving this is explained on the joystick page under "Precision Timing".

https://www.pjrc.com/teensy/td_joystick.html

By default a new message is sent for every change you make. That's very simply, but also very inefficient. When you use manual send mode, no USB packet is buffered until you call Joystick.send_now();

There are many ways you might structure your program. But whatever you do, make sure you call Joystick.send_now() no more than 1000 times per second. You simply can not ever get lower latency than 1ms, because HID uses "interrupt" type endpoints which are serviced by the USB host controller's periodic schedule. 1ms is the minimum possible polling interval.

The only way to get lower latency is to use 480 Mbit/sec speed and set the polling interval to 125us. This is only possible on Teensy 4.0 & 4.1 which offer 480 Mbit speed. The older models have only 12 Mbit USB, so you just can't get shorter than 1ms polling interval on those Teensy models.

AtomicRobot · Mar 25, 2021

PaulStoffregen said:
Yes, this is fundamentally how USB device mode works. Well, the way buffers are allocated can of course vary. But the concept that packets must be put into a buffer and get actually transmitted later is a fundamental aspect of all USB devices. This isn't something unique to Teensy. All USB works this way.

The reason USB became so widely used is because it has this design where a single USB host manages all the communication, which allows for inexpensive hardware in the devices. USB devices are built with much simpler hardware which only responds to tokens the USB host transmits. So when Teensy runs in USB device mode, like all USB devices, it can not transmit a packet to the USB host whenever it wants. Packets can only be transmitted by devices when the host sends an IN token, giving that particular device permission to transmit at that moment.

The host controller chip inside your PC implements 2 schedules for when it allows the many connected USB devices to communicate. USB manages bandwidth in frames, either 1ms or 125us. First a start-of-frame token is transmitted. Then the periodic schedule is checked and every endpoint in every device which is supposed to get guaranteed latency (all HID devices) are sent and IN or OUT token to give them their opportunity to communicate. Then after all those are completed, the asynchronous schedule is used to allow all the endpoints without guaranteed latency to share whatever bandwidth remains in the frame.

...

Hypothetically, if there were no limit or if you raise the limit to a higher number, you can expect worse latency, not better! More packets waiting to transmit means more latency.

...

There are many ways you might structure your program. But whatever you do, make sure you call Joystick.send_now() no more than 1000 times per second. You simply can not ever get lower latency than 1ms, because HID uses "interrupt" type endpoints which are serviced by the USB host controller's periodic schedule. 1ms is the minimum possible polling interval.

The only way to get lower latency is to use 480 Mbit/sec speed and set the polling interval to 125us. This is only possible on Teensy 4.0 & 4.1 which offer 480 Mbit speed. The older models have only 12 Mbit USB, so you just can't get shorter than 1ms polling interval on those Teensy models.

I think I may be using the wrong terminology - maybe. I currently have it set to 1ms as the polling interval and understand that the host controls that. I'm trying to have my program structure match the poll rate with an almost identical function to the Joystick.send_now() function, wherein it should be only called 1000 times per second.

You really should design your program so it never generates messages faster than this rate. On Teensy, the key to achieving this is explained on the joystick page under "Precision Timing".

I am using coroutines to separate this from a different loop which gathers the IMU data at a different interval using this library and calls a function similar to the Joystick.x() and Joystick.buttons(), which puts the data in array to later be put in the USB buffer when the usb_imu_ao_send function is called (which is posted above, and should be almost identical to the Joystick.send_now() function. So theoretically, I should be able to scan the IMU at a different rate than the USB polling interval, and only generate the messages at the USB interval rate.

Could it help to set the device as an "interupt" type, as you mention on the page about Serial communication under Transmit Buffering? "If other devices are using a lot of USB bandwidth, priority is given to "interrupt" (keyboard, mouse, etc) and "isychronous" (video, audio, etc) type transfers." It is also mentioned in that same page about a 3ms timeout with partially filled buffers - I am not sending 64bytes at a time, so the timeout would apply. Can that timeout be lowered or something?

Apologies if these are silly questions. (I'm starting to think this may be a similar problem that companies like Microsoft and Sony would have solved for their game controllers - coroutines or threads to seperate scanning for input changes from sending the packets to a console/computer).

Frank B · Mar 26, 2021

Anything different from the normal usb mouse protocol requires to write a custom USB "Mouse" driver on the PC side.

Assuming a screen refresh rate of 60Hz (60 times per second) the max useful refresh would be 16.67milliseconds - if your eyes were fast enough to see a difference between frames. They are not.
This also means that you must be Lucky Luke, who shoots faster than his shadow.

I admit that I therefore do not understand the meaning behind the optimization.

AtomicRobot · Mar 27, 2021

Frank B said:
Anything different from the normal usb mouse protocol requires to write a custom USB "Mouse" driver on the PC side.

Not an issue, as this will be interfacing with either SDL2 or JoyShockMapper, which already have done the majority of work in supporting inputs from a variety of game controllers. (and my project is closer to a game controller than a mouse - I am wanting to interface with the linked software to mix the inputs from the gyroscope with the inputs from an Xbox controller, to get this result: gyro aiming. Since Xbox controllers do not have a built in IMU, like a Sony DualShock 4, my only option is to try to mix two different devices in software with one output. My concern is on the hardware side.

Frank B said:
Assuming a screen refresh rate of 60Hz (60 times per second) the max useful refresh would be 16.67milliseconds - if your eyes were fast enough to see a difference between frames. They are not.
This also means that you must be Lucky Luke, who shoots faster than his shadow.

That isn't exactly true. There are more factors at play than just screen refresh rate. Many people opt for monitors with refresh rates at at least 144Hz, and 240Hz to 360Hz is becoming more common - in addition to having gaming rigs with CPU's/GPU's capable of pushing out images at 1440p at 144 frames per second or more, depending on the game. Response rate of the server has an effect as well. Here is a decent write up on it.

Additionally, here is a video from Linus Tech Tips on 60Hz versus 240Hz moniters for gaming (if you don't want to watch the whole thing, skip to around 17 minutes for the conclusion). And here is another video from Linus Tech Tips featuring professional esports players, demonstrating that higher frames per second as well as a higher refresh rate do make a difference. Here is an article from Nvidia on the effects of frames per second and refresh rate (though to be fair, this seems more marketing oriented, but the point still stands).

On the input/hardware side, mice and keyboards are generally at 1000Hz (or more) - and there is noticeable input lag between peripherals with higher Hz. The Sony DualShock 4 has a 250Hz response time versus an Xbox One controller with a response time of 125Hz, and a Nintendo Switch JoyCon/Pro Controller has a response time of 66.7Hz. It is widely observed that the Switch controllers feel "choppier" when used with games on a PC with high FPS and a high refresh rate monitor. Though the difference between 250Hz and 125Hz does not seem to be much, it is noted by a lot of people the difference in input lag between the two.

Frank B said:
I admit that I therefore do not understand the meaning behind the optimization.

If you do not play competitive PC or console games, you probably would not understand.

The optimization is to ensure the least amount of input lag between my peripheral and the PC. You wouldn't want to move a mouse, and then have your actions reflected on screen 2 seconds later - that would be functionally horrible. Or for an example related to RC planes - you don't want noticable lag between changing the throttle or adjusting yaw/pitch/roll during manual control (environmental factors notwithstanding).

I'm not saying 1-2ms of delay is bad, but I would like to push the limits if that is functionally possible.

(allegedly this project was able to get around 0.74ms input lag with a 1ms USB poll rate, using the Arduino PluggableUSB library/and NicoHoods HID library - so I am not sure what the difference is between the Teensy USB stack and the Arduino PluggableUSB stack that could cause an extra amount of delay)

Frank B · Mar 27, 2021

The factor of 60 Hz and 120Hz is two. Still more than 8 millisconds - and that just the Screen.. No eyes, no brain, no muscles, no nerves included, that time is a multiple of that.
But OK, if you think it helps, it will help.
(It reminds me of audiophiles who can clearly hear a difference between a $5 and $100 cable, and can't be convinced by anything, even if the cables just have a different color)

AtomicRobot · Mar 27, 2021

Frank B said:
The factor of 60 Hz and 120Hz is two. Still more than 8 millisconds - and that just the Screen.. No eyes, no brain, no muscles, no nerves included, that time is a multiple of that.
But OK, if you think it helps, it will help.
(It reminds me of audiophiles who can clearly hear a difference between a $5 and $100 cable, and can't be convinced by anything, even if the cables just have a different color)

As I mentioned above, I understand there is more at play - monitor refresh rate + GPU render time + CPU + the human response rate, which I understand encapsulates the time taken to register a change and respond to it. I don't think your comparison to people assuming a more expensive cable makes things sound better is quite fair - picking out variances in tone is different than being able to react to a single stimulus, like starting a race with a cap gun. I do not just think it helps, there is objective evidence that controlling for other factors, a higher refresh rate coupled with higher FPS being output to the screen is non-trivial.

The questions concerning whether a person can perceive changes like that aren't relevant currently, and were not the focus of my initial question. The part I am focused on is the actual delay from the input hardware, and what might be done to reduce possible delay on the hardware side.

My initial questions weren't sufficiently answered - is there something in the USB stack that the TeensyLC/3.+ that could cause an additional delay equal to the poll interval?

If not, then that would indicate the code used by this project was flawed when using this function ( usb_tx_packet_count(JOYSTICK_ENDPOINT) == 0 ) as a way to make sure that the data being sent over USB would not encounter any additional delay before being put in the USB buffer to be collected at the USB poll interval. To make it extremely simple, I could just ensure that the loop calling my equivalent of the Joystick.send_now() function is operating at 1000Hz. I am also planning on comparing the timestamp generated by my IMU, which has a 25 microsecond resolution.

I have also noticed that the USB code used for the Teensy4.0/4.1 differs from the LC/3.++ (which makes sense - different hardware). I don't need the microsecond frame times offered by it, but if it does in some way not have the same delay observed when using a 1ms poll interval I may have to look in to using that instead.

TeensyLC USB Stack Questions on Latency

AtomicRobot

Member

Frank B

Senior Member

PaulStoffregen

Well-known member

AtomicRobot

Member

Frank B

Senior Member

AtomicRobot

Member

Frank B

Senior Member

AtomicRobot

Member