Teensy 2 - Hardware Serial: Serial1.write() not as fast as Arduino code base...

KurtE

Senior Member+
Maybe not too important, but was playing around with the Teensy 2 and I am playing with AX Buss code that runs at 1mbs.
I am running windows 10, 1.6.9...

Looking at output on Logic Analyzer I see there is gaps of time between bytes being output. Not sure if this shows it too well but:
screenshot.jpg

So I took a look at: size_t HardwareSerial::write(uint8_t c)
in HardwareSerial.cpp and noticed the speed hack added to arduino in about 1.5.3 was not added here. The hack was when you call this function, if the queue is empty and the data register is empty, simply stuff it into the data register without adding it to queue and relying on interrupt handler. I found that this helped a lot at fast baud rates.

Code in current Arduino code base:

Code:
size_t HardwareSerial::write(uint8_t c)
{
  _written = true;
  // If the buffer and the data register is empty, just write the byte
  // to the data register and be done. This shortcut helps
  // significantly improve the effective datarate at high (>
  // 500kbit/s) bitrates, where interrupt overhead becomes a slowdown.
  if (_tx_buffer_head == _tx_buffer_tail && bit_is_set(*_ucsra, UDRE0)) {
    *_udr = c;
    sbi(*_ucsra, TXC0);
    return 1;
  }
  tx_buffer_index_t i = (_tx_buffer_head + 1) % SERIAL_TX_BUFFER_SIZE;
...
 
Interesting speedup - same for one or multibyte write in T3? By the time it stuffs the byte(s) in the queue and returns - the first character may be gone rather than waiting for the transfer and the interrupt to pick up. I did that when I added a queue to Talkie - if not active - I started the sound, otherwise it got queued - faster sound start and queue is one bigger when idle, a bit easier there though.
 
I have not checked the T3 stuff as with the larger queue, it keeps even at 2mbs...

The speed up was part of a fix I had for 1.5.1ish deadlock that was happening with the serial code.

My code was merged into another similar fix, which was merged into release somewhere around 1.5.3 or 1.5.4
 
Yep - T3 does not have any bypass of the queue either. May play some with the T2 to see how easy it is to speed it up.
 
Was playing around some yesterday with T2 stuff, looks like if I am going to make my experiment work with turning the T2 into something like the USB2AX (atmega32u2, with lufa on visual studio... ), it will take additional changes to the Hardware Serial code to make it more compatible with Arduinos code base.

In particular, the settings in UCSR1B. My code base uses the UART in a half duplex mode without additional hardware. It does this by turning on/off the RX and TX parts of the UART. Works OK with Arduino core as the Serial1.write(...) only sets the UDRE0 bit and leaves the other bits alone.

Where Teensy core, Serial1.write(...) sets:
UCSR1B = (1<<RXEN1) | (1<<TXCIE1) | (1<<TXEN1) | (1<<RXCIE1) | (1<<UDRIE1);
So it just reenabled RX with RX interrupt. It does similar stuff in a few other places as well.

Will take a cut at changing this usage to match and see if that helps.
 
I went ahead and have made changes to my fork of cores, to add the bypass into Serial1 of the Teensy2 code. I also updated the code to change only specific bits of UCSR1B in the write and ISR functions, as to not necessarily enable the receiver which my user code externally updates to emulate half duplex. Again this makes it hopefully more compatible with arduino code base.

Paul: I also issued a pull request in case you wish to take a look

These changes did help speed up the output to AX Buss at 1mbs. Which you can see in the following screenshot:
screenshot2.jpg
In case anyone is interested, I am also trying to speed up the code to return the data from the AX Buss back to the host over USB. Using some other IO pins to trace things. If you see the green channel, this is in the loop when I receive a byte from Serial1 and then do a Serial.write(ch). I set high before the write and low after, and I see that it is taking quite a bit of time, so maybe not buffering?

So updated that code loop, to process through the input from AX Buss and buffer up and do 1 write at end, which appears to have sped things up here:
screenshot.jpg
The green is now showing when bytes are put into queue, plus last one when I do the buffered write. The bottom channel, shows when I call Serial.flush() to hopefully make sure the data is sent to host...

Will play some more, plus maybe try out code using different host to see how speed is again versus USB2AX, plus see how much of the changes here in my sketch impact my code for T3.2.

Kurt
 
Last edited:
It looks like Donziboy2 ran into this at 6Mbps last year on T_3:

I have found one issue so far and that is a pause between transmitting bytes that is roughly 2.4uS(longer then it takes to actually transmit the byte). Is there a way to reduce this dead time?

edit....
Yay it gets stranger, I went ahead and took it down to 3Mhz and now its transmitting without pauses and its taking less time then 6Mhz... :(
Went from 63uS for 17 Bytes @ 6Mhz to 49uS to transmit the same 17 Bytes at 3Mhz.
 
The weird thing is I mentioned it to Paul and he was able to produce 6Mhz output with no dead time between bytes. I need to take another look and see if my code was doing something funny.

Although just looking at what he did gives me a clue, I separated each byte.
I converted my 16 bit values into bytes and then sent them individually in a long list(I should clean it up now that im almost ready to run the gocart lol). Im also now wondering if its limited by having to transfer the bytes around before transmitting them, does serial output get buffered or is it per byte/word?
Paul sent his data as a word basically and let the compiler do optimizations.


Code:
 if(serialcounter >= 200) {   
  buffer[0] = (byte) ((executedtime >> 8) & 0xFF); 
  buffer[1] = (byte) (executedtime & 0xFF);
  buffer[2] = (byte) ((amps >> 8) & 0xFF); 
  buffer[3] = (byte) (amps & 0xFF);
  buffer[4] = (byte) ((busvaverage >> 8) & 0xFF); 
  buffer[5] = (byte) (busvaverage & 0xFF);
  buffer[6] = (byte) ((battvaverage >> 8) & 0xFF); 
  buffer[7] = (byte) (battvaverage & 0xFF);
  buffer[8] = (byte) ((motortaverage >> 8) & 0xFF); 
  buffer[9] = (byte) (motortaverage & 0xFF);
  buffer[10] = (byte) ((hstaverage >> 8) & 0xFF); 
  buffer[11] = (byte) (hstaverage & 0xFF);
  buffer[12] = (int) ((RPM >> 24) & 0xFF); 
  buffer[13] = (int) ((RPM >> 16) & 0xFF);
  buffer[14] = (int) ((RPM >> 8) & 0xFF); 
  buffer[15] = (int) (RPM & 0xFF);
  buffer[16] = drivestate;

  
//CRC,  32 bytes should take 3uS
  crc = CRC8.maxim(buffer, 17);
  buffer[17] = crc;
//  xmittime = 0;
  
//xmit, takes roughly 49uS to xmit 17 bytes @ 3Mhz
 Serial1.write(buffer[0]);
 Serial1.write(buffer[1]);
 Serial1.write(buffer[2]);
 Serial1.write(buffer[3]);
 Serial1.write(buffer[4]);
 Serial1.write(buffer[5]);
 Serial1.write(buffer[6]);
 Serial1.write(buffer[7]);
 Serial1.write(buffer[8]);
 Serial1.write(buffer[9]);
 Serial1.write(buffer[10]);
 Serial1.write(buffer[11]);
 Serial1.write(buffer[12]);
 Serial1.write(buffer[13]);
 Serial1.write(buffer[14]);
 Serial1.write(buffer[15]);
 Serial1.write(buffer[16]); 
 Serial1.write(buffer[17]);    //crc byte




I've often wondered about this, and specifically how the high res baud rate *really* works.

First, I did some experimenting, and yes, it seems you're right. 6000000 really is the fastest possible baud rate. Here's how it looks.

View attachment 5071

I was going to print "Hello World", but only 2 bytes fits really well on the scope screen.

Code:
void setup() {
  Serial1.begin(6000000);
}

void loop() {
  Serial1.print("Hi");
  delay(1);              // wait for a second
}

Since I am tearing my test fixture down to populate my E-gocart I have 2 Teensy 3.1's broke out for solder less breadboards I could use to test some of this. (ill add it to my long list of things to do lol.)
 
Last edited:
I think the catch is keeping something in the buffers based on what I read. If you leave enough time to empty the queue then the dead time shows up.

One byte with a pause probably allows that byte to transmit in full - one byte goes from the ram queue to the hardware fifo - and completes before the next arrives and if the behavior from the Arduino speedup applies then the next byte you send goes into the queue and the next ISR feeds the FIFO which then goes out. The xmit timing difference between 3 and 6 Mbps may explain that.

The code shown will queue two bytes that then get pushed as a group - once the ISR triggers and pushes both to the FIFO and they will transmit with no delay.

If your test shows it again - try to convert all bytes in advance and send them as a group and use "serial2_write(const void *buf, unsigned int count)"? If that works it might stop the inter character time gap from making dead time.
 
I think the catch is keeping something in the buffers based on what I read. If you leave enough time to empty the queue then the dead time shows up.

One byte with a pause probably allows that byte to transmit in full - one byte goes from the ram queue to the hardware fifo - and completes before the next arrives and if the behavior from the Arduino speedup applies then the next byte you send goes into the queue and the next ISR feeds the FIFO which then goes out. The xmit timing difference between 3 and 6 Mbps may explain that.

The code shown will queue two bytes that then get pushed as a group - once the ISR triggers and pushes both to the FIFO and they will transmit with no delay.

If your test shows it again - try to convert all bytes in advance and send them as a group and use "serial2_write(const void *buf, unsigned int count)"? If that works it might stop the inter character time gap from making dead time.

I will have to get the CRC figured out for int's.
 
This is very much similar to Teensy2 stuff, with the caveat that with Serial1 and 2 we have a nice fifo queue, I have not looked at all at Serial3, which only has the simple double buffer.

As mentioned doing a single write will remove a little overhead. However internal to it, it more or less simply does the same as a simple loop, doing an output of one byte at a time...

You also may find that you can speed up your whole thing by outputting each byte as you compute them. Sort of like:
Code:
 if(serialcounter >= 200) {   
  Serial1.write(buffer[0] = (byte) ((executedtime >> 8) & 0xFF)); 
  Serial1.write(buffer[1] = (byte) (executedtime & 0xFF));
...
  Serial1.write(buffer[16] = drivestate);

  
//CRC,  32 bytes should take 3uS
  Serial1.write(crc = CRC8.maxim(buffer, 17));
  buffer[17] = crc;
//  xmittime = 0;
That way the serial port is active during the time it takes to compute the CRC. But, it may not help as much as each write will startup output to usart where single write, puts as much in queue may again run into speed issues as doing one byte writes will case the condition to be set to setup interrupt to ISR to put the byte on the queue as it can before setting C2_TX_ACTIVE.
Again probably need to experiment.

Example when I was talking to USB (Serial.write(...)) on Teensy2, doing individual Serial.write(myByte), was slow and was sped up when I buffered the data up and did a single serial.write(buffer, cnt); It was like it was doing a USB packet for each write. Whereas when I did the same buffering on Teensy3.2, the buffering slowed it down, as probably Teensy 3 did the buffering for you...

Anyway I may take another look soon at faster speed with T3.2 as I have been playing around with using TTL serial between T3.2 and hosts (ODroid, and soon UP). So far I have only tried up to 2mbs as the other board I was using (Arbotix Pro), their Serial port had max baud of 2.25mbs.

I also wonder if you would get any benefit of playing around with the TX watermark which is currently 2
 
Back
Top