Teensy 3.0 UART datarate?

dlchambers · May 15, 2013

I's seen this table:
http://www.pjrc.com/teensy/td_uart.html
and have read various postings that serial-over-USB can do 256k.

My question is about the non-USB serial datarate for the Teensy 3.0.
Can it's UART support 256k?

Thanks,
-Dave

PaulStoffregen · May 15, 2013

dlchambers said:
I
My question is about the non-USB serial datarate for the Teensy 3.0.
Can it's UART support 256k?

It's very likely to work, but hasn't been officially tested.

Teensy 3.0 has 3 serial ports. Serial1 has a FIFO and optimized code. Serial2 and Serial3 are normal ports with only the usual double buffering. Use Serial1 for the best speeds.

I recently did some optimization work on the Serial1 code. Use 1.14-rc1 to get this newer version. It can probably do several Mbit/sec judging only from work to measure the CPU overhead, but those fast speeds have not been tested.

If you try this, please post a followup here on how well it does or doesn't work. If you run into trouble, I can try to help if you post sample code that reproduces the problem.

stevech · Jun 15, 2013

Is the fractional baud rate hardware used?

PaulStoffregen · Jun 17, 2013

stevech said:
Is the fractional baud rate hardware used?

Yes, the fractional baud rate hardware is used. Serial.begin() converts your baud rate to the actual fractional divisor using an inline macro that's defined in HardwareSerial.h. For the common case of a numeric constant used, this lets the compiler optimize away all the math to do the conversion.

Then win the actual begin() function, the divisor is written to the baud register and fractional register, with this code in serial1.c:

Code:

        UART0_BDH = (divisor >> 13) & 0x1F;
        UART0_BDL = (divisor >> 5) & 0xFF;
        UART0_C4 = divisor & 0x1F;

stevech · Jun 17, 2013

Thanks. I'm going to use the XBees at 115200. They inherently have a baud rate error over 1% because the (series 1) have an 8051 core CPU with no fractional baud rate hardware and an 8.0MHz oscillator.
So it's necessary to have near 0 error in the CPU side.
I need the high speed due to the 802.15.4 message rate - up to 60/second, and each message in the API wrapper is on the order of 150 bytes.
Also important is that the UART RX *and* TX FIFOs are used, and the ISR empty/fill the FIFO in one interrupt; this reduces the interrupt rate n-fold, based on the FIFO depth/trigger point.

(this is migrating from an NXP LPC21xx version - older).

regards, Steve.

PaulStoffregen · Jun 17, 2013

stevech said:
Thanks. I'm going to use the XBees at 115200. They inherently have a baud rate error over 1% because the (series 1) have an 8051 core CPU with no fractional baud rate hardware and an 8.0MHz oscillator.

If you know what baud rate it's actually rounding up or down, you could put that number into Serial1.begin() for a very close match. With the clock at 48 or 96 MHz and the fractional divider, Teensy3 can create any baud rate in this range with very high accuracy.

I need the high speed due to the 802.15.4 message rate - up to 60/second, and each message in the API wrapper is on the order of 150 bytes.
Also important is that the UART RX *and* TX FIFOs are used, and the ISR empty/fill the FIFO in one interrupt; this reduces the interrupt rate n-fold, based on the FIFO depth/trigger point.

The FIFOs are used. Remember, only Serial1 had the FIFO. Serial2 and Serail3 are ordinary UARTs without the FIFO feature.

Sounds like a really interesting project. If it's not a super-secret project, I hope you'll share some details online. I sure a lot of other people would find it interesting or even inspiring.

stevech · Jun 17, 2013

Only UART0 has FIFOs - didn't see that. Found it sort of buried in the K20 Sub-family Reference Manual- paragraph 3.9.4.1.
I see that UART2 uses a different (slower?) clock for baud rate generation, limited to max baud rate = 1/16th of the "bus" clock. For F_CPU = 48 or 96MHz, what's the bus clock rate? And thus the max baud rate on UART2?
Unless I add wires, cut/traces (not a big deal), the XBee board is wired to UART2 rather than UART0 where the FIFOs are.

Looking to see if any/all 3 UARTs can use DMA; Looks like YES.

Project: I'll check on OK to talk about in general terms. Just the usual worries about who says what to who and by what medium.

PaulStoffregen · Jun 17, 2013

115200 baud is only 11.52 kbytes/sec, which really is not very fast for Teensy 3.0. I'll try to answer these questions, but first I'd like to just put this speed into perspective. Worrying about these UART specs probably does little good. Your time would be much better spent on the code that actually does whatever you're going to do with the data.

With that in mind, Serial1 and Serial2 run from the CPU clock (which you select from Tools > CPU Speed), and Serial3 runs from the bus clock. With the CPU at 96 MHz, the bus clock is 48 MHz. Otherwise, they're the same. The UART clock only really affects what baud rates you can use, not how much CPU overhead it causes. But unless you're trying to generate a very fast baud rate, even that really doesn't matter much. Even at 24 MHz, the fractional divider creates 115200 with only -0.08% error. These speeds are just so slow compared to the hardware's capability that it really isn't worth worrying about. But this page has a table of the baud rate errors, for a couple columns of data which are essentially zero.

http://www.pjrc.com/teensy/td_uart.html

The FIFO does reduce the number of interrupts. But again, let's keep this in perspective. The interrupt takes about 1.5 us. A single byte in 8N1 format at 115200 baud takes about 87 us. So even without the FIFO, the interrupts are only taking approximately 2% of the CPU time at this slow baud rate. Pulling the bytes from the buffer with Serial1.read() or Serial3.read() takes about 1 more microsecond per byte.

All the Teensy3 UARTs support DMA, but there is currently no code written to use it. While you could try to write DMA-based drivers, for such slow baud rates it would be pointless. The total overhead is under 3% of the CPU time. With DMA, you might get that down to almost zero, but it's tremendously complex, for only a tiny gain.

For only 115200 baud, I'd just use Serial3 as-is. If you're concerned about efficiency, write code that calls Serial3.available() only once and then calls Serial3.read() many times, depending on how many bytes are available (or how many you're prepared to use). If you're just copying data from one place to another, it's hard to burn lots of CPU time, but if you're parsing it, your optimization efforts should definitely focus on crafting efficient parsing code.

A good way to test your code is to have it parse some dummy data from an array. You can call micros() before and after digesting a block of data, to get an idea of how many microseconds your algorithm needs per byte. If you discover you really do need 80-some microseconds for each byte, then aggressively pursuing driver-level optimizations might free up that extra few percent of CPU time to enable your algorithm to keep up with the data rate. But 80 us is a very long time for a 32 bit ARM processor. Unless you're doing something really, really complex, it's very likely your code will need much less time per byte. If that's the case, like 40 us or less, I'd just use Serial3, knowing your program will spend about half of the CPU time just waiting for new data to arrive.

For libraries that are reused for many projects, I'll put an incredible amount of work into optimizations. But for a specific application that has a fixed maximum speed, the very best thing you can do it measure the actual CPU time the specific code needs. If it's already much faster than the data can ever arrive, which it probably will be, I'd put my time into other things... like having a beer!

stevech · Jun 17, 2013

Thanks, Paul.

As to the higher level design... I already have that from a prior implementation in C for the ARM7 in ARM mode. More or less adapting it for the M4, perhaps going to C++ for most of it this time. The prior implementation uses FreeRTOS with preemption disabled in the config to simplify the code and avoid most race conditions. I have run the M4 FreeRTOS + Arduino-lib that has been done. Looks fine, except I need to add to the serial driver(s) a means to resume a task waiting on serial input. Several ways to do that. Last time, I used queues within RTOS. And that was inefficient at one-byte objects. And there's no variable length object in FreeRTOS queues that I could find.

The issue I had with FreeRTOS (on a 72MHz ARM7) was that there FreeRTOS queue in/out routines were rather slow. The 16 deep FIFO on the LPC2xxx improved things. The ISR unloaded the FIFO in one interrupt, and vice-versa for TX. The FreeRTOS queue mechanism isn't a circular buffer. It's a list of like-sized objects. With a one-byte object, there's a lot of overhead. I suppose there are other ways to have the ISR feed (and consume) queues and the tasks get awakened on queue not empty, and so on.

My prior code used the XBee API and had very robust error recovery. Stuff happens. Esp. when you have dozens running 24/7 unattended and no one wants to get on an airplane to go reboot or debug.

So, I'm looking at the options with the M4.

doughboy · Sep 12, 2015

can teensy 3.1 serial1 be configured for faster rate?
esp8266 serial can go up to 115200*40 or 4.6Mbit/sec
That will solve the slow transmit issue for me.

PaulStoffregen · Sep 12, 2015

Yes, you can have higher baud rates. Just put the number you want into Serial1.begin(baud).

However, as you request higher baud rates, you'll run into limits of the resolution in the baud rate generator. Fortunately, Teensy 3.1 has high-res baud rate hardware, but there are still limits.

Here's the relevant code:

https://github.com/PaulStoffregen/cores/blob/master/teensy3/HardwareSerial.h#L89

As you can see, Serial1 and Serial2 run from F_CPU. Those 2 also have FIFO, so you definitely want to use Serial1 or Serial2 for higher speeds. Serial3 won't perform as well.

If F_CPU is 96 MHz and you try for 4608000 baud, that formula will compute 42. The ideal divisor would be 41.739. That's pretty close, so odds are good it may work.

The other issue you may run into is the small 64 byte serial buffer. If your code checks for data quickly, maybe it'll be fine. But if your code is slow or has other work to do while data may be arriving, you might need to edit serial1.c or serial2.c to increase the buffer. 64 bytes at that speed is only 139 us. Fortunately, Teensy 3.1 has lots of memory, so just make the buffer bigger if you lose incoming data.

stevech · Sep 12, 2015

depending on wiring and the nature of whatever is on the other end of the UART data line, you may hit slew-rate/rise-time constraints. I've struggled with this at > 250Kbps on another project. Esp. in half-duplex.

doughboy · Sep 12, 2015

PaulStoffregen said:
Yes, you can have higher baud rates. Just put the number you want into Serial1.begin(baud).

However, as you request higher baud rates, you'll run into limits of the resolution in the baud rate generator. Fortunately, Teensy 3.1 has high-res baud rate hardware, but there are still limits.

Here's the relevant code:

https://github.com/PaulStoffregen/cores/blob/master/teensy3/HardwareSerial.h#L89

As you can see, Serial1 and Serial2 run from F_CPU. Those 2 also have FIFO, so you definitely want to use Serial1 or Serial2 for higher speeds. Serial3 won't perform as well.

If F_CPU is 96 MHz and you try for 4608000 baud, that formula will compute 42. The ideal divisor would be 41.739. That's pretty close, so odds are good it may work.

The other issue you may run into is the small 64 byte serial buffer. If your code checks for data quickly, maybe it'll be fine. But if your code is slow or has other work to do while data may be arriving, you might need to edit serial1.c or serial2.c to increase the buffer. 64 bytes at that speed is only 139 us. Fortunately, Teensy 3.1 has lots of memory, so just make the buffer bigger if you lose incoming data.

I'll give this a try.

I just tried 230400, and I can send data to esp8266 twice as fast now. A 60kbyte web page is now sent out in about 3.x seconds instead of 6.x.

I have increased TX buffer size to 2048 to match the esp8266 buffer size. esp8266 will reject anything larger than 2048.

doughboy · Sep 12, 2015

I tried one step at a time incrementing 115200 until I got to 10x (1152000 baud) and esp8266 is still responding ok.
So I now tried my main program and the same 60k page is sent in less than 1 second.
however, after a few transmits, my program hangs.

I'll fine tune this to find the stable baud rate.

edit:
I just did a quick test at 4,608,000 baud and it works. I'll see how stable it is.

at 40x, the 60k page sends in about 300ms. There is diminishing return, as 10x time is about 900+ms.
I think it is the esp8266 that hangs after a few sends and not my program, as esp does not respond to AT commands anymore.

doughboy · Sep 12, 2015

I think it will work at higher baud rate if I fine tune my teensy code. without modifying anything except the baud rate, it runs fine at 400,000 baud. I'm going to try to get it to work stable at 4,000,000. That baud results in a perfect divisor with no rounding error.

defragster · Sep 13, 2015

I did a loop set a few months back on a pair of T_3.1's :: Ser_1 to Ser_2 from each - I never pushed it over 921,600 (then using Ser1 & Ser3).

Given this post#14 I did with:

Serial1.begin(4608000); // RXa/TXb
Serial2.begin(4608000); // RXb/TXa

It is working well with default buffers and 24 char messages - only running a short time (37 min) so far 17.4 million messages and no problems.

doughboy · Sep 13, 2015

in theory, if you run t3.1 at 96mhz, the highest baud rate you can set is 6000000. The divisor will calculate as 1. You can try that.

defragster · Sep 13, 2015

So far " Serial1.begin(6000000); // RXa/TXb" so good, no errors - 40 minutes - 19M msgs. <edit>

No net message rate increase - To be sure I did further reduce USB output. Having a .read() per char not ideal. The rest of my monitor code needs some cleaning I suppose to maximize.

<edit> set to run overnight - sample I started with used String - went to null term 'c' char array - seems faster and still valid so far.

PaulStoffregen · Sep 13, 2015

doughboy said:
in theory, if you run t3.1 at 96mhz, the highest baud rate you can set is 6000000.

Are you sure about that?

PaulStoffregen · Sep 13, 2015

I've often wondered about this, and specifically how the high res baud rate *really* works.

First, I did some experimenting, and yes, it seems you're right. 6000000 really is the fastest possible baud rate. Here's how it looks.

I was going to print "Hello World", but only 2 bytes fits really well on the scope screen.

Code:

void setup() {
  Serial1.begin(6000000);
}

void loop() {
  Serial1.print("Hi");
  delay(1);              // wait for a second
}

PaulStoffregen · Sep 13, 2015

With only a normal baud rate divider, the next step below 6000000 ought to be 3000000.

But Freescale's UARTs have high-res baud rate generation, which is supposed to allow for 32X finer steps. So the possible baud rates should be: (with Teensy at 96 MHz)

6000000
5818182
5647059
5485714
5333333
5189189
5052632
4923077
4800000
4682927
4571429
4465116
4363636
4266667
4173913
4085106
4000000
3918367
3840000
3764706
3692308
3622642
3555556
3490909
3428571
3368421
3310345
3254237
3200000
3147541
3096774
3047619
3000000

If you're trying to use a ESP8266 at maximum possible speed, 4571429 will be the closest to 4608000 baud, with -0.794% error.

Here's an actual test, with Teensy set to 4571429 and my oscilloscope set to decode at 4608000.

Code:

void setup() {
  Serial1.begin(4571429);
}

void loop() {
  Serial1.print("Hi");
  delay(1);              // wait for a second
}

defragster · Sep 13, 2015

I can confirm 6000000 (works to some few million cycles so far) - with stock Serial1 and Serial2 buffers printing 58 chars out s1 (as wired) to s2 of one t3.1 to another t3.1, where the second t3.1 running the same code doing the same thing the other direction. Whenever S1 or S2 get something they print a similar usec timestamp char response string back out the other direction.

So ground plus four serial wires 0>10, 1>9 on each Teensy 3.1. {can also cross 0>1 and 9>10}

With this as my current code:View attachment SerialEventDualMax.ino

Paul: would be interesting to see if the scope tells you anything with this activity. Note qBlink() hits the LED on each NewLine string of both ports - that should be a pretty picture {though in loop() not on receive}.

USB only dumps each 2 secs, expects newlines on s1 to equal s2 {msg in transit always off by 1}, and records when it misses a message for 5 secs and restarts {this is how it starts up so one device always shows a Rst}. Each half return loop looking like 150 usecs for 58 chars. The elapsedMicros misses entry by up to 300usec and it takes ~470 usec to complete the USB spew - with a yield between s1 and s2 stats.

If you drop one wire that channel loop goes dead and tries restart at 5 secs and each sec after that.

PaulStoffregen · Sep 13, 2015

Well, I just spent quite a bit of time scrolling through the waveform. I changed the program to print 0x55, and I turned off slew rate limiting on the Teensy pin and bandwidth limiting on my scope.

I really expected to see some of the bits lengthened by 5 or 10 ns (one cycle or half a cycle of 96 MHz). But it's really difficult to tell. The automatic measurements show 218 to 219 ns for every bit as I scroll through the waveform with the horizontal scale at 50 ns/div. Maybe my scope's 200 MHz bandwidth is a factor?

Here's a screenshot of 0x55 sent at 4571429 baud. Do any of these bits looks a little different longer or shorter than the others?

Code:

void setup() {
  Serial1.begin(4571429);
  CORE_PIN1_CONFIG = PORT_PCR_MUX(3); // no slew rate limit
}

void loop() {
  Serial1.write(0x55);
  delay(1);              // wait for a second
}

PaulStoffregen · Sep 13, 2015

Ok, I did manage to measure this at 5818182 baud, lining the bits up to the scope divisions with the horizontal time scale set to 167 ns/div.

If you look closely, you can see the first several bits line up almost perfectly to the 167 ns divisions. But then a couple are stretched slightly, which is how the high-res baud feature is making baud rates that aren't exact integer divisions of 6 Mbit/sec.

With some fiddling, I also managed to trigger and delay to one of the positions where the bit changes. My guess is the bit stretching is a free-running counters within the UART, not in sync to the data itself.

It also seems pretty clear the bit stretching is happening at approx 10 ns, or a full cycle of the 96 MHz clock.

doughboy · Sep 13, 2015

PaulStoffregen said:
Are you sure about that?

From page 1275 of datasheet, I got this formula

UART baud rate = UART module clock / (16 × (SBR[12:0] + BRFD))

to get maximum baud, SBR and BRFD must be 0, therefore maximum is F_CPU/16 = 96mhz/16=6000000

that's why I qualified my statement with if you are running t3.1 at 96mhz

Teensy 3.0 UART datarate?

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Senior Member+

Well-known member

Senior Member+

Well-known member

Well-known member

Well-known member

Senior Member+

Well-known member

Well-known member

Well-known member