Fast data exchange between four microcontrollers

mkoch

Active member
Let's assume I have a system of four Teensy 4.0 or 4.1. They are close together on the same board, and some communication is required between them:
Teensy 1 sends (the same) 8 bytes to Teensy 2, 3 and 4.
Teensy 2 sends (the same) 8 bytes to Teensy 1, 3 and 4.
Teensy 3 sends (the same) 8 bytes to Teensy 1, 2 and 4.
Teensy 4 sends (the same) 8 bytes to Teensy 1, 2 and 3.
The whole data exchange must run as fast as possible. Almost all pins are available, except one I2C bus. How would you do this?

Thanks,
Michael
 
Often low latency is at odds with high bandwidth. Since the message size is only 8 bytes, I'm guessing this may be an application where "fast" means lowest latency? But really just guesswork since we know nothing about this intended application.

Without more info, because the message size is so small, I'd probably recommend hardware serial with a high baud rate like 1 to 6 Mbit/sec. You'd need something like logic gates to route or combine the data. But if 2 or more might transmit at the same moment (more guesswork... another reason why giving context about the intended application is so important for getting actually useful tech help) then perhaps Teensy 4.1 ethernet with an ethernet switch might be better?

In terms of raw bandwidth, USB is the fastest. But it's not great on latency and it's not a peer-to-peer protocol. Again hard to say if that matters with basically no info about the application, but from this minimal info a single-host to many-device protocol like USB probably doesn't fit.
 
Here is some more info. It's a digital coprocessor for an analog computer. There are 4 microcontrollers and each of them is doing (almost) the same thing:
1. Get data from 4 analog inputs. All 4 microcontrollers together have 16 analog inputs.
2. Share the data with the other microcontrollers.
3. Now each microcontroller has the same 16 variables and calculates 4 functions of 16 variables. The functions are quite complicated, including cross products and dot products, and possibly also sin() and cos(). All together 16 different functions.
4. Each microcontroller sends the results of the 4 functions via I2C to its own MCP4728 DAC. All together 16 analog outputs.
This procedure shall repeat as fast as possible. About 10kHz would be fine.
My question is only how to realize step 2.

Michael
 
Given the fairly low data rates ((3*12bits + 4 overhead )* 10,000 samples) transmitted = 50KBytes/sec or ~500KBits/S serial, it might be possible to use a pseudo-star network:

(Packing and unpacking 40 bytes into 3 12-bit values shouldn't take a teensy more than a microsecond.)

Teensy 1 sends its data to Teensies 2,3,4 on its UART1. That UART output connects to UART1 inputs on the other three Teensies. I'm hoping that with some short traces and serial resistors, the T1 output can drive three UART inputs. (You could probably test this with one Teensy sending from UART1 to its own UARTs 2,3,4 over reasonable length wires.)

T1 receives data from T2, T3, T4, on the UART channel equivalent to the Teensy number.

T2 Sends on UART2 and receives on UART channels 1,3 and 4.

etc, etc, for Teensies 3 and 4.

You could use the extra 4 bits in the data stream as an incrementing sample counter to detect some loss of sync events.

The OP hasn't specified the ADC and DAC resolutions required. If they are less than 12 bits, some further data packing could allow lower baud rates. If the ADC inputs are the onboard ADCs, getting 12-bit resolution on a board with four T4s, each adding its own share of digital noise, will be a challenge in itself.

Another questionable part is whether you can update 4 DAC channels at 10KHz. The MCP4728 data sheet says it has a high-speed I2C capability of 3.4Mb/s, but will the Teensy I2C go that fast? If you are limited to the normal 400KBits/S sending 4 12-bit values + overhead at 10K samples/second seems unlikely.

There has also been no mention of whether the 4 DAC outputs need to be synchronized and whether the complex calculations can be completed in under 100 microseconds on all processors for all data inputs. If you need some digital filtering to reduce noise, things get really interesting!

It might be simplest to have Teensies 2,3, and 4 responding to a digital update pulse generated by T1. Those Teensies compute their functions based on the last inputs from the others and update their DACs when they receive the sync pulse.

You could also simplify the wiring by connecting the Teensies in a token ring so that each teensy needs only one send and one receive serial port. However, that communications architecture may need higher baud rates and more complex data handling. The fact that the data has to go around the ring, taking 4 packet sends and receives, will also affect the throughput. The ring packets also have to be at least 6 bytes to hold all 4 12-bit values.


Off topic: Sounds like a good application for the USB_TMC (Test and Measurement Class) driver I was working on about a year ago. I had it working and transferring 4 to 8 MB/second from two Teensies to a host using USB at 480Mbits/S and running mostly with DMA and pretty low overhead for the host and ADC processors. For your application you would need a host T4x to talk to all of the Teensies doing the ADC conversions. Synchronization gets easier since TMC is a command-response system. One of the issues with the driver is that it is difficult and VERY unintuitive to add a completely new USB driver to the Teensy USB stack. Alas, I'm out of the country until the end of October and can't revive the driver and do a test app or try out the serial ring or star networks.
 
Teensy 4.1 is said to have 18 adc inputs. So why not share the analog input lines? Perhaps add buffers.

Also one single Teensy can do a lot at 10kHz datarate. There is plenty of Ram. Perhaps use precalculated lookup tables for 12 bit results.
 
Thank you all for your answers. I think the easiest solution is to use three UARTs on each Teensy. Each Teensy has a direct UART connection to each of the other three Teensys. No logic gates for routing. No star network. Each Teensy is doing almost the same thing:
1. Get 2 bytes from the first ADC. I haven't yet decided if I use the internal ADCs or an external MCP3304.
2. Send these 2 bytes to all three UARTs, with 6 Mbps (or 20Mbps, if required).
3. Receive 2 bytes from each UART and save them.
4. Repeat steps 1 to 3 for the other 3 ADC channels.
5. Now each Teensy has the same 16 variables and calculates 4 functions of 16 variables.
4. Each Teensy sends the 4 results via I2C to its own MCP4728 DAC. All together 16 analog outputs.
Teensy 4.1 is said to have 18 adc inputs. So why not share the analog input lines? Perhaps add buffers.
16 AD conversions on a single Teensy take too much time. I anyway need several Teensys because computing the 16 functions would take too long on a single Teensy. The functions are complicated vector arithmetic (many cross products and dot products). You want to see them? Equations 11a,b,c the following document. Keep in mind that L, S1 and S2 are vectors of three elements.

Perhaps use precalculated lookup tables for 12 bit results.
Functions of many variables can't be precalculated. A function of 2 variables would already require 4096^2 = 16M entries.

The OP hasn't specified the ADC and DAC resolutions required.
12-bit is sufficient.

There has also been no mention of whether the 4 DAC outputs need to be synchronized and whether the complex calculations can be completed in under 100 microseconds on all processors for all data inputs. If you need some digital filtering to reduce noise, things get really interesting!
It's sufficient if the ADC and DAC conversions are roughly synchronized within 50us. I don't yet know if the complex calculations fit in the 100us time slot (100us minus ADC time, minus time for data sharing, minus DAC time). 10 kHz is not a hard limit. 5 kHz would also be acceptable. I'll find out what's possible, and let it run as fast as possible.

Michael
 
As you work on the calculations part, unless the math is naturally integers you probably should use 32 bit float variables. The FPU in Teensy 4.0 and 4.1 computes 32 bit float math at approx the same speed as integers. 32 bit float can end up much faster if you avoid lots of if-else checks on numerical ranges commonly needed with integer math. Even though Cortex-M7 inside Teensy 4.x has branch prediction, branches not predicted wasted a few cycles or more. Best speed usually avoids if-else conditional branching on numerical results.

The FPU can also do 64 bit double math at half the speed, for basic math. However, math functions like sin(), cos(), log() typically run more than 2X slower than their 32 bit versions sinf(), cosf(), logf() because they use polynomial approximations that need more terms computed or convergence algorithms needing more iterations to reach 64 bit precision.

C / C++ has rules to automatically promote 32 bit float to 64 bit double when you perform any math operation with another 64 bit double. The main "gotcha" is numerical constants. If you write "x = y + 2.0" where x and y are 32 bit floats, the math is automatically promoted to slower 64 bit double because "2.0" defaults to 64 bit double. You would need to write "x = y + 2.0f" for the constant 2.0 to be treated as 32 bit float.

If you need 64 bit precision, then of course this is all a moot point. But if 32 bit float is good enough, hopefully these little details can help you avoid the common problems where math you intended to be fast 32 bit float gets compiled with slower 64 bit double.
 
Thank you all for your answers. I think the easiest solution is to use three UARTs on each Teensy. Each Teensy has a direct UART connection to each of the other three Teensys. No logic gates for routing. No star network. Each Teensy is doing almost the same thing:
1. Get 2 bytes from the first ADC. I haven't yet decided if I use the internal ADCs or an external MCP3304.
2. Send these 2 bytes to all three UARTs, with 6 Mbps (or 20Mbps, if required).
You may save a lot of time and complexity in the transmission if you bypass the serial driver and simply store your two bytes in the UART transmission FIFO. That should take only a few lines of code and eliminates all the code and latency issues of the normal serial transmit handler.

Similarly, you can bypass the interrupt-driven receive handler and simply read two bytes from each UART's receive FIFO.


OTOH this kind of optimization may not be worth the trouble. If you properly interleave data transmission and calculations, the Teensy has all the time spent in calculation and DAC transmission for the standard, Interrupt-Driven serial driver to send and receive the two bytes sent and two bytes received from each of the other Teensies.

One or more digital sync signals generated by one of the Teensies and sent to the three others will probably help a lot in the debugging stage. The control Teensy could also have a "Restart" input with a connected button that might help if you don't get your Teensy program exactly right the first time! ;-)

I took a quick look at the equations in the referenced paper. They made my head spin! (it's morning and I'm properly caffeinated, so puns are inevitable!) Those equations reminded me of many months of converting guidance and navigation calculations from MatLab to C++. I sincerely hope you have either a good data set and a way to play back the analog signals or a simulator that can generate signals to produce known outputs.

3. Receive 2 bytes from each UART and save them.
4. Repeat steps 1 to 3 for the other 3 ADC channels.
5. Now each Teensy has the same 16 variables and calculates 4 functions of 16 variables.
4. Each Teensy sends the 4 results via I2C to its own MCP4728 DAC. All together 16 analog outputs.

16 AD conversions on a single Teensy take too much time. I anyway need several Teensys because computing the 16 functions would take too long on a single Teensy. The functions are complicated vector arithmetic (many cross products and dot products). You want to see them? Equations 11a,b,c the following document. Keep in mind that L, S1 and S2 are vectors of three elements.


Functions of many variables can't be precalculated. A function of 2 variables would already require 4096^2 = 16M entries.


12-bit is sufficient.


It's sufficient if the ADC and DAC conversions are roughly synchronized within 50us. I don't yet know if the complex calculations fit in the 100us time slot (100us minus ADC time, minus time for data sharing, minus DAC time). 10 kHz is not a hard limit. 5 kHz would also be acceptable. I'll find out what's possible, and let it run as fast as possible.

Michael
 
You want SPI ADCs and DACs if you want fast response, that I2C will be the bottleneck otherwise.
You can probably just use serial connections(*) between the processors if using I2C data conversion chips...

Have you thought about the need for synchronous conversion for all 16 channels? skewing the sampling timepoints for different channels will introduce distortion. Multiple SPI DACs or ADCs can be clocked synchronously on the same SPI bus. Some ADCs have sample/hold clocks that enable multiple channels to be sampled simultaneously.

(*) serial1 tx from teensy 1 to the others' serial1 rx's
serial2 tx from teensy 2 to the others' serial2 rx's etc etc
 
Back
Top