Thoughts about extremely low overhead and low bitrate communications on Teensy 4.1

madmyers

Member
I'm working on a project with Teensy 4.1. It's very (very!) timing centric and I've got it working. I need to communicate to an external device using whatever method/protocol is best, but "easy" things like Serial take way too time.

I'm looking to send bytes per second, nothing huge!

For example, I need to be able to send some data within a 250ns window. I have multiple of these windows, so can use them to send additional data.

Is there a better solution than just using 2 data pins, one for clock and one for data and just clock out some data very slowly?

I tried to lookup what overhead exists to use the ethernet port, even if just to send an Ethernet frame in a non-IP compliant way, but didn't find any info on overhead.

Maybe Teensy has something built in that I could just use?

Thanks!
 
Last edited:
So you need to send a small amount of data but at a very precise time?

250ns is a very tight timing requirement to hit, that's in the same ballpark as the processors interrupt response time. Simply getting the transmit to be within a window that small is not simple.

I don't know the specifics of your application but generally I'd say split this into two channels, separate the data and the timing information into two different things.
i.e. Send the data early using a simple method e.g. UART. When you want the other device to "Receive" the data toggle a GPIO using digitalWriteFast(). That triggers an interrupt on the receiving device telling it to read / act on the byte that is waiting for it in the UART receive buffer.


And keep in mind that all the processors clocks will be running at different rates, even if two units start with their clocks perfectly synchronised it won't take long for them to drift apart by that much.
 
I'm looking to send bytes per second, nothing huge!

For example, I need to be able to send some data within a 250ns window.

Maybe Serial (real hardware serial on TX1) can work if you just write directly to the data transmit register. Just use Serial1.begin(baud) to get the hardware configured, and then rather then Serial1.write() which does complicated buffering, just write your byte directly to LPUART6_DATA. Maybe like this:

Code:
void setup() {
  Serial1.begin(115200);
}

void loop() {
  uint32_t t1 = ARM_DWT_CYCCNT;
  LPUART6_DATA = 'T';
  LPUART6_DATA = 'e';
  LPUART6_DATA = 's';
  LPUART6_DATA = 't';
  uint32_t t2 = ARM_DWT_CYCCNT;
  Serial.printf("write 4 bytes took %u cycles\n", t2-t1);
  delay(100);
}

The catch is the serial port has only a 4 byte FIFO, so if you write a 5th byte before at least 1 of the prior written bytes finishes transmitting, you'll get an overflow condition. But hopefully not an issue if you're writing only a few bytes per second.

If you run this, you'll see it measures 23 cycles to write 4 bytes, which is only 38ns.... well under your 250ns requirement. Interestingly, if you write 3 bytes or less it takes much less time, only a few cycles. My best guess is the bus bridge inside the chip can buffer a couple writes. So if you really care about the speed this code executes, plan on writing 3 or less bytes at once.

And just to confirm this really does work, here's what my scope sees on TX1 (pin 1):

file.png
 
Out of interest what is the latency between putting the data into the tx buffer and the data output starting? If you set a pin high, write the serial data and then set it low how do the signals line up? If you repeat this triggering off the GPIO and with persistence turned on how much variation is there?
Since the uarts are using a divided clock (at least I assume they do, I've not actually checked but on most MCUs this is the case) rather than the full processor clock I'd expect there to be a bit of jitter in the latency between the processor writing the data and the data actually being output. 250ns is a fraction of a bit period so depending on how the clocking is done the physical output time may not be very determinate even if the time the processor queues the data up is.

Whether this matters depends on the details of the actual application.

edit - re-reading the question I may have missed the issue. I was thinking the aim was for the timing to be controlled to within 250 ns. Reading it again I suspect he means the transmit routine must exit within 250 ns. Actual time doesn't matter as long as the code doesn't take long to run. In which case ignore everything I said.
 
Agreed, the meaning of the question could be read a few different ways.

Maybe not really relevant to the original question (or maybe it is?) but I got curious why writing 4 bytes measure so much longer than 1, 2, or 3. So I took a quick dive into the generated assembly...

Turns out the compiler can generate much more efficient code for the 3 byte case. It generates this sequence, with 2 LDR which read the cycle counter and 3 STR which write to the UART.

Code:
      74:       684a            ldr     r2, [r1, #4]
      76:       61dd            str     r5, [r3, #28]
      78:       61dc            str     r4, [r3, #28]
      7a:       61d8            str     r0, [r3, #28]
      7c:       684b            ldr     r3, [r1, #4]

However, it spends 5 instructions before that first LDR getting the registers set up with the 2 addresses and 3 data bytes. Two of those are LDR which fetch constants. So the reported time is probably only about half, since the work to set up the registers doesn't get measured.

But for reasons I don't understand, the compiler does something completely different (rather than just using R6 or R7) for the case of 4 writes. The register setup stuff gets mixed in with the sequence of five LDR / STR which do the actual work. The first thing loop() does is push R3, R4, R5, LR onto the stack. Maybe the compiler believes the slower code is overall better because it avoided saving another register on the stack?

Anyway, the take-away message here is probably not to trust this simple cycle count measurement too much, as it doesn't measure register setup work the compiler does before the first read of the cycle counter. And the compiler can generate wildly different code. Results may vary quite a lot depending on the surrounding code of a real application.
 
Out of interest what is the latency between putting the data into the tx buffer and the data output starting? If you set a pin high, write the serial data and then set it low how do the signals line up? If you repeat this triggering off the GPIO and with persistence turned on how much variation is there?
Since the uarts are using a divided clock (at least I assume they do, I've not actually checked but on most MCUs this is the case) rather than the full processor clock I'd expect there to be a bit of jitter in the latency between the processor writing the data and the data actually being output. 250ns is a fraction of a bit period so depending on how the clocking is done the physical output time may not be very determinate even if the time the processor queues the data up is.

Whether this matters depends on the details of the actual application.

edit - re-reading the question I may have missed the issue. I was thinking the aim was for the timing to be controlled to within 250 ns. Reading it again I suspect he means the transmit routine must exit within 250 ns. Actual time doesn't matter as long as the code doesn't take long to run. In which case ignore everything I said.
Yes, I meant the transmit routine must be no more than 250 ns.
Thanks for thinking about my problem and asking related interesting questions.
 
Maybe Serial (real hardware serial on TX1) can work if you just write directly to the data transmit register. Just use Serial1.begin(baud) to get the hardware configured, and then rather then Serial1.write() which does complicated buffering, just write your byte directly to LPUART6_DATA. Maybe like this:

Code:
void setup() {
  Serial1.begin(115200);
}

void loop() {
  uint32_t t1 = ARM_DWT_CYCCNT;
  LPUART6_DATA = 'T';
  LPUART6_DATA = 'e';
  LPUART6_DATA = 's';
  LPUART6_DATA = 't';
  uint32_t t2 = ARM_DWT_CYCCNT;
  Serial.printf("write 4 bytes took %u cycles\n", t2-t1);
  delay(100);
}

The catch is the serial port has only a 4 byte FIFO, so if you write a 5th byte before at least 1 of the prior written bytes finishes transmitting, you'll get an overflow condition. But hopefully not an issue if you're writing only a few bytes per second.

If you run this, you'll see it measures 23 cycles to write 4 bytes, which is only 38ns.... well under your 250ns requirement. Interestingly, if you write 3 bytes or less it takes much less time, only a few cycles. My best guess is the bus bridge inside the chip can buffer a couple writes. So if you really care about the speed this code executes, plan on writing 3 or less bytes at once.

And just to confirm this really does work, here's what my scope sees on TX1 (pin 1):

View attachment 35617
This is amazing. Thank you!

I don't suppose there's a way to do this with Pin 4? My setup currently is using Pin 1 -- but I can change that if needed!

-- Altan
 
Agreed, the meaning of the question could be read a few different ways.

Maybe not really relevant to the original question (or maybe it is?) but I got curious why writing 4 bytes measure so much longer than 1, 2, or 3. So I took a quick dive into the generated assembly...

Turns out the compiler can generate much more efficient code for the 3 byte case. It generates this sequence, with 2 LDR which read the cycle counter and 3 STR which write to the UART.

Code:
      74:       684a            ldr     r2, [r1, #4]
      76:       61dd            str     r5, [r3, #28]
      78:       61dc            str     r4, [r3, #28]
      7a:       61d8            str     r0, [r3, #28]
      7c:       684b            ldr     r3, [r1, #4]

However, it spends 5 instructions before that first LDR getting the registers set up with the 2 addresses and 3 data bytes. Two of those are LDR which fetch constants. So the reported time is probably only about half, since the work to set up the registers doesn't get measured.

But for reasons I don't understand, the compiler does something completely different (rather than just using R6 or R7) for the case of 4 writes. The register setup stuff gets mixed in with the sequence of five LDR / STR which do the actual work. The first thing loop() does is push R3, R4, R5, LR onto the stack. Maybe the compiler believes the slower code is overall better because it avoided saving another register on the stack?

Anyway, the take-away message here is probably not to trust this simple cycle count measurement too much, as it doesn't measure register setup work the compiler does before the first read of the cycle counter. And the compiler can generate wildly different code. Results may vary quite a lot depending on the surrounding code of a real application.
Interesting!
To clarify, using the ARM_DWT_CYCCNT will always give the real info. The "probably not to trust" was about expectations based on C code and the compiler?
 
I don't suppose there's a way to do this with Pin 4? My setup currently is using Pin 1 -- but I can change that if needed!
Any UART Tx pin could be used. You would need to change the UART the code references but other than that it should be identical. Just take care, the Teensy names and the UART numbers don't match (e.g. in the example Tx1 is not UART1, it's UART6)
The mapping is:
Serial1 = UART6
Serial2 = UART4
Serial3 = UART2
Serial4 = UART3
Serial5 = UART8
Serial6 = UART1
Serial7 = UART7
Serial8 = UART5

So to use pin 8 (Serial2 Tx) you would modify Pauls code to use LPUART4

To clarify, using the ARM_DWT_CYCCNT will always give the real info. The "probably not to trust" was about expectations based on C code and the compiler?

Correct. It will accurately count the number of cycles between those two points in the code, the UART register is considered volatile and so the compiler is restricted in how it can optimise those specific writes. But the compiler may have moved the order of things around and generally tried to be clever in a way that means you're not timing all of the extra time required for the added instructions. You will get the time writing the data to the uart but not necessarily getting the data ready to write or getting things back to the previous state so that the rest of your code can continue.
 
Thanks. I wondered about the mapping of Teensy serial names to UARTs.
And based on that info, Pin 4 isn't an option.
 
Another option: DMA_UART. See https://forum.pjrc.com/index.php?th...yte-directly-in-teensy-3-5.71466/#post-315706

In your 250 ns you’ll have enough time to fill the DMA Tx buffer with up to, say, 64 bytes, and then you’ll still have enough time left to trigger the UART DMA transmit process. (Assuming running the CPU at the default 600 MHz). Irrespective of baudrate, the buffer content will appear on the tx pin and the CPU will not get interrupted. Plus, if needed, any traffic in the opposite direction will land in the Teensy DMA Rx buffer.
 
Appreciate the insights!

I've got a bit of a pin problem so I'm looking into other approaches for the short term (may adjust my PCB longer term if it's needed).

@Paul shared some great info about Ethernet in a different thread I started. It was here https://forum.pjrc.com/index.php?th...teensy-4-1-ethernet-driver.75722/#post-348692.

Is the Ethernet hardware similar to the HW serial in the sense of "once you set it up, it runs by itself"? Aka, if I've got my ring buffers setup and finally do a

Code:
        ENET_TDAR = ENET_TDAR_TDAR;

Does the hardware take over? Perhaps it could TX many more bytes than needed as long as the setup before fits into my small window?

Note: this would be kind of proprietary ethernet on a special "network". Not meant for an IP network.

UPDATE: On a stand along Teensy, I soldered the Ethernet stuff and did some tests with the raw code mentioned above. I think (early belief) that it does run with very little overhead. For example, I could use a single buffer and just make sure it's available. Write directly to that buffer for a zero copy little network stack. It certainly runs just fine with interrupts disabled (a test I did). My minor problem is for my other effort I have the Teensy soldered to another board and I didn't add the Ethernet headers before I did that (so I cannot access the "bottom" where I'd need to solder). Still love to hear people's thoughts/experiences.
 
Last edited:
Good news is that you shouldn't have to do all this work yourself. I've designed the QNEthernet library to work without an IP stack, and with an "EthernetFrame" API similar to the "EthernetUDP" API. In other words, you can easily use the library with just raw frames. Here's how:

1. In lwipopts.h (or via project build options, say with PlatformIO), set LWIP_IPV4 to 0
2. In qnethernet_opts.h (or project build options), set QNETHERNET_ENABLE_RAW_FRAME_SUPPORT to 1
3. Use the EthernetFrame API.

I modified the RawFrameMonitor example to remove IP-related things (but kept the possibly-unnecessary VLAN things):
C++:
// C++ includes
#include <algorithm>

#include <QNEthernet.h>

using namespace qindesign::network;

// VLAN EtherType constants
constexpr uint16_t kEtherTypeVLAN = 0x8100u;
constexpr uint16_t kEtherTypeQinQ = 0x88A8u;

// Tracks the received frame count.
int frameCount = 0;

// Main program setup.
void setup() {
  Serial.begin(115200);
  while (!Serial && millis() < 4000) {
    // Wait for Serial
  }
  printf("Starting...\r\n");

  // Print the MAC address
  uint8_t mac[6];
  Ethernet.macAddress(mac);  // This is informative; it retrieves, not sets
  printf("MAC = %02x:%02x:%02x:%02x:%02x:%02x\r\n",
         mac[0], mac[1], mac[2], mac[3], mac[4], mac[5]);

  // Add listeners before starting Ethernet

  Ethernet.onLinkState([](bool state) {
    printf("[Ethernet] Link %s\r\n", state ? "ON" : "OFF");
  });

  // Initialize Ethernet
  // Optionally turn DHCP off by uncommenting the following line:
  // Ethernet.setDHCPEnabled(false);
  printf("Starting Ethernet%s...\r\n");
  if (!Ethernet.begin()) {
    printf("Failed to start Ethernet\r\n");
    return;
  }
}

// Main program loop.
void loop() {
  int size = EthernetFrame.parseFrame();
  if (size <= 0) {
    return;
  }

  frameCount++;

  // Access the frame's data directly instead of using read()
  // size = EthernetFrame.read(buf, size);
  const uint8_t *buf = EthernetFrame.data();
  if (size < EthernetFrame.minFrameLen() - 4) {
    printf("%d: SHORT Frame[%d]: ", frameCount, size);
    for (int i = 0; i < size; i++) {
      printf(" %02x", buf[i]);
    }
    printf("\r\n");
    return;
  }

  printf("%d: Frame[%d]:"
         " dst=%02x:%02x:%02x:%02x:%02x:%02x"
         " src=%02x:%02x:%02x:%02x:%02x:%02x",
         frameCount, size,
         buf[0], buf[1], buf[2], buf[3], buf[4], buf[5],
         buf[6], buf[7], buf[8], buf[9], buf[10], buf[11]);
  uint16_t tag = (uint16_t{buf[12]} << 8) | buf[13];

  // Tag and (possibly stacked) VLAN processing
  int payloadStart = 14;
  int vlanTagNum = 0;

  // Loop because there could be more than one of these
  while (tag == kEtherTypeQinQ) {  // IEEE 802.1ad (QinQ)
    if (payloadStart + 4 > size) {
      printf(" TRUNCATED QinQ\r\n");
      return;
    }

    uint16_t info = (uint16_t{buf[payloadStart]} << 8) | buf[payloadStart + 1];
    payloadStart += 2;
    printf(" VLAN tag %d info=%04Xh", ++vlanTagNum, info);
    tag = (uint16_t{buf[payloadStart]} << 8) | buf[payloadStart + 1];
    payloadStart += 2;
  }

  if (tag == kEtherTypeVLAN) {  // IEEE 802.1Q (VLAN tagging)
    if (payloadStart + 4 > size) {
      printf(" TRUNCATED VLAN\r\n");
      return;
    }

    uint16_t info = (uint16_t{buf[payloadStart]} << 8) | buf[payloadStart + 1];
    payloadStart += 2;
    if (vlanTagNum > 0) {
      printf(" VLAN tag %d info=%04Xh", ++vlanTagNum, info);
    } else {
      printf(" VLAN info=%04Xh", info);
    }
    tag = (uint16_t{buf[payloadStart]} << 8) | buf[payloadStart + 1];
    payloadStart += 2;
  } else if (vlanTagNum > 0) {
    printf(" MISSING VLAN");
  }
  // 'tag' now holds the length/type field

  int payloadEnd = size;
  if (tag > EthernetFrame.maxFrameLen()) {
    printf(" type=%04Xh\r\n", tag);
  } else {
    printf(" length=%u\r\n", tag);
    payloadEnd = std::min(payloadStart + tag, payloadEnd);
  }

  printf("\tpayload[%d]=", payloadEnd - payloadStart);
  for (int i = payloadStart; i < payloadEnd; i++) {
    printf(" %02x", buf[i]);
  }
  printf("\r\n");
}

You can remove all the VLAN handling to simplify this even further. You could also turn on promiscuous mode to see all the frames on the network. (See QNETHERNET_ENABLE_PROMISCUOUS_MODE.) Last, make sure to have a look at the docs in EthernetFrame.h.

Update: See the next message for a version without the VLAN processing.
 
Last edited:
Here's a version without the VLAN processing:
C++:
// C++ includes
#include <algorithm>

#include <QNEthernet.h>

using namespace qindesign::network;

// VLAN EtherType constants
constexpr uint16_t kEtherTypeVLAN = 0x8100u;
constexpr uint16_t kEtherTypeQinQ = 0x88A8u;

// Tracks the received frame count.
int frameCount = 0;

// Main program setup.
void setup() {
  Serial.begin(115200);
  while (!Serial && millis() < 4000) {
    // Wait for Serial
  }
  printf("Starting...\r\n");

  // Print the MAC address
  uint8_t mac[6];
  Ethernet.macAddress(mac);  // This is informative; it retrieves, not sets
  printf("MAC = %02x:%02x:%02x:%02x:%02x:%02x\r\n",
         mac[0], mac[1], mac[2], mac[3], mac[4], mac[5]);

  // Add listeners before starting Ethernet

  Ethernet.onLinkState([](bool state) {
    printf("[Ethernet] Link %s\r\n", state ? "ON" : "OFF");
  });

  printf("Starting Ethernet...\r\n");
  if (!Ethernet.begin()) {
    printf("Failed to start Ethernet\r\n");
    return;
  }
}

// Main program loop.
void loop() {
  int size = EthernetFrame.parseFrame();
  if (size <= 0) {
    return;
  }

  frameCount++;

  // Access the frame's data directly instead of using read()
  // size = EthernetFrame.read(buf, size);
  const uint8_t *buf = EthernetFrame.data();
  if (size < EthernetFrame.minFrameLen() - 4) {
    printf("%d: SHORT Frame[%d]: ", frameCount, size);
    for (int i = 0; i < size; i++) {
      printf(" %02x", buf[i]);
    }
    printf("\r\n");
    return;
  }

  printf("%d: Frame[%d]:"
         " dst=%02x:%02x:%02x:%02x:%02x:%02x"
         " src=%02x:%02x:%02x:%02x:%02x:%02x",
         frameCount, size,
         buf[0], buf[1], buf[2], buf[3], buf[4], buf[5],
         buf[6], buf[7], buf[8], buf[9], buf[10], buf[11]);
  uint16_t tag = (uint16_t{buf[12]} << 8) | buf[13];

  if (tag == kEtherTypeQinQ || tag == kEtherTypeVLAN) {
    printf(" tag=%04Xh\r\n", tag);
    return;
  }

  int payloadStart = 14;

  // 'tag' now holds the length/type field

  int payloadEnd = size;
  if (tag > EthernetFrame.maxFrameLen()) {
    printf(" type=%04Xh\r\n", tag);
  } else {
    printf(" length=%u\r\n", tag);
    payloadEnd = std::min(payloadStart + tag, payloadEnd);
  }

  printf("\tpayload[%d]=", payloadEnd - payloadStart);
  for (int i = payloadStart; i < payloadEnd; i++) {
    printf(" %02x", buf[i]);
  }
  printf("\r\n");
}
 
Back
Top