Teensy4.1 AuduinoWebsockets NativeEthernet or QNEthernet - WebSocket messages being received over around 1440 bytes are corrupted and crash the Teensy

I'm running into a bug on a Teensy4.1 used in a robotics application using WebSockets to receive information from a web server. If the packets sent are over about the size of a single ethernet packet, they are corrupted. This appears to be a memory overflow bug somewhere. I've removed all our code and reproduced the bug in a simple modified version of the Teensy4.1 example code from the WebSockets release.

The Teensy Code:
C++:
/*
  Teensy41 Websockets Client (using NativeEthernet)
  2024-08-24 Websociet test example modified by Curt Welch curt@kcwc.com
*/

#include <ArduinoWebsockets.h>
#include <TeensyID.h>
// #include <NativeEthernet.h>
#include <QNEthernet.h>

using namespace websockets;
WebsocketsClient client;

// We will set the MAC address at the beginning of `setup()` using TeensyID's
// `teensyMac` helper.
byte mac[6];
IPAddress ip(192, 168, 1, 100); // Curt's Mac

// Enter websockets url.
// Note: wss:// currently not working.
// const char* url  = "ws://echo.websocket.org";
const char *url = "ws://192.168.1.10:9000"; // Curt's Mac

auto MessageNow = millis();
auto MessageLast = MessageNow;

void setup() {
  // Start Serial and wait until it is ready.
  Serial.begin(9600);

  while (!Serial)
    ;

  Serial.printf("Start websocket test\n");

  // Configure Ethernet
  teensyMAC(mac); // Get the mac address for this Teensy from rom
  Ethernet.begin(mac, ip); // Sets the mac and ip address
  Serial.print("Ethernet connected (");
  Serial.print(Ethernet.localIP());
  Serial.println(")");

  // Set up callback when messages are received.
  client.onMessage([](WebsocketsMessage message) {
    MessageNow = millis();
    Serial.printf("Got Message len is %d ms is %d\n", message.data().length(), MessageNow-MessageLast);
    MessageLast = MessageNow;
    Serial.println(message.data());
  });   
}

bool connected = false;

void loop() {
  // Check for incoming messages.

  if (!connected) {
    Serial.printf("Connecting to server.\n");
    if (client.connect(url)) {
      connected = true;
      Serial.printf("Connected!\n");
      Serial.printf("Sending ID\n");
      client.send("{\"connect\" : \"true\"}");
      // {
      //   teensyId: "ins",
      //   connect: true,
      // }
    } else {
      Serial.print("Connection Failed.\n");
    }
    return;
  }

  if (!client.available()) {
    connected = false;
    Serial.printf("Lost connection to server.\n");
    return;
  }

  if (client.poll()) {
    Serial.printf("M ");
  }
}

The server sending the messages is in typescript (I don't know typescript -- written by a co-worker):

JavaScript:
import { ServerWebSocket } from 'bun'

const frequency = parseInt(process.argv[2])
const characterCount = parseInt(process.argv[3])

if (isNaN(frequency) || isNaN(characterCount)) {
  console.log('Usage: bun teensyStress.ts <frequency>ms <characterCount>')
  process.exit(1)
}

let wes: ServerWebSocket

Bun.serve({
  fetch(req, server) {
    const success = server.upgrade(req)
    if (success) {
      return undefined
    }
    return new Response('Teensy Stress Test API')
  },
  websocket: {
    async message(ws, message) {
      let msg = JSON.parse(message as string)
      console.log(msg)
      if (msg.connect) {
        wes = ws
        console.log('Connected to Teensy!')
      }
    },
  },
  port: 9000,
})

setInterval(() => {
  if (!wes) return

  const characters =
    'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'

  let message = ''
  for (let i = 0; i < characterCount; i++) {
    message += characters[Math.floor(Math.random() * characters.length)]
  }
  // console.log("Send message")
  wes.send(
    JSON.stringify({
      type: 'mock',
      message,
    })
  )
}, frequency)


When I run the server on my Mac, with the command:

bun teensyStress.ts 100 50

The above waits for a connection and they starts sending json packets with a random text string of 50 bytes, every 100 ms.

For the small packets, the output of the Teensy looks like this:

Code:
Start websocket test
Ethernet connected (192.168.1.100)
Connecting to server.
Connected!
Sending ID
Got Message len is 78 ms is 8029
{"type":"mock","message":"j3JMfwvh5gFaDUxN88dZ2P3w63rqLaIWyzYtg2Kd2WrddcZUXk"}
M Got Message len is 78 ms is 99
{"type":"mock","message":"ZhdspgbsfIxXYlumNLZHnWlM7yyObMLTHARNGwZCx6Eg6ktp2W"}
M Got Message len is 78 ms is 101
{"type":"mock","message":"VZAposqDgpv189iDqp4yzdkPlIOEzztfGUEO6nqtnkQ0qSWRh8"}
M Got Message len is 78 ms is 101
{"type":"mock","message":"04ofBze49paexr93c8N8bk971U5NbPaLheudn9sv7NOYAri6kJ"}
M Got Message len is 78 ms is 101
{"type":"mock","message":"mVzBdCgevJNq9H3W5qtM67GAPhHgsGFVd4ETS8Jmf94oqg6V9c"}
M Got Message len is 78 ms is 101
{"type":"mock","message":"gfqCKqtZNe9oWLnRwauTWSj77wJolrkyzDoMzvvPknAoTNh6jQ"}
M Got Message len is 78 ms is 100
{"type":"mock","message":"DMpdjWYZw80ml55cJIEnvJAf0MXqpGpPBe2prbwscEOGnjo2nD"}
M Got Message len is 78 ms is 101
{"type":"mock","message":"8td2Swdf24FERJL4YGH34SgX22KknzXs8do11lXP5ejhsdzYSE"}
M Got Message len is 78 ms is 101
{"type":"mock","message":"oNrLsH77EZ92bhUETW640ekifDKsgqRczdz5YKZwz2o2PsYU3T"}


Everything is fiue until the packet gets over around 1440 bytes in size and the message sent to my handling routine is obviously corrupted and the teensy will soon crash.

Output for when I run the server with 2000-byte packets looks like this:

bun teensyStress.ts 100 2000
Code:
Start websocket test
Ethernet connected (192.168.1.100)
Connecting to server.
Connected!
Sending ID
Got Message len is 1523 ms is 8023
{"type":"mock","message":"XqtcWNLkiRwQxDKyIoVDHzEQl3sSZQpQP3lU9udFZoSRCtU0SpXmf7oFVmiTLfB2EqLHVWUjpzyFqzje1ywq2yP6jG5q1F8RDiAMNh2kRgJK1tL4tYPS4OUnko4vzL5P0J4jyg3LIoTsczKDXBn1KOJXLJQbC7pthGPX65GdlOvLmuVAp84jMkLeYiCcoXIepmeeb7OONsljDIGoTugFHstOutJ63ZMwZGMBqtTDZykYejqra8MsjRHHVTqowZzrNlCDSIpi8B4lYcTUlsCGiCG0LUfjaT9WrIJuL2as3pHYucEUGTx0kWMh0hNfS1hBjHrakU7rGrfkVXVqQx6yInA9Wjwcwr8qHLKuXcTGVtrYETZNUPj3TEY6SGNEcJypZnu5wt2SeL4mZZhr0vmXRyGoeUGwjWOrpsCHvAok6FRK1zPSEUXadvo6Csiv6tBmj1QoqWZH7GpcmZz78E5rNUiUIMwvBsQuLxFLhGyvpPSe5TJaiRfylVWbILrwnKStgxg7NU5CI6yTJuSXqGvH07lYcOvmJUUfRn9bnv4tjoBjIvYTHg2Q4qp01dJHzx3DeuH0iRYNebnpb9a1yhO3fwMlXzjtI6MvxaQmF7fQuXUTC6VFPrHQHo2sqFI4OcbnzROoeV13tpyylUEgIYtSkFQapuFVdOcPwJUNvX7oHDCZioBkyD5tFxlD6D9yh4dcf8UD63NX5wIuej4eV00FTUBH2ViO7o6rPKGIabQHWnHC68YGXIccCMIzoZodv1qdGS0M5tKtgRuKJG6ije9V8dky25W8VRduavkpVrMSAVlU8WRRslhtALuCD0o4lJt82TXf372F3TbdcGTSesgS7q8bvmWroFagYJA0E45C8eC7fIvRGocua3Q6LfPUcvBCKCpBJDkfbl5s3TuNyJrs2r2UqoSyRkbpxTvtRKbs533pov4gKew3Xta6gY98nzQiV7JpaJn7hQ7wZmVgnFLgfQFPOt1kSYRbtT5pumn1QG492UsDfYqPHu5t0ss8CUQwwqG98r7MbZo4gRbrj9BTpUrDAbd8KLF9ozqmrb5WDrQvHSGOnnOm2ZrOKQ0Xq0AVau13HE7L3ISukmr6zSi3doSJNaYHJjrRIKguAm0f52Kw43G0SsOcJ4wFwahNy6eKPcHfVtYfSwwplb5aE4Cd7uK1eom7LEHD1FBo2ISIYOm5HVf1LsooePm4BJobPJFhb7gGVEXyJW6QyX4npkTBIFPBS7vb0T4DUKPgIs5wgOENl6lxkHkZrwRG7o81AoQHnaysU1EAZtWzJNFSQ8w3TMMbwL1cbEj6I1YvXnO6DfCkbzpNmrjSoiuBVtpxOViSmbiVnPMM7y7gdWzM5G2hB6KrqCqfGPJWAHQsJib1ktAHgcNvhKa6pgtrJA1jNHqLyrRu00yYluLF2KueF9YnW9qD1gfXVZS2"}9d1w0Ef5olzRVvQAMAA5AWfxl3Yy"}}���SD����~␂�D0�␆)��,�m>�

The Teensy shows 1523 bytes in the message instead of over 2000 as it should for the size of the full Json message. And it obviously gets garbled values at the end (scroll to the right to see it).

I've tested with NativeEthernet as well as QNEthernet.

The version of WebSockets I'm using is 0.5.4

Anyone have a clue what could be causing this? Is it a bug in websociets maybe I need to report? Ideas on how to narrow down the problem or tests to try?

I"m compiling and downloading to the Teensy using vscode and PlatformIO on my Macbook and I'm running the test with a USB ethernet connection connected to the Tensey direclty without an external router or switch. Using static IPs so no DHCP at play.
 
I spent Sunday chasing deeper into this and learned more.
  • The bug is (as I suspected) triggered when the WebSocket message is too large to fit in a single 1500-byte Ethernet packet and is split across two (or more) packets.
  • The WebSocket code receives valid data from the TCP layer for all the data in the first ethernet frame, but data returned from reads of the socket is corrupted, starting with the data in the second packet.
  • After the TCP data becomes corrupted, the WebSocket layer reads garbage for the following WebSoecket Header, which includes a random invalid message length. The WebSocket layer does not recognize the corruption and just waits in a busy loop, trying to read too many characters for the next message.
  • But then the TCP layer gets even worse and goes into a mode where it's returning -1 indicating there is no data to read, while still returning valid for client. available(), so the TCP layer is telling the WebSocket layer there is no data to read but that the socket is still open.
  • Using WireShark I verified that the server is sending the correct data. WireShark has no problem decoding the correct WebSocket data from the ethernet packets. There are no packet protocol errors on the ethernet.
  • The server stops sending because the TCP send window fills up, and the Teensy TCP layer never accepts data after the first. message.
  • The Teensy does correctly ACK all the data in the first two packets (the full large WebSOcket message). But it refuses to ACK any data after that. So the next message that follows is never being confirmed as read by the TCP layer on the Teensy. I have not verified how much total corrupted data it reads before it hangs. But it seems to be more than what was actually ACKed from the TCP stream.
  • I was wrong about my comment saying I tested using QNEthernet. The WebSockets packet includes NativeEthernet and does not support QNEthernet. So, changing the include in my main code to QNEthernet was pointless. It was still using NativeEthernet. An hour of trying to convert WebSockts to QNEthernet yield no results. Endless compile errors I got tired of trying to chase down and understand. So this is a bug related to Teensy4.1, with ArduinoWebsockets and NativeEthernet.
This is an obviously a memory corruption issue that I have not found the cause of. I'm a seasoned (old as F) C systems programmer but my C++ foo is weak. I've been learning as fast as possible over the past days of extensive bug hunting what I forgot 15 years ago about C++ to understand all the ins and out of memory management in C++. But so far, I have found no code doing anything wrong with buffers and objects to explain the corruption. I've checked a lot of the basic code flow in the AurdinoWebSockets and in the code our guys wrote (that I stripped to and replaced with the simple demo I shared above). But I've not yet dove into the TCP code.

I've added prints to show heap allocation, and there is no memory leak at play here for the heap (RAM2). It's use is stable.

There is no inherent reason to believe the TCP code can't correctly read and assemble two IP packets in a TCP stream. And there was nothing fancy or unusual in the ethernet packets I could spot with WireShark to imply something odd was at play that the TCP stack couldn't handle.

The Teensy has 512K in RAM1 for stack space and 512K in RAM2 for heap space. This should not be a problem related to a lack of memory on the microprocessor because I'm trying to read a 2K WebSocket message.

My instincts are telling me this is some sort of bug related to the attempts by the Ethernet driver to be driven async with the main code loop. I don't even know how it does that yet. I don't know if it's done with interrupts or I've seen mention of parallel threads in Aurdion using yield() but I don't understand how that mechanism works. yet.

I do know, however, that the Teensy will continue to echo ICMP Pings, even when the WebSocket layer is hung in the endless loop waiting for more data that the TCP layer will never send. So either interrupts are at play or something is calling yield() I guess. So there seems to be concurrent threads at play that could be sharing some memory incorrectly that ends up getting corrupted.

The AurdinoWebSockets code seems to come from here:


But it looks like there's no active support going on by the developer the created it. And so far, I have found no alterative libraries for doing WebSockets on a Teensy4.1 But I guess I should open a issue on there about this if no one here has any thoughts to share about this.

I'm still pretty new to how the whole microprocessor open source ecosystem works, but I'm trying to learn. It feels I'm just going to have to hunt down and fix this bug myself. We shouldn't be using WebSockets in our embedded robotics application to start with, but that's a different issue.
 
The NativeEthernet ethernet library does not seem to be supported by it's author any more.
QNEthernet has become the defacto library to use.
It is well maintained by it's author who is a frequent visitor to this forum.
 
The NativeEthernet ethernet library does not seem to be supported by it's author any more.
QNEthernet has become the defacto library to use.
It is well maintained by it's author who is a frequent visitor to this forum.
Thanks, yeah, I noticed. At work, I have been testing QNEthernet, and we are planning on dumping WebSockets (sort of the wrong option to begin with) for our robot, and switching to a UDP protocol. But when I see bugs, it's hard for me not to trace them down and try to fix them. At least document the problems and the fixes so others can implement as well if need be.

But AurdinoWebsockets only runs with NativeEthernet, and I didn't notice any other WebSockets library that runs on QNEtherhet. Is there one that you know of I failed to find? Not urgent but just curious.

I did spend about an hour trying to convert AurdinoWebsockets to use QNEthernet but there were too many compile errors to deal with so I just gave up.
 
The API should practically be the same. I made QNEthernet follow the Arduino-style Ethernet API. Can you show an example of some compile errors? Sounds like, however, you’ll be switching to UDP…

Check out the BroadcastChat example.
 
The API should practically be the same. I made QNEthernet follow the Arduino-style Ethernet API. Can you show an example of some compile errors? Sounds like, however, you’ll be switching to UDP…

Check out the BroadcastChat example.
Actually, now that I've dug deeper into the code and understand more, I don't know why it was having such problems compiling. I don't remember the actual errors. Because ArduinoWebSockets is set up to support a wide range of hardware, it's got a handful of different network socket abstraction layers for each lower level of networking it supports, and the NativeEthernet "glue" code is, I believe, only one small set of methods (a few hundred lines at best), so only that code should care about whether it's connected to NativeEithernet or QNEthernet.

But I guess it could be making deeper assumptions and reaching inside the implementations to get access to things below the published API, so if your internals don't match the other system internals, it could break everything. I guess.

Maybe I'll find time this weekend to try that again as a learning exercise and get a deeper understanding of it all to see if I can connect ArudioWebsockets to QNEthernet.

But, here at work, now that I have fixed the bigger bugs we will continue to use my patched version of AurdinoWebSockets and NativeEthernet as we test new robot hardware out and then likely switch to UDP when we have the time to do that protocol design and implementation work. We are a small start-up (about to be only 3 1/2 engineers), so our time is spread very thin across all the work required, most of which is hardware work, and not software. I don't get tp play with software as much as alike, which is why I spent a few weekends playing with it.

But if I'm able to connect AurdinoWebscokets to QNEtherhet, we will stop using NativeEthernet altogether. I will share the answer if I get it working.
 
@shawn Yeah, I found some time yesterday to have a quick look at switching ArduinoWebsockets to use QNEthernet, and it wasn't hard at all. I know more now than I did when I tried it last time.

First, there are two files that include NativeEthernet.h that need to be changed, I was only changing one last first. Second, I had not found and read the README documentation for QNEthernet so I didn't know critical stuff like the fact you have the classes in a different namespace. So the errors being generated were all about the references, the Ethernet, etc. being undefined, and the errors that resulted in trying to mix NativeEthernet and QNEthernet in the same build. So most of my previous problem was just a lack of basic knowledge.

Other than the namespace issue being dealt with, the other compatiblity problem was server.available() return trying to be cast to a bool in the WebSockets code that oddly worked by default with NativeEithernet but had to be dealt with with a static_cast<bool>() for QNEthernet. I don't understand that yet, but I will look deeper into why that changed was needed. Another learning opportunity for me.

C++:
bool poll() override {
      yield();
#ifdef USE_QNE
      return static_cast<bool>(server.available());
#else
      return server.available();
#endif
    }

I got it to compile under vscode once, and a quick test verified it was working, but vscode's automatic build process is confused because of my hacks so I have to do it better. Will probably fork Websockets to play more.

But yet, the changes to make it work under QNEthernet is no more than one would expect.
 
First point is that the following will also work, without the use of #ifdef's:
C++:
bool poll() override {
  yield();
  return static_cast<bool>(server.available());
}

EthernetServer::available() returns an instance of EthernetClient. EthernetClient defines a bool operator (see https://www.arduino.cc/reference/en/libraries/ethernet/if-ethernetclient/) that returns whether the client is connected. Note that the Arduino docs are ambiguous with regards to what this operator actually means. Now, EthernetClient::connected() returns whether a client is connected OR there's data available that hasn't yet been read. (https://www.arduino.cc/reference/en/libraries/ethernet/client.connected/). Since connected() provides that functionality, and if the bool operator is different, then it must mean "whether a client is connected".

I've tried to explain the differences, and how the QNEthernet library understands the Arduino Ethernet reference documentation, here:
A survey of how connections (aka `EthernetClient`) work

I'm not sure what the ArduinoWebsockets library means by poll(), but if it means "there's data available to be read", then the following is more correct:
C++:
bool poll() override {
  yield();
  return server.available().connected();
}

If, on the other hand, it just means "find a client that's connected, without there necessarily being data", then the first version, with the static_cast<bool>, is more correct.

I think, however, that since it's using EthernetServer::available(), meaning "give me a client with data available", then the second version is actually the more correct one. (With server.available().connected().)

Now, I dislike some implicit conversions because it makes the code less clear. This is the reason I chose to make the EthernetClient::operator bool() function explicit. This means that doing an implicit bool conversion won't work; it must be an explicit conversion. Hence the static_cast<bool>. The NativeEthernet library has the same bool operator function not declared explicit; this is why you don't need that static_cast<bool>. Note that it's not wrong to explicitly cast to a bool, which is why you don't actually need that #ifdef.

See also: The safe bool problem

Additionally, I find that the Arduino Ethernet reference is ambiguous and ill-defined at places, so I've tried to be clear in my QNEthernet documentation how I'm interpreting the declared behaviour.
 
Last edited:
Thanks for all that Shawn!

I've been a C programmer for decades but have not used C++ much so the difference between the static cast and the implicit cast was lost on me. Thanks for explaining why one worked and the other didn't!

Looking into why they heck they have a poll() function and I find no answers. It's not used in their code except to make WebSocketServer.poll() pass though a call to this TcpServer poll().

It's not used in any of their example code, and there's not so much as one comment anywhere explaining the intent.

However, on the client side, their WebSocketClient class, they say to use client.poll() like this in use code:

C++:
loop() {
    client.poll();
}


They claim it is required to keep getting more messages, but never explain, or even hint at, what the bool return value means and their example code only shows the above where the return is ignored.

All I can guess is that they made the server-side mirror the client side with a poll() function, but later figured out it was not needed but just left it in.

What's up with people defining an API and not including so much as one comment in the code explaining the intent of the function? It just seems absurd and sloppy AF to me. Thank you for not being like that with QNEthernet!

So clearly, an undefined function that's not used or documented isn't very important when trying to figure out what is right.
 
Back
Top