New lwIP-based Ethernet library for Teensy 4.1

The polling function is set up internally via EventResponder to run inside yield(); no timers are used.

Shawn, thanks for continuing to provide these examples. Per the quote above from your original post, Ethernet.begin() calls startLoopInYield(), which I see results in yield() calling Ethernet.loop(). I sometimes use a cooperative RTOS that overrides yield(), so in that case I'll call Ethernet.loop() explicitly from a cooperative task. Does that sound right?
 
I can preliminary confirm that the QNEthernet library runs stable in my application sending sensor data to Azure Storage Tables via https requests.
https://github.com/RoSchmi/AzureDataSender_Teensy_QnEthernet
It's running now for about 14 hours with uploads every 10 seconds without any issues.
NativeEthernet with FNET however proved to run stable with my application as well.
The only thing which now remains disturbing on Teensy for my application and requires a patch of the Stream.h file is, (when using khoih.prog/EthernetWebServer) that Stream.timedRead() should be protected instead of private.
(https://github.com/PaulStoffregen/cores/issues/531)
 
I can preliminary confirm that the QNEthernet library runs stable in my application sending sensor data to Azure Storage Tables via https requests.
https://github.com/RoSchmi/AzureDataSender_Teensy_QnEthernet
It's running now for about 14 hours with uploads every 10 seconds without any issues.
NativeEthernet with FNET however proved to run stable with my application as well.
The only thing which now remains disturbing on Teensy for my application and requires a patch of the Stream.h file is, (when using khoih.prog/EthernetWebServer) that Stream.timedRead() should be protected instead of private.
(https://github.com/PaulStoffregen/cores/issues/531)

I’m skeptical of the stability of the current NativeEthernet just because of some of the choices I’ve made and other issues that have propped up. Currently I’m working on a rewrite of the whole thing to hopefully make it more stable and finally add non blocking functions. I’ve skimmed a bit of QNEthernet and it inspired me to learn and implement some of the C++ functions I seen being used. Some of the hodge podge I threw together can be made so much simpler and likely safer by using those functions it’s got my head reeling with ideas already. Most people who are used to the Arduino style library likely won’t notice a difference between the two stacks/libraries, but I myself prefer FNET at this point because I’m in too deep already.
 
I just released v0.5.0. The changes:

  • Added link-status and address-changed callbacks.
  • New `EthernetServer::end()` function to stop listening.
  • New `Ethernet.linkSpeed()` function. Returns the link speed in Mbps.
  • Fixed behaviour when using some of the functions in loops so that there's always "data movement": more internal calls to `yield()`.
  • Fixes to listening server management.

https://github.com/ssilverman/QNEthernet/releases/tag/v0.5.0

I'm working on some examples.
 
Wow Thank you! I was about to post in desperation because the Udp.parsePacket() in the Nativeethernet lib takes forever to compute. I thought I do one more look in the forums before I do that and I'm glad I did.

If I fasttoggle a pin with just Udp.parsePacket() in the loop I get a 1.4MHz clock with nativeethernet. I just swapped the libraries and see a 10x improvement.
 
Apologies, I spoke too soon. I realized that I was putting the code in its own while(1) loop to cut everything else out so it wasnt really parsing anything since your library relies on yield(). So I am back to being desperate! What are the limitations and bottlenecks here? Can udp.parsepacket(); execute any faster? Below is a sample code for reference. The edge-to-edge of the pintoggle is 276nS. Maybe this is too much to ask but I expected a bit more with a 100mbs interface and a 600mhz clock.

Code:
#include <QNEthernet.h>
using namespace qindesign::network;

IPAddress ip(192, 168, 1, 177);
IPAddress subnet_mask(255, 255, 255, 0);
IPAddress gateway(192, 168, 1, 1);

unsigned int localPort = 6454;  

#define UDP_TX_PACKET_MAX_SIZE 1480  // MHA ADDED
#define PIN_0 38

unsigned char packetBuffer[UDP_TX_PACKET_MAX_SIZE];  // buffer to hold incoming packet,

EthernetUDP Udp;

void setup() {
  pinMode(PIN_0, OUTPUT);
  
  Ethernet.begin(ip, subnet_mask, gateway);

  Udp.begin(localPort);
}

void loop() {
  while (1) {
    digitalToggleFast(PIN_0);
    int packetSize = Udp.parsePacket();
    if (packetSize) {
       Udp.read(packetBuffer, UDP_TX_PACKET_MAX_SIZE);
    }
    yield();
  }
}
 
Sorry for spamming the thread but replacing yield(); with Ethernet.loop(); made a major improvement: edge-to-edge is 62nS. I can live with that. Thank you!
 
mamdos, there's a few things happening here:

  1. `loop()` is already called in a loop, so there's not really any need to loop in that.
  2. After every call to `loop()`, the system calls `yield()`, giving the stack a chance to move forward.
  3. I'm calling `yield()` in `EthernetClient` inside the I/O functions specifically to avoid problems like the one you're seeing, but I did not apply the same thing to `EthernetUDP`. I've just pushed a change that fixes this by calling `yield()` inside `parsePacket()`. You'll see the results of a call to `yield()`, however, and not simply a call to `Ethernet.loop()`. (But maybe I should use that instead of `yield()`.) In any case, try the code from the most recent push first and remove your own call to `yield()`.

Update: If it turns out this really hits performance, I'll change all those `yield()` calls to `Ethernet.loop()`.
 
Update: Heck with it, I just replaced all the non-looping `yield()` calls with `EthernetClass::loop()`. This should speed things up a bit for `EthernetClient` and `EthernetUDP`.
 
I just replaced all the non-looping `yield()` calls with `EthernetClass::loop()`. This should speed things up a bit for `EthernetClient` and `EthernetUDP`.

Yes, please use EthernetClass::loop() rather than yield(), for cases where yield() will not call Ethernet::loop().
 
yield() will always call loop() (unless you’ve overridden yield()). It’s that yield() has other overhead.
 
Ah, yes, indeed. That's certainly a case where yield() willn't call loop(). :)
Thanks for pointing that out.

Thanks also to everyone who's been adding to the discussion. It's making the library better.
 
I just released v0.6.0. The changes:

  • There's a new "survey of how connections work" section to the README.
  • Added low-level link receive error stats collection.
  • Added a call to `loop()` in `EthernetUDP::parsePacket()`.
  • Added `EthernetLinkStatus` enum for compatibility with the Arduino API. Note that `EthernetClass::linkStatus()` is staying as a `bool`; comparison with the enumerators will work correctly. This change should make the library very close to a drop-in replacement.
  • Now sending a DHCPINFORM message to the network when using a manual IP configuration.
  • Changed all the internal "`yield()` to move the stack along" calls to `EthernetClass::loop()`, for speed.
  • Ethernet.end() no longer freezes, but after restarting Ethernet, DHCP can lo longer get an IP address. This is still a TODO.

https://github.com/ssilverman/QNEthernet/releases/tag/v0.6.0

Examples are still planned.
 
Last edited:
Tune lwIP. I could use some help with this. (@manitou, I know you've already done some tuning; I point to this in the README.)

It's been a while since I've messed with the Teensy 4 lwIP. I recommend NativeEthernet for the Arduino Ethernet API. I did run some of my Ethernet benchmarks on your QNEthernet, see following summary
Code:
TCP xmit (mbs)       65    client.flush()      
TCP recv (mbs)       80    TCP_WND  (6 * TCP_MSS)        

UDP xmit (mbs)       99    blast 20 1000-byte pkts
UDP xmit (pps)   157074    blast 1000 8-byte pkts
UDP recv (mbs)       96    no-loss recv of 20 1000-byte pkts
UDP RTT (us)        227    RTT latency of 8-byte pkts

ping RTT (us)       360
To compare with other T4 ethernet libs see the table at lwIP tests

I found that the Udp.remoteIP() worked just fine, but the remoteIP() for a TCP server returned 0? I was having mixed results with spinning on Udp.parsePacket(), further study required...

To improve T4 receive rate, I increased TCP_WND to 6*TCP_MSS in lwipopts.h. The TCP transmit rate was only 17 mbs. Examining the TCP packets with tcptrace (linux), I observed a 250 ms idle period after the SYN packet and before the first data packet.
t4seq1.jpg

Code:
 00:00:00.937040 IP 192.168.1.19.55958 > 192.168.1.4.5001: Flags [S], seq 6539,win 8760, options [mss 1460], length 0
 00:00:00.000042 IP 192.168.1.4.5001 > 192.168.1.19.55958: Flags [S.], seq 1673819974, ack 6540, win 64240, options [mss 1460], length 0
 00:00:00.000254 IP 192.168.1.19.55958 > 192.168.1.4.5001: Flags [.], ack 1, win8760, length 0
 00:00:0[COLOR="#FF0000"]0.250144[/COLOR] IP 192.168.1.19.55958 > 192.168.1.4.5001: Flags [P.], seq 1:1461, ack 1, win 8760, length 1460
 00:00:00.000000 IP 192.168.1.19.55958 > 192.168.1.4.5001: Flags [P.], seq 1461:2921, ack 1, win 8760, length 1460
 00:00:00.000052 IP 192.168.1.4.5001 > 192.168.1.19.55958: Flags [.], ack 1461,win 62780, length 0
 00:00:00.000012 IP 192.168.1.4.5001 > 192.168.1.19.55958: Flags [.], ack 2921,win 62780, length 0
 00:00:00.000239 IP 192.168.1.19.55958 > 192.168.1.4.5001: Flags [P.], seq 2921:4381, ack 1, win 8760, length 1460
After the 250 ms pause, the data rate approached 8 Mbytes/sec.
t4qput1.jpg

I had experienced a similar pause in earlier (2016) lwIP TCP transmit tests and solved that by adding tcp_output(pcb) in the packet send loop. For QNEthernet I added client.flush() to the client.write() loop and that eliminated the pause. (250 ms is the period of one of the lwIP TCP timers.)
 
Last edited:
Thanks for these tests, @manitou.

It's been a while since I've messed with the Teensy 4 lwIP. I recommend NativeEthernet for the Arduino Ethernet API.

Hopefully my library migrates to your "recommended list" one day. :)

I have some questions and comments:

I'm assuming you used your fnet_perf.ino as a base. If that's true, what did you use on the "other end", eg. for the TCP sink and source and for the UDP source? Do you have a link to source for that? I'd love to make some tweaks and do my own testing. (I.e. with the same codebase. I could write my own, but I’d rather use the same thing you used.)
(Update: I think you're using TTCP per the file comments.)

I did run some of my Ethernet benchmarks on your QNEthernet, see following summary
Code:
TCP xmit (mbs)       65    client.flush()      
TCP recv (mbs)       80    TCP_WND  (6 * TCP_MSS)        

UDP xmit (mbs)       99    blast 20 1000-byte pkts
UDP xmit (pps)   157074    blast 1000 8-byte pkts
UDP recv (mbs)       96    no-loss recv of 20 1000-byte pkts
UDP RTT (us)        227    RTT latency of 8-byte pkts

ping RTT (us)       360
To compare with other T4 ethernet libs see the table at lwIP tests

It seems like TCP performance is respectable but could use some work, and UDP performance at least matches the other tests. I do the lwIP "timeouts" and link status polling every 125ms and that UDP RTT time looks close to the other results plus 125. Same with the ping; it looks similar to the others but with a 250ms increase. I'll probably experiment with the timing in the polling.

I found that the Udp.remoteIP() worked just fine, but the remoteIP() for a TCP server returned 0? I was having mixed results with spinning on Udp.parsePacket(), further study required...

I'll have a look at this. For the parsePacket() spinning, when you say “mixed results”, do you mean “success and not-success” or do you mean just different timings?

To improve T4 receive rate, I increased TCP_WND to 6*TCP_MSS in lwipopts.h. The TCP transmit rate was only 17 mbs. Examining the TCP packets with tcptrace (linux), I observed a 250 ms idle period after the SYN packet and before the first data packet.
View attachment 25908

Code:
 00:00:00.937040 IP 192.168.1.19.55958 > 192.168.1.4.5001: Flags [S], seq 6539,win 8760, options [mss 1460], length 0
 00:00:00.000042 IP 192.168.1.4.5001 > 192.168.1.19.55958: Flags [S.], seq 1673819974, ack 6540, win 64240, options [mss 1460], length 0
 00:00:00.000254 IP 192.168.1.19.55958 > 192.168.1.4.5001: Flags [.], ack 1, win8760, length 0
 00:00:0[COLOR="#FF0000"]0.250144[/COLOR] IP 192.168.1.19.55958 > 192.168.1.4.5001: Flags [P.], seq 1:1461, ack 1, win 8760, length 1460
 00:00:00.000000 IP 192.168.1.19.55958 > 192.168.1.4.5001: Flags [P.], seq 1461:2921, ack 1, win 8760, length 1460
 00:00:00.000052 IP 192.168.1.4.5001 > 192.168.1.19.55958: Flags [.], ack 1461,win 62780, length 0
 00:00:00.000012 IP 192.168.1.4.5001 > 192.168.1.19.55958: Flags [.], ack 2921,win 62780, length 0
 00:00:00.000239 IP 192.168.1.19.55958 > 192.168.1.4.5001: Flags [P.], seq 2921:4381, ack 1, win 8760, length 1460
After the 250 ms pause, the data rate approached 8 Mbytes/sec.
View attachment 25909

I had experienced a similar pause in earlier (2016) lwIP TCP transmit tests and solved that by adding tcp_output(pcb) in the packet send loop. For QNEthernet I added client.flush() to the client.write() loop and that eliminated the pause. (250 ms is the period of one of the lwIP TCP timers.)

I'm going to experiment with reducing the time between timeout/link polls to see how that affects things, but for that I'll need the code for the "other side". Maybe I'm not seeing the link in your lwIP tests?

Additionally, because I rely on yield() to do most of the stack polling (data and lwIP timeouts/link) (some of the I/O functions call loop() directly, internally), I'm thinking that this could be slower or faster, completely depending on what the program using QNEthernet is doing. NativeEthernet uses a 1ms timer. I'm trying to avoid using a timer, even if performance takes a small hit.
 
Last edited:
It seems to be difficult to find documentation on what to send on the control port for `nuttcp` to have it listen for data on port 5001. (Same comment for iperf.) @manitou, what commands are you using on the server side to measure throughput?
 
There’s a few reasons:

1. I’d prefer not to run anything from an ISR context. Either Ethernet functions are called, in which case they need to run quickly, or a flag is set and then polled in some main loop. For the second, we’re back to where we started because that poll must happen somewhere and I’d choose yield() again so the user doesn’t have to worry about calling some loop() function. For the first, all sorts hairy things can happen in an ISR context, plus I don’t really know how long the Ethernet functions will take. I thought I’d just avoid the whole potential issue.

2. Things like the Arduino-style IntervalTimer library don’t really work across libraries if you need to do anything other than simple things with it. For example, I use it in TeensyDMX (only because I have to) to avoid conflicts with other libraries, but I have to jump through hoops to implement one-shot timers (Eg. “Stop the timer when it’s notified”) because there’s no state passed to the callback. Sure, there are other libraries and timers one could use, but the minute you do something a little more complex, there’s possible conflicts with other libraries using the stock API. Same goes with other timers. It’s hard writing a library that needs to work with other libraries if there’s no easy way to track resource usage.

Those are the main ones.
 
Those are valid reasons, as far as FNET it has a fairly complex ISR management system and it’s default implementation uses it’s own timer library to run from either a PIT or GPT timer. The only reason I use an IntervalTimer is for compatibility with other libraries much like you did in TeensyDMX.
 
I very much like that QNEthernet doesn't use hardware timers or interrupts, and the frequency of execution of Ethernet::loop() can be controlled as necessary for a given application.

@shawn, perhaps the hard-coded "timeouts" could be configurable (default with override capability), at least for Ethernet::loop(), or add an argument so you can run with timeout=0 for benchmarking.
 
Here's some results I got myself.

Test setup:
  • MacBook Pro (13-inch, 2020)
  • macOS Bug Sur v11.6
  • Belkin USB-C Ethernet adapter
  • Both the Teensy 4.1 and the laptop plugged into an eero 6 Pro as the hub
  • Testing qneth_perf.ino file attached. It's minimally modified from fnet_perf.ino (9905baf).
  • My own compiled version of nuttcp.c v8.2.2, modified to not change to client/server mode when receiving TCP (line 2691). Also attached.

Settings: TCP_WND = 4 * TCP_MSS
TCP receive: 94.5 Mbps (`./nuttcp -t -p 5001 my.teensy.IP`)
TCP send: 93 Mbps (`./nuttcp -r -p 5001`)
UDP send: 20 packets * 1000 bytes: 123.5 Mbps
UDP send: 1000 packets * 8 bytes: 9.5 Mbps, 149477 packets/s
Ping from laptop: ~1.2ms (I wonder why this is so much higher?)

Results from TCP_WND=6*TCP_MSS: TCP send and receive are about 1Mbps faster (depends on moon position, it might not even be that much faster).

I haven't yet written a laptop-side program to do the UDP echo, receive, and RTT tests.
If you're testing yourself, please make sure you have the latest version of QNEthernet from GitHub.
 

Attachments

  • qneth_perf.ino
    8.8 KB · Views: 135
  • nuttcp-8.2.2-mod.c
    316.7 KB · Views: 243
Last edited:
I realized after it was uneditable that the UDP send speed is a value > 100Mbps. I’m sure there’s a good reason for that…
 
Back
Top