UDP client stops receiving packets

After a reset I get the null report and then:
11:11:34.540 -> Breadcrumb #1 was 3396345872 (0xCA702010)
11:11:34.540 -> Breadcrumb #6 was 2461095967 (0x92B1581F)
 
I tried with a queue length of 64 but it still crashes

Once again I get a stream of complete 12345 sequences (obviously the first one is 345)
 
Which version are you using? If you haven’t tried the latest from GitHub (0.19.0-snapshot), could you try that? There were a couple bugs I fixed since v0.18.0.
 
I prepared another branch that uses the latest lwIP. Would you mind also trying the `lwip-master` branch? (In addition to `master`.) I want to see if maybe there've been some lwIP fixes since the 2.1.3 release that help.
 
Thanks Shawn

I installed the 0.19.0-snapshot (replacing the 0.18 version)

So far it has not failed and I am up to 1 million packets

Do you still want me to try the latest IwIP branch?
 
That’s great news! Sure, if you have the time to test the other branch, that would be great. Now that I know that the latest v0.19.0-snapshot works much better, I would love to build confidence that the latest lwIP works well too.
 
I prepared another branch that uses the latest lwIP. Would you mind also trying the `lwip-master` branch? (In addition to `master`.) I want to see if maybe there've been some lwIP fixes since the 2.1.3 release that help.


This part of code in QNEthernetUDP.cpp might be the bug:
Code:
  // Increment the size
  if (udp->inBufSize_ != 0 && udp->inBufTail_ == udp->inBufHead_) {
    // Full
    udp->inBufTail_ = (udp->inBufTail_ + 1) % udp->inBuf_.size();
  } else {
    udp->inBufSize_++;
  }
  udp->inBufHead_ = (udp->inBufHead_ + 1) % udp->inBuf_.size();

I think it says that if your buffer is full then discard the oldest element in the FIFO queue.
But what if someone was just working on (parsing) that oldest buffer element when someone else decided it's no longer relevant and can (soon/now) be overwritten?
Should there not be a semaphore that mutually excludes simultanuous access to that oldest element?
Or would it not be wise to stop filling the head when buffer is full, i.e. discard the newest message instead of the oldest?
 
I just realized there’s a spelling misunderstanding. :)
“lwIP” stands for “Lightweight IP”. “LWIP” but with the first two letters in small case. Silly non-serif fonts…
 
@sicco I've looked at this some more and I don't think the lack of synchronization here is a problem. The library is not asynchronous, so a lock isn't needed. All the input processing happens from `Ethernet.loop()`, and that's called via `yield()`, which happens in a few places such as after each main `loop()` iteration and inside `delay()`. When Ethernet input arrives, a flag is set in the ISR and that's how `Ethernet.loop()` picks it up. In short, `EthernetUDP::recvFunc()` shouldn't be called concurrently with `EthernetUDP::parsePacket()`. In fact, none of the library functions are called concurrently with each other. That is, unless someone tries to call them asynchronously with a threading or some other asynchronous library, in which case I don't guarantee proper functioning of my library.

Another fun fact: all internal lwIP critical functions assert that they're not running from an interrupt context. See `sys_check_core_locking()` in sys_arch.cpp and `LWIP_ASSERT_CORE_LOCKED()` in lwipopts.h.
 
Hi Shawn

Thanks for explaining the spelling :)

I tried the lwIP-master branch, however Ethernet.waitForLocalIP(kDHCPTimeout) was failing. I put back the 0.19.0-snapshot version and it worked immediately

Let me know if I need to do something different with the lwIP version, such as set a static IP address?
 
In the meantime I am leaving it on a soaktest with 0.19.0-snapshot but with the queue length back to 1 (previous few tests I still had it set to 64)
 
I tried the `lwip-master` branch myself and I think I know why you're not getting an IP address. Try changing the `kDHCPTimeout` value to 15 or 20 seconds. 10 seconds may not be long enough. It was a little slower with mine, too. The latest lwIP has a new ACD (address conflict detection) feature for DHCP, so that might be taking longer.
 
@sicco I've looked at this some more and I don't think the lack of synchronization here is a problem. The library is not asynchronous, so a lock isn't needed. All the input processing happens from `Ethernet.loop()`, and that's called via `yield()`, which happens in a few places such as after each main `loop()` iteration and inside `delay()`. When Ethernet input arrives, a flag is set in the ISR and that's how `Ethernet.loop()` picks it up. In short, `EthernetUDP::recvFunc()` shouldn't be called concurrently with `EthernetUDP::parsePacket()`. In fact, none of the library functions are called concurrently with each other. That is, unless someone tries to call them asynchronously with a threading or some other asynchronous library, in which case I don't guarantee proper functioning of my library.

Another fun fact: all internal lwIP critical functions assert that they're not running from an interrupt context. See `sys_check_core_locking()` in sys_arch.cpp and `LWIP_ASSERT_CORE_LOCKED()` in lwipopts.h.

Can the Ethernet hardware not overwrite data in the RAM buffer locations anytime it wants to do so? It's doing that via DMA I think. That act will not yield will it?

I still think the FIFO buffer head 'write' that you wrote shall not be allowed to touch the tail pointer/index. If the FIFO buffers are all full/in use, then so be it, but don't let anyone write into them, anywhere. If the FIFO buffers are all in use then bin your incoming Ethernet data so that those packets will be ignored / lost. That means also forbid the Ethernet DMA to write into the buffers!
 
Hi Shawn. Increasing the timeout worked, thanks

Have left a soaktest running with lwIP-master version, so far so good

The previous 0.19.0-snapshot test got to 384 million packets, so I think we can call that "working" :)
 
@sicco First, thanks for engaging. Concurrency is hard to get right and I appreciate the discussion. I'm open to be shown new details or flaws in my thinking I haven't thought of. Here's my view of the system; please pick it apart because maybe there's something I'm missing. I'll describe my train of thought in a "think out loud" manner.

There's three layers of buffering. The first is where DMA interacts with the `rx_ring` buffers declared in lwip_t41.c. (Note that I'll be referencing IMXRT1060 Manual, Rev 3.) The Ethernet hardware is supplied with some external memory that's expected to contain instances of "enhanced buffer descriptors" (as opposed to "legacy buffer descriptors" because I'm enabling interrupts and also IEEE1588 support). (41.3.14.1 Enhanced receive buffer descriptor, page 2124.)

Initially, all buffers (there's five of them in the code, see `RX_SIZE`) have the "Empty" flag set. When a packet comes in, the hardware fills one or more of these buffers and marks the "Empty" flag(s) as false, and then an interrupt is generated. In `enet_isr()`, a flag is set indicating there's data. When next `Ethernet.loop()` is called, that flag is checked, and if there's data, each non-empty buffer is first copied into an lwIP pbuf (see `enet_rx_next()`; it calls `t41_low_level_input()` to translate the buffer into a pbuf), and then its "Empty" flag is set. pbufs live in one of the lwIP pools.

The following detail is important: if my assumption below is true, then new packets can't be stored if there's no "Empty" RX buffer, and I set the "Empty" flag only after I've copied the data into a pbuf. As well, that data is copied ultimately from the main `loop()` and not from an ISR.

An assumption I'm making: When there's no RX buffers that have the "Empty" flag set then the incoming frame is dropped. If anyone knows for sure this is the case, could you please reply here with how you know if this is so? I haven't found this detail in the docs.

So far we have "new packets get dropped if full" behaviour, but I'm not sure I can control that.

The second buffering layer is those pbufs, one pbuf per RX buffer. This is where the raw incoming DMA-transferred data gets processed. If there's no more pbuf space, then the "Empty" flag for the RX buffer is set anyway, clearing room for new packets to arrive. This is "old packets get dropped if full" behaviour. I suppose I could change this so that the RX buffer, if there's no space in a pbuf for it, does not have its "Empty" set; that would change the behaviour to "new packets get dropped if full."

For UDP, the third buffering layer is that circular buffer. The current behaviour is to "drop old packets if full." This is also done ultimately from the main `loop()` function and not from an ISR.

So, In summary, the UDP queueing buffers packets from the pbufs and not the raw RX buffers written to by the hardware.

One of the reasons I've chosen "old packets get dropped if full" behaviour is because I believe it works better with "realer"-time streaming applications, where congestion won't result in starvation. It was primarily a gut choice.

I'd love your opinion (or anyone who has experience with this, for that matter) on these questions:
1. Should I provide a way to change the RX buffer->pbuf transfer behaviour to "new packets get dropped if full"? I'd probably do it with a define or something.
2. Should I provide a way for `EthernetUDP` queueing behaviour to be changed to "new packets get dropped if full"? I'd either use a define or a `setIgnoreNewIfFull(flag)` function or something.

If the answer to either of these questions is "yes", then why?

What do you think? Does this clarify why I don't think there needs to be locking in `EthernetUDP`?
 
Hi Shawn. The test with the lpIP-master variant was workign fine until I stopped the test at 54 million packets received

Is it right that the library manager should think that library variant is also called Version 0.19.0-snapshot?

The download from GitHub has the name "QNEthernet-lwip-master.zip" but after loading it, it is listed as "Version 0.19.0-snapshot"
 
Yep, that's fine. It merely looks at what's in the library's library.properties file, and `lwip-master` is simply based on top of the current `master`, which is also currently at 0.19.0-snapshot. I didn't change that value.

Thank you for testing that.
 
Last edited:
Back
Top