Is it possible to detect if a serial write is blocking execution?

Talkiet

Well-known member
I'm trying to track down a horrible situation where when I write 32 bytes out Serial5 (115200) to a TTL<->RS485 module which then sends it to a LORA DTU - my main code halts SOMEWHERE in a reasonably complex main loop.

Main loop typically has a per second worst case execution time of 400-20000us, and achieves 700k loops/second... When there's a write to LORA though, I see a worst case execution of 400milliseconds.

I have put simple Serial.print(micros()) at the start and end of the serial write routine and it shows that the routine completes and exits in 6-7ms. I also changed the serial5 speed to 19200 and it increased the halt to around 700ms.

I've checked the serial output buffer before and after the write (127 bytes, 99 bytes) but I can't isolate what's causing the slowdown.

My code is huge and uses various interrupts, intervaltimers and multiple simultaneous serial ports (Nextion screen, GPS, RS485 adaptor, thermal printer) so isolating this will be a pain. I am hoping that perhaps I am coming across a well known quirk which is immediately obvious to more experienced programmers than I.

Specifically, is there any way I can determine if the serial (specifically serial5) port is in fact what is blocking execution? The fact that I can reliably change the length of time blocking by adjusting the data rate from 19200 to 115200 indicates that it's likely to be that. Also, when I leave the rest of the code exactly as is, and JUST comment out literally the lines as follows...

Code:
  if (loraaddr == 1) {
    DEBUG_CS.print("Sending Start test request to LORA at ");
    DEBUG_CS.print(local);
    DEBUG_CS.print(" seconds : ");
    DEBUG_CS.print(sizeof(LoraData.LoraLine));
    DEBUG_CS.print("bytes - CRC : 0x");
    DEBUG_CS.println(LoraData.timepacketdata.crc, HEX);
    //rs485.write(start_loraADDH);
    //rs485.write(start_loraADDL);
    //rs485.write(start_loraCHANNEL);
    //rs485.write("AB");
    //rs485.write(LoraData.LoraLine, timepacketsize);
    rs485lastuse = millis();
  }

The code blocking disappears.

I _know_ the right answer is to recreate this in a simple scenario but I fear that this would absolutely not happen without the rest of the code running and I'd rather attempt to identify what's causing the blocking to start with. (There's about 6000 lines in total)

Any ideas?

Cheers - Neil G
 
This is a wild guess because we can't see your code, don't know your platform, but are you using an RS485 library? If so, it may have blocking delays for RS485 bus control. If the RS485 is just a point-to-point link, and there are no control signals to the TTL/485 converter, then you can just write to Serial5, which should be non-blocking.
 
Fair point, and I KNOW I am leaving a lot undocumented given the size of the program, but I am not using an RS485 library, I'm just writing out to serial5. Also, I am using a Teensy 4.1 - Sorry, that was originally in the post. (But I helpfully edited that out at some point before posting!)

Code:
#define rs485 Serial5
..
  rs485.begin(115200);
  //  Following changes added 128 bytes to buffer, but it was already large enough (tested at 127);
  //  static char transmitbuf_rs485[128];
  //  rs485.addMemoryForWrite(transmitbuf_rs485, 128);

I added this before and after the writes, and as above, it showed me that the buffer only drops from 127 before to either 99 or 98 after.

Code:
Serial5.availableForWrite()

With the additional addMemoryForWrite above, that changed (as expected) to 255 and 255-28ish.

The actual rs485.write() commands are clearly non blocking - a before and after test shows the program execution time in those 5 lines is only about 8 MICROseconds.

Cheers - N
 
The actual rs485.write() commands are clearly non blocking - a before and after test shows the program execution time in those 5 lines is only about 8 MICROseconds

Okay, now I'm confused. You say in your OP that your problem goes away if you comment out the calls to rs485.write(), but now you say they take only 5 us. So, what's the problem?
 
Okay, now I'm confused. You say in your OP that your problem goes away if you comment out the calls to rs485.write(), but now you say they take only 5 us. So, what's the problem?

The commands themselves take only a few microseconds to execute, but measuring the entire main loop execution time shows that when they run within that one second interval (triggered by an intervaltimer that is disciplining the clock from the GPS PPS), the max_loop_execution_time for that second shows many hundreds of milliseconds.

When I comment out the Serial5/RS485 writes, the longest main loop execution time for any second is about 30ms, and typically the worst is about 4ms.

It looks like the code is happily putting all the bytes into the transmit buffer and carrying on without delay, (as the Serial5.availableForWrite() shows immediately after the last write), but then something to do with that buffer being serviced out to the RS485 chip is badly hurting and somehow causing one main loop iteration to take hundreds of milliseconds (sometimes around half a second).

Things I am suspicious of:
Flow control on the RS485 module, interrupt priority issues between the serial ports and my intervaltimer...

For the moment I have managed to schedule all the RS485 comms to be buffered in software and have scheduled the transmission to take place very early in the second - I still get the big spikes but as the execution isn't halted over a second rollover, it's not actually breaking the code anymore.

I don't like it, but it's "working" *1

Cheers - N

*1 - for small values of "working"

Cheers - N
 
Sounds like your sketch is fairly complex, so it's hard to say, but perhaps it has something to do with how you are using the other hardware serial functions after your calls to write(). For example, after your calls to write(), are you waiting for a response and calling available/peek/read? Those functions all disable interrupts briefly.
 
I calculate that 32 bytes at 115k baud would take 2.7 ms even if there were no buffering at all for the transmit bytes, nowhere near 400 ms.
And 32 bytes at 19.2k baud would take 16.6 ms, nowhere near the increase of 300 ms ( 400 ms to 700 ms ).

So I would look elsewhere such as did you set up flow control and it is the LORA that is slowing things down? Or is the LoraData.LoraLine buffer properly terminated such that you actually send only 32 bytes? Or is LoraLine string data and I would ask myself if Teensy has a string garbage collector that runs( I don't know the answer )?
 
Just a completely blind guess... maybe you're using Arduino's RS485 library? And maybe it is using a delay create the DE signal to your RS485 chip? (another blind guess about your RS485 hardware, that it's the normal type where you connect the RE' and DE pins together for half duplex mode... yet another guess that you're doing half duplex...)

Teensy's hardware serial has a built in feature to very efficiently control the DE pin on common RS485 hardware. It's Serial1.transmitterEnable(pin). Documentation is here:

https://www.pjrc.com/teensy/td_uart.html

Obviously Arduino's library won't make use of this, since it's a Teensy-specific extension that none of Arduino's products have.

If that admittedly long chain of blind guessing turns out to be applicable to your situation, maybe with some fiddling transmitterEnable() could help?

And if that guesswork was somehow off the mark, perhaps if you tell us more about what you're really doing, and even better show a small sample program which would let any of us reproduce the problem, we could probably help much more than advise based on so much guessing.
 
I really appreciate the educated guessing here. I am painfully aware of how hard it is to troubleshoot something with a very small subset of the information...

485.jpg

That's the RS485 module I am using - I am not using any specific library - it's just TTL out from the Teensy 4.1 Serial5 into the UART on that module.

I had seen the DE stuff, but I am not sure that applies here given the hardware I am using - The description for the module is just "TTL Turn To RS485 Module Hardware Automatic Flow Control Module Serial UART Level Mutual Conversion Power Supply Module 3.3V 5V".

The LORA DTU attached to that is one of these ( https://www.ebyte.com/en/product-view-news.html?id=534 )
lora.jpg

Given there's no hardware flow control mechanism connected to the Teensy, we're left with SW flow control as the only option, but it appears that the bytes exit the teensy pretty quick. I guess I could throw my scope on either the UART or RS485 side of the RS485 converter and confirm/count the bytes and time taken (and whether there is any contra traffic flow which may indicate SW flow control trying to do something)

The relevant settings for the DTU are that it's configured for 64 byte packets (as my 28 byte payload exceeds 32 once I add my header, the address and channel bytes and the RSSI byte that the unit adds on transmission), 19200 RS485 interface speed, and 1.2kbps air interface speed.

Now, since yesterday I did make some bigger changes and have dramatically improved things, but still at least an order of magnitude at least off what it should be and I am frustrated that I don't know where the code is hanging up for so long. I really wish it was possible to see where the code is hanging... So far my smattering of serial.println(micros)) haven't shown up anything useful :-(

I changed the teensy serial5 from 19200 to 115200 and I upped the air data rate to 2.4kbps. This shouldn't have made much difference but it did. The delays went from about 400-600ms down to 60-110ms. I'm glad for the change, but that factor of change isn't even close to aligning with any of the adjustments I made.

When I calculate the theoretical air transmission time of the full 64 byte packet (padded by the DTU - my input is properly terminated), that's 1.2kbps for 64 bytes = ~512ms and all of a sudden I'm in the right ballpark for what I was seeing, of course I hadn't considered this would be the case given that not only does the Teensy have a serial buffer, but the Ebyte DTU also has a documented 1000 byte buffer...

When I doubled the air rate to 2.4kbps though, I don't see the times fall to ~256ms, I see (as above) 60-110ms. Of course I did change the UART speed from 19200 to 115200 at the same time - but again, buffers should make that irrelevant as far as code execution times are involved!

I really thought with the Teensy serial buffer and the DTU buffer that this wouldn't be related, but it appears that it is so I am out of ideas.

With the improvement to 60-110ms, and scheduling the transmissions to happen shortly after the second rollover, the code is now "working" (*1) but I wish I understood why this is happening.

Cheers - N

*1 for medium values of "working" now...
 
What I will typically do in cases like this is, to instrument the code potentially with Serial.prints... But often times with digitalWriteFast/digitalToggleFast of some unused pins.
(with pinMode(OUTPUT... in setup)

And then I will typically hook up a logic analyzer to see when things are happening.
Typical places in cases like this, that I might bracket, or simply toggle are:
void loop() { digitaltoggleFast(debug_pin1); ...
likewise if I have some other main functionality loops like Interval Timer. In this case I might bracket start and end.

If I have a function, that does the main outputs to this port, maybe one around that function, or the portion that actually does the outputs...

I might also go into the core code and for example see if any of your code is calling rs485.flush();
As this will wait until all of the software and hardware queues are completely empty.

If you don't have Logic analyzer or scope, you could potentially still use the output to help localize, it is just maybe a little harder, like:
wire up LEDS to these pins, and so you can try to visually see when things are happening. You could wire it up to another processor, that maybe monitors those pins,
and prints out a rough idea of what things are taking and then to maybe adjust your locations of write/toggle statements...
 
Thanks Kurt.. I have been using a bunch of sprinkled print statements, in fact, this function has been useful for basic tracing of how long between lines being executed... I remember being quite happy to find '__LINE__' :)

Code:
whereami(__LINE__);

..

void whereami(int linenumber) {
  DEBUG_CS.print(linenumber);
  DEBUG_CS.print(": ");
  DEBUG_CS.println(micros() - linetimestamp);
  linetimestamp = micros();
}

But so far the hangup has been elusive - it's NOT happening in the functions that are doing the serial writes.

I hadn't thought of using some digital pins to indicate things as well... That's a good idea. I still am wondering about the best way to identify that program execution has paused... I guess I could have an interrupt routine log __LINE__ and timestamps to a file. (Yes I am aware there's a lot happening there, but I am looking for dozens of ms at least - and even a huge gap in the lines being executed could help.

I also just chucked the Serial5 output and the RS485 input onto a scope and they are both showing expected results... (Spot on the expected 2.9ish millisec to send the 32-33 byte payload across to the LORA DTE)

LORAserial.jpg

I think I am just going to have to wade through inserting manual logging points in the code and just try and find where it hangs. I guess I could be thoroughly evil and awk a micros() comparison into the the code every 5-10 lines and see what falls out the other side :)

Cheers - N
 
On another note, I wish I hadn't aliased Serial5 to rs485 - it's caused so much confusion here. I am not using and rs485 libraries. I'm just outputting data to Serial5 on the T4.1 to a UART-RS485 adaptor which goes into an ebyte LORA DTU over 485.

Cheers - N
 
Well, having peppered the code with a bunch of calls that measure time spent in each function, and having noticed that the issue doesn't seem to occur on a remote unit without a display connected - I think I have found the culprit.

There's an interaction with the traffic I am sending over serial5 to the external LORA box and the DAMN NEXTION DISPLAY connected via serial3. The remote unit has no display and doesn't exhibit this issue... When I schedule all the LORA (serial5) comms to NOT occur over a second boundary the issue doesn't happen. (The failure mode is that the time discipline code that has to run each second was getting missed and was messing up the calculated period that I used)

And when I disabled the Nextion on the main unit and let the Serial5 sending happen over a second boundary, there's no issue either!

Then I remembered that I had seen this before, a long time ago and it's one of the reasons I hate Nextion screens so much - the serial calls to update them ARE blocking... This hasn't been a big deal in the past as anything that absolutely had to happen (timestamping start and finish pulses, timestamping the PPS input etc) was interrupt based and just worked... BUT the time discipline code is in the main loop and for some reason if Serial5 and Serial3 overlap (I haven't figured out precisely how badly or how much overlap is needed), then the Nextion calls seem to stall for orders of magnitude longer than they should take in total. And these are blocking, so the program just sits there and waits.

I don't really want to rewrite the nextion handler, but at least knowing where this is happening has validated my workaround - I'll adjust the Serial5 (Lora) and serial3 (Nextion) schedules to be mutually exclusive so they shouldn't hit each other.

Uuuggggly.

Thanks for the suggestions - ultimately it was the couple of notes about blocking serial that triggered me to look at the screen - but I'm still confused how serial5 is affecting serial3.

Thanks heaps for the patience especially given I wasn't able to supply a distilled proof of problem!.

Cheers - N
 
I don't know anything about your Nextion handler, but I have written a Library for the Nextion. It is intended as a foundation for a user's Nextion Comms.
It contains a lot of functionality and will have all that most people will require of Nextion comms.
You can find it here.
I am not saying that you should use it, although you are quite welcome to do so, just that it might give you some ideas for your Nextion Comms.
 
Oh, now that makes a lot of sense. I recall looking at the Nextion library code about 1 year ago. It was very simple, just send the command and then busy loop waiting for the display to confirm.
 
It was very simple, just send the command and then busy loop waiting for the display to confirm.
Yes, but you don't have to wait for any reply.
Simply flag that a confirmation/reply is expected and pick up that confirmation when appropriate.
The confirmation will be waiting in the serial input buffer.
 
I think there is something like that in the RA8875/76 code bases. Where it will only wait if it is trying to send the next command and the appropriate response from the previous command had not been received.

I am glad you have localized it and have a handle one it.

Good luck
 
Thanks all - I may end up looking at the Nextion handler and seeing if there's a more appropriate way that just hanging around and waiting for the reply... It seems pretty sub-optimal as is.

Cheers - N
 
Back
Top