Teensy 4.0 USB connection crashes, not recoverable with watchdog

gijpma

New member
I am running 4 linear actuators, a CO2 sensor and a solenoid valve from a Teensy 4.0. the teensy is always connected to the PC from which a C# program is sending simple text commands to control the linear actuators and solenoid valve and read the CO2 sensor. In a similar project I had issues with the teensy crashing, likely caused by buffer overflow, so I programmed in a watchdog to reset the teensy whenever it gets overwhelmed. The watchdog works when I intentionally send long nonsensical strings to the teensy and the C# program can reopen the port and continue communication. Somehow, the Teensy still crashes once in a while ( ~ once every 12 hours, but irregular) in a way that the watchdog cannot reset it. Besides requesting CO2 sensor data, no communication is usually happening at the moment of the crash, and the only way to recover the connection to the teensy is by unplugging and plugging it back in. As I am using this in a system that needs to run continuously for >5 days, I need to resolve this crashing problem. Clearly, the teensy can crash in ways the watchdog cannot recover, but how?

This is my code:
Code:
#define HWSERIAL Serial5
#include "Watchdog_t4.h"
WDT_T4<WDT1> wdt;             // watchdogtimer
char c = ' '; //initializes to a space
char cmdBuffer[9] = "";
int Current = 0;
int AIN2[4] = {1,5,8,11};   
int AIN1[4] = {2,4,9,10};   
int PWM[4] = {0,6,7,12};    
int STBY = 3;               
int Solenoid = 18;          
elapsedMillis AliveTime;    //timer for checking PC connection
long TimeOn = 0;            // tracks time CO2
void setup()
{
  Serial.begin(115200);     
  Serial.println("Teensy is ready");
  Serial.println("Serial started at 115200");
  
  HWSERIAL.begin(9600);     // Serial port declaration for communication with CO2 sensor
                           
  for ( int k = 0; k <= 3; k++) {   // sets mode of all channels
    pinMode(AIN2[k], OUTPUT);
    pinMode(AIN1[k], OUTPUT);
    pinMode(PWM[k], OUTPUT);
    analogWriteFrequency(PWM[k], 20);     
    analogWrite(PWM[k], 30);
  }
  pinMode(STBY, OUTPUT);
  pinMode(Solenoid, OUTPUT);
  analogWriteResolution(8);
  digitalWrite(STBY, HIGH);
   // watchdog config block
  WDT_timings_t config;
  config.trigger = 5;
  config.timeout = 10;
  //  config.callback = myCallback;
  wdt.begin(config);
  // end watchdog config blcok
}

void loop()
{
  wdt.feed();
  if (HWSERIAL.available())     // pass through data from CO2 sensor to PC
  {
    c = HWSERIAL.read();
    Serial.write(c);
  }

  if (Serial.available())       // respond to commands send from PC.
  {
    while (Serial.available())
    {
      c =  Serial.read();       // read by character

      cmdBuffer[Current] = c;
      Current++;

      if (c == '\n')            // jump to action when end of line is detected, otherwise keep reading
      {
        else if (cmdBuffer[0] == 'Z' )   // CO2 data request.
        {                                                   
          for (int i = 0; i <= 2; i++) {
            HWSERIAL.write(cmdBuffer[i]);
            delay(10);
          }
        }
        else if (cmdBuffer[0] == 'I')       // linear actuator control
        {
          if (Current == 7)                 /
          {                                 // command format: "I" +"0" (bathnumber) + "000" (pump speed) +"U" or "D" or "S" (direction) + "/n"
            String Temp = "";
            Temp = cmdBuffer[1];
            int Pump = Temp.toInt();
            Temp = "";
            Temp += cmdBuffer[2];
            Temp += cmdBuffer[3];
            int PumpSpeed = Temp.toInt();
            String Direction = cmdBuffer[4];
            if (Pump >= 0 && Pump <= 3 && PumpSpeed > 0 && PumpSpeed <= 100 && (Direction == "U" || Direction == "D" || Direction == "S"))
            {
             
              if (Direction == "D")
              {
                digitalWrite(AIN1[Pump], HIGH);
                digitalWrite(AIN2[Pump], LOW);
                analogWrite(PWM[Pump], PumpSpeed);
              }
              else if (Direction == "U")
              {
                digitalWrite(AIN1[Pump], LOW);
                digitalWrite(AIN2[Pump], HIGH);
                analogWrite(PWM[Pump], PumpSpeed);
              }
              else
              {
                digitalWrite(AIN1[Pump], LOW);
                digitalWrite(AIN2[Pump], HIGH);
                analogWrite(PWM[Pump], 0);
              }


            }
          }
        }
        else if (cmdBuffer[0] == 'S')       // CO2 solenoid control. command format: "S" + "0000" (time solenoid stays in open position in milliseconds)
        {
          String Temp = "";
          Temp += cmdBuffer[1];
          Temp += cmdBuffer[2];
          Temp += cmdBuffer[3];
          Temp += cmdBuffer[4];
          TimeOn = millis() + (Temp.toInt());
           digitalWrite(Solenoid, HIGH);
       }
        else if (cmdBuffer[0] == 'A')  // signals live connection
        {
          AliveTime = 0;
        }
        cmdBuffer[0] = 0;

        Current = 0;
      }
      delay(55);
    }

  }
  if (millis() > TimeOn)        // turns solenoid closed when time is up.
  {
    digitalWrite(Solenoid, LOW);
  }
  if (AliveTime > 10000) // Means connection to PC lost, so shut down all actuators
    {
     digitalWrite(Solenoid, LOW);
     analogWrite(PWM[0], 0);
     analogWrite(PWM[1], 0);
     analogWrite(PWM[2], 0);
     analogWrite(PWM[3], 0);
     Serial.println("6. External Dead");
     delay(500);
    }
}
 
If it were me some of the things I would try include:

I would probably expand cmdBuffer to something longer like lets say 80 bytes.

I would check that I did not overflow it:
Code:
      c =  Serial.read();       // read by character

      cmdBuffer[Current] = c;
      Current++;
like:
Code:
      c =  Serial.read();       // read by character
      if (Current < sizeof(cmdBuffer) {  
          cmdBuffer[Current] = c;
          Current++;
      }
Again, maybe do something if you detect this, like reset or ignore everything up to the next \n


Personally, I hate using the String class and the toInt for doing things... But may just be me!
There are things like, what will it do if the characters you passed in are not 0-9?
So, I would do things like:
Code:
     int Pump = -1;
     if ((cmdBuffer[1] >= '0') && (cmdBuffer[1] <= '9') Pump = cmdBuffer[1] - '0';

More likely I might offload that to some simple function like:
Code:
int ConvertToInt(int index, int len) {
    int return_value = 0;
    while (len) {
        if ((cmdBuffer[index] < '0') || (cmdBuffer[index] > '9') return -1;
        return_value = return_value * 10 + cmdBuffer[index] - '0';
        index++;
        len--;
    }
    return return_value;
}
So then for example the 'I' part could be something like:
Code:
            int Pump = ConvertToInt(1,1);
            int PumpSpeed = ConvertToInt(2, 2);

I would also add in printing of CrashReport at startup.
like:
Code:
if (CrashReport) Serial.print(CrashReport);
To see if anything shows up
 
Hi Kurt, Thanks for the quick response. I didn't know about the CrashReport option. I am running the program overnight, if/when it crashes I'll post the CrashReport.

I've also implemented the other suggested changes, because they seem to be a good idea anyway, but would you expect any of these to result in the type of crash that cannot be reset with the watchdog, and disable the USB?

Thanks
 
Odd. If it were a typical crash that CrashReport would handle, the device would go offline 8 seconds and restart.

Wondering if the WDT was interfering with CrashReport - it does not seem to. With no WDT or not feeding 1-10 seconds then causing a fault ( where it is config'd for 8 seconds in below code? )

CrashReport cycled properly when fault was triggered before WDT timeout - and CrashReport notes WDT 'caused' when it had not yet expired before Crash?
Code:
... // secs 0 with no WDT through 5 seconds omitted
Teensy wdt CRASH test ... in seconds 6

C:\T_Drive\tCode\T4\WatchDogTest\WatchDogTest.ino Aug  3 2022 14:02:51
Teensy is ready for WDT CRASH TEST
Serial started at 115200
CrashReport:
  A problem occurred at (system time) 14:5:12
  Code was executing from address 0x126
  CFSR: 82
	(DACCVIOL) Data Access Violation
	(MMARVALID) Accessed Address: 0x0 (nullptr)
	  Check code at 0x126 - very likely a bug!
	  Run "addr2line -e mysketch.ino.elf 0x126" for filename & line number.
  Temperature inside the chip was 46.25 °C
  Startup CPU clock speed is 600MHz
  Reboot was caused by watchdog 1 or 2
Teensy wdt CRASH test ... in seconds 7

C:\T_Drive\tCode\T4\WatchDogTest\WatchDogTest.ino Aug  3 2022 14:02:51
Teensy is ready for WDT CRASH TEST
Serial started at 115200
CrashReport:
  A problem occurred at (system time) 14:5:33
  Code was executing from address 0x126
  CFSR: 82
	(DACCVIOL) Data Access Violation
	(MMARVALID) Accessed Address: 0x0 (nullptr)
	  Check code at 0x126 - very likely a bug!
	  Run "addr2line -e mysketch.ino.elf 0x126" for filename & line number.
  Temperature inside the chip was 46.25 °C
  Startup CPU clock speed is 600MHz
  Reboot was caused by watchdog 1 or 2
Teensy wdt CRASH test ... in seconds 8

C:\T_Drive\tCode\T4\WatchDogTest\WatchDogTest.ino Aug  3 2022 14:02:51
Teensy is ready for WDT CRASH TEST
Serial started at 115200
Teensy wdt CRASH test ... in seconds 9

C:\T_Drive\tCode\T4\WatchDogTest\WatchDogTest.ino Aug  3 2022 14:02:51
Teensy is ready for WDT CRASH TEST
Serial started at 115200
Teensy wdt CRASH test ... in seconds 10

C:\T_Drive\tCode\T4\WatchDogTest\WatchDogTest.ino Aug  3 2022 14:02:51
Teensy is ready for WDT CRASH TEST
Serial started at 115200
HALTING :: Teensy wdt CRASH test

just for fun - EEPROM based counting cycle code:
Code:
#include "Watchdog_t4.h"
WDT_T4<WDT1> wdt;             // watchdogtimer
#include <EEPROM.h>

char eepVal;
void setup()
{
  Serial.begin(115200);
  while (!Serial && millis() < 4000 );
  Serial.println("\n" __FILE__ " " __DATE__ " " __TIME__);
  Serial.println("Teensy is ready for WDT CRASH TEST");
  Serial.println("Serial started at 115200");
  if (CrashReport) Serial.print(CrashReport);
  eepVal = EEPROM.read(500);
  if ( eepVal > 10 ) {
    EEPROM.write(500, 0);
    eepVal = EEPROM.read(500);
    Serial.println("HALTING :: Teensy wdt CRASH test");
    while ( 1 ) delay(100);
  }
  EEPROM.write(500, eepVal + 1);

  // watchdog config block
  WDT_timings_t config;
  config.trigger = 3;
  config.timeout = 8;
  //  config.callback = myCallback;
  if ( eepVal > 0 )
    wdt.begin(config);
  // end watchdog config block
}

void loop()
{
  if ( eepVal > 0 )
    wdt.feed();
  if (millis() > 12000 )
  {
    Serial.print("Teensy wdt CRASH test ... in seconds ");
    Serial.println((uint8_t) eepVal);
    delay( eepVal * 1000 );
    int *p = 0;
    p[0] = 1;
  }
  delay(55);
  if ( Serial.available() ) {
    EEPROM.write(500, 0);
    eepVal = EEPROM.read(500);
    Serial.println("HALTING :: Teensy wdt CRASH test");
    while ( 1 ) delay(100);
  }
}
 
but would you expect any of these to result in the type of crash that cannot be reset with the watchdog, and disable the USB?

Yes, a buffer overflow definitely could "crash" that way. But better to think "get stuck" rather than "crash".

For example, you could overwrite variables in RAM used by USB Serial, or hardware Serial5, which cause their available() functions to never return anything other than zero. So loop() could continue to run and keep resetting the watchdog timer, but not actually be able to do anything useful. Technically that's not a "crash", because your program is still running, but it certainly could be a scenario you've described when your program no longer functions and the watchdog doesn't restart it.

Watchdog timers are notoriously difficult to actually use effectively. The idea seems simply, but the practical usage is far more difficult.

This sort of situation where the program keeps running but not working due to a software defect is pretty common problem. A watchdog timer won't help if your program just resets it willy nilly. To use a watchdog effectively, you need some sort of strategy where you confirm every essential function of your program is still operating properly, then only reset the watchdog when you have certainty everything is still working.

Sometimes using a watchdog timer can make small problems much worse, especially for programs where the flow of control or algorithm changes substantially depending on inputs or other factors. When programs aren't very simple fixed functions, it can be really difficult to determine when to reset the watchdog. If you err on the side of too much reluctance to reset the watchdog, it can become overly sensitive and end up restarting your program under conditions you didn't anticipate. This is particularly difficult if you later add features to the program, causing its timing and flow to change. But you need that reluctance to reset the watchdog, otherwise your program can get stuck in a loop where you're resetting the watchdog but not actually running properly.

I don't have any easy answers. I'd be amazed if anyone did. Watchdogs are tough to use effectively for anything but fairly simple fixed function programs which always run with rigidly defined timing and behavior.
 
Thanks for the clarification. I ran the system with the suggested changes for the past 22 hours without crashes, so so far so good. What I do not understand is how the teensy getting stuck in a loop could cause the PC to stop recognizing the USB connection (i.e. my C# program indicates the port is closed and I cannot reopen it unless I unplug-plug the USB). Is it possible that when the teensy gets stuck in a high speed loop it prevents further serial communication, which is interpreted as a lost connection by the PC?
 
Sometimes when you overwrite memory it is hard to know what the ramifications might be. It depends on what was overwritten. For example if something is on the stack (local variable), and you overwrite some variable there, it could for example lose the return address to return from that function, and it returns it jumps some random place which could for example decide to disable interrupts and sit there or...
 
Great that it is running.

@Paul's p#5 gave some indication about loss of USB. If the code goes awry and trashes memory the USB Stack can get disabled.

If the code was writing to that array without a limit on Current++, and other edits made.
 
The USB controller operates with DMA accessing linked lists of structs in memory to define the properties of all the USB endpoints and manage all the pending data transfers. If you really want to dive into the details, look for the QH and QTD structures defined in the reference manual, but be careful to read the section about device mode because the hardware has those same struct names for in memory config for host mode, but host mode has many important differences from the simpler device mode. If your buffer overflow corrupts those linked lists of stucts that control the USB endpoints, all sorts of wrong things could happen. But almost all of them would fall into the category of "stops working".
 
Yes, a buffer overflow definitely could "crash" that way. But better to think "get stuck" rather than "crash".

For example, you could overwrite variables in RAM used by USB Serial, or hardware Serial5, which cause their available() functions to never return anything other than zero. So loop() could continue to run and keep resetting the watchdog timer, but not actually be able to do anything useful. Technically that's not a "crash", because your program is still running, but it certainly could be a scenario you've described when your program no longer functions and the watchdog doesn't restart it.

Watchdog timers are notoriously difficult to actually use effectively. The idea seems simply, but the practical usage is far more difficult.

This sort of situation where the program keeps running but not working due to a software defect is pretty common problem. A watchdog timer won't help if your program just resets it willy nilly. To use a watchdog effectively, you need some sort of strategy where you confirm every essential function of your program is still operating properly, then only reset the watchdog when you have certainty everything is still working.

Sometimes using a watchdog timer can make small problems much worse, especially for programs where the flow of control or algorithm changes substantially depending on inputs or other factors. When programs aren't very simple fixed functions, it can be really difficult to determine when to reset the watchdog. If you err on the side of too much reluctance to reset the watchdog, it can become overly sensitive and end up restarting your program under conditions you didn't anticipate. This is particularly difficult if you later add features to the program, causing its timing and flow to change. But you need that reluctance to reset the watchdog, otherwise your program can get stuck in a loop where you're resetting the watchdog but not actually running properly.

I don't have any easy answers. I'd be amazed if anyone did. Watchdogs are tough to use effectively for anything but fairly simple fixed function programs which always run with rigidly defined timing and behavior.
Paul, thanks,
The similar question I'm trying to answer is, can the T4 perform the equivalent of a power-on reset from one of the IMXRT1060's built-in WDT's without running a wire out and back in, so to speak? I'm not asking if any outside circuitry gets reset, only the T4. Because a restart in some cases has to load the code from flash, init everything in the processor / on chip peripherals, etc. Or do I need to poke the bootloader chip or blip the power on/off pin with an I/O output?
 
Back
Top