Teensy 3.2 lockup issue

Status
Not open for further replies.

brianmichalk

Active member
My Teensy 3.2, locks up randomly, and I've been trying for months to find this problem with little luck. It's a large program with multiple hardware devices, most implementing interrupt routines, so removing modules one at a time for testing is very difficult because other modules don't work properly, and also because it can take up to a day for the lockup to occur.

What issues could possibly cause a Teensy to lock up that the watchdog also fails to reboot the device? Some of my mitigating solutions:
1) Add ferrite beads to all wires
2) big copper pour on circuit board with a decoupling capacitor
3) Instrumented the code for DMAMEM to document entry and exit from subroutines
4) Print out the RCM registers to print cause of last reboot
5) Compile with "debug" and no LTO optimization.
6) CPU speed at 72MHZ.

When it locks up, I get this message from the kernel:
[444123.283708] usb 1-14: USB disconnect, device number 95
[444123.283938] cdc_acm 1-14:1.0: failed to set dtr/rts

If I try to load firmware via Arduino Teensy Loader, it fails because there is no serial device.
If I press the button, the Teensy will reboot and enter the update mode.

The machine is a brass case feeder, and I have discovered that static builds up on the brass, and is released when the cases strike other cases at a different voltage potential, so removing this source of noise is difficult.
This could be noise on the digital wires.
This could be a misbehaving ISR.
This could be bad power.

The causes of reboot are either "External Pin Reset" when it locks completely, or watchdog timeout when one of the motor operations takes too long. I have not seen any other causes.

What environmental factors could cause the WDT to not reset the device?
 
Maybe a guess, but I've seen MCUs that can't boot after a soft reset because the startup code doesn't initiate all registers properly in case that something uncommon is screwed up.If the WDT is preforming a hard reset you case don't sounds possible.
 
Agreed on the watchdog, the very first thing to check is the possibility that the chip really is rebooting, but your startup code isn't able to deal with the wrong conditions that lead to the reboot. One simple way to do this, if you have an extra pin free, is to wire a LED+resistor to that pin. Put some code *very* early in the startup to turn on the LED. For example, look for this in mk20dx128.c

Code:
        //PORTC_PCR5 = PORT_PCR_MUX(1) | PORT_PCR_DSE | PORT_PCR_SRE;
        //GPIOC_PDDR |= (1<<5);
        //GPIOC_PSOR = (1<<5);

If you uncomment this code, it will turn on the LED on pin 13. You can edit this code to use another pin, if 13 is already used.

Then in your main program, maybe blink the LED or simply turn it off. Next time the reboot happens, if the LED remains on you can know the chip did reboot and ran that very early code, but then got stuck somewhere in trying to initialize everything.
 
If you suspect high voltage spikes or transients are coupling to signals, ferrite beads alone will give very little protection unless the duration of the spike is very short.

Usually the way to protect pins looks like a series resistor and a pair of clamping diodes, to GND and 3.3V. If an external spike couples to a signal, the diodes tend to limit the voltage Teensy sees to about -0.7 to +4V, or about -0.4V to +3.7V if you use schottky diodes. The resistor between the diodes and outside world is essential. During the spike, all that voltage is across the resistor. Obvious the higher the resistor value, the more protection you get.
 
...

Code:
        //PORTC_PCR5 = PORT_PCR_MUX(1) | PORT_PCR_DSE | PORT_PCR_SRE;
        //GPIOC_PDDR |= (1<<5);
        //GPIOC_PSOR = (1<<5);

If you uncomment this code, it will turn on the LED on pin 13. You can edit this code to use another pin, if 13 is already used.
...

I looked for that code - it is under:
Code:
#elif defined(__MK64FX512__) || defined(__MK66FX1M0__)

So uncommenting that 'in place' would not show on T_3.2 as it is T_3.5/3.6 exclusive.
 
If you suspect high voltage spikes or transients are coupling to signals, ferrite beads alone will give very little protection unless the duration of the spike is very short.

Usually the way to protect pins looks like a series resistor and a pair of clamping diodes, to GND and 3.3V. If an external spike couples to a signal, the diodes tend to limit the voltage Teensy sees to about -0.7 to +4V, or about -0.4V to +3.7V if you use schottky diodes. The resistor between the diodes and outside world is essential. During the spike, all that voltage is across the resistor. Obvious the higher the resistor value, the more protection you get.

About four months ago I went through a bout of tracking down the source on a different CPU board that also uses a Teensy. Even though that other board was four feet away compared to the inches on this design, the other board had to accurately count teeth on a gear. With my scope, I was seeing about 20nS pulses of noise when brass dropped. On this board, even though it's right next to the brass, noise on the brass sensor line was okay because counting was not a priority. I just assumed that these pulses were wreaking havoc with my board because of bad PCB design. I just had a fresh set of boards made, and they did the same thing, but my scope showed the lines to be pretty quiet.

I instrumented up to 11, and narrowed the offending code to something like this:
Code:
#define _HITWD     WDOG_REFRESH = 0xA602;\
                   WDOG_REFRESH = 0xB480;
#define LEDPIN     13               
unsigned long oldTime;
unsigned long curTime;
setup(){
  noInterrupts();
  WDOG_UNLOCK = WDOG_UNLOCK_SEQ1;
  WDOG_UNLOCK = WDOG_UNLOCK_SEQ2;
  delayMicroseconds(1);
  WDOG_TOVALH = 0x006D; // 1 second timeout
  WDOG_TOVALL = 0xDD00;
  WDOG_PRESC = 0x400;
  WDOG_STCTRLH |= WDOG_STCTRLH_ALLOWUPDATE |
                  WDOG_STCTRLH_WDOGEN | 
                  WDOG_STCTRLH_WAITEN |
                  WDOG_STCTRLH_STOPEN | 
                  WDOG_STCTRLH_CLKSRC;
  interrupts();
} 
void loop() {
  _HITWD
  oldTime = curTime;
  curTime = millis();
  int delta = curTime - oldTime;
  
  if (abs(delta) < 100){
    delay(100 - delta);
  }

This is of course not the original code, which has ISR's, timers and such. I have tried to condense this down to a working example, and so far am unable to duplicate the problem.

I think that the "int delta" was a problem, but using this in my SSCCE, the watchdog seems to work properly.

Somewhere along the way, my code has started to work properly, and I don't know why. I've removed some noInterrupts() blocks because they were no longer needed, and that may have been a factor.

My delay in the working code now looks like this:
Code:
if (delta < 100 && delta > 0){
    unsigned long d1 = millis();
    unsigned int d = 100 - delta;
    while ((millis() - d1) < d);  // spin here for a delay;
}

I'm not yet convinced I've found the cause. It sometimes takes a long time to trigger the problem. I am reading the responses, and am looking into those other ideas.
 
I looked for that code - it is under:
Code:
#elif defined(__MK64FX512__) || defined(__MK66FX1M0__)

So uncommenting that 'in place' would not show on T_3.2 as it is T_3.5/3.6 exclusive.

So, where would I find the equivalent mk20dx256.c? This doesn't exist in my install.
 
I've come a long way to solving the problem. I inserted more debug statements, and the program became less and less reliable. What I figured out is that I was bit by the String() concatenation bug. I removed all catenations and the program got a lot better. I've removed my DMAMEM variables, and replaced them with a single DMAMEM variable that is set from a macro like this:
Code:
#define DC  dmaCanary = __LINE__
Then in my code, I insert DC everywhere like so:
Code:
DC;curTime = millis();
When the code boots, it prints the last line number of the code that was stored with the macro.
Another improvement was to remove my watchdog code and go with the Adafruit SleepyDog library. When I discover the remaining causes of rebooting, I'll report back.
 
I am still having problems with locking up. My DMA canary is showing that it's locking up at analogRead(). But I have another problem that is the sleepyDog is not resetting the Teensy.
I can run my machine for hours and not have any problems. My customer can lock up two of his machines reliably within five minutes.
1) What happens if Watchdog.enable() is not called at the beginning of setup()? Does startup_early_hook() come into play here? Should I include the header file earlier?
2) When I use SleepyDog, the GitHub readme says partial support, and I think that just means no low power sleep mode.
3) My libraries:
Using library Adafruit_SleepyDog_Library at version 1.1.2 in folder: /home/michalk/Arduino/libraries/Adafruit_SleepyDog_Library
Using library Encoder at version 1.4.1 in folder: /home/michalk/Arduino/libraries/Encoder
Using library EEPROM at version 2.0 in folder: /home/michalk/Downloads/arduino-1.8.9-linux64/arduino-1.8.9/hardware/teensy/avr/libraries/EEPROM
/h
Do any of these conflict with analogRead()?
 
Status
Not open for further replies.
Back
Top