Teensy 4.1 Freeze (~1786ms)

defragster · Mar 10, 2021

Yes p#24. If it still fails with the suspected problem areas removed the problem is elsewhere.

Not sure how the 'freeze' is detected? The last USB Serial output ? That can be misleading when the USB stack gets trashed or halted for another reason the pending output may get stalled and the problem actually happens later. To test that putting Serial.flush() after print()'s makes sure output is transmitted before moving on. That may have questionable effect if the problem happens by interrupt - though IIRC USB has generally higher priority than default interrupt setting.

Frank B · Mar 10, 2021

What happens if you disable interrupts before analogRad() (and enable them after that, of course) ? Still freezes?

UdoZ · Mar 10, 2021

ZeitStopX are all unsigned long - and until this latest change they were storing micros().

My detection works as present: I have unsigned long TimeMarkers = millis() between each call to a subroutine inside Loop(). I also measure the length of the loop using TimeMarkers at the begin and end. If the difference is > 1700ms, I write out all TimeMarkers. There is little to no USBSerial activity during Loop(), lasting less than 15ms.

Each subroutine only outputs USBSerial if it runs longer than expected (times set individually for each sub).

Initially, I had doubt, that my USBSerial based timing might be wrong, because I have to rely on millis(). I therefore added a digitalWrite(Pin) pulse at the start of each Loop and measured the time difference between two consecutive pulses with an Arduino Due. That measurement confirmed that those Timings, available through USBSerial are indeed correct. I also checked my TimerISR the same way in order to make certain that it never run longer than supposed.

I am not relying on the timestamp an USBSerial has been displayed on my monitor.

I hope this makes clearer what I am doing.

UdoZ · Mar 10, 2021

Frank B said:
What happens if you disable interrupts before analogRad() (and enable them after that, of course) ? Still freezes?

I have not yet tried this - unfortunately, due to the rare occasion of those Freezes, each test takes usually one night.

However, I like to remind, that the problem has shifted from initially mostly happening at while(!client.available()) , when calling the light status from (any one of my two) Philips HueBridge, each returning approx. 6000 bytes. A wait which usually takes very little time and is currently done once a minute using an Arduino Due (and that since many months), showing the issue is not with the HueBridge.

These status calls have been disabled at present, and since then all recorded freezes happened at the time between ZeitStopB and ZeitStopC, when analogRead(A9) was executed.

Frank B · Mar 10, 2021

I had analogRead running the whole night, with no freezes.
I suspect it must be something else, and the only Idea I have now that it happens in an Interrupt. So, this test can show if that is the case.

UdoZ · Mar 10, 2021

We have been fortunate (or is it rather unfortunate?) there have been already some Freezes since this latest change (analogRead -> digitalFastRead):

The new test started at 10:36 and the first 3 Freezes within one minute occurred at 11:05 (2nd 17 loops later, 3rd 19 loops after the second). Those Freezes were no longer between ZeitStopB & ZeitStopC (both show identical ms values) . However, the first Freeze still happened in the same sub, but not longer between ZeitStart and ZeitStopD, as all show identical millisecond reading. The Freeze happened between ZeitStopD and the end of the routine (ZeitStop -ZeitStart= 1767018546).

This simple subroutine is this (showChange is set true, but no print out recorded at time of Freezes):

Code:

void poll_Switches_and_Hell(boolean showChange)
{
  unsigned long t1=micros();

  KNXCounterInInput = KNXCounter;
  
  ZeitStart=millis();

  

  i_SwitchUp+=digitalRead(SwitchUp); ZeitStopA=millis();
  i_SwitchDw+=digitalRead(SwitchDw); ZeitStopB=millis();
  i_Hell+=digitalReadFast(HellSensor)*1023; //digitalReadFast(HellSensor);    
  ZeitStopC=millis();
  
  if(counterSwitch==max_counterSwitch-1)
  {
     fHell = float(i_Hell)/float(max_counterSwitch);
     
     if(i_SwitchUp==0) iSwitchUp=1;
     else if(i_SwitchUp==8) iSwitchUp=0;
    

     if(i_SwitchDw==0) iSwitchDw=1;
     else if(i_SwitchDw==8) iSwitchDw=0;
    

     iHell = int(fHell);

      counterSwitch=0; i_SwitchUp=0; i_SwitchDw=0; i_Hell=0;
     
  }
  else
  {
    counterSwitch++;
  }

  ZeitStopD=millis();
  
  if(showChange)
  {
          boolean Change=false;
        
        if(old_iSwitchUp!=iSwitchUp)
        {
           Serial.print("new SwitchUp: "); Serial.print(iSwitchUp);
           old_iSwitchUp=iSwitchUp;
           Change=true;
        }
      
         if(old_iSwitchDw!=iSwitchDw)
        {
           Serial.print("\tnew SwitchDw: "); Serial.print(iSwitchDw);
           old_iSwitchDw=iSwitchDw;
           Change=true;
        }
      
         if( abs(old_iHell-iHell)>15 )
        {
           Serial.print("\tnew iHell: "); Serial.print(iHell);
           old_iHell=iHell;
           Change=true;
        }

        if(Change) Serial.println();
  }

  if(micros()-t1>100) {Serial.print("dt_us>100(Inputs):"); Serial.println(micros()-t1);}

  ZeitStop=micros();

  
}

The second and third Freeze happened inside the routine, which checks if a client is available. Whilst this is a lengthy routine, it writes out, if a client was detected - which wasn't the case. Therefore the "active" part in this instance was merely this:

Code:

void PCContact()
{
  
     if(PC_Contact_Established) {Serial.println("already inside PCContact"); return;}
     
     
          //Arbeitsparameter   
          char ReadText[100];
              
          ReadText[0]=0;
          
          byte Count=0;
          byte Action=99; 
          
          boolean ClientWasDetected=false;

          unsigned long t1=micros();  //Dauer der Routine


              Ethernet.setRetransmissionCount(2);    //<------------------------------------------------------------------------------------------------------------------- 
              Ethernet.setRetransmissionTimeout(50); //<-------------------------------------------------------------------------------------------------------------------

              EthernetClient client = server.available();
 
              client.setConnectionTimeout(100); 

  
  //Wir prüfen, ob sich jemand mit uns in Verbindung gesetzt hat. Ist jemand mit Arduino verbunden?
  
  if (client) 
  {
       unsigned long t2=micros();   Serial.print("PCC client t2-t1:"); Serial.print(t2-t1);

      CODE REMOVED as no client detected

}  //end   if(client) 

     
        
  if(ClientWasDetected) // && micros()-t1 > 3000) 
  {
         Serial.print("\tt(PC_us)): "); Serial.println(micros()-t1); 
  }

  PC_Contact_Established=false;

}

Now, interestingly, at 11:18 the very same Freezes sequence happened, only difference the 3rd Freeze happened 18 loops after the 2nd [before 19], whilst the 2nd happened equally 17 loops after the first.

Does this tell us anything?

I will next remove the SDI OLED from the sketch, although this is at present only written to once per minute. Are they any other suggestions?

Frank B · Mar 10, 2021

Can you upload the whole code somewhere?

UdoZ · Mar 10, 2021

Frank, I just emailed you a link, where the code of my last test can be found, i.e. before removal of the SDI linked OLED.
I also included a .txt file showing the USBSerial from this morning

Frank B · Mar 10, 2021

UdoZ said:
Frank, I just emailed you a link, where the code of my last test can be found, i.e. before removal of the SDI linked OLED.
I also included a .txt file showing the USBSerial from this morning

That's a LOT of code ..

The freeze happens outside the "433" interrupt, and this interrupt is not the reason - is this correct?
(I'm asking because the interrupt is very long and is a lot of prints inside- things that should be avoided, normally)

Where can I see the freeze in the textfile? Edit: Found it
Another question: What triggers the 433 interrupt? A keypress?

WMXZ · Mar 10, 2021

while my system did not freeze (hold for 1786 ms) in last two days, I'm still convinced that:
it is either something on CPU level (generating interrupts) then we cannot do anything about it
or it has something to do with disabling interrupt and waiting explicit or implicit for CPU clock (the 2^30*600 MHz = 1789.6 ms is far too intriguing to be ignored)
What is the freeze with 396 or 528 MHz?
So far, I have not found a loop of that type.
To be complete, my freeze event appeared and disappeared by making minor changes to the code.

Frank B · Mar 10, 2021

WMXZ said:
(the 2^30*600 MHz = 1789.6 ms is far too intriguing to be ignored)

Yes, of course.
Perhaps Defragster is on the right path re: micros..

It's really a freeze, not just wrong values of micros or millis(), right? Do I understand this right?
So it must be a loop where micros() or a timeout is used? But in an interrupt? Hm... it seems to happen after a fixed time because with the analogRead removed it happens too, but just a little bit later?

WMXZ · Mar 10, 2021

Frank B said:
It's really a freeze, not just wrong values of micros or millis(), right? Do I understand this right?

If I take my application where I count I2S-DMA interrupts (that are also easily visible on LED). Typically there are 750 interrupts/s and LED is flashing.
when freeze occurred, LED remains off, interrupt counts drops close to zero, and there is no processing in ISR (monitored with digitalFastWrite an LA).
processing is typically between 50 to 75 % of available time in ISR.
Caveat, while the freeze is similar in time, the cause may be different to OP

UdoZ · Mar 10, 2021

Frank B said:
That's a LOT of code ..
The freeze happens outside the "433" interrupt, and this interrupt is not the reason - is this correct?
(I'm asking because the interrupt is very long and is a lot of prints inside- things that should be avoided, normally)

Where can I see the freeze in the textfile? Edit: Found it
Another question: What triggers the 433 interrupt? A keypress?

The 433 Interrupt is triggered by 433MHz motion sensors. The output is pre-processed by a Arduino Mini Pro and the result then passed to RX2. There are few of these interrupts, particularly when scanning the program at night. There was definitely no 433 interrupt during those faults, reported this morning.

Many prints have been added during my fault finding - I tried to pin-point the Freeze in my code during the last 2 weeks (before asking for help here).

UdoZ · Mar 10, 2021

Frank B said:
Yes, of course.
Perhaps Defragster is on the right path re: micros..

It's really a freeze, not just wrong values of micros or millis(), right? Do I understand this right?
So it must be a loop where micros() or a timeout is used? But in an interrupt? Hm... it seems to happen after a fixed time because with the analogRead removed it happens too, but just a little bit later?

It is really a Freeze, proven by measuring the time between Loop()s, using a Arduino Due, see above

Frank B · Mar 10, 2021

i see these includes:

#include <NativeEthernet.h>
#include <EEPROM.h>
#include <TimeLib.h>
#include <U8x8lib.h>
#include <NativeEthernetUdp.h>

@wmxz do you use any of them?

UdoZ · Mar 10, 2021

Test without <U8x8lib.h> and therefore without >SPI.h>: Freeze after 86 minutes, again inside subroutine poll_Switches_and_Hell, after ZeitStopD and end.
My next test will use the same program, but running CPU Speed 396MHz.

So far excluded (as prime cause) Teensy RTC, <NativeEthernetUdp.h>, <U8x8lib.h> using SPI OLED

WMXZ · Mar 10, 2021

Frank B said:
i see these includes:

#include <NativeEthernet.h>
#include <EEPROM.h>
#include <TimeLib.h>
#include <U8x8lib.h>
#include <NativeEthernetUdp.h>

@wmxz do you use any of them?

No, I did use TimeLib,
but I removed it (reading RTC directly and not via CPU timer, which could be an issue)

WMXZ · Mar 10, 2021

I should add, I'm using micros() with an ISR

defragster · Mar 10, 2021

So AnalogRead() ruled out as the source of the trouble - Cool.

Question: This "FREEZE" - is where execution is 'elsewhere' for some time - but then returns to normal operation and function ... until the same FREEZE repeats ...?

micros() really isn't suspect - and it completes in about 36 cycles - it has been used Billions of times. It doesn't really do anything dangerous, something unique in that one loop will repeat until two millis_Tick data globals are read/copied without an interrupt occurring. It was only questioned because it was near what 'appeared' to be the point of failure - but with DigitalRead() showing that was not the point of failure that was just an exercise/test to see if it was related.
> The T_3.x micros() kills/restores interrupts when called to get the 'micro' portion - and takes more cycles. T_4.x micros() does not stop/start interrupts.
> The T_4.x was given a low res timer for millis() - that gave ~10 us res for millis(). So with a running ARM_CYCCNT I added code to get us res based on the last millisecond timer tick with that. That requires reading two DWORDS that had to be atomic.
> The MCU's have a way to repeat a do{}while(); when an interrupt happens. When adding that code I started some us's timers to trigger that while calling micros() looking to get each us value returned one or more times and that worked with no issues. So testing here was probably a Billion+ times running micros() before it was put into TeensyDuino.
->> Perhaps there is a case where that repeat loop could stall - but only as long as there was an 'interrupt storm' so fast that MCU could not find 6-15? cycles to read two DWORDS without an interrupt to exit that loop.

@UdoZ: the 'DSB' (Frank B offered) assures that actions/changes on the low/quarter speed hardware/IO bus are recognized and seen on the full speed memory processor bus. There are cases that interrupt code completes so fast that the interrupt complete flag doesn't make it to the other part of the MCU and after calling the _ISR() it sees the interrupt as un-serviced and it calls it again. The "DSB" stalls the _ISR() exit until the busses are synchronized.
>> Perhaps all of your _ISR()'s should be tested adding that asm ("dsb":::"memory"); before exit just as a test.
> None of the _ISR()'s should be doing anything complex like printing to USB or other things that could cause interrupts or othe rnon-ist() safe actions.

Given the location/cause of the Freeze is still in question - the prior note about Serial.flush() might be helpful assuring that Serial.print() output completes at the point of execution by putting that just after the print.

It is ODD that two sets of code with only the Teensy in common [no common library code?] are suddenly seeing this freeze in some similar recurring fashion.

Frank B · Mar 10, 2021

Yes, esp that the freeze is the same length, and this length.
However, micros() has a minor problem, there is a minimal chance that it reports a wrong value. But this is not the problem here (the result can overflow).
And it seems to happen with millis() too, and even without time measurement (LED blinks) if I got it right.
Seems to occured the first time now.. reproducable..but with two different programs? hmm..

WMXZ · Mar 10, 2021

defragster said:
It is ODD that two sets of code with only the Teensy in common [no common library code?] are suddenly seeing this freeze in some similar recurring fashion.

It is odd indeed, but IMHO, we use Teensy core and not all features are discovered.

defragster · Mar 10, 2021

Frank B said:
Yes, esp that the freeze is the same length, and this length.
However, micros() has a minor problem, there is a minimal chance that it reports a wrong value. But this is not the problem here (the result can overflow).
And it seems to happen with millis() too, and even without time measurement (LED blinks) if I got it right.
Seems to occured the first time now.. reproducable..but with two different programs? hmm..

Wrong Value in micros()? overflow? what case? Paul added code to make sure if doesn't jump a ms when CPU at low speeds may get behind (?) updating ms_ticks and CYCCNT math would round up a ms?
If you have repro code and question start a thread for that?

WMXZ said:
It is odd indeed, but IMHO, we use Teensy core and not all features are discovered.

Indeed - there is tons/CORES of common code - not having an external common lib behind it means it could be 'anything else'

Not looked at either code base here - WMXZ - have you seen anything in common to your code _isr()'s or other?

Since I can't see it - not even sure where to being finding where it might be stuck ... not faulting ... can the code detect when there is an ongoing freeze? Not sure what it could do then ...

The only thing similar WMXZ was on the MSC thread - I put an interval timer in that kept running and USB Serial printing - but USB Host stalled if a write protected flash was the first 'disk' it saw? Never got to the bottom of that - but that was a lifetime stall and triggered by USB Host code in some fashion and never returns to loop(). It didn't help find a fix - but did show that the MCU was running and printing even though loop() was dead and it never faulted.

Frank B · Mar 10, 2021

Code:

uint32_t micros(void)
{
    uint32_t smc, scc;
    do {
        __LDREXW(&systick_safe_read);
        smc = systick_millis_count;
        scc = systick_cycle_count;
    } while ( __STREXW(1, &systick_safe_read));
    uint32_t cyccnt = ARM_DWT_CYCCNT;
    asm volatile("" : : : "memory");
    uint32_t ccdelta = cyccnt - scc;
    uint32_t frac = ((uint64_t)ccdelta * scale_cpu_cycles_to_microseconds) >> 32;
    if (frac > 1000) frac = 1000;
    uint32_t usec = 1000*smc + frac;
    return usec;
}

a) Frac > 1000? what if frac is > 2000? it looses usecs. (under special circumstances only - can happen with long time disabled interrupts)
b) uint32_t usec = 1000*smc + frac; -> smc is 32 bit, too and systick_millis_count counts endless until it wraps to zero. so it can be very high and then get multiplied with 1000.
As said, not really an important issue, and very very unlikely to be the reason for the freeze.
It can't because the freeze happens without time measurement, too (if I read it right)

You can open a thread of course. I'm not interested - too minor issue.

Frank B · Mar 10, 2021

WMXZ said:
It is odd indeed, but IMHO, we use Teensy core and not all features are discovered.

Hm
Well, there is the ARM Cortex M7 core. Should take 1 minute of google to find wether it has such problems.
It's not used by NXP only and is in millions(billions?) of devices.
Then, there are the NXP additions.
I don't see that we use it in a way that can cause such a phenome. It would be way more often.
It would have to stop the ARM Cortex M7 core. It's clock. Not very likely.
But its entirely possible that the code not stops and wer're in an interrupt instead...
So, for the freeze, I'd use occam's razor and assume a problem with the user code. This possibility has the highest probability.

Frank B · Mar 10, 2021

I wouldn't rule out GCC, too.
It has hundrets of known bugs / missed optimizations etc etc... and 5.4 is very old. Same for the whole rest of the toolchain.
But, again, unlikely. I bet, it's a problem with the code.

Teensy 4.1 Freeze (~1786ms)

Senior Member+

Senior Member

Well-known member

Well-known member

Senior Member

Well-known member

Senior Member

Well-known member

Senior Member

Well-known member

Senior Member

Well-known member

Well-known member

Well-known member

Senior Member

Well-known member

Well-known member

Well-known member

Senior Member+

Senior Member

Well-known member

Senior Member+

Senior Member

Senior Member

Senior Member