Program hangs depending on code length

Status
Not open for further replies.

David C

Active member
Hi

I know you're supposed to post code, but I have a situation which happens only when the Teensy is more than 80% full, and that's a lot of code...

So I'm really looking for pointers as to how to investigate.

I have this big program, and have been merrily adding code for a few months. The application compiles successfully, but then hangs after a seemingly random time. The new code is not called under normal circumstances (it handles an inbound message, which arrives only very infrequently). I have extensive logging and I can determine when it stops, and it's different every time.

If I remove this new code, the system works indefinitely.

If I include this new code and remove about 200 lines of older code, the new code works fine and the system works indefinitely. (The older code I removed was also not used, being dependent on a selectable option.)

I have used RamMonitor to examine the heap and stack size as well as free memory. It's about 2kb/3kb/17kb. Total memory used is 63%.

Is there anything else anyone could suggest? Is there some compiler limitation on size? Or numbers of variables? I noticed by examining the .elf that one boolean variable took up 4 bytes, and another 2 bytes which struck me as odd.

Thanks in advance

David
 
First guess not given any code :: Sounds like a bad pointer or array overrun?

Array index not kept in bounds :: int foo[5]; // must only be indexed with foo[0] to foo[4]

Same with any pointers to memory - they have to respect the allocated space they are meant to cover.

Not noted what Teensy is in use … seems like a T_3.2? - or Arduino IDE and TeensyDuino versions? Any external libraries or devices could come into play if ill behaved.

Using almost all of memory is fine if not abused. Only trouble getting closer to full might come first with dynamic allocations or strings or the like.
 
Following degragsters answer with overflows, is the code the you have added to handle rare messages parseing messages of variable length or that can arrive (however unlikely) before the code has emptied the buffer of the last one? You may need to capture memory state during the message parse process and look closely at how worst case memory usage would look.
 
Thanks

I've been caught by the array thing before, so I'm sure it's not that. I've written a VB thing which analyses the .elf to tell me where everything is. So when I see corrupt data, I can look at how variables have been placed and the likely over-run culprits.

Both blocks run fine when included and tested independently. Only when they are included together does the issue occur. And the code in the modules does not have to be called for the system to hang.

It's a Teensy 3.2. I'm on Arduino 1.8.5, Teensyduino 1.40, on Visual Studio 2017 (Win10) with the Visual Micro add-in.
 
Is your code creating objects dynamically on the heap with new()? That's memory usage which can not be foreseen by the compiler and thus does not enter in the memory usage statistics after compilation. So, if your code tries to allocate more memory than available, it will silently die.
 
Hi All

I'm using RamMonitor to examine the heap size and alert to crossover between stack and heap. Not an issue (if I understand RamMonitor correctly).

In any event neither code creates new objects. I can arrange matters such that the message parser receives no messages to parse (that's the new code) and the old code is not executed (it depends on vehicle type). But the problem persists until I remove either source. In other words, if both code blocks are included in compilation, there's a failure elsewhere. If only one block is included, there is no failure anywhere. The failure does not occur in either block.

Is there a reason why, for example:

boolean acceptTaskHighlighted = false, taskAcknowledged = false;

would, according to the elf, result in the first variable having length 4 bytes and the second only 2 bytes given they are of the same type? And why 2 bytes (or 4) for a boolean?
 
With 32bit processors, the compilers make usually some effort to align memory addresses, i.e. for variables, to 32bit boundaries to allow much quicker access.
 
Hello David,

recently (Arduino IDE >= 1.8.5) I'm running into strange issues when flash memory is almost full AND I build in "fast/faster/fastest". This is with well-proven code without leaks, array bound issues etc.
It's difficult to pin down, since it occurs infrequently; so I'm not able to really give an example for bug reporting.
Try building with "Smallest Code" (not "Smallest with LTO". I'm having sporadic runtime issues with LTO).
I know this should not affect the case at hand, but...
 
There are always fun to find!

It may depend on lots of things, like did it fault? Did memory get corrupted? ...

Not sure if @defragster faulting code may find something or not...

I believe there is also a memory address where two parts of memory come together that if you do a memory access across this boundary can fault?

I have seen and done many things that can cause really strange results...

Things like: maybe a function is called, where you allocate a buffer on the stack that get passed into some API that holds onto it as a buffer (example doing some Asynch SPI code). Where it appears to work for a long time as this function was deep in call stack and so nothing else called corrupted this data, until some code path changed which now that part of the stack is used and ... (Fun to find!) Been there on someone elses code... Helped to have hardware debug support to watch the memory location...

Or sometimes you use things like malloc or new and you have code that does it in the main line and code that maybe does it in some form of ISR, like a timer interrupt, like interval timer. So you just happen to be inside a malloc in main line code when ISR comes along and mucks with heap and now things are corrupt...

And of course issues with volatile memory with variables not set as such... Your code is waiting for an event, that happens, but the code does not see it...

Or other timing code issues like:
Code:
   bool in_my_function;
...
   MyFunction();
   in_my_function = true;
   while (in_my_function) ;
where MyFunction may setup for an ISR and the ISR handler clears in_my_function...

Two possible problems here. First I did not setup my variable as volatile.
Second maybe while I am still in the call MyFunction() some ISR happens, which takes time, and my ISR triggers and clears in_my_function,
but when I return from MyFunction, clear that variable and then wait forever as the event already happened...

Again lots of possibilities for random hangs... Hard to say what else with very little information
 
Here is a version I uploaded before that should function:


Unzip and put into {sketchbook}\libraries\debug_t3.h.

Put this in code:: #include "debug_t3.h"

Make sure serial is online:
Code:
setup() {
  while (!Serial && millis() < 5000 );
  Serial.println("\n" __FILE__ " " __DATE__ " " __TIME__);
// ...
}

It has some examples, the code above will show common FAULTS by stopping out to USB with notice.

I have improved it minimally - I'm not sure what version this is - I won't be able to update with current for some hours as I need to make sure I didn't leave it broken.
>> I can confirm there is at least a small error as running this now when the HARD fault occurred my prompt was not displayed - and this is probably not in the attached version. But the fault will be trapped.
>> If you post what you see I can help decipher [ I did minimal so far to prevent code growth adding problems ].
>> HACKSTER.io told me about another more elaborate library at 1am - I may look to get some of this stuff added there, but it is more invasive and runs to an app.

Following the examples you can record some TRACE info with this in setup() :: DebState(LED_BUILTIN);
>> This will PANIC blink the Teensy LED on Fault

If your code doesn't Serial2.begin() this line will show it is working :: if ( UART2_C2 ) deb_t3( 9, __LINE__ ); // UART2 not initialized, accessing this register causes a Hard Fault
>> Output for this event it shown below.

And then things like this from one of the current examples - use only the ones in the included example - one or more items below may not be included?:
Code:
void DebugTest2() {
  DebText( __func__ ); // show function entry
[B]  deb_t3( 2, micros() );
[/B]  float x = log (0);
  haltif_t3( isfinite( x ) ); // Halt if float not NAN or INF
[B]  deb_t3( 3, 0x101101 );
[/B]  Serial.printf("INF >> %f\n", x); // Hard Fault with printf()
  deb_t3( 6, millis() );
  x = 0.0 / 0.0;
  deb_t3( 100, 0x111011 );
[B]  assert_t3( !isinf( x ) ); // Testing here - passes isinf()
[/B]  Serial.printf("NAN >> %f\n", x); // Hard Fault with printf()

[B]  deb_t3( 9, __LINE__ );
  if ( UART2_C2 ) deb_t3( 9, __LINE__ );  // UART2 not initialized, accessing this register causes a Hard Fault
[/B]}

This shows a HARD fault occurred after my line #66 "deb_t3( 9, __LINE__ );" that was called from "<< last func::DebugTest2".
>>>> debug_fault >>>> debug_fault >>>> TYPE:hrd
Debug Info:
2 => 382024 0x5D448 [L#54_C#1
3 => 1052929 0x101101 [L#57_C#2
6 => 50052 0xC384 [L#59_C#3
9 => 66 0x42 [L#66_C#4 _<< last func::DebugTest2

fault:
??: F0140430 ??: 4006A000 ??: 10000000
psr:0 adr:FFFFFFE9
lr: E000EDFC r12:1FFF07DC r3: 1FFF159C
r2: E0001004 r1: 21800000 r0: 10C00000
r4: 1FFF15D0 lr: 1FFF07DC
---

CFR:8200
HFR:40000000
DFR:0
MMA:4006C003
BFA:4006C003
ASR0

--- Halting
 
@David_C : can you confirm (post the) code for RamMonitor()?

I suppose I've seen some variants - but I could include one in the startup for the debug_t3 code to provide a notice to any user.

HERE:: Memory-Status-and-Monitoring

I see it links to :: https://sourceforge.net/projects/teensy-3-x-rammonitor

I find this - is it what you are using? I just pulled down the latest version.
Code:
class RamMonitor {
int32_t unallocated() const; // space between heap and stack (current): negitive if heap/stack crash
uint32_t stack_used() const; // stack size (current): grows into unallocated
uint32_t heap_total() const; // heap (dynamic mem) size: can grow into unallocated
uint32_t heap_used() const; // heap allocated
uint32_t heap_free() const; // free heap

int32_t free() const; // free ram: unallocated and unused heap
uint32_t total() const; // physical ram

// these functions (along with initialize and run)
// create the ellusion of stack allocation. 
uint32_t stack_total() const; // stack size (historical)
int32_t stack_free() const; // calc stack usage before next stack growth 
int32_t adj_unallocd() const; // space between heap and "alloc'd" stack: negitive if heap/stack crash
int32_t adj_free() const; // free ram: adjusted unallocated and unused heap

bool warning_lowmem() const; // returns true if stack and heap get close
bool warning_crash() const; // returns true if stack is in danger of overwriting heap

void initialize(); // initializes stack monitoring
void run(); // monitors stack and adjusted unallocated
};
 
After compiling, if you scroll up are there any compiler warnings? Warnings like "breaks strict aliasing" or "array bounds" are the sorts of things you want to investigate. Stuff like unused variables or narrowing conversion can usually be safely ignored.

I can tell you we see this sort of thing reported here regularly, where programs work with stuff written in some order but break if the order of functions is changed, or when adding or removing something. Those sorts of problems almost always turn out to be a buffer overflow. Usually it's something subtle, like using "<=" (less than or equal) in a for loop when it should have been "<" (less than) to avoid writing beyond the end of an array.

I know you said you're sure you haven't made this mistake. But I can tell you we see this here over and over. The symptoms are almost always described as the code works normally but breaks when something else unrelated is added or removed elsewhere in the code. That "something else" ends up changing the order or placement of where the linker allocates your variables. It's entirely possible your problem is different, but I'm bringing this up because we see it on this form every week or two. The description is pretty much the same as what you've said.
 
Thanks, defragster and others

Yes, that's the RAM Monitor I was using.

I added defragsters' debug code and got this output at the point it hung:

>>>> debug_fault >>>> debug_fault >>>> TYPE:hrd
Debug Info:

fault:
??: 1FF05
??: 1FFFFFA7
??: 3135322C
psr:0
adr:20007AD0
lr: 20000003
r12:FFFFFFF9
r3: 0
r2: 5C
r1: 20007A74
r0: 5C
r4: 5C
lr: 1FFFA95C

When I added the code shown in defragsters setup, I got a few more errors, such as:

C:\Users\David\Documents\Sketches\GEN3_MP_CAN\GEN3_MP_CAN.ino Oct 30 2018 23:49:41
_:L=226: Hello World 504025
_:L=229: ping
___ HALT in ___ FILE >> C:\Users\David\Documents\Sketches\GEN3_MP_CAN\GEN3_MP_CAN.ino
___ in function >> setup() at LINE# >> 231
Expression >> isfinite(x)

>>>> HALT If true >>>> program Paused Debug Info:
3 => 504 0x1F8 [L#230_C#2 _<< last func::setup
7 => 504031 0x7B0DF [L#227_C#1

fault:
??: 1CF15
??: FFF00000
??: 0
psr:20001BDC
adr:7F800000
lr: 338DC
r12:2209F
r3: 0
r2: 0
r1: E7
r0: 3238C
r4: 32380
lr: 338DC
---

___ HALT Triggered ___ 'y' to continue ...


So, if you could provide some guidance as to where to look next, I'd greatly appreciate it!
 
Cool ! … errr … that sukz :)

Hopefully the following will help get more useful info - but that shows it is working and being triggered.

That first blob:' >>>> debug_fault >>>> debug_fault >>>> TYPE:hrd'
Shows that a HARD fault was trapped, but without any calls like below there is no trace info to show where it was when that happened. Adding calls like below in a methodic fashion - or where 'just before' you suspect trouble {those 'just after' won't show } - you can locate the code that is causing the trouble.

The other 'register' information presented {psr:0 … adr:20007AD0 } is borrowed from Paul's code - and I don't know the values well enough to do a cold read and tell you what that means.

Except for the setup additions - the rest of that stuff was just for example to track and log file locations before a fault, it shows the system is working { by forcing faults you should remove - it will tell you the line # to remove or correct " setup() at LINE# >> 231" } and if 'y' had been entered on the SerMon it would have resumed execution - to begin with add the #include and ONLY this to your code:
Code:
#include "debug_t3.h"

setup() {
  while (!Serial && millis() < 5000 );
  Serial.println("\n" __FILE__ " " __DATE__ " " __TIME__);
  DebState(LED_BUILTIN); // If you want to see the LED blink for attention of a fault.
  deb_t3( 0, millis() ); // will record the time in index [0] when Serial was online.
// ...
}

Then you can drop things like this to track progress :
DebText( __func__ ); // show function entry

And this - where param 1 is index 0-9 and will record the uint32_t value passed as param 2 if you have a variable to watch, you can change param 1 as desired within or between functions to trace progress.
deb_t3( 2, micros() );

That will also give Line# ( "[L#__" ) an idea of the last "deb_t3()" call ( "C#__" ) made before a fault occurs when it prints a table like that shown (before that is the Param 1 index and then the value passed in DEC and then HEX) :
>>>> HALT If true >>>> program Paused Debug Info:
3 => 504 0x1F8 [L#230_C#2 _<< last func::setup
7 => 504031 0x7B0DF [L#227_C#1

Shows an index ordering of when that deb_t3() was called so you can trace the order of them.
 
Excellent, thank you!

I can't pretend to understand how it works, but I've been able to use it to identify a line which causes an issue, which is:

logFile.println(strlogMessage);

This simply uses The Greiman SD library to print a line to a file.

The debug output is:

>>>> debug_fault >>>> debug_fault >>>> TYPE:hrd
Debug Info:
1 => 25315 0x62E3 [L#12_C#25 _<< last func::sysMessage

fault:
??: 1FA89
??: 1FFFFFF3
??: 65747261
psr:5
adr:20007E94
lr: 20000003
r12:FFFFFFF9
r3: 0
r2: 25
r1: 20007E84
r0: 25
r4: 25
lr: 1FFFA978
---

If I understand it correctly, it's showing that this function has been called 25 times, so has worked on the previous 24.

Unfortunately, I have no idea how to delve into the library to work out what the issue could be, but it's progress!

Thanks

David
 
Just testing the current debug_t3 code as I have tested a bit on a T_3.1 - here is an updated copy:
>> When it halts then continue or other prompts are shown, and did a quick add of the RamMonitor checking and prints.
>> Note you can put this function in your sketch to call when prompted by entering 'd'. You can customize this to DUMP/print critical values when the code stops - even during a FAULT::
Code:
void Debug_Dump(void)
{
  DebSer.print(" User Custom Debug Dump. Micros==");
  DebSer.println(micros());
  // Put code here to call after Fault or HaltIf() - or call it in sketch
  return;
}

Looks like it shows that much … Have to assume that is line 12 in sysMessage - and if that deb_t3( 1, Value? ) was only put in one place and that was the 25th call.

From that the length of strlogMessage should be monitored?

Only knowing the bits stated about the code … Perhaps:
Code:
deb_t3( 0, micros() );
deb_t3( 1, strlen( strlogMessage ) );

logFile.println(strlogMessage);

deb_t3( 2, micros() );
deb_t3(100, micros() ); // this will dump the state of the trace so far

This should add index 0 called with time for fun, then length of the message if it is properly terminated, then log the message, then add index 2 with the time again to show how long it took.

Adding the call with 100 will just dump the trace each time it is called - in your case 24 times?
 
Thanks, defragster

It does indeed do what you say.

The output is:

>>>> debug_fault >>>> debug_fault >>>> TYPE:hrd
Debug Info:
0 => 25189 0x6265 [L#14_C#76
1 => 92 0x5C [L#15_C#77 _<< last func::sysMessage
2 => 24963 0x6183 [L#19_C#75

fault:
??: 1FB29
??: 1FFFFFBF
??: 3030382C
psr:9
adr:20001A8C
lr: 20000003
r12:FFFFFFF9
r3: 0
r2: 5C
r1: 20001A48
r0: 5C
r4: 59
lr: 1FFFA978

So the string is of length 92 and strLogMessage is set to be 250, so it isn't that. I guess because the debug on line 19 has a lower count (75) than the one on line 15 (77) that it broke on the offending line 17 (16 and 18 are blank for clarity).

I notice sometimes it halts after dumping the debug output, and sometimes not. What controls that?
 
Yes ,you are reading that as designed.

This points to line #15 which has highest count #77 as the last one called: "1 => 92 0x5C [L#15_C#77 _<< last func::sysMessage"

And that is what the '<< last func' tries to point out, where I'm trying to limit the printing and code size overhead so that adding the debug code isn't problematic.

Whatever was after that kept it from getting to line #19 as you say.

So the next step is to figure out what line #17 is doing to fault the program. It should work to add the .h and some of the deb_t3 calls in that code.

It seems the deb_t3( 100, micros() ); got put in. By design that will show the TRACE log list and continue without stopping.

Do those show any trend on the string length or the time it takes to process line #17?

<edit>: David C :: Not that it matters - but the prior post was not using the latest version from p#16 - so you missed the added 'memory report' ( that adds over 1K of code _ may remove it ). I also compressed the register print since it took up many lines and only adds value at rare times.
 
Last edited:
Paul or somebody with background might be able to identify the registers and values printed to give an idea as to what caused this Hard Fault - what that may mean and how it got there?

As noted I lifted that code from PJRC sources as it had code that kept the USB alive to allow the text notices shown after the processor would normally just go DEAD.

If there are other values or programmatic ways to display useful info in cases like this hat would be welcome.
 
Paul or somebody with background might be able to identify the registers and values printed to give an idea as to what caused this Hard Fault - what that may mean and how it got there?
would a map file not be useful to identify the incriminating variable?
 
Hi WMXZ

If by map file you mean the location of all the variables in memory, I have such - obtained by decoding the .elf. I can't work out where the functions have been placed.

The offending line is a simple write of a line to the SD using the library (logFile.println(strlogMessage);), which works 24 times out of 25 when all code is included and 1000 times out of 1000 when 200 unused source code lines are omitted. The length of strLogMsg is within limits.

I need now to figure out how to get into the SD stuff to see if the issue might be there. I am not so sure.

As Paul says, it may be that the linker has done something interesting when code is added or removed and split something up which should not be split up. I have seen this with Microchips. You either press on and add more code until the problem disappears, or take something out.

It would be good to know what the other debug output is, though.
 
Build time address of a function is a truncated offset it seems as it move from flash to RAM at runtime. Is that what shows in the map?

#define addr_t3( a ) { DebSer.print( #a); DebSer.print( " addr: " ); DebSer.println ( (uint32_t)a,HEX ); }

Not sure if this helps - locates a variable on the stack in a function:

#define where_t3( ) { if(1) {int where_ii; DebSer.print( __func__); DebSer.print( " stack: " ); DebSer.println ( (uint32_t)&where_ii,HEX );} }


Code:
void foo( void ) {
  where_t3( );

// test
  addr_t3( foo );
  int ii;
  addr_t3( &ii );
  int jj;
  addr_t3( &jj );

  // Code Here
}

Shows:
foo stack: 20007FC4
foo addr: 46D
&ii addr: 20007FC0
&jj addr: 20007FC4
 
Some time ago, I added lines into Teensyduino's platform.txt file to automatically generate .sym and .lst files while compiling. The .sym file will give you the addresses of all functions and all global and static variables. The .lst shows the actual assembly generated by the compiler after linking (so it has all the actual addresses embedded).

You can find these files in the temporary folder where Arduino compiled your program. On Mac and Windows this is usually a hidden folder. The easiest way to find its location is to turn on verbose output while compiling in File > Preferences.

Of course, these files will only have the addresses that the compiler+linker assigned. Local variables on the stack and malloc/new allocations on the heap are computed as your program runs, so they won't be in these files.

One minor detail to keep in mind is the ARM processor sometimes requires the LSB of an address to be set, indicating "thumb" format instructions. Teensy always uses thumb mode. When you see a function address like 00000653 and try to match it up with the .sym file or search through the .lst file, remember to change it to 00000652. Thumb instructions are always aligned to 16 bit boundaries.
 
Status
Not open for further replies.
Back
Top