TeensyDigital
Well-known member
I have an amateur rocket flight telemetry application on a T4.1 that runs a main loop and two thread loops. The code base is (relatively) large, as it is tracking and managing over 100 variables and sampling sensors at 400 Hz. One of my thread loops (A) is for logging data to QSPI Flash using LittleFS and the other thread loop (B) is for sampling four I2C sensors (2 accelerometers, a gyro, and a barometer) and doing quaternion math on the output. The main loop is monitoring flight events, serial GPS data, serial radio data, and another dozen "housekeeping" items. Recently I have been working on the LittleFS logging and experimenting with the logging rates. In my bench testing I started to notice that the T4.1 was crashing randomly about every 90 minutes with a "Data Access Violation". After 3-4 crashes I started to get organized about isolating the cause. In the field it is rare that the device is on for over an hour, but we have seen resets in the field that we (falsely?) attributed to RFI issues. In the last week of testing I've also seen the crash happen within minutes.
I suspect the issue relates to shared variables across the sampling and logging loops that are causing instability. I have about 75 variables, organized in structs, that get populated (written) in the (B) sampling thread. The (A) logging thread "reads" these variables, constructs long strings with them and then writes the log entry to flash. There are only about 5 variables (counters) that are updated by both threads. These variables I have declared as volatile, but I do not have the other 75+ variables declared as volatile (although I tried and it didn't make a difference). I am not doing anything with pointers or writing directly to memory. The strings for logging are being constructed using strcpy, strcat, and sprintf. Most of the sensor data is different size ints and floats.
My threads are initiated with a stack size of 4096 and 2048 and are running with setMicroTimer to 5 (very small time slices for massive amounts of fast ITC reading). I've also got my code loaded with thread.yields() after every logic step in the code in all three threads.
Here is what I know, so far...
1. If I turn off thread A or thread B the device does not crash, but when they are both running it crashes at a random time from 2-120 minutes in duration.
2. In thread B (sampling), if I comment out 3 of the four sensors it still crashes, regardless of which sensor I keep active. Each sensor read also results in a lot of computation and global variables being set. There are no shared variables across the four sensor loops.
3. If I comment out all of the LittleFS code in thread A and only retain the logging string construction the board still crashes. So, it does not seem to be related at all to LittleFS.
4. I am monitoring the stack size in both threads and they are both stable and healthy.
5. If I increase the rate of logging (rate of string construction) to 150-200 times per second the crashes are more frequent (2-20 minutes). If I decrease the rate of logging (string construction) to 10X per second the crashes still happen, but I have gone as long as 8 hours without a crash.
6. I was working in TD1.55, for better LFS performance, but I moved everything to the latest TD1.56 and there was no change in behavior.
I have been experimenting with locking around the string construction functions in the logging loop, but that seemed to make it worse not better. Although, I don't feel like I fully have a handle on how locking works and the implications to the other thread. I am not sure how sensitive threading is to shared use of variables. It is obvious to me that if a variable is getting updated by two different threads it needs to be volatile, but what if one thread writes and the other thread just reads the shared variable?
I've got at least 40-50 crash reports that are all the same "Data Access Violation", but the Code Execution address and the Accessed Address are always different, event when I just recycle the power and rerun over and over without touching code.
I appreciate any help to troubleshoot further. I am going to continue to try and isolate the code to something reasonably small and frequently repeatable, but my trial and error cycle time is still 60-90 minutes.
Any insight into 1) better shared variable management with threading (e.g., how exactly does locking work?) or 2) any other diagnostics I can use to evaluate the crash reports or the environment as it is running?
Thanks,
Mike
Example Crash Reports:
ps -- I love the new CrashReport capture feature on restart. That will make recovery "in the air" much easier.
I suspect the issue relates to shared variables across the sampling and logging loops that are causing instability. I have about 75 variables, organized in structs, that get populated (written) in the (B) sampling thread. The (A) logging thread "reads" these variables, constructs long strings with them and then writes the log entry to flash. There are only about 5 variables (counters) that are updated by both threads. These variables I have declared as volatile, but I do not have the other 75+ variables declared as volatile (although I tried and it didn't make a difference). I am not doing anything with pointers or writing directly to memory. The strings for logging are being constructed using strcpy, strcat, and sprintf. Most of the sensor data is different size ints and floats.
My threads are initiated with a stack size of 4096 and 2048 and are running with setMicroTimer to 5 (very small time slices for massive amounts of fast ITC reading). I've also got my code loaded with thread.yields() after every logic step in the code in all three threads.
Here is what I know, so far...
1. If I turn off thread A or thread B the device does not crash, but when they are both running it crashes at a random time from 2-120 minutes in duration.
2. In thread B (sampling), if I comment out 3 of the four sensors it still crashes, regardless of which sensor I keep active. Each sensor read also results in a lot of computation and global variables being set. There are no shared variables across the four sensor loops.
3. If I comment out all of the LittleFS code in thread A and only retain the logging string construction the board still crashes. So, it does not seem to be related at all to LittleFS.
4. I am monitoring the stack size in both threads and they are both stable and healthy.
5. If I increase the rate of logging (rate of string construction) to 150-200 times per second the crashes are more frequent (2-20 minutes). If I decrease the rate of logging (string construction) to 10X per second the crashes still happen, but I have gone as long as 8 hours without a crash.
6. I was working in TD1.55, for better LFS performance, but I moved everything to the latest TD1.56 and there was no change in behavior.
I have been experimenting with locking around the string construction functions in the logging loop, but that seemed to make it worse not better. Although, I don't feel like I fully have a handle on how locking works and the implications to the other thread. I am not sure how sensitive threading is to shared use of variables. It is obvious to me that if a variable is getting updated by two different threads it needs to be volatile, but what if one thread writes and the other thread just reads the shared variable?
I've got at least 40-50 crash reports that are all the same "Data Access Violation", but the Code Execution address and the Accessed Address are always different, event when I just recycle the power and rerun over and over without touching code.
I appreciate any help to troubleshoot further. I am going to continue to try and isolate the code to something reasonably small and frequently repeatable, but my trial and error cycle time is still 60-90 minutes.
Any insight into 1) better shared variable management with threading (e.g., how exactly does locking work?) or 2) any other diagnostics I can use to evaluate the crash reports or the environment as it is running?
Thanks,
Mike
Example Crash Reports:
Code:
CrashReport:
A problem occurred at (system time) 0:29:48
Code was executing from address 0x17C2A
CFSR: 82
(DACCVIOL) Data Access Violation
(MMARVALID) Accessed Address: 0x343331
Temperature inside the chip was 42.50 °C
Startup CPU clock speed is 396MHz
Reboot was caused by auto reboot after fault or bad interrupt detected
CrashReport:
A problem occurred at (system time) 0:30:28
Code was executing from address 0x8B8A
CFSR: 82
(DACCVIOL) Data Access Violation
(MMARVALID) Accessed Address: 0x39003836
Temperature inside the chip was 41.20 °C
Startup CPU clock speed is 396MHz
Reboot was caused by auto reboot after fault or bad interrupt detected
CrashReport:
A problem occurred at (system time) 21:28:25
Code was executing from address 0x15C8E
CFSR: 82
(DACCVIOL) Data Access Violation
(MMARVALID) Accessed Address: 0x303000
Temperature inside the chip was 41.85 °C
Startup CPU clock speed is 396MHz
Reboot was caused by auto reboot after fault or bad interrupt detected
CrashReport:
A problem occurred at (system time) 0:3:51
Code was executing from address 0x9776
CFSR: 82
(DACCVIOL) Data Access Violation
(MMARVALID) Accessed Address: 0x380000
Temperature inside the chip was 41.85 °C
Startup CPU clock speed is 396MHz
Reboot was caused by auto reboot after fault or bad interrupt detected
ps -- I love the new CrashReport capture feature on restart. That will make recovery "in the air" much easier.