4.1 crashes with SD card in strange state

Fenichel

Well-known member
I have a large (counting libraries, > 400K lines) application that now crashes every few days. I have a clue as to what is going on, but I don't know how to interpret the clue. I wonder if it will ring a bell with anyone else.
What I see is that the app has tried to reboot, as if power had been cycled, but its initialization code freezes when SD.begin(BUILTIN_SDCARD) returns false. If I then manually power-cycle the system, it reboots normally. In other words, the failure seems to leave the uSD hardware in a state from which SD.begin cannot reset it, but power-cycling can.
  1. Does this story sound familiar to anyone, suggesting what might be happening?
  2. Is there a way to reset the uSD system from whatever strange state it might be in, stronger than SD.begin?
 
@Fenichel: I don't have a direct answer to your question, but maybe this may prove beneficial:

If your code (or any included libraries, etc.) encounters an exception (e.g. NULL pointer reference, etc.), it will automatically reboot, and you can get the cause of the reboot by including the following in your setup() function:

Code:
if (CrashReport) {
   Serial.print(CrashReport);
}

Note that, if a crash occurs, the crash information will be printed in the SerialMonitor. The report will include an address where the crash was recorded. Paul has also created a CrashReport() <webpage> with useful details. You can also check the entry in the unofficial Teensy wiki <here> for links to descriptions of where to find the addr2line utility for both the old (1.8.x) & new (2.3.x) Arduino IDE, as well as detailed descriptions of how to use it.

Good luck . . .

Mark J Culross
KD5RXT
 
A lot of examples have something like the following in setup():
C++:
  if (!SD.begin(chipSelect)) {
    Serial.println("Card failed, or not present");
    while (1) {
      // No SD card, so don't do anything more - stay stuck here
    }
  }
This is unhelpful, particularly as a re-try often works. If your code has something like that, try replacing it with:
C++:
  while (!SD.begin(chipSelect)) {
    Serial.println("Card failed, or not present - retrying...");
    delay(100);
  }
I've had this happen on numerous occasions, often when uploading new firmware to a running Teensy. It does suggest that begin() could be made a bit more robust, but I have no idea where.

You could also limit the number of re-tries, if your system can usefully limp into action without SD access, and e.g. prompt the user to cycle power.
 
Thanks again to @h4ynnym0u5e. Retrying the call to SD.begin gets things back on track. My app is still crashing now and then, but it is a data-collecting satellite, sending data to a central station, and the system as a whole tolerates brief absences of one or another satellite. Of course, I wish I knew what was causing it to crash, but that is not my highest priority at the moment.
 
Any chance you could craft a version of your program which reproduces this problem without any special hardware? Or with hardware that's relatively easy to buy and connect, like a specific sensor module sold by SparkFun or Adafruit.

To have any hope of investigating what's really going on and improving the SD card init to recover, we'll need a program which reproduces the crash and leaves the SD card in that state.
 
I'm willing to try; the hardware now connected to that board is
  • an XBee transceiver (UART)
  • an anemometer, connected through a debouncer board to an interrupt
  • a VC0706 camera (UART)
  • an ILI9341 TFT display (SPI)
  • an SPS30 (UART)
  • misc analog inputs
  • a DS1307 RTC (I²C)
My satellite boards all have the same code, configured by a file on the µSD card, read only as part of the boot sequence. The file tells the main program's setup routine which drivers to initialize. Thus, this board's program includes a bunch of device drivers for sensors not in use. The unused-sensor drivers are instantiated (that is, they define classes, and instances of those classes are declared), but their respective begin/setup/<etc> routines are never called. I don't think that they are where the problem will be found.

This stream was the first time I heard about the CrashReport/addr2line tools, and I was thinking about trying to use them, but the hope of capturing the crash data on the µSD card seems dim, inasmuch as the crash somehow breaks the µSD connection.

I'm open to suggestions as to what to try first.
 
@Fenichel: I would strongly suggest that you include the crash report generator as described in the earlier post. It costs you absolutely nothing if there is no crash to report, and it generates very specific information that will definitively help you to identify the cause in the event that a crash occurs. By including this simple snippet of code, you are prepared for (from what you describe, the eventuality of) a crash. It may not be your highest priority, but resolving your crash may help to clarify anything else that is occurring. The crash report information (with the example given) is printed to the Serial Console during the next Teensy boot cycle (but it could theoretically be sent anywhere thru any connectivity...it's just text), so experiencing a broken microSD connection would in no way preclude you from investigating the crash, as long as you have something connected to the Serial Console on which to view the report (or, as indicated above, you could even send the crash report to your central station over your radio data connection). If your program is crashing, no matter the reason, simply including the simple code listed in the earlier post is almost guaranteed to get you the info that you need to identify exactly what caused the crash & where. Don't be so quick to dismiss and/or ignore the value of this capability, since it costs nothing to add but a few lines of simple code !!

Mark J Culross
KD5RXT
 
the hope of capturing the crash data on the µSD card seems dim, inasmuch as the crash somehow breaks the µSD connection.
…but because…
Retrying the call to SD.begin gets things back on track
you could presumably check for the presence of the CrashReport after you have the SD online, and then write it to a crash log folder.

It has to be said I’ve never written a crash report to anything apart from Serial, but the documentation page shows examples.
 
You're right, I didn't think it through. I have added code to look for the existence of CrashReport and save it to a file on the µSD card. Now I'm just waiting for my app to crash again.

Incidentally, some of the directory structure used by the compiler seems to have changed since @carlsuttle's helpful post. I think his
Code:
C:\Users\<user name>\AppData\Local\Temp\arduino-sketch-<GUID>\<file name>.elf
should now be
Code:
C:\Users\<user name>\AppData\Local\arduino\sketches\<GUID>/<file name>.elf
 
OK, CrashReport says
Code:
A problem occurred at (system time) 21:49:23
  Code was executing from address 0x1C358
  CFSR: 82
    (DACCVIOL) Data Access Violation
    (MMARVALID) Accessed Address: 0x0 (nullptr)
      Check code at 0x1C358 - very likely a bug!
      Run "addr2line -e mysketch.ino.elf 0x1C358" for filename & line number.
  Temperature inside the chip was 53.38 °C
  Startup CPU clock speed is 600MHz
  Reboot was caused by auto reboot after fault or bad interrupt detected
and addr2line says the problem was in
Code:
C:\Users\rrf\AppData\Local\Arduino15\packages\teensy\hardware\avr\0.60.4\cores\teensy4/memcpy-armv7m.S:125

My code must be far, far away in the call stack, unless I got here by a wild transfer. I will try some breadcrumbs.
 
This is a very common mistake: your code passed a pointer to some function call that attempts to make a copy to/from the invalid (NULL) pointer. This can certainly occur in one of several ways. You can be sure that this is NOT a problem with the identified source code and line, in spite of the fact that this is where the violation actually occurred (this time). It is absolutely a problem in your sketch (I don't mean that in any way to be insulting, but rather to provide assurance that no time should be wasted attempting to figure out what may be wrong with memcpy-armv7m !!). Maybe a visual inspection of your source, concentrating specifically on any code where a pointer is passed and/or where a copy of a variable is being made, might identify the offending line.

Good luck & hope you can discover the underlying cause quickly !!

Mark J Culross
KD5RXT
 
This can also quite easily happen if dynamic memory runs out e.g. if you're using C++ objects like String or std::vector (exceptions aren't supported on Teensy).
 
This is a very common mistake: your code passed a pointer to some function call that attempts to make a copy to/from the invalid (NULL) pointer. This can certainly occur in one of several ways. You can be sure that this is NOT a problem with the identified source code and line, in spite of the fact that this is where the violation actually occurred (this time). It is absolutely a problem in your sketch (I don't mean that in any way to be insulting, but rather to provide assurance that no time should be wasted attempting to figure out what may be wrong with memcpy-armv7m !!). Maybe a visual inspection of your source, concentrating specifically on any code where a pointer is passed and/or where a copy of a variable is being made, might identify the offending line.

Good luck & hope you can discover the underlying cause quickly !!

Mark J Culross
KD5RXT
Thanks. I had no doubt that the problem was in my code
This can also quite easily happen if dynamic memory runs out e.g. if you're using C++ objects like String or std::vector (exceptions aren't supported on Teensy).
Right. I don't use String, and I don't directly do any dynamic memory allocation.

. I am waiting for the next crash, and then I'll be closing in with more breadcrumbs.
 
As noted in my original post in this thread, my bug leaves the µSD drive in a state that is resistant to SD.begin() but is reset by power cycling. Using CrashReport & its associated breadcrumbs, I've found that the problem arises when SD.open(<filename>, FILE_WRITE) returns a null pointer. This call is made many times a day, with a filename like 10231105.jpg each time. The failure of SD.open(<filename>, FILE_WRITE) happens only every day or so. The pertinent code
Code:
bool tVC0706::SaveSnapshot()
  { // debulk TakeSnapshot
    bool Result = false;
    char Message[100];
    const uint16_t GulpSize = 64;  // seems OK, but 32 might be more stable
    const int ReassureAfterWrites = 100;

    SetCrumb(crumbVC0706, 68);
    FileSystem.ClosePrivateFiles();
    File SnapshotFile = SD.open(TheFileName, FILE_WRITE);

    if (SnapshotFile)  // new crash-avoidance test
      { // file opened
        uint16_t Remainder = AF_VC0706.frameLength();
        sprintf(Message, "Writing %d-byte image to %s.",
                           Remainder, TheFileName);
        Infrastructure.SayInIDE(Message);

        int WriteCount = 0;
        while (Remainder > 0)
          { // write out another gulp
            SetCrumb(crumbVC0706, 82);  // used to crash between here ***********************************
            uint8_t *Buffer;
            uint8_t BytesToRead = min(GulpSize, Remainder);
            Buffer = AF_VC0706.readPicture(BytesToRead);
            if (Buffer)
              { // got buffer
                SetCrumb(crumbVC0706, 88);
                SnapshotFile.write(Buffer, BytesToRead);
                digitalWrite(pinuSDNonEmpty, HIGH);
                WriteCount++;
                if (Infrastructure.InIDE and ((WriteCount % ReassureAfterWrites) == 0))
                  { Serial.print(DOT); } // reassure waiting user
                Remainder = Remainder - BytesToRead;
                SetCrumb(crumbVC0706, 95);
              } // got buffer
            else
              { // Remainder NZ, but no buffer
                SetCrumb(crumbVC0706, 99);
              } // Remainder NZ, but no buffer
          } // write out another gulp   // and here ****************************************************

        SnapshotFile.close();
        msSinceLastImage = 0;
        Result = true;
      } // file opened
    else
      { // couldn't open file
        Infrastructure.SayInIDE("couldn't open file");
        SetCrumb(crumbVC0706, 110);
      } // couldn't open file

    Infrastructure.SayInIDE("done");
    SetCrumb(crumbVC0706, 114);
    return Result;
  } // tVC0706::SaveSnapshot()
is inside a routine that uses the Adafruit AF0706 library to capture images from a motion-detecting camera. I have now modified the code (as shown) to simply discard an image if the file can't be created; this is OK, but I regret discarding the images.

Under what conditions should SD.open(<filename>, FILE_WRITE) be expected to fail like this?
 
bug leaves the µSD drive in a state that is resistant to SD.begin()

Wonder if you buffered the 64B gulps into a 512 Buffer and wrote only 512B at a time (until the last of the file tail was left to write) if the SD card would be better served?

512/64=8 so the image would be read/buffered faster (between the larger writes) and the SD would write more efficiently in desired 512 Byte blocks.

Easy enough to try and see if that keeps the SD functional?

You could time the current "SnapshotFile.write(Buffer, BytesToRead);" and then compare the avg and cumulative time for the write of the total file and might see the overall write time inprove with the buffered 512B write? - which means less mucking around with the SD card manipulating data.
 
Thanks for your suggestion. Except here (where I copied the use of 64B gulps from some Adafruit code somewhere, and the comment ("32 might be more stable") comes from there too), I always read & write µSD cards in 512-byte bufferloads, for exactly the timing reasons that you mention. Probably I should have (a) obeyed Adafruit's small-gulp practice for dealing with the VC0706 but then (b) accumulated the gulps into 512B bufferloads. I will add that to the code soon, and I expect it to get faster.

But: The problem here doesn't seem to be a timing problem at all. When SnapshotFile is a null pointer, SnapshotFile.write(<buffer>, <N>) should crash no matter what <N> is. SD.open(<filename>, FILE_WRITE) is doing me wrong before the timing clock starts.
 
32 might be more stable
Saw that comment and maybe 32 would be better to read/buffer.

Not clear why .open should ever fail - except the small write causing the underlying code and hardware to deal with partial buffer write to fill the native 512B blocks. Thought was not doing that 'some many times over days' ??? before it fails might be the solution to that?
 
I may not understand your comment. I agree that writing short strings to the µSD is inefficient, but SD.write should be able to handle them without poisoning itself or SD.open, if only because it needs to handle short strings as the tail ends of files. Is your theory that the gulp size of 64 causes the Adafruit VC0706 code to go wild, sending garbage into the SD code or data, leading to the (much later) null result of SD.open? I suppose that's possible, but it does seem farfetched. What the hell, I'll try it out.
 
Also, don't know if this could be part of the problem, but frameLength() returns uint32_t, and you are assigning its value to uint16_t Remainder.
 
SD.open should return an object, yes. But sometimes it returns a null pointer, and I now detect that (as shown in post #14), so the program doesn't crash any more.. I'll fix the uint16_t/uint32_t business, but all of the frameLengths from the VC0706 are the same size, so the 16/32 problem should be consequential always or never.
 
The test eliminates the null-pointer crashes that used to occur. Would
Code:
if (SnapshotFile == null)
have been more correct?
 
It never returns a pointer, it returns a File object which has a method to evaluate it and return a bool.

Code:
Buffer = AF_VC0706.readPicture(BytesToRead);
Where does the memory come from that this function returns? Are you meant to release it when you're done with it?

Opening files requires allocating memory. I think it's failing to open the output file because you have a memory leak somewhere else.
 
I copied the core of my code from Adafruit's MotionDetect.ino example:
Code:
  File imgFile = SD.open(filename, FILE_WRITE);

  uint32_t jpglen = cam.frameLength();
  Serial.print(jpglen, DEC);
  Serial.println(" byte image");

  Serial.print("Writing image to "); Serial.print(filename);

  while (jpglen > 0)
    { // read 32 bytes at a time;
      uint8_t *buffer;
      uint8_t bytesToRead = min((uint32_t)32, jpglen); // change 32 to 64 for a speedup but may not work with all setups!
      buffer = cam.readPicture(bytesToRead);
      imgFile.write(buffer, bytesToRead);

      //Serial.print("Read ");  Serial.print(bytesToRead, DEC); Serial.println(" bytes");

      jpglen -= bytesToRead;
    } // read 32 bytes at a time;
  imgFile.close();
  Serial.println("...Done!");
  cam.resumeVideo();
  cam.setMotionDetect(true);
I haven't looked at the underlying library, but I have assumed that it is allocating & releasing the buffer appropriately.

The idea that SD.open is failing because of memory outage from a leak somewhere else is a good thought. I will work on that.
 
Last edited:
Back
Top