T4.1 SPI bus hard crash - potential bug in SPI library?

jensa

Well-known member
Hi,
On a project I have a board with no less than 14 ADCs reading K-type Thermocouples and PT100 sensors. These are all SPI and I've been running them successfully for a year by now. For the latest revision of the board, we're using the T4.1 instead of T3.6 that we've used until now and judging by the number of posts on the forum, there are indeed a few things not yet discovered when it comes to SPI.

In my case, code that worked stopped working at completely random intervals (but within a few seconds). It's not the sensor stopping working, it's a hard crash of the IMXRT1060 chip itself that stops code execution. I have now nailed it down to where the error must be and I have a workaround. I do however think that this is a bug that needs to be addressed in a future version of the SPI library. The SPI chip I'm using to reproduce the bug is the MAX31855. Its been our workhorse for Thermocouples for many years and it's been very stable with T3.2 and T3.6.

Reading the MAX31855 with this code will crash within a few seconds:

Code:
uint32_t MAX31855::spiread32(void) {
  uint8_t buf[4];
  uint8_t buf2[4];
  SPI.beginTransaction( SPISettings(10000000, MSBFIRST, SPI_MODE0) );
  digitalWriteFast(_cs,LOW);
  SPI.transfer(buf2, buf, 4);

  lastDataRead = buf[0];
  lastDataRead <<= 8;
  lastDataRead |= buf[1];
  lastDataRead <<= 8;
  lastDataRead |= buf[2];
  lastDataRead <<= 8;
  lastDataRead |= buf[3];

  digitalWriteFast(_cs,HIGH);
  SPI.endTransaction();
  return lastDataRead;
}

Reading it with this code will never crash:

Code:
uint32_t MAX31855::spiread32(void) {
  uint8_t buf[4];
  uint8_t buf2[4];
  SPI.beginTransaction( SPISettings(10000000, MSBFIRST, SPI_MODE0) );
  digitalWriteFast(_cs,LOW);
  
  SPI.transfer(buf2, buf, 1);
  lastDataRead = buf[0];
  lastDataRead <<= 8;
  SPI.transfer(buf2, buf, 1);
  lastDataRead |= buf[0];
  lastDataRead <<= 8;
  SPI.transfer(buf2, buf, 1);
  lastDataRead |= buf[0];
  lastDataRead <<= 8;
  SPI.transfer(buf2, buf, 1);
  lastDataRead |= buf[0];

  digitalWriteFast(_cs,HIGH);
  SPI.endTransaction();
  return lastDataRead;
}

The only difference is how I read back the 4 bytes using SPI.transfer. reading 4 bytes at once will crash within a few seconds, but doing 4 reads of one byte will never crash. I'm guessing this has to do with timing and the faster speed of the T4.1 and that doing this in multiple operations solves the timing issue since it's slower?

I'm posting this as a quickfix for others with a similar problem as well as a suggestion as to something to look into the next time work is done on the SPI library. I'm not able to dig further into this at the moment, but with some luck I can revisit this when I don't have such a high workload.

PS: I have JTAG ports on this board in case that can help find the culprit.
 
Is a MAX31855 chip required, or will the code crash if just run without anything (or just a resistor on MISO) connected to the SPI port?

Could you create a minimal but complete program with this code and confirm is really does crash? I want to try reproducing the problem here. But as-is, I have to guess details about the rest of the program. For example, "lastDataRead" doesn't seen to be declared in either of these 2 code fragments.

Please, post a complete program, even if it is "trivial". Over and over, when I try to reproduce a strange bug from only a code fragment I end up guessing the wrong details for the rest of the program, which leads to wasting a lot of time not reproducing the bug.

I see Mouser and Digikey have plenty of MAX31855 chips in stock. If the chip really is needed, I can get one (or several if needed)... but to investigate this, I really need you to post a complete program which I can copy into Arduino and load onto a (unmodified) Teensy to reproduce the problem without guesswork.
 
Also with:
Code:
uint32_t MAX31855::spiread32(void) {
  uint8_t buf[4];
  uint8_t buf2[4];
  SPI.beginTransaction( SPISettings(10000000, MSBFIRST, SPI_MODE0) );
  digitalWriteFast(_cs,LOW);
  SPI.transfer(buf2, buf, 4);
...
You are transferring each time whatever 4 byte garbage is on the stack, in buf2...
Does it crash if for example you fill this with 0s?

Or simply SPI.transfer(nullptr, buf, 4);
 
You are transferring each time whatever 4 byte garbage is on the stack, in buf2...
Does it crash if for example you fill this with 0s?

Or simply SPI.transfer(nullptr, buf, 4);

Using either just zeros or nullptr does not change the behavior.

I see Mouser and Digikey have plenty of MAX31855 chips in stock. If the chip really is needed, I can get one (or several if needed)... but to investigate this, I really need you to post a complete program which I can copy into Arduino and load onto a (unmodified) Teensy to reproduce the problem without guesswork.

Unfortunately, the chip is required it seems? This is the most compact code I could come up with that let you test both variants.


Code:
#include "SPI.h"
int count = 0;
int csPin = 29;
uint32_t lastDataRead = 0;

void setup() {
  Serial.begin( 115200 );
  pinMode(csPin, OUTPUT);
  SPI.begin();
  delay(5);
}

uint32_t spiread32A(void) {
  uint8_t buf[4] = {0,0,0,0};
  uint8_t buf2[4] = {0,0,0,0};
  SPI.beginTransaction( SPISettings(10000000, MSBFIRST, SPI_MODE0) );
  digitalWriteFast(csPin,LOW);
  
  SPI.transfer(buf2,buf, 4);

  lastDataRead = buf[0];
  lastDataRead <<= 8;
  lastDataRead |= buf[1];
  lastDataRead <<= 8;
  lastDataRead |= buf[2];
  lastDataRead <<= 8;
  lastDataRead |= buf[3];

  digitalWriteFast(csPin,HIGH);
  SPI.endTransaction();
  return lastDataRead;
}

uint32_t spiread32B(void) {
  uint8_t buf[4] = {0,0,0,0};
  uint8_t buf2[4] = {0,0,0,0};
  SPI.beginTransaction( SPISettings(10000000, MSBFIRST, SPI_MODE0) );
  digitalWriteFast(csPin,LOW);
  
  SPI.transfer(buf2, buf, 1);
  lastDataRead = buf[0];
  lastDataRead <<= 8;
  SPI.transfer(buf2, buf, 1);
  lastDataRead |= buf[0];
  lastDataRead <<= 8;
  SPI.transfer(buf2, buf, 1);
  lastDataRead |= buf[0];
  lastDataRead <<= 8;
  SPI.transfer(buf2, buf, 1);
  lastDataRead |= buf[0];

  digitalWriteFast(csPin,HIGH);
  SPI.endTransaction();
  return lastDataRead;
}

void loop() {
  Serial.print( count );
  Serial.print(" data: ");
  //Serial.println( spiread32A() ); // will crash
  Serial.println( spiread32B() ); // will never crash
  delay(10);
  count++;
}
 
Running it here on a Teensy 4.1 with nothing connected, and the Serial.println( spiread32A() ); line uncommented.

It's not crashing.

screenshot.png

Are you sure this will reproduce the crash on an unmodified Teensy 4.1 if I buy that MAX31855 and wire to pins 11, 12, 13, 29 ? Does crash work if I have no sensor connected to the MAX31855?
 
Also what does crashing mean?

Is the IMXRT still running? or is it hung? Resets? Or is it simply devices are not working?


Have you tried simply using uint32_t myval = SPI.transfer32(0);
Not sure if you might have to reverse the bytes after or not.
If so think you can use something like: n = __builtin_bswap32(n);
 
Are you sure this will reproduce the crash on an unmodified Teensy 4.1 if I buy that MAX31855 and wire to pins 11, 12, 13, 29 ? Does crash work if I have no sensor connected to the MAX31855?

Let me drop by my customer tomorrow to pick up some MAX31855's for testing with an original T4.1 on a breadboard before you order anything? I'm thinking it's likely the Microchip part that is the problem here. I'll test tomorrow.

Also what does crashing mean?
Is the IMXRT still running? or is it hung? Resets? Or is it simply devices are not working?

As in "no longer running the main loop"? Frozen/crashed/not executing. Since I don't have a watchdog or anything running off a timer, I don't know if parts of it is "alive", but the main loop is not running and it'll not "restart" in any way by itself.

Have you tried simply using uint32_t myval = SPI.transfer32(0);
Not sure if you might have to reverse the bytes after or not.
If so think you can use something like: n = __builtin_bswap32(n);

transfer32 is originally private, but I tried making it public and calling it directly. It crashed.
 
Let me drop by my customer tomorrow to pick up some MAX31855's for testing with an original T4.1 on a breadboard before you order anything?

Ok, haven't ordered any MAX31855 chips, but I will if you can confirm that's what it takes to reproduce the problem.

FWIW, the program has kept running all day and still going, without any hardware connected to the SPI pins. It's now counted up to 3132548. Hard to imagine how live data versus just reading zero would matter, but then sometimes these obscure bugs can be really tricky.
 
Not sure what is going on yet.

But things I would potentially try out are things like with the function:
Code:
uint32_t spiread32A(void) {
  uint8_t buf[4] = {0,0,0,0};
  uint8_t buf2[4] = {0,0,0,0};
  SPI.beginTransaction( SPISettings(10000000, MSBFIRST, SPI_MODE0) );
  digitalWriteFast(csPin,LOW);
  
  SPI.transfer(buf2,buf, 4);

  lastDataRead = buf[0];
  lastDataRead <<= 8;
  lastDataRead |= buf[1];
  lastDataRead <<= 8;
  lastDataRead |= buf[2];
  lastDataRead <<= 8;
  lastDataRead |= buf[3];

  digitalWriteFast(csPin,HIGH);
  SPI.endTransaction();
  return lastDataRead;
}
Things like wonder if you changed 10mbs to 8mbs on beginTransaction and see if that makes difference?


Maybe add delays in like:
Code:
  digitalWriteFast(csPin, LOW);
  delayMicroseconds(500);
  SPI.beginTransaction( SPISettings(10000000, MSBFIRST, SPI_MODE0) );
  delayMicroseconds(500);
  digitalWriteFast(csPin,LOW);
If it works, then make shorter delays...

If I were running it, would hook up Logic Analyzer and see if shows anything like communication fails...
 
So now I've tested with an original T4.1 + the part on a breadboard and I cannot reproduce the problem. That indicates that the error is somewhere in my custom board. Looking at the signals with a scope, the levels look good and the edges are also ok when you zoom in. The part has a listed 5Mhz max speed for the clock, but we have formerly tested it to be stable at 10Mhz and on this board. Running it at 5Mhz vs 10Mhz does not affect the problem. Not even running it at 1Mhz. With the fix above, I can make it run at speeds up to 25Mhz so I don't think this is a signal problem?

The only other chip on this SPI bus is a W25Q80DV Flash chip and that can run at speeds above 100Mhz so that should not be an issue. I make sure that all the CS lines are set high so no other chip is affecting the bus. Any pointers as to how I can troubleshoot this further?

Things like wonder if you changed 10mbs to 8mbs on beginTransaction and see if that makes difference?

I've tried reducing speed and adding delays. Doesn't matter. Whatever happens is in the transfer32 method.

SDS2104X Plus_PNG_11.pngSDS2104X Plus_PNG_10.png
 
Huh... The plot thickens. I initially tested with varying speeds from 0.5-10Mhz. It kept crashing no matter the speed. After I tried the "fix" above, I could not make it crash. This was at 10Mhz and I didn't try changing the speed after that. Now I tested other speeds (since KurtE mentioned it) and with different speeds, it does indeed crash with the code I initially posted also?

I've never seen such a case. I just ran all 11 MAX31855 sensors continously every 10ms for 2+ hours at 10Mhz. When I change the SPI to 5Mhz, the same code crashes within seconds. It also crashes at lower speeds. This is incredibly frustrating - especially since I felt so smart having figured out a solution. Now I just feel dumb and dunno where to look...

The signals and signal levels all look good. All the chips are soldered nicely. There is continuity to all legs on the IC's. The CS pins of all the MAX31855's are being set correctly and the layout has the required components. It's also been tested with 8 sensors before with no issues. Just to take that out of the equation - I removed the Flash chip from the bus and it changed nothing.
 
Can you try the crashing code on a Teensy 4.1 having no other hardware connected? Or on a breadboard with just 1 MAX31855 chip and no sensor?

We can do a lot more to help when we're able to reproduce the problem. I'm willing to buy MAX31855 chips and sensors if needed. But I need clear info about exactly the hardware & software setup to build here to reproduce the problem.
 
As mentioned, it does not crash with a stock T4.1 and the chip on a breadboard :-/ It has to do with SPI on my board specifically.

J
 
Again I might purchase one of these devices if it would help to debug... But...

So best I can do currently is throw darts. So here is a few more... Note: I am a retired software guy... Others are better with the electrical stuff... BUT

a) You say you have 11 Sensors connected and it reproduces on your board, but not on breadboard... Question is are they the same thing, connected the same way.
What I am asking, is, does your board: solder these chips directly on or are you using some breakout board, like: https://www.sparkfun.com/products/13266

For example with their Breakout:
screenshot.jpg
The have decoupling cap and maybe more interesting Pull Up resistor to the CS pin.
The Adafruit version is even more complicated in that they also have PU on SCK and MISO...

So for example wondering things like:

Are your CS pins setup such that they have Pull up resistors? If not making sure how of the CS pins are high except for the one you are talking too.

If you have like 11 PU resistors on MISO? maybe causing issue?


b) If you are trying the transfer32, I am assuming you are trying the simple one
Code:
    uint32_t transfer32(uint32_t data) {
        uint32_t tcr = port().TCR;
        port().TCR = (tcr & 0xfffff000) | LPSPI_TCR_FRAMESZ(31);  // turn on 32 bit mode
        port().TDR = data;        // output 32 bit data.
        while ((port().RSR & LPSPI_RSR_RXEMPTY)) ;    // wait while the RSR fifo is empty...
        port().TCR = tcr;    // restore back
        return port().RDR;
    }

Not the one marked private... Just the one you pass it 32 bits to send and it returns 32 bits of data...
As you can see there not much too it... It tells the port to go into 32 bit word size. It puts the data out on the Transfer Data register
And it waits while the receive buffer is empty, and then it simply returns the data...


If this Hangs, I would probably hack up the code some like:
Code:
    uint32_t transfer32(uint32_t data) {
        elapsedMillis em;
        uint32_t tcr = port().TCR;
        port().TCR = (tcr & 0xfffff000) | LPSPI_TCR_FRAMESZ(31);  // turn on 32 bit mode
        port().TDR = data;        // output 32 bit data.
        while ((port().RSR & LPSPI_RSR_RXEMPTY) && (em < 1000)) ;    // wait while the RSR fifo is empty... or 1 second
        if (em >= 1000) Serial.printf("transfer32 timed out: %x %x\n", port().SR, port().RSR);
        port().TCR = tcr;    // restore back
        return port().RDR;
    }

Note Typed on fly so could by typos... But would be interesting to see if this prints out data and what the status and receive status shows us.
 
Weekend and refurbishing got me, but now it's Monday and I found the culprit. It's related to the transactions and multiple IC's. If I remove beginTransaction and endTransaction, it works flawlessly on any speed. I'll get my logic analyzer out tonight to capture visual proof of a crash to post here.
 
I'll get my logic analyzer out tonight to capture visual proof of a crash to post here.

I believe you. You don't need to prove anything. Please instead pour that effort into giving us a reproducible test case. What's needed is a way to reproduce the problem, hopefully without time consuming guesswork, so dev time can be spent focusing on fixing the bug.
 
I tried reproducing the bug with a stock T4.1 on breadboard with 3 MAX31855's to no avail. I only had 3 breakout boards, but I don't seen how the amount of SPI devices could be relevant other than for signal quality. It should be possible to reproduce this if it is indeed a firmware bug.
 
@guzo could you try commenting out the beginTransaction & endTransaction in the Adafruit Lib? That solved the problem for me, but I still are unable to reproduce this with just a T4.1 on Breadboard.
 
@jensa

I keep meaning to mention, that for example reading one byte at a time code like this:
Code:
uint32_t spiread32B(void) {
  uint8_t buf[4] = {0,0,0,0};
  uint8_t buf2[4] = {0,0,0,0};
  SPI.beginTransaction( SPISettings(10000000, MSBFIRST, SPI_MODE0) );
  digitalWriteFast(csPin,LOW);
  
  SPI.transfer(buf2, buf, 1);
  lastDataRead = buf[0];
  lastDataRead <<= 8;
  SPI.transfer(buf2, buf, 1);
  lastDataRead |= buf[0];
  lastDataRead <<= 8;
  SPI.transfer(buf2, buf, 1);
  lastDataRead |= buf[0];
  lastDataRead <<= 8;
  SPI.transfer(buf2, buf, 1);
  lastDataRead |= buf[0];

  digitalWriteFast(csPin,HIGH);
  SPI.endTransaction();
  return lastDataRead;
}

can be simplified by simply using the main SPI.transfer method... Something like:
I am trying to remember if the buffer call like the above is short circuited anyway...

Code:
uint32_t spiread32B(void) {
  SPI.beginTransaction( SPISettings(10000000, MSBFIRST, SPI_MODE0) );
  digitalWriteFast(csPin,LOW);
  
  lastDataRead = SPI.transfer(0);
  lastDataRead <<= 8;
  lastDataRead |= SPI.transfer(0);
  lastDataRead <<= 8;
  lastDataRead |= SPI.transfer(0);
  lastDataRead <<= 8;
  lastDataRead |= SPI.transfer(0);

  digitalWriteFast(csPin,HIGH);
  SPI.endTransaction();
  return lastDataRead;
}
Or you can do it all in one...
Code:
uint32_t spiread32B(void) {
  SPI.beginTransaction( SPISettings(10000000, MSBFIRST, SPI_MODE0) );
  digitalWriteFast(csPin,LOW);
  
  lastDataRead = (SPI.transfer(0) << 24) | (SPI.transfer(0) << 16) | (SPI.transfer(0) << 8) | (SPI.transfer(0);

  digitalWriteFast(csPin,HIGH);
  SPI.endTransaction();
  return lastDataRead;
}
But that probably does not help out your case here.

SPI.beginTransaction(...) stuff.
Currently assuming that you always call the beginTransaction stuff with same settings. The code always still sets the SPI configuration registers with the values.
Like:
Code:
		port().CR = 0;
		port().CFGR1 = LPSPI_CFGR1_MASTER | LPSPI_CFGR1_SAMPLE;
		port().CCR = _ccr;
		port().TCR = settings.tcr;
		port().CR = LPSPI_CR_MEN;
It used to bypass this code if the settings had not changed from the last call to here. However ran into a few issues. including user code could and did change
which clock was used for SPI... Also some code sets these registers themselves.. Which we were missing.
Including code that called:
void setBitOrder(uint8_t bitOrder);
void setDataMode(uint8_t dataMode);
void setClockDivider(uint8_t clockDiv) {


But wondering if it might work better if we check the registers like maybe:
Code:
		if ((port().CCR != _ccr) || (port().TCR != settings.tcr) {
			port().CR = 0;
			port().CFGR1 = LPSPI_CFGR1_MASTER | LPSPI_CFGR1_SAMPLE;
			port().CCR = _ccr;
			port().TCR = settings.tcr;
			port().CR = LPSPI_CR_MEN;
		}
Again I have not tried this out, but might later if I get a chance
 
@guzo could you try commenting out the beginTransaction & endTransaction in the Adafruit Lib? That solved the problem for me, but I still are unable to reproduce this with just a T4.1 on Breadboard.

If you have no SPI.beginTransaction(), the SPI bus could be running at any unknown speed & settings.
 
Back
Top