I2C hanging with SDA and SCL both high, ARB_LOST & TIMEOUT forever status

Status
Not open for further replies.

bboyes

Well-known member
So admittedly I am abusing the I2C bus, that's the point. I'm trying to make it fail and then self-correct.
See message #502 in the i2c_t3 thread for more context. In my slightly modified i2c_t3, I2C_AUTO_RETRY is enabled and I can see how many times resetBus has been executed.

I was running the PCA9548A_Test program (github code and readme about hardware in use), and decided I needed pullups on all the mux SDA outputs. Naturally I did this while the test was running. I triggered these errors, see below. What I expected was TIMEOUT would trigger the retry and resetBus would fix things. This did not happen, though resetBus did execute - but only once.

Code:
et:9185  Good:42391361  4616/sec
et:9186  Good:42395969  4616/sec
et:9187  Good:42400577  4616/sec
simple test in middle of write/read loop failed with return of 0x04: I2C_TIMEOUT
control write failed with return of 0x07: I2C_ARB_LOST
et:9188  Good:42401594  4616/sec  bad:2  busReset: 1
control write failed with return of 0x04: I2C_TIMEOUT
control write failed with return of 0x07: I2C_ARB_LOST
et:9189  Good:42401594  4615/sec  bad:4  busReset: 1
control write failed with return of 0x04: I2C_TIMEOUT
control write failed with return of 0x07: I2C_ARB_LOST

In the i2c_t3.cpp source code, the only applicable place ARB_LOST status gets set (sendTransmission, I am doing a write) is line 1011:
Code:
            // check arbitration
            if(status & I2C_S_ARBL)
            {
                i2c->currentStatus = I2C_ARB_LOST;
                *(i2c->S) = I2C_S_ARBL; // clear arbl flag
                // TODO: this is clearly not right, after ARBL it should drop into IMM slave mode if IAAS=1
                //       Right now Rx message would be ignored regardless of IAAS
                *(i2c->C1) = I2C_C1_IICEN; // change to Rx mode, intr disabled (does this send STOP if ARBL flagged?)
                return;
            }
This is a single-master system. Did I mention SDA and SCL are both high during all this time? So the Teensy master is not attempting to drive SCL. Interesting TODO there...

I reset Teensy via TyQt and surprisingly it then was able to run OK with no further error in startup or init. So it seems the PCA9548A was not the culprit (makes sense: the slave can't drive either SCL or SDA high), it was apparently Teensy which thought it can't drive SCL for some reason. From the comments above it looks like mode gets changed to receive, but then Teensy should still drive the SCL, and call resetBus() if TIMEOUT happens.

TIMEOUT does repeatedly happen so why is resetBus() not called? If I understand the code around line 795, it might be because Teensy thinks it is not the master.

In the meantime I am going to add ability to call resetBus() from my test program command line interface. I'm also changing to Wire.setDefaultTimeout(10000); though 10 msec seems over long, I saw that in the i2C_t3 example code somewhere.

What's the way out of this? Have I discovered a subtle bug? Can someone smarter than I shed some light and suggest a fix?
 
Last edited:
New PJRC Wire vs i2c_t3

The hardware I need is at the lab running tests. I tried to compile with your Wire and had to bracket a number of i2c_t3-isms like so (not sure that is the best way to do this)
Code:
#if defined I2C_T_H	
		error.ret_val = Wire.status();				// to get error value
#endif

Wire.requestFrom(address, quantity, stop) docs say stop is boolean true to send I2C stop. But this throws a warning in your lib and says stop should be int:
uint8_t TwoWire::requestFrom(int, int, int)

signed int types for values which are at most 8 bits and must be positive (what does a negative base address or length mean)?

But now my mux library and the example test at least compile error-free.
My tech just built two more of our custom PCA9548A mux boards, so tomorrow I can run this test code with your Wire on it and see how it compares to i2c_t3. I'll do some of my typical abuse (holding SDA low, etc) too.

Our main app uses a lot of the i2c_t3 added functions so it would be hard to give those up at this point. And it has been super reliable on a board with PCA9557, TMP102, some memory, RTC, etc.

What is the intended pro and con of your Wire vs i2c_t3? Other than that Wire supports AVRs. You said hopefully "stuck-proof" was one. In the case slave misses a clock and holds SDA low, is there a way for Wire to send clocks in hopes of flushing out the stuck byte and resynching the slave I2C state machine? I'll test this tomorrow.
 
Last edited:
PJRC Wire library initial test results

I'm running the same test code but with Wire library. Running on the adafruit breakout with simpleTest enabling four channels on the mux at the same time, which puts too high a pullup load (3.3K 4 in parallel = 825 ohms to 5V) on SDA, so we would expect some errors due to SDA unable to be driven adequately low. On our custom board I can drive the mux reset with a pushbutton, effectively clobbering I2C response by the slave. Other test: hold SDA low with external evil force (a test lead), sort of the opposite of the four-channels-mux test.

Forcing SDA low with test lead I can get failure:
simple test in middle of write/read loop failed with return of 0x04: Other error
simple test in middle of write/read loop failed with return of 0x02: Receive addr NAK
et:159 Good:705973 4496/sec bad:16
Error: control value=0xE1
et:160 Good:706597 4472/sec bad:16
control write failed with return of 0x04: Other error
control write failed with return of 0x04: Other error
et:161 Good:706597 4444/sec bad:18
control write failed with return of 0x04: Other error
control write failed with return of 0x04: Other error

Teensy is stuck with SCL never trying to go low because it sees SDA always driven low by the slave. This might be what I suspect to be the "ARB LOST" bug where Teensy thinks it does not ever have control of I2C, which makes no sense in a single-master system. This error (if my speculation is correct) should be handled differently.
Even resetting Teensy won't usually recover (sometimes it does, not sure why, maybe enough of a change on SCL that the slave can finish its cycle) because SDA is still driven low by the slave.

On a custom board if I reset the mux repeatedly Teensy can get into a state where it is stuck, but still driving SCL:
control write failed with return of 0x04: Other error
control write failed with return of 0x04: Other error
et:1085 Good:4747598 4383/sec bad:168
control write failed with return of 0x04: Other error

It would be really helpful if an enum of extended status values were returned, as in i2c_t3. What I see on the scope is the SDA line is always low. Resetting Teensy does not help, the slave is not able to release SDA and Teensy doesn't send any clocks to 'free' it by completing an unfinished I2C cycle. This looks like ARB LOST too, but I can't be certain. Resetting the mux recovers, since now it is not holding SDA low.

On the breakout which has too-low-value pullups I do not see consistent errors due to that, however it does get stuck with SDA low and can't recover. So maybe on that board the inability to always pull SDA low enough in a normal cycle triggers an error where the slave gets out of synch and then it becomes ARB LOST: SDA is stuck low and SCL is not being driven by Teensy. Resetting the mux recovers from that since the slave then releases SDA.

Update: I bypassed the "turn on multiple mux outputs" test for now since that sort of problem is not relevant to the I2C library behavior. On the breakout board it is pretty easy to get into what seems to be ARB LOST by forcing SDA low a few times. SCL is high, SDA low, the only recovery is resetting the mux.
 
Last edited:
Teensy is stuck with SCL never trying to go low because it sees SDA always driven low by the slave.

I would like to improve the Wire library to automatically detect and recover, if that is even possible without compromising compatibility with other I2C devices.

How can I set up the test? Do I need to buy anything other than this Adafruit board?

https://www.adafruit.com/product/2717

You mentioned a custom board. Do I need that? Or can I build something similar here?

As a practical matter, I really can't do anything on the Wire library until I have hardware here to reliably reproduce this SDA stuck low condition.
 
Last edited:
The Adafruit board has the PCA9548A on it. Our custom board has that plus a PCA9600 buffer (it will handle 4000pF and sink 60 mA vs 1/10 or so for standard I2C slave devices, so we can drive a long modular/RJ25 cable), and our custom Teensy board has a matching 9600 buffer. The result is both setups "look" the same to Teensy and I can run identical code on each. So if you have the Adafruit board that is all you need. Caution: you need pullups on all the outputs of the MUX on the Adafruit board, I just have them on the SDA outputs. Though now that I think about it maybe I do need them on the SCL also, even though Teensy is always driving SCL and Teensy SCL and SDA also have pullups. I use 3.3K to 5V. The MUX is a bit of an odd device for reasons mentioned in the readme of the PCA9548A library. For one thing the mux doesn't isolate or buffer, it will tolerate different input and output voltages, but when you open a channel to a mux output, that enabled I2C net now includes all the resistance and capacitance of both the I2C net segment on the mux input and the enabled mux output.

It should be possible to create the stuck SDA low on any I2C slave device; I will try that this weekend... all my test setups are tied up at the moment. I haven't deliberately abused I2C this way until recently, but with slaves on several feet of pluggable cables I suddenly have a reason for it to withstand missing slaves, crushed or cut cables, slaves with address jumpers set wrong, etc. Oh, and did I mention the not infrequent incidence of rodents chewing on the cables? All of this is not possible on a nice clean circuit board with slaves always soldered in place a few inches away.

All you need to do is tie SDA to GND momentarily: I just use a .025" male end test lead. Since SDA is open drain you don't hurt anything. You want SDA to be held low in the middle of a I2C message in progress, so you might need to try a few times to see the failure in question. Holding SDA low will get Teensy and the slave out of synch and when you release SDA, Teensy will think there has been a timeout but the slave will be waiting for more SCL pulses to finish its message. Teensy will either try another message which will be corrupted but at least keep the I2C bus going, or it will see SDA always low when it is not driving it and think "Aha, there is another master on the bus so I will declare ARB LOST and give up trying". That's my theory at least. That thinking leads me to similar places in the library code of both Wire and i2c_t3 with similar comments about "this can't be right".

Oddly, perhaps, running identical code on the Adafruit breakout vs our custom board, ours doesn't have random errors and the breakout does. Today the breakout is having address and data NAK errors. So far, 5 errors in 200 million or so messages. I don't have a good theory for that. On a scope the SCL and SDA are clean and swing from 0 to 4V. I'm only running at 100 KHz, so there is plenty of timing margin. Part of this network on our custom board has a device only capable of 100 KHz and we will be driving several feet of untwisted modular cable so slower is better, for us at least.

I now have a mux board with 6 temp sensors per channel coming on line and it is working really well. Our design is surprisingly tolerant of the modular cable on the unbuffered mux outputs which have a string of TMP275s on each. We did pick a specific cable pinout which gives the lowest capacitance possible on SCL and SDA by putting them on pins 1 and 6, with ground on pins 2 and 5. 5V and reset are on the inner two pins.

I hope that all helps. Thanks for looking into this. FYI I will be out of country for a few days with unreliable email and Internet so I may not be able to respond here until after Apr 02.
 
I would like to improve the Wire library to automatically detect and recover, if that is even possible without compromising compatibility with other I2C devices.

I would imagine this is possible, similar to how i2c_t3 does it with resetBus automagically invoked if you set a manifest constant in the library header. You would think this issue also applies to other I2C slaves. But I don't seem to be able to get the TMP102 to stick this way.
 
I ordered the Adafruit board. It'll probably be here in week.

I've put this on my high priority issue list. Timing isn't going to work out to get a fix into the 1.36 release, but I do want to look into this soon.
 
The Adafruit board arrived. I'm working with it now.

img.jpg
 
One in a billion error rate: good enough?

The Adafruit board arrived. I'm working with it now.
View attachment 10112

And I am back from almost two weeks in Cuba! The PCA9548A device was left running while I was gone and my test code accumulated over a billion cycles. It's interesting that during that time, it wasn't necessary to execute resetBus. But in testing just after that, it was: one time. So in billions of I2C messages, resetBus was only needed twice.

Also interesting is that, on our custom board, with identical parts, the error rate is much lower.

The difference is a PC board layout vs breakout boards in white protoboards with wires. Which one will have less noise? The PC layout. So a clean PC layout will give better results than a mass of prototype wires. No surprise there, really.

So the conclusion seems to be that given noise (which is always present!) and enough time, you will see I2C network errors. Less than one in a billion message cycles may have the exact conditions needed to hang the bus and need a device reset (not always possible) or extra clocks of something like resetBus to recover from a corrupted cycle. This one-in-a-billion chance might come up every few days or weeks, so it is more likely than the raw odds might suggest. If you want 100% system reliability, such small chances for errors must be considered and handled.
 
Last edited:
I think I know what might be causing that 1-in-1e9 error. Well, not the cause, but what part of the code probably isn't handling it properly. But how to fix is still a good question. I hope to have a proposed fix "soon"....
 
just to note from what i remembered

I2C_AUTO_RETRY does not invoke resetbus, it attempts 3-5 times to recover then gives up

Wirex.resetBus(); you use in your sketch, even if I2C_AUTO_RETRY is not defined

for me i verify the I2C device register, if there is a mismatch, I reset the bus, send the configuration to the chip(s), then return;.
 
I2C_AUTO_RETRY does cause execution of resetBus()

just to note from what i remembered

I2C_AUTO_RETRY does not invoke resetbus, it attempts 3-5 times to recover then gives up

Wirex.resetBus(); you use in your sketch, even if I2C_AUTO_RETRY is not defined

for me i verify the I2C device register, if there is a mismatch, I reset the bus, send the configuration to the chip(s), then return;.

You can see my test code in Github. I #define I2C_AUTO_RETRY, and in the comments of i2c_t3.h, code line 199,
Code:
- (v9.2) Modified 29Dec16 by Brian (nox771 at gmail.com)
...
// Auto retry - uncomment to make the library automatically call resetBus() if it has a timeout
so, it actually *does* invoke resetBus()
and I have also added a counter to the i2c_t3.cpp file so I know how many time resetBus has been called. This is in the cpp starting line 795:
Code:
        #if defined(I2C_AUTO_RETRY)
            // if not master and auto-retry set, then reset bus and try one last time
            if(!(*(i2c->C1) & I2C_C1_MST))
            {
                resetBus_(i2c,bus);
                i2c->resetBusCount++;   // bboyes
                if(!(*(i2c->S) & I2C_S_BUSY))
                {
                    // become the bus master in transmit mode (send start)
                    i2c->currentMode = I2C_MASTER;
                    *(i2c->C1) = I2C_C1_IICEN | I2C_C1_MST | I2C_C1_TX;
                }
            }
        #endif
and my test code output verifies that in fact it does get called when needed, as noted above. Plus the simple CLI of the test code lets me invoke it manually with 'r' or 'R'
 
I put pullups on the I2C net and also on the outputs of the MUX. But now I can't remember why I put them on the outputs since the MUX is an analog switch, so all the resisitive and capacitive load on the output(s) (more than one can be enabled at a time) passes through to the input.
 
I think I know what might be causing that 1-in-1e9 error.
I'm very interested to hear more! I have a TotalPhase I2C sniffer I could put on it if I knew how to trigger it (otherwise it will be like looking for a needle in a haystack with 10e9 straws). And I am also interested why the breakout board is so different from a custom PC board. I'm surprised the breakout would really be that much "noisier".
 
I've been continuing to look into this problem for the last couple days. If anyone's wondering why I've been so quiet the last couple days... this is it.

DSC_0538_web.jpg

As you can see, I've added a little more hardware to test. A 74HC08 chip buffers the signals and drives two LEDs. I also added a 2N3904 NPN transistor to pull down on the SDA line when the button is pressed. This allows viewing the base voltage, so it's easy to see on my scope when the button was pressed.

This is turning out to be a really tough problem, but it's looking like Freescale's arbitration lost feature is getting triggered if the transistor (or the noise in your system) happens to be forcing the SDA line low at just the wrong moment at the beginning of the last ACK bit time.

file.png

The red trace is the NPN base voltage. In this test, the SDA "noise" is starting just after the address byte.

The green trace is pin 13, where I've added a digitalWriteFast to drive it high the instant I see the arbitration lost flag.

I believe what's happening here is the I2C port quickly releases the bus because it believes another I2C master has control, but another master isn't actually in control. The result is SDA goes high almost immediately. Here's a zoom in to what I believe it likely the problem point:

file.png

That low pulse on SCL is only about 250 ns wide! No other master is actually taking control here. :(

The TCA9548 chip probably has a low-pass filter to reject stuff much higher than 400 kHz, so it's probably ignoring this pulse. So from its point of view, the master still hasn't completed the ACK bit cycle. It keeps driving SDA low, forever waiting for the master to pulse SCK and generate the stop condition.

At least that's what I think is really going on here.

How to properly solve this is something I still haven't figured out.....

Anyway, just wanted to post about what I've learned so far. This is a really tough problem. I am continuing to work on it and I do intend to solve this eventually. But before crafting any horribly ugly hacks, I want to be really sure they don't fix this case but make any other situation worse.
 
i use resetBus() in my T3.5 with 12 i2c devices on Wire1. i validate the mcp expander registers to make sure they are valid, if not, Wire1.resetBus() and then write the registers, i dont use i2c_auto_retry, and besides, auto retry only tries a few times and gives up after, but you can call resetBus() anytime...
 
Status
Not open for further replies.
Back
Top