Probable race condition in Radiohead library

Paul,

Arduino 1.8.2 w/ teensyduino 1.36.b4, teensy 3.2 w/ RFM95 lora 433. No problems with the teensy-supplied library.

IMO, the rising vs high "fix" is simply narrowing an already small timing window. It's not a fix, but a bandaid that's sure to cause problems later. The proper fix is in the code that you have incorporated in 1.36.b4.

David
 
Thanks,

It is a group effort. The race condition was fixed by dgranger.

Warning, the fix eas only done for the RFM95 portion. My guess tells me, that similar changes may be needed for other radios as well.
 
Arduino 1.8.2, Teensy 3.2 @ 96MHz, Teensyduino 1.36b4, Radiohead library supplied w/ Teensyduino

>1,000 TX cycles without lockup. Also removed the 10ms delays and 1S pacing delay in the main loop of the sketch in post 1

Thanks Paul, KurtE, dgranger and others!
 
Arduino 1.8.1 TeensyDuino 1.36 Teensy 3.5 RFM95 or equivalent modules (Semtech SX127x chips)

Hello, I am still encountering identical problems, i.e. lockup's with the RF transmission while using the RadioHead Library supplied with TeensyDuino 1.36.

With the moving of the _mode = RHModexxxx statements as discussed above the Example programs of RF95 client indeed do not lock up anymore.

I had reverted to these sample programs since I was getting lockups using the above hard/sfwr for what I'm working on.

I moved to the Teensy 3.5. after running out of memory on an ATMEGA-328PB chip, a variant of the ATMEGA 328P used in the Arduino's.
I'm trying to read 8 temp and 1 pressure sensor, store data locally on the Teensy 3.5 SD and copy it to another Teensy using LORA chips.
But irratic lockups where occuring. See also : https://forum.pjrc.com/threads/4485...s-and-Radio-Head-library-not-working-reliably

Then I found this thread and made sure I was using the patched RF95 library. That worked for the sample program.
With the sample program not locking up anymore I went back to my own code, but alas, that still locks up up after two transmissions.

I've attached the code and would appreciate any suggestions.

One remark: I have a hardware/RF background and am on a near vertical learning curve for C and C++. I might have violated a few naming conventions and not used every trick in the book but it compiles without complaints :eek:

tnx
Pim
 

Attachments

  • MuliChannelTempLogger_Teensy3.5_V1.0.txt
    20.5 KB · Views: 607
Hi,

I don't have your hardware setup so I can't run your sketch. However, I did read the code you posted and saw a problem which could explain the "lock ups".

In the "loop()" portion of your sketch you have the following code:

Code:
String  fileName = (year());
        fileName += (month());
        fileName += (day());
        fileName += ".CSV" ;
      
        char charFileName[fileName.length() + 1];
        fileName.toCharArray(charFileName, sizeof(charFileName));
   
      File dataFile = SD.open(charFileName, FILE_WRITE);
      if (dataFile);{
        dataFile.println(dataString);
        dataFile.close();
      }

The problem is that the "charFileName" array is statically allocated at compile time so the compiler has no way of knowing what the run time value of "fileName.length()" is. So the char array will most certainly not be allocated to the size that you intend it to be. The subsequent "fileName.toCharArray" then copies the string data to the under-allocated char array overwriting other stack data, most likely causing the "lock up".

To test this, try declaring the char array with sufficient static size to hold whatever size fileName you need and see if the lock up goes away, like:

Code:
    char charFileName [32];

By the way, you make the same mistake in the "generateFiles" routine at the bottom of your sketch. However, that code is never referenced.

There may be other problems, but I'd start by fixing this one. If the code still locks up, I'd put a few Serial.prints in the loop section to try and narrow down where it's happening.

David
 
Hi David,

Re. charFileName array definition.
Oeps. I moved it out of the way and made it a global array. tnx.
However the issue was not fixed yet.

Re. Println inserts.
That narrowed it down to the call to the RHGenericDriver::waitPacketSent() function in the RHGenericDriver.cpp file.
It appears that while (_mode == RHModeTx) never becomes untrue after the second time through the main loop of the program.

Then as suggested in a post above I changed the interrupt state from RISING to HIGH in RH_RF95.cpp attachInterrupt(interruptNumber, isr0, HIGH); and that fixed it.

Since I'm using TeensyDuino 1.36 I'm still a bit confused that this fixes it because reading in this thread there is mentioning of it being included.
But it works now.

Thanks.

Pim
 
Last edited:
Hmm,

Glad you got it working.

I have no clue as to why HIGH works and RISING does not.

I tested the race condition fix with a Teensy 3.2, and I believe that KurtE was using a 3.2 as well. I wonder it there's something different on a 3.5? I have one here, I'll see if I can hook it up and take a look.

David
 
David,
One of the differences is clearly the clock speed: 72 MHz for the 3.2, 120 MHz for the 3.5 if I'm not mistaken.
Now this most likely being a race condition this could be the true source.


I did some more digging trying to get to the bottom of this rather then putting in the workaround for the rising vs. high state of the interrupt pin.
What's nagging me about it is that I don't understand how this workaround works.

I first went back to IRQ's on the rising edge and then tried to briefly disable IRQ's when changing the state parameter.

Code:
void RH_RF95::setModeTx()
{
	//debugging race condition
	noInterrupts();
	//
    if (_mode != RHModeTx)
    {
    _mode = RHModeTx;       // set first to avoid possible race condition
	spiWrite(RH_RF95_REG_01_OP_MODE, RH_RF95_MODE_TX);
	spiWrite(RH_RF95_REG_40_DIO_MAPPING1, 0x40); // Interrupt on TxDone
    }
	// debugging race condition
	interrupts();
	//
}
(Done for Rx, Tx, Idle and Sleep)

No effect.

Next was properly resetting the Semtech chip.
The chips do their own Power on Reset but if you do reset them manually you have to pull the reset line low and then leave it floating. (Nearly all the code I've seen force the pin to a high state after the reset. The datasheet gives no direct explanation as to why it should be left floating.)

So in radioon()
is now.

Code:
/*****************************************************************
 * Reset the radio, 
 * - program the frequency;
 * - set Tx power;
 * - set PA stage to use;
 *  and communicate progress over the Serial Monitor link
 */
void radioon()
{
 // manual reset
 // Reset pin is in input mode floating the reset line
  digitalWrite(RFM95_RST, LOW);
 // Pull reset line low
  pinMode (RFM95_RST, OUTPUT);
  delay(1);             // Spec is to keep reset low for a minimum of 100E-6 s.
  pinMode (RFM95_RST, INPUT); // Let it float again
  //  digitalWrite(RFM95_RST, HIGH);
  delay(10);            // And then wait at least 5 ms for it to initialize itself
  Serial.println(F("LORA SX127x chip was reset"));

  while (!rf95.init())

etc.
Together with a "stabililizing" delay of 2000 ms at the beginning of setup that had some effect in that it went through the loop 48 times and then still locked up.:(
Changing that delay in steps back to 100ms the number of iterations it makes through the loop varies between 3 and 48 and the numbers even vary a bit per run.

So I'm scratching my head on what's going on here and can get fixed by going to a IRQ line triggered at a high level and not a rising edge. It beats me.
 
Last edited:
Hmm,

Glad you got it working.

I have no clue as to why HIGH works and RISING does not.

I tested the race condition fix with a Teensy 3.2, and I believe that KurtE was using a 3.2 as well. I wonder it there's something different on a 3.5? I have one here, I'll see if I can hook it up and take a look.

David
I believe my testing may have been done with T3.6...
But it has been awhile since I looked at it and I simply took your fix, which appeared to work for me...
 
First of all sorry for my English. I had the same problem with teensy 3.2 and RFM65. Th waitPacketSent() hang under conditions because the system seams to remain in TxData state. After a lot of debugging and tests i find the problem. The last command in the interrupt service routine is "spiWrite(RH_RF95_REG_12_IRQ_FLAGS, 0xff); " which clear all interrupt flags before ISR exit. But after a TxData Complete interrupt, the TxDone flag does not cleared. A second consecutive command "spiWrite(RH_RF95_REG_12_IRQ_FLAGS, 0xff); solve the problem and clear the TxDone flag.
 
Somehow an earlier post of mine got lost.

I think the 3.2 vs 3.5 thing is not really the issue. In your earlier post (#30) you stated that the sample sketch was running without problems using the RH lib from Teensyduino 1.36. It was when you used your sketch with the 1.36 RH lib that you experienced hangs albeit somewhat randomly in terms of when.

I think you need to focus on what is making your sketch hang when the sample sketch does not. Try commenting out sections to see if you can isolate the portion which causes the hang. If you end up with an essentially do-nothing sketch that still hangs, then it's probably time to instrument the RH lib to save state vars, such as _mode and the interrupt source captured in the interrupt handler.

If catha is correct that the reset flags command isn't working on TxDone, you might try re-reading the int source state in the interrupt handler to see if the tx bit is still set after the reset. This data would be pretty helpful in trying to sort out the source of the problem, and hopefully a solution.
 
Eat the Elephant properly

Catha's fix of clearing the Interrupt flag twice works when the IRQ pins are set to respond to rising.
A first attempt as to why was to dump the chips registers around that sequence of commands with the printRegisters() function. That showed 0 for register 0x12 both times.

So time to eat the Elephant properly as you suggested, one small step at a time.

But not today anymore.
I'll be back
Pim
 
Hi
Catha is right. The Tx done interrupt flag appears not to get cleared the first time. When you clear them a second time they are.

Rather then skinning my software I realized that one of the differences with rf95_client is that there
Code:
rf95.waitPacketSent()
is called directly after the call to
Code:
rf95.send(data, sizeof(data))

In my code I'm not doing that since I intend to transmit only once a minute and the function
Code:
rf95.send(data, sizeof(data))
starts with the same function anyway
Code:
bool RH_RF95::send(const uint8_t* data, uint8_t len)
{
    if (len > RH_RF95_MAX_MESSAGE_LEN)
	return false;

    waitPacketSent(); // Make sure we dont interrupt an outgoing message

That makes the call to
Code:
rf95.waitPacketSent()
in rf95_client superfluous but possibly also the reason why the example code works.

Skinning the rf95_client example code to the bone you get.
(Note only a iterations counter got added and the total delay in loop was shortened from 3000+400 ms to 1000ms to save some time.)
Code:
// rf95_client.pde
// -*- mode: C++ -*-
// Example sketch showing how to create a simple messageing client
// with the RH_RF95 class. RH_RF95 class does not provide for addressing or
// reliability, so you should only use RH_RF95 if you do not need the higher
// level messaging abilities.
// It is designed to work with the other example rf95_server
// Tested with Anarduino MiniWirelessLoRa, Rocket Scream Mini Ultra Pro with
// the RFM95W, Adafruit Feather M0 with RFM95

#include <SPI.h>
#include <RH_RF95.h>

// Singleton instance of the radio driver
RH_RF95 rf95(10,2); // Teensy 3.5

uint16_t iterations;

void setup() 
{

  Serial.begin(9600);
  while (!Serial) ; // Wait for serial port to be available
  if (!rf95.init())
    Serial.println("init failed");
  iterations = 1; //humans start at 1 :-)
}

void loop()
{
  //debug
  Serial.print (iterations);
  // debug end
  Serial.println(" Sending to rf95_server");
  //debug
  iterations++;
  //debug end
  // Send a message to rf95_server
  uint8_t data[] = "Hello World!";
  rf95.send(data, sizeof(data));

 /* 
  rf95.waitPacketSent();
  // Now wait for a reply
  uint8_t buf[RH_RF95_MAX_MESSAGE_LEN];
  uint8_t len = sizeof(buf);

  if (rf95.waitAvailableTimeout(1000))
  { 
    // Should be a reply message for us now   
    if (rf95.recv(buf, &len))
   {
      Serial.print("got reply: ");
      Serial.println((char*)buf);
//      Serial.print("RSSI: ");
//      Serial.println(rf95.lastRssi(), DEC);    
    }
    else
    {
      Serial.println("recv failed");
    }
  }
  else
  {
    Serial.println("No reply, is rf95_server running?");
  }
  */
  delay(1000);
}

This code does reliably lockup after 1 iteration through loop().

Then modifications to RH_RF95.cpp showed indeed that the Tx done IRQ flag appears to get cleared only the second time you bang it.

Here is the relavant piece of RF_RF95.cpp

Code:
void RH_RF95::handleInterrupt()
{
    // Read the interrupt register
    // Serial.println("RH_RF95.cpp: RH_RF95::handleInterrupt");
    uint8_t irq_flags = spiRead(RH_RF95_REG_12_IRQ_FLAGS);
	Serial.print("i1 "); Serial.println(_mode<<8 | irq_flags, HEX);  // DEBUG: output _mode & irq flags
	//Serial.print ("irq_flags (reg 0x12) = ");
	//Serial.println(irq_flags, HEX);
    if (_mode == RHModeRx && irq_flags & (RH_RF95_RX_TIMEOUT | RH_RF95_PAYLOAD_CRC_ERROR))
    {
	_rxBad++;
    }
    else if (_mode == RHModeRx && irq_flags & RH_RF95_RX_DONE)
    {
	// Have received a packet
	uint8_t len = spiRead(RH_RF95_REG_13_RX_NB_BYTES);

	// Reset the fifo read ptr to the beginning of the packet
	spiWrite(RH_RF95_REG_0D_FIFO_ADDR_PTR, spiRead(RH_RF95_REG_10_FIFO_RX_CURRENT_ADDR));
	spiBurstRead(RH_RF95_REG_00_FIFO, _buf, len);
	_bufLen = len;
	
	//debug
	Serial.print("i2 "); Serial.println(_mode<<8 | irq_flags, HEX);  // DEBUG: output _mode & irq flags
	//debug end
	spiWrite(RH_RF95_REG_12_IRQ_FLAGS, 0xff); // Clear all IRQ flags
	// debug
	uint8_t debug_pim = spiRead(RH_RF95_REG_12_IRQ_FLAGS);
	Serial.print("i3 "); Serial.println(_mode<<8 | debug_pim, HEX);  // DEBUG: output _mode & irq flags
	// debug end
	// debug
	// Clear IRQ flags again.
	spiWrite(RH_RF95_REG_12_IRQ_FLAGS, 0xff); // Clear all IRQ flags
	// and read them back
	debug_pim = spiRead(RH_RF95_REG_12_IRQ_FLAGS);
	Serial.print("i4 "); Serial.println(_mode<<8 | debug_pim, HEX);  // DEBUG: output _mode & irq flags
	// debug end
	

	// Remember the RSSI of this packet
	// this is according to the doc, but is it really correct?
	// weakest receiveable signals are reported RSSI at about -66
	_lastRssi = spiRead(RH_RF95_REG_1A_PKT_RSSI_VALUE) - 137;

	// We have received a message.
	validateRxBuf(); 
	if (_rxBufValid)
	    setModeIdle(); // Got one 
    }
    else if (_mode == RHModeTx && irq_flags & RH_RF95_TX_DONE)
    {
	_txGood++;
	setModeIdle();
    }
    //debug
	Serial.print("i5 "); Serial.println(_mode<<8 | irq_flags, HEX);  // DEBUG: output _mode & irq flags
	//debug end
    spiWrite(RH_RF95_REG_12_IRQ_FLAGS, 0xff); // Clear all IRQ flags
	// debug
	uint8_t debug_pim = spiRead(RH_RF95_REG_12_IRQ_FLAGS);
	Serial.print("i6 "); Serial.println(_mode<<8 | debug_pim, HEX);  // DEBUG: output _mode & irq flags
	// debug end
	// debug
	// Clear IRQ flags again.
	spiWrite(RH_RF95_REG_12_IRQ_FLAGS, 0xff); // Clear all IRQ flags
	// and read them back
	debug_pim = spiRead(RH_RF95_REG_12_IRQ_FLAGS);
	Serial.print("i7 "); Serial.println(_mode<<8 | debug_pim, HEX);  // DEBUG: output _mode & irq flags
	// debug end

And here is the edited debug output.

1 Sending to rf95_server
i1 308
i5 208
i6 208
(This should be 200 now):(
i7 200 (like this):)

So, why is this happening?
On to the spiWrite and spiRead functions I guess:(
To be continued tomorrow.
Pim
 
PimN exactly what you say is correct. I think the problem isn't the spiWrite but the Rfm95 operation because the same spiwrite operates correct with other Rfm chips.
Few words for me. I am hardware engineer up to 24 years I am working in R&D department of a big International Telecommunication company as emended hardware engineer, project manager and now Principal engineer. Now i started the implementation of small projects using teensy. One of my first projects is the drone monitoring for fishing purposes (heavy casting)
I am happy to help if i can.

Are there a better more powerful and more professional than arduino environment?

Best regards,
 
Catha, It might be something related to the higher speed of the CPU together with the RFM95. I went back to my own code with the same debugging in place and see that in some 20+% of the iterations through loop() it does not need the second clearing of the Interrupt flag. Now I think I also understand why the changing of the IRQ trigger mechanism from rising to high worked. If the first clearing of the flag is not successful a level IRQ does not get cleared so upon the return from the routine the IRQ is still set. Hence right away the IRQ routine is called again and it bangs the flag a second time. Come to think of it that might be even be a more fool proof workaround then clearing the flag twice because, what if it also is not successful the second time?? Maybe nit-picking but food for thought. Still need to read up on this high level setting for IR flags because it's not a feature of the ATMEL AVRMEGA 8 bit chips (ATMEGA328P). That only has edges and a LOW level setting.


Anyway those kind of timing issues are "interesting" to track down to to begin with. There the Arduino environment is not helping and I see your point in searching for a more professional environment.
The chip on the Teensy is from NXP and I came across this page for development tools. http://www.nxp.com/products/softwar...re/kinetis-software-and-tools:KINETIS_SWTOOLS

I've no experience with it at all.
Some of it is free download but if you want more functionality you have to pay up. For a professional company I think not too much of an issue. I saw prices of 500 $/Euro's per seat.

By the way, interesting project to work on for a living. Lot's of different technologies coming together I assume.:)

rgds
Pim
 
PimN, Yes its time related the problem of the flag erasing mechanism. Additional for information only the work around with IRQ set to High from rising doesn't works for me. I thing we find a stable solution for that problem.
Relative to IDE i try the Atmel Studio with an Visio Addon for Arduino. But i do not know if support the teensy 3.2 and the ARM processor.
I try to find a solution reading the forums but i thing the problem is complicated. I will continue with ARDUINO IDE (without Breakpoints, without Watches ertc)
 
The v1.76 RadioHead library linked above in post #43 won't compile due to an error in RHGenericDriver.cpp.

The function at around line 155 should be changed from
Code:
int8_t RHGenericDriver::lastRssi()
to
Code:
int16_t RHGenericDriver::lastRssi()
which is how it is defined in RHGenericDriver.h. "_lastRssi" is declared as an int16_t in the .h file as well.

Moreover, this version of the RadioHead library has the "_mode" sets placed after the code that initiates the I/O which was determined to cause the race condition that was the original subject that started this thread.

While the new library does work without hangs in my brief testing, I do feel that setting the _mode prior to performing the I/O is safer as it eliminates the potential race condition.

YMMV. In my local copy, I've swapped the _mode set order for the "setModeXX()" procs in RH_RF95.cpp.
 
My mileage does indeed vary. I tried both positions of the _mode statements and it was not the root cause of this lockup. (If it is "cleaner" coding is another discussion. But I won't judge not being a programmer.)
I tried either way and it did not matter. (Nor did the changing of the IRQ trigger sensitivity from rising to HIGH as mentioned somewhere above was a sure fix)

Debugging it I found that adding or removing a line of code somewhere would change the behavior of when it would lock up.
You thought you had it fixed and after 50+ or as much as 208 iterations it would still lock-up.

Only after following the suggestion by catha, clear the interrupt flags twice, everything became reliable.

Since that, still, does not make much sense I got the scope out and verified that indeed the DIO0 pin of the RFM95 module stayed high after the first clearing of the interrupt register 0x12.
Immediately after the second clearing of 0x12 the signal drops.

So the question remains: why is this chip only responding to a command the second time.
I checked timing of the spi signals, etc. etc. and all is well.
Taking it further would require some kind of logic analyzer to which I have no access.

As to 1.76 giving a compile error. I believe this is due to a bit to early upload for a last minute fix for the RSSI issue. I've notified Mike M about it.

rgds.
Pim
 
Last edited:
Pim,

I did not mean to imply that the double clear of the IRQ wasn't needed as your post (#39) clearly showed that the first clear was not reliably working for you.

I do think that the _mode sets should precede the I/O ops as it closes the potential of a "fast" interrupt getting in between and entering the IRQ handler without the proper _mode. When that happens. the IRQ source is cleared followed by the _mode being set which locks up the code. That problem provably existed for me.

So now, with your and Catha's work we have found 2 problems. The fact that v1.76 works without the _mode set change does not mean that the race condition isn't still there. Slight changes to the code timing or the CPU clock frequency may cause it to reappear. That was the point of my previous message.
 
Hi,
1. Version 1.77 of the RadioHead library is published, compile errors are gone.
2. Yes the race condition is still there somewhere and it will bite me one day but at least I'm warned.
But it must be buried deep being capable of getting the Semtech chip in such a state.
Enjoy the weekend.

Pim
 
KurtE - have you been working with your LORA units? Does this update apply to any issues you saw or worked around? I never powered mine up as I never saw that they were known to work yet.
 
Nope, I have not been doing anything recently with the LORA... Keep meaning to get back to them, but to many other distractions.
 
I have had two 3.2's communicating over LoRa without lock-up issues for quite some time now.

However, I have not had any problems since the fix that was incorporated into Teensyduino 1.36 (which is what I am using). I have never seen the issue that Catha & Pim identified requiring the double int source clear, although Pim clearly demonstrated that it exists on his setup.

Perhaps it's a difference in Semtech versions? All my testing has been with Adafruit Feather 433 LoRa units. I have several Hope-sourced units that I could try swapping in...
 
Back
Top