PDA

View Full Version : I don't think critical sections work (interrupt locking)



froeber
08-23-2013, 05:06 AM
I'm trying to port some code I've used on many other systems/OSes. It uses critical sections to safeguard manipulating critical data. When I look at the Teensy 3 code it looks like the code simply uses __disable_irq and __enable_irq to surround critical sections. The problem is that those functions unconditionally disable and reenable interrupts. The problem comes up if one piece of code has a critical section in it so calls __disable_irq before calling some other function from inside its critical section that also has a critical section. When that nested function calls __enable_irq at the end of its critical section it would allow interrupts again even though the code was still executing inside the critical section of the calling function. That would be wrong and could cause problems that were very hard (actually almost impossible) to hunt down since the outer critical section code would no longer be protected from interrupts. Without looking at every use of __disable/enable_irq in the Teensy software and all the libraries I don't know if there is an actual problem due to this issue but I worry there could be. And it's an issue for the code I'm writing where I want to implement event logging to trace all the operations of a program running on the Teensy so I can figure out why I'm getting hard faults when using low power modes.

When I've written Cortex M3 code I've used critical section code that was:



static int lock_nesting_count; /// Tracks nested critical sections

/**
* Enter a critical section in way that supports nested calls.
*
* @details
* This function can be called to allow code to enter a critical section.
* It differs from the CMSIS function __disable_irq() (which it uses) in that
* it supports nested calls. See file header for more info.
*
* Critical sections entered by a call to this function should be exited
* by a call to the core_critical_exit() function defined below.
*/
void
core_critical_enter(void)
{
__disable_irq();
++lock_nesting_count;
}

/**
* Exit critical section and enable interrupts if needed.
*
* @details
* This function can be called to exit a critical section that was entered
* by a call to the core_critical_enter() function. This function should be
* used in place of the __enable_irq() function because it correctly handles
* nested calls.
*/
void
core_critical_exit(void)
{
/* Unlock interrupts only when we are exiting the outermost nested call. */
--lock_nesting_count;
if (lock_nesting_count == 0) {
__enable_irq();
}
}


Looking at the Freescale MQX RTOS for the processor used on the T3 I see they use a similar but even more complex implementation to account for nesting in their mqx_prv.h file. They seem to be keeping track of different status register values to maybe maintain different task priority levels. But that's something we don't need with this Teensy 3 code. Their code is:

#if MQX_USE_INTERRUPTS

#define _INT_DISABLE_CODE() \
if (kernel_data->ACTIVE_PTR->DISABLED_LEVEL == 0) { \
_PSP_SET_DISABLE_SR(kernel_data->DISABLE_SR); \
} /* Endif */ \
++kernel_data->ACTIVE_PTR->DISABLED_LEVEL;

#define _INT_ENABLE_CODE() \
if (kernel_data->ACTIVE_PTR->DISABLED_LEVEL) { \
if (--kernel_data->ACTIVE_PTR->DISABLED_LEVEL == 0) { \
if (kernel_data->IN_ISR) { \
_PSP_SET_ENABLE_SR(kernel_data->INTERRUPT_CONTEXT_PTR->ENABLE_SR); \
} else { \
_PSP_SET_ENABLE_SR(kernel_data->ACTIVE_SR); \
} /* Endif */ \
} /* Endif */ \
} /* Endif */
#else
#define _INT_DISABLE_CODE()
#define _INT_ENABLE_CODE()
#endif

Since I know bugs caused by incorrect handling of nested critical sections can be hard to handle I'm just wondering if anyone (Paul?) has checked the use of the critical sections already in the code to make sure there is no nesting. I think for my code I'll just go back to using the same code I used on the M3 processors. Just wanted to provide others a head's up if they start writing critical section code that has any nested calls in it. (I do realize critical sections should be kept very short but I have often seen cases where they make function calls that can cause these problems and I have been burnt by these types of problems before requiring much wasted debug time; typically fixing someone else's code).

froeber
08-23-2013, 11:19 AM
Just realized that I can't just use my old Cortex M3 critical section code. Any critical section code that tracks critical section nesting levels only works right if ALL critical sections use the same enter/exit functions so that nesting is properly tracked. This would mean all the T3 code and libraries would have to use the same code. Probably not going to happen.

The AVR processors that normally run Arduino can easily do recoverable interrupt locking (ie critical sections can be safely nested) using:

uint8_t intr_state,;
intr_state = SREG;
cli();

SREG = intr_state;

Paul has provided as "SREG emulation" in the avr_emulation.h code. Maybe I can just use AVR style critical sections to get safe code?

tni
08-23-2013, 11:20 AM
IMO, there are extremely few legitimate uses for a recursive mutex. They are evil and often lead to nasty bugs.

If you have interrupts disabled, you should really understand everything you are calling. Otherwise you can get interrupt latency spikes that can also manifest as extremely hard to track down bugs. IMO, your 'cure' is as bad as the disease.

froeber
08-23-2013, 11:42 AM
IMO, there are extremely few legitimate uses for a recursive mutex. They are evil and often lead to nasty bugs.

If you have interrupts disabled, you should really understand everything you are calling. Otherwise you can get interrupt latency spikes that can also manifest as extremely hard to track down bugs. IMO, your 'cure' is as bad as the disease.

@tni I like your characterization of the problem I am talking about as a recursive mutex problem. Because the use of interrupt locking is really just like a simple mutex to protect critical sections. And I agree that recursive mutexes should be avoided where possible. And I agree that you have to be very careful to understand the latency implications of any use of recursive mutexes. But I disagree that they shouldn't be used.

There are often cases in complex code where nested mutexes legitimately come up. You realize of course there is a huge amount of work that has been done to handle these situations right to avoid problems like priority inversion and such. It's nice to say we shouldn't nest critical sections but the only way to guarantee it doesn't happen is to require that code in a critical section makes no function calls at all. Because any function that is called might have its own critical section in it (or, worse, have a critical section added during some maintenance sweep). And as I have said, using the __disable_irq and __enable_irq calls for critical sections will cause broken code that will be broken in very hard to find ways since the result would be an unprotected critical section that could result in corruption of the code being protected by an interrupt happening at just the wrong time running just the wrong code such that it manipulated the same variables being protected by the first critical section.

I would respectfully say that while I think it's nice to say one shouldn't nest critical sections, it's important to realize it happens a lot in "real" applications and we better be able to handle it right or risk difficult to find bugs. And since the T3 processor brings a lot more processing power to the table than the typical Arduino AVR core, T3 boards might start getting used for more complex applications that could run into these issues. And, as I said, I'm running into the issues with the event logging I'm adding to the T3 code to track operations. In fact, I'm adding the same type of code that I added to VxWorks years ago that Wind River then adopted and used as the basis for their WindView product.

PaulStoffregen
08-23-2013, 12:17 PM
First, a little background.....

On AVR, the read SREG, disable interrupts, do stuff, then restore SREG approach is necessary because AVR has such a simple interrupt system. The main concern is code executed with interrupt routines. If you set that bit while executing an interrupt routine, all interrupts (including that one) become enabled again and can recursively interrupt, which rapidly overwrites the entire RAM and effectively crashes the program. So on AVR, this approach is usually used.

ARM has a much more sophisticated system. It's safe to use __disable_irq() and __enable_irq() inside interrupt routines. Using __disable_irq() will prevent higher priority interrupts. Using __enable_irq() puts things back the way they were, but it does NOT have the dire effects that setting the global interrupt enable bit on AVR would have if used within an interrupt. ARM's NVIC (Nested Vector Interrupt Controller) has a state bit for every interrupt, so the same interrupt can never interrupt itself. It also allows configure a lower priority level for an interrupt, so if you want some interrupts to be able to interrupt others, that's all handled by the NVIC hardware. You don't build this functionality yourself using __disable_irq() and __enable_irq(), as you would do on AVR.

So, regarding your specific issue:



Without looking at every use of __disable/enable_irq in the Teensy software and all the libraries I don't know if there is an actual problem due to this issue but I worry there could be. And it's an issue for the code I'm writing where I want to implement event logging to trace all the operations of a program running on the Teensy so I can figure out why I'm getting hard faults when using low power modes.


You're going to have to design around the reality that lots of code contains already __enable_irq().

I would recommend implementing your event logging code using the system supervisor call exception. Teensyduino's mk20dx128.c defines this as void svcall_isr(void) with a weak alias to the unused function. So all you should need to do is place your event logging code in svcall_isr(). To call your function, you'll probably use asm("svc #0"). Unfortunately, passing parameters or data to it is tricky. The approach is to push the data onto the stack, do the svc instruction, then pop or adjust the stack pointer after svc. Inside the exception code, you can grab the parameters off the stack, adjusted for the 12 words that got pushed while entering the exception.

While this involves a little inline assembly to pass parameters, the huge advantage of the system call exception is your code executes with elevated priority. You can use __disable_irq() and __enable_irq() at will. Using __enable_irq() within your logging code will never enable lower priority code to run. The NVIC is pretty nice like that. :)

The other huge advantage of this approach is you can configure the NVIC priority levels for your system call exception and other interrupts. For example, if you wanted your code to be able to transmit on the USB or a hardware serial port, you could adjust the interrupt priorities so the USB or serial interrupt is able to interrupt your exception, but of course only when __disable_irq() isn't in effect. You can call the normal code which uses USB, containing __disable_irq() and __enable_irq(), and as long as you've set the USB interrupt to a higher priority, it will work properly. You could write to the NVIC registers to lower the priority of all other interrupts, so the only interrupts that can interrupt your syscall handler are the ones you need for communication.

froeber
08-23-2013, 01:49 PM
@paul, Thanks, as always, for the nice explanation. The ARM Cortex M4 interrupt model is pretty slick and sophisticated. And, even though it seems no use is made with the current code of the flexible priority scheme, it does seem pretty nice that you could set up priorities to allow special interrupt handling and nesting to happen. I've seen where many RTOS's and OS's use that to great advantage to allow efficient operation under various conditions.

But my main concern was about making sure that critical sections do what they are supposed to -- (as you well know) keeping interrupt code from messing up what is supposed to be a group of atomic operations to variables in your program. It would seem to me that the possibility of nested critical sections would be something that has to be accounted for and handled. Don't you agree that the current __disable_irq, __enable_irq method of doing critical sections doesn't allow nesting? So, the solution is to either make sure there is no nesting of critical sections by examining all the code that uses the disable/enable method and all the functions that that code calls within any critical sections. Or, using some critical section protection method that allows nested critical sections without prematurely unlocking the critical section.

Looking at your SREG emulation implementation, I believe it offers exactly what I'm looking for (it's a pretty slick implementation itself). Using the AVR style code I showed above with your SREG emulation looks to me to "do the right thing" since it allows one to preserve a PRIMASK value of 1 (ie locked interrupts) until you get back to the end of the critical section that originally set PRIMASK. And it does that without preventing any of the benefits you mention for being able to adjust priorities of interrupts. And it only costs a few machine cycles over the simpler but potentially unsafe disable/enable method. At least that's my take.

Also, your suggestion that I use SVC calls to implement the logging is certainly has merit. It would be a good way to do it but only if execution time was of little consideration. Because, in my book, SVC calls are so heavyweight and involve so much overhead. Yes, they are needed for things like RTOS's implementing task level memory protection, but would kill performance for the type of logging I'm doing. I'm looking to add logging calls to all "change of state" cases like interrupt handler entry and exit so you record and can get a time based history of what the system is doing. Most RTOS vendors now have tools that provide such capability based on instrumentation in their kernels of the type I am adding and it makes a huge difference trying to debug complex timing related problems. But to be effective it has to be low overhead. An example of the type of info I'm looking to get is the same as shown with what is now called the Wind River System Viewer (I guess they changed the name from the original WindView somewhere along the line).
http://www.windriver.com/products/workbench/workbench-details-2.html#details_2

So, I think I am actually good with using your SREG emulation to allow safe nesting of my calls. I hope people realize that they can't safely nest critical sections using the normal method. And I hope there aren't any such nested calls anywhere in the core Teensy 3 code since that could make for some subtle bugs. Not that I suspect there are any such cases, it's just that I don't know that there aren't.

I would appreciate if you might indicate if my thinking on the safety of using the SREG emulation makes sense -- and doesn't really compromise the flexibility of the NVIC due to the way you implemented the emulation.

tni
08-23-2013, 02:33 PM
But my main concern was about making sure that critical sections do what they are supposed to -- (as you well know) keeping interrupt code from messing up what is supposed to be a group of atomic operations to variables in your program. It would seem to me that the possibility of nested critical sections would be something that has to be accounted for and handled.
You are not fully solving the problem. What happens in recursive cases?

E.g. you have a memory allocator that wants to report an error. The error reporting code in turn makes a memory allocation, the memory allocation code relies on atomic access, but since you can get reentrance with the recursive lock, you may just get silent corruption of your data structures.

The same recursive behavior can easily happen with callbacks and whatnot.

Whenever you hold a lock (have interrupts disabled), you need to understand what you call. Calling random 3rd party code is a recipe for disaster.

I have worked on large concurrent code bases (10 million lines of code), where people thought recursive locks were a great idea. Sure, they 'fix' a set of issues. But they do so at the expense of burying hard to find time bombs everywhere.

IMO, if you are worried about the recursive interrupt disabling, the right thing is error out (if you try to disable interrupts, when they are already disabled, you hang the CPU and blink an LED) and not to duct-tape things with recursive locks.

froeber
08-23-2013, 03:44 PM
@paul, Again, I agree with your analysis. You write wonderful code and clearly know your stuff. Not everyone has the same breadth of experience though so I thought it was useful to make sure things were done right. And there are serious bugs lurking in incorrect use of the standard critical section method if people nest calls. And by this, I mean nest them with forethought but the assumption that things will work right and be safe.

As with you, I have a good deal of real time experience. When I worked at Raytheon in the 80's and 90's I was one of the more experienced real time OS experts at the company. One older guy I worked with is still the head of the Posix real time committee. He and I would often have to go argue with vendors (eg DEC, Sun, SGI) about why their OSes had to work a certain way. I had code access to most of those OSes. But I only mention that to indicate that I too know what I'm talking about.

Let's not argue the benefits or drawbacks of nested critical sections. Yes, most definitely they can be dangerous and you have to be careful. But clearly with all your experience you must admit you have used them to necessary effect at times? I just was hoping to get your opinion as to whether you agreed with my thoughts on being able to use your SREG emulation to safely do nested interrupt locks. Since you have clearly studied up on the Teensy processor more than I have I was wondering if you thought there might be any gotcha's with my approach?

Thanks.

tni
08-23-2013, 04:26 PM
Since you bring up Posix, here is what the guy responsible for recursive locks thinks:
http://groups-beta.google.com/group/comp.programming.threads/msg/d835f2f6ef8aed99?hl=en

PaulStoffregen
08-23-2013, 07:52 PM
Don't you agree that the current __disable_irq, __enable_irq method of doing critical sections doesn't allow nesting?


Yes, I agree. Nesting critical sections is not feasible on Teensy3.



So, the solution is to either make sure there is no nesting of critical sections by examining all the code that uses the disable/enable method and all the functions that that code calls within any critical sections.


Yes, absolutely. If you're using __disable_irq(), you can't expect a general nesting scheme to work together with code from other sources, including Teensy3's core library. You can only use such an approach solely within your own code base. As soon as you call any function from any library, include Teensy3's core library, all bets are off.



Or, using some critical section protection method that allows nested critical sections without prematurely unlocking the critical section.


That could work in an alternate universe where you every library and every other project standardized on using only the nested locking code. Such an alternate universe might look like a conventional operating system or RTOS that utilizes the ARM user/supervisor modes?

But on Teensy3 with "bare metal" programming that leverages Arduino libraries from many different sources, you simply can't depend on everyone else (or really, *anyone* else) also using the same nested locking mechanism.

froeber
09-16-2014, 03:43 PM
Hi all, I've seen this issue resurface as part of the discussions about the new TeensyDuino release. After the whole discussion above I did follow up on the post from @tni about the Posix discussions of nested locking. They were very interesting yet didn't change my mind based on 30+ years of doing safety critical SW code. I also revisited the issue with my friend who heads the POSIX Real Time group and he agreed; sometimes you need nested locks to get things to work right. I'm too busy to waste a lot of time arguing this issue. And I totally agree with Paul that getting all the code to work correctly in the face of potential nested locking issues is hard if not impossible. But I will at least post the code I'm using for my code where I know there is potential for locks to be nested. It avoids the problem.


static inline uint32_t sl_hw_lock(void) __attribute__((always_inline, unused));
static inline void sl_hw_unlock(uint32_t primask) __attribute__((always_inline,
unused));

/**
* Lock interrupts to start a critical section.
*
* When used with sl_hw_unlock() this function starts a critical section that
* works properly even if nested in another critical section because it reads
* the PRIMASK value so it can be restored.
*
* @return the priority mask (PRIMASK) register value upon entry
*/
static inline uint32_t
sl_hw_lock(void)
{
uint32_t primask;

asm volatile("mrs %0, primask\n" : "=r" (primask)::);
__disable_irq();
return primask;
}

/**
* Unlock interrupts to end a critical section.
*
* When used with sl_hw_lock() this function ends a critical section that works
* properly even if nested in another critical section because it uses the
* previous interrupt locking state (defined by PRIMASK) to selectively unlock
* interrupts.
*
* @param primask The previous priority mask (PRIMASK) register value as
* returned by sl_hw_lock().
*/
static inline void
sl_hw_unlock(uint32_t primask)
{
if (primask == 0) {
__enable_irq();
}
}