meaning of "with LTO"

Status
Not open for further replies.

NikoTeen

Well-known member
hi,
a search in this forum for "with LTO" brings a lot of threads but nothing concerning this type of compiler optimization.

My project on a Teensy 3.2:
  • A 433 MHz receiver outputs digital signals from a wind sensor and a temperature sensor.
  • These outputs are decoded by an interrupt routine, stored into a data array and flagged when a data set is available.
  • In loop() the program checks if there are results from the interrupt routine. When a data set is flagged as available the data are interpreted and printed on the Serial Monitor. There is also a counter, Scount, which counts how many times loop() is running until a new data set is available.

During testing I have curious effects:
1) with optimization fast
- loop counter contains reasonable values, shown on the Serial Monitor by Serial.print(Scount)
- some of the data sets are missed (because the RF signal to the 433 MHz receiver is simulated I know exactly what data sets should be received)

2) with optimization debug
- for loop counter (type word) the Serial Monitor shows values much higher than possible with 16 bits!
- all data sets are received and correct

3) with optimization fast with LTO
- loop counter is ok
- all data sets are ok

The effects of optimization took a lot of time for debugging. And I could not find the reasons.

Now with "fast with LTO" I have a solution. But I would like to know what happened with the other optimizations and what is specific with "with LTO".
Can anybody explain it to me or where can I find explanations?

A frustrated NikoTeen:confused:
 
LTO means Link Time Optimization.

Normally each file is complied to "relative" machine code, meaning placeholders are left for the addresses of functions and variables from other files. Then the linker simply concatenates them all together, replacing all those relative addresses with the actual addresses where every function and variable actually ended up in the memory. The linker can detect when functions and variables aren't accessed by anything and automatically exclude them. It also does a few other simple things, but generally it can't alter the compiled code.

LTO only partially compiles all your files. Then when the linker is run, all that code is run through a number of compiler steps to optimize and compile it into the final machine code. In some cases, this allows much more optimization, because the compiler is able to consider how a function or variable is actually used from everywhere else in your program. Essentially the last step is optimization across your entire project, which would have been impossible normally.

Currently we never use LTO as the default for Teensy 3 & 4. It tends to expose subtle bugs. Most of those are probably bugs in the libraries or Teensy's core lib. Some may be compiler bugs, since LTO is relatively new.

Please post complete programs to reproduce these problems you've found, and clearly say which optimization setting to use. These sorts of bugs aren't ever going to be fixed if we don't have test cases...
 
LTO means Link Time Optimization.

Currently we never use LTO as the default for Teensy 3 & 4. It tends to expose subtle bugs. Most of those are probably bugs in the libraries or Teensy's core lib. Some may be compiler bugs, since LTO is relatively new.
You say that LTO is buggy. But in my case only LTO generates correct code. Even optimization "debug", which in my understanding is no optimization at all, is printing the variable Scount as some hundred millions whereas with optimization "fast with LTO" the variable is printed as about 7 which is realistic. The receiver sends data sets every 7 sec. Scount is counting how many times loop() is repeated until a new data set is received. A Teensy 3.2, 96 MHz, cannot run loop(), which contains a delay(1000), some hundred million times within 7 sec.
Because I suspected a memory leak I printed other variables which are declared directly beneath the variable Scount. These variables were printed correctly.
During writing this post my sketch is running in parallel with optimization "debug". I had a look on the outputs on the Serial Monitor and saw that sometimes, only sometimes, the values of Scount are correct like always with "fast with LTO" :confused:.
What is going on here?

Please post complete programs to reproduce these problems you've found, and clearly say which optimization setting to use. These sorts of bugs aren't ever going to be fixed if we don't have test cases...
To reproduce the effects it is not sufficient to have the code but it is necessary to have a 433 MHz OOK-receiver and a 433 MHz transmitter which sends known signals in order to be able to check the received data.
If you really would take the effort to build such an equipment then I would send you the code. I also could provide the code for simulation of the defined data to be transmitted.
 
You say that LTO is buggy.
...

Seems you are using a T_3.2? The T_4 is buggy and the T_3.5/3.6 are perhaps not perfect.

When LTO works it is nice - and it seemed it generally did on T_3.2. But when it makes a bad decision on unused code or shared code re-order - it fails.

For the fails as FAST or other a complete sample sketch as noted in p#3 would be the only way to understand what is wrong.
 
Without seeing the code it's difficult to tell, but based on your description, the reason could be buggy code how you communicate between interrupt routine and the main loop. Remember that interrupts may happen anywhere in the main loop if you don't disable them, so if for example you happen to read two values in the main loop to execute some logic and determine new values based on the two values, the interrupt could happen between reading those two values messing up the logic. Compiler can even change the order of fetching the values and worse some value fetches which you may think are done atomically with single instruction may end up being two separate instructions (e.g. reading a 16-bit variable on an 8-bit MCU). You need to be super careful when writing this communication code between interrupts and the main loop.

Furthermore, it's important to declare variables used for communicating between interrupts and the main loop "volatile" so that compiler doesn't try to optimize memory fetching of these variables. If you have something like:
Code:
static bool s_wait_interrupt;
while(s_wait_interrupt) {}
without volatile s_wait_interrupt could be fetched to a register before the while loop and never fetched from the memory again in the loop (resulting in an infinite loop) because compiler doesn't realize that the variable could be updated by an interrupt outside the main program flow. It can be helpful to inspect the generated assembly code. By declaring the variable volatile forces compiler to read variable from memory every time it needs to access it.

So, understanding these two things, depending on compiler/linker optimizations, compiler may end up organizing the code differently so that things just happen to work in your case. In other words, you just happen to get lucky. But if you change your code, the compiler may end up organizing code differently and break the LTO build as well.
 
... optimization "debug", which in my understanding is no optimization at all

The "debug" menu item causes gcc to be run with "-Og", which gcc's documentation describes as "-Og enables all -O1 optimization flags except for those that may interfere with debugging". It does indeed apply many optimizations.

A simple way to get a good idea of much each way optimizes code is the CoreMark benchmark.

https://github.com/PaulStoffregen/CoreMark
 
JarkkoL's example is a strong argument for the "Always post complete source code" warning at the top of the forum. With access to the source JarkkoL and others could certainly spot potential problems in the communication between the interrupt routine and the main loop. No source code, and you get only hypothetical advice.
 
To give a bit more context, consider the following innocent looking code:
Code:
  static uint8_t s_buffer[256]; // interrupt fills the ring buffer with data
  static uint8_t s_buffer_pos; // interrupt updates the buffer position
  static uint8_t s_interrupt_triggered; // interrupt sets to 1 when it was triggered
  ...

  // in main loop
  if(s_interrupt_triggered)
  {
    uint8_t data=s_buffer[s_buffer_pos];
    ...
    s_interrupt_triggered=0;
  }
  ...

  // interrupt routine
  uint8_t data=...; // fetch data from a sensor
  s_buffer[++s_buffer_pos]=data;
  s_interrupt_triggered=1;
However, there are many things that are wrong here:
- s_interrupt_triggered & s_buffer_pos could be (in theory) fetched from memory to registers only once during the whole program and never updated from memory again (should be declared as "volatile"). It really depends on the surrounding code complexity what happens without volatile.
- another interrupt could happen immediately after "if(s_interrupt_triggered)" statement messing up the logic and miss data (should precede the if-statement with noInterrupts() and follow with interrupts())
- even with noInterrupts() there could be two (or more) interrupts happening before getting to the main loop if-statement thus you would miss data. You should probably maintain separate read & write positions instead to deal with multiple interrupts within single loop iteration.
- the memory fetch for "uint8_t data=s_buffer[s_buffer_pos]" could be (in theory) moved before the if-statement by the compiler, if for some reason compiler considers that to be a better option. You should add an "acquire" memory fence here to prevent this reordering: std::atomic_thread_fence(std::memory_order_acquire);

There are likely other issues as well, but this is just to show how challenging it can be to write robust code with interrupts even for a simple case like this. And it can be very difficult to reproduce the issues as they may be related to compiler/linker options or timing that happen quite rarely.
 
Because my program is quite complex and uses external interrupts, I tried to reduce my program to a version which can be posted here and made several tests. But then the effects are away. This leads me to the conclusion that someting with my interrupt handling is wrong. Therefore I have some questions to your statements.
(the reduced program is attached to this reply)

However, there are many things that are wrong here:
- s_interrupt_triggered & s_buffer_pos could be (in theory) fetched from memory to registers only once during the whole program and never updated from memory again (should be declared as "volatile"). It really depends on the surrounding code complexity what happens without volatile.
This is ok: all variables in my program touched by the interrupt routine are declared volatile.

- another interrupt could happen immediately after "if(s_interrupt_triggered)" statement messing up the logic and miss data (should precede the if-statement with noInterrupts() and follow with interrupts())
If another interrupt happens here which is not changing s_buffer or s_buffer_pos, why could it mess up the logic?
I only stopped the specific external interrupt (not all interrupts), then copied variables used by the interrupt to local variables, and then enabled the external interrupt. Pulses triggering the external interrupt can follow each other within some tens of microseconds.

- even with noInterrupts() there could be two (or more) interrupts happening before getting to the main loop if-statement thus you would miss data. You should probably maintain separate read & write positions instead to deal with multiple interrupts within single loop iteration.
Again, I don't understand why these interrupts would cause data to be missed. Of course, if they are changing s_buffer or s_buffer_pos then wrong/changed data would be read.

- the memory fetch for "uint8_t data=s_buffer[s_buffer_pos]" could be (in theory) moved before the if-statement by the compiler, if for some reason compiler considers that to be a better option. You should add an "acquire" memory fence here to prevent this reordering: std::atomic_thread_fence(std::memory_order_acquire);
If really the compiler would move "uint8_t data=s_buffer[s_buffer_pos]" before the if-statement, then this is a failure of the compiler. In this case the compiler is changing the logic of the program!
 

Attachments

  • Oregondec.zip
    7 KB · Views: 69
You don't need to make all the variables used by the interrupt function volatile, just the ones shared between interrupt function and the main loop. Anyway, it doesn't hurt the validity of the program, only the performance. And sorry about the confusion with "another interrupt". I meant the same interrupt function, but another invocation of it. Moving the memory fetch before the if-statement wouldn't change the logic of the program and compiler is free to make such changes in the program flow. If you declare a variable as volatile though, it also prevents reordering across other volatile variable accesses, but not across non-volatile.
 
Moving the memory fetch before the if-statement wouldn't change the logic of the program and compiler is free to make such changes in the program flow.
Can you, please, give us a reference, that compiler can change the scope of a variable from local to higher levels? (moving "uint8_t data" from inside to outside the "{…}")
 
You can check for example this post about compiler memory access reordering: https://preshing.com/20120625/memory-ordering-at-compile-time/

The scope of local variables isn't the crux here, but the flexibility of memory access reordering during compilation while maintaining "the behavior of a single-threaded program". Compiler is free to rearrange local variable allocation as well, e.g. all the local variables in a function could be allocated from the stack in the beginning of the function, regardless of their scope within the function.
 
You can check for example this post about compiler memory access reordering: https://preshing.com/20120625/memory-ordering-at-compile-time/

thanks for link, but I'm not sure if that answers my question on the particular example you presented in #9.
I read the example in that post that the compiler can change the sequence of operations within the "{…}" which may have side effects outside, e.g. in an other thread.
Also, all variables where declared global and in #9 you used a local variable as destination of the memory access.

Nevertheless, this is something I have to investigate more, as I use if(flag) continuously.
 
Status
Not open for further replies.
Back
Top