Generated Code of teensy3.6

Status
Not open for further replies.

ossi

Well-known member
Is it possible to have a look onto the code that teensyduino generates? I want to see what optimizations are done.
 
Yes.

First you need to find the temporary folder where it's creating the files. Easiest way is to turn on verbose output while compiling in File > Preferences. Then look at some of the compiler commands to find the temporary folder pathname. On Mac and Window, those locations are often hidden folders, so a little more work is needed to access them.

When you find that folder, you'll see Arduino creates a .lst file that shows you the assembly code it generated, and a .sym file with the memory assignments for all static allocated variables.

This feature is specific to Teensy. Most other Arduino compatible boards have a "platform.txt" file which does not cause the .lst and .sym files to be created.
 
It does the optimizations that GCC does. There are no Teensyduino-specific code-optimizations other than those in the sourcecode. If you turn on "verbose", you see the GCC-commandline switches that are used.
 
Thanks, that was exactly the information I needed. Seems I have opened Pandora's box somehow. Now I have to learn ARM Code and see what an optimizing compiler generates.
 
There are many good books on ARM assembly and ARM architecture. Something to keep in mind when reading these is Teensy (and all Cortex-M chips) use only the Thumb mode instructions. If you find a book that talks about ARM & Thumb mode, you can ignore all the info about non-Thumb mode stuff, and subjects like "interworking" stuff compilers do between the 2 modes.

While it doesn't go into deep detail on ARM assembly, and it's kind of written with the assumption you already at least sort-of know ARM assembly, this book gives a lot of really essential info.

https://www.amazon.com/dp/0124080820

If you're just trying to optimize code, you might not really need to know these deeper details.

On Thumb assembly, if you're used to the assembly instructions used on 8 bit chips, the thing that takes the most adjustment (at least for me) is how function calls work. There's no "call" instruction which pushes the return address onto the stack. Instead, one of the 16 registers is dedicated to the return address, called the Link register (LR). Functions are called with the "Branch Link" instruction, which copies the return address to that register, as it jumps to the function. When you wish to return, the LR is just copied back to the program counter (PC). By default there's no pushing onto the stack, only copying between registers. If the called function needs to call something else, it needs to either copy LR into one of the registers called functions preserve, or push it onto the stack. If you're used to call and return instructions, this copying between registers, and different handling depending on what the function does can take a little mental adjustment.
 
There's no "call" instruction which pushes the return address onto the stack. Instead, one of the 16 registers is dedicated to the return address, called the Link register (LR). Functions are called with the "Branch Link" instruction, which copies the return address to that register, as it jumps to the function.

Hi Paul, thanks a lot for your help and the great teensy stuff.

A comment to the BL instruction: You probably are too young to have learnt the IBM360 assembly :) . When I went to university around 1980 we learnt some IBM360 assembly programming and learned the BALR (branch and link register) instruction for subroutine calls. So this instruction poses no problems for the "oldtimers" like me. I think this sort of instruction fits nicely into RISC instruction sets, where pushing and popping has to be done by the user using further instructions.
 
Thanks, that was exactly the information I needed. Seems I have opened Pandora's box somehow. Now I have to learn ARM Code and see what an optimizing compiler generates.

:D:D:D

Quite a bit, and of course some places that aren't optimized as well as could be.

Note, GCC has been worked on by hundreds of people in the 32 years since it was first released in 1987, though the number of core people working on today it is a much smaller number.
 
Yes, some issues listed on the gcc bug tracker are marked "open" since many years..

Nevertheless, gcc does a great job, often. In other cases it is far from optimal.

ARM Cortex has not many registers. I have the feeling this is the root-problem, and the registerallocator has a difficult job(?)
Well, the ARM-compiler produces 30% faster code, sometimes.

I fear it's above my pay grade to contribute :)
 
Last edited:
Sometimes I wonder if that compiler has specific optimizations crafted for the coremark code.

I don't know about the ARM compiler, but it is certainly part of the compiler game. And it has been alleged that various compilers work better for building benchmarks than real code.

Generally, I do try to put in general optimizations, but sometimes, you do put something in that particularly helps a particular benchmark.

Of course with benchmarks, it helps if the bencmark actualy is close to what is being run by the users. Of the 22 benchmarks in Spec 2017 speed (and 30 in Spec 2006 CPU) I tend to think that two benchmarks match typical user code. Perlbench tends to match interpreter work loads, where it might be dominated by a switch statement, but you can get caught by a function that needs to save a lot of registers in the prologue and restore them in the epilogue. Gcc also tends to match normal code, because it has one of the flattest profiles around (i.e. there isn't a single function that the benchmark spends 50% of its time in which makes it easier to optimize for, but the highest function is something like 1% -- it is just a lot of data and branches).

In the distant past I was working at a now-dead minicomputer company Data General. This was about the time when I was moving over from the AOS/VS C compiler I had written the front end for to the 88000 GCC compiler for the AViiON systems. Now, DG's penchant was to name their MV-Eclipse computers with a number that indicated the relative performance. The first machine was the MV/8000 (this was the machine in Tracy Kidder's Soul of a New Machine). The next generation had a MV/4000 and a MV/10000, with the MV/4000 generally being 1/2 the speed and 1/2 the cost of the MV/8000. The third generation had a mid-life kicker for the MV/8000 called the MV/7800. It was supposed to be similar in performance to the MV/8000 at a lower cost. That was the theory.

Now inside, the machine had a really good FP unit that was roughly equal to the MV/8000, but the integer unit was closer to the MV/4000. Unfortunately the customers who bought the MV/7800 mostly used a weird Cobol variant (icobol if memory serves). This language only used integer instructions. So needless to say users were rather angry that they bought this MV/7800 and it didn't run any faster than the less expensive MV/4000.

In the post mortem after the machine was delivered, the perf. group looked at the benchmarks to see what went wrong. Now the benchmark that they used at the time was whetstone, which is primarily a floating point benchmark that was originally written in Algol 60, and then converted to Fortran. So using Whetstone pretty much showed the speed of the FPU unit. The performance group looked around at the other benchmarks at the time, and settled on using Dhrystone, which favored integer stuff, which more closely matched what their customers were using. Dhrystone BTW was originally written in Ada and then converted to C, and yes, the name was chosen as a pun against Whetstone.

So, I'm cranking away on making the GCC 88000 compiler work, and we get this call from on high to measure the performance of GCC. Now you have to understand, GCC wasn't the original compiler DG used for the 88000, but Green Hills was. However, Green Hills had some wacky license restrictions that upper management felt they could not sign. So they went in search of other compilers to use. I was off in my little cube. I had a Sun workstation, and I wanted to make Emacs run faster. So I found the initial GCC, and it indeed did make Emacs run faster (mostly because Emacs and GCC were written by the same person, and he put in special optimizations for Emacs into the compiler). But as I was looking at GCC, I saw a 1/2 done port of GCC for the 88000.

I brought this to my manager's attention, and he ran it up the flag pole, eventually getting legal to sign off on using GCC (due to the GPL). The theory was GCC was going to be the free compiler, but if you wanted a better compiler, you could pay for Green Hills.

But people did want to validate using GCC even if GH was supposed to be a better compiler. As I said, Dhrystone was the measure of performance in DG those days. And true enough, GH was faster than GCC. I dug into the benchmark, and discovered the core of Dhrystone is a copy of a 31 byte character literal with strcpy to an array. Now in the original Ada, this copy was pretty fast because Ada has real strings, and the compiler could know the length. The naive C library just does load byte, is it 0, stop, store byte, rinse, lather, repeat. The GH compiler saw that the strcpy was of a constant string literal, and changed this into a memcpy, which they could then inline. Initially GCC was just calling the strcpy function.

I figured two can play that game, and added similar support to GCC. Now in my code, I carefully made sure the cutoff before calling the function was 32 bytes, so that it would general inline code. When I coded this up, and ran it, it was much faster than GH. Since then I never really trust benchmarks, though I seem to spend a lot of time actually looking at benchmark results (spec 2017 being the current target).
 
Yes. Partly by selecting the optimization options in the menu. More freedom when editing boards.txt. Totally free when composing your own makefile.
 
I have now located the boards.txt file. I am totally lost with so many options. Where can I put compile options or the objdump options ?
 
Status
Not open for further replies.
Back
Top