Gcc 5, flto

Frank B

Senior Member
Hi,

i played a bit with GCC 5 and -flto.

It works good and leads to less flash usage.

Howto:

1) Edit linker file. Move the line "*(.startup*)"

below the line "KEEP(*(.flashconfig*))"

Code:
/* Teensyduino Core Library
 * http://www.pjrc.com/teensy/
 * Copyright (c) 2013 PJRC.COM, LLC.
 *
 * Permission is hereby granted, free of charge, to any person obtaining
 * a copy of this software and associated documentation files (the
 * "Software"), to deal in the Software without restriction, including
 * without limitation the rights to use, copy, modify, merge, publish,
 * distribute, sublicense, and/or sell copies of the Software, and to
 * permit persons to whom the Software is furnished to do so, subject to
 * the following conditions:
 *
 * 1. The above copyright notice and this permission notice shall be 
 * included in all copies or substantial portions of the Software.
 *
 * 2. If the Software is incorporated into a build system that allows 
 * selection among a list of target devices, then similar target
 * devices manufactured by PJRC.COM must be included in the list of
 * target devices and selectable in the same manner.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
 * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
 * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
 * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
 * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
 * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
 * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 * SOFTWARE.
 */


MEMORY
{
    FLASH (rx) : ORIGIN = 0x00000000, LENGTH = 256K
    RAM  (rwx) : ORIGIN = 0x1FFF8000, LENGTH = 64K
}




SECTIONS
{
    .text : {
        . = 0;
        KEEP(*(.vectors))
        /* TODO: does linker detect startup overflow onto flashconfig? */
        . = 0x400;
        KEEP(*(.flashconfig*))
        *(.startup*)
        *(.text*)
        *(.rodata*)
        . = ALIGN(4);
        KEEP(*(.init))
        . = ALIGN(4);
        __preinit_array_start = .;
        KEEP (*(.preinit_array))
        __preinit_array_end = .;
        __init_array_start = .;
        KEEP (*(SORT(.init_array.*)))
        KEEP (*(.init_array))
        __init_array_end = .;
    } > FLASH = 0xFF


    .ARM.exidx : {
        __exidx_start = .;
        *(.ARM.exidx* .gnu.linkonce.armexidx.*)
        __exidx_end = .;
    } > FLASH
    _etext = .;


    .usbdescriptortable (NOLOAD) : {
        /* . = ORIGIN(RAM); */
        . = ALIGN(512);
        *(.usbdescriptortable*)
    } > RAM


    .dmabuffers (NOLOAD) : {
        . = ALIGN(4);
        *(.dmabuffers*)
    } > RAM


    .usbbuffers (NOLOAD) : {
        . = ALIGN(4);
        *(.usbbuffers*)
    } > RAM


    .data : AT (_etext) {
        . = ALIGN(4);
        _sdata = .; 
        *(.fastrun*)
        *(.data*)
        . = ALIGN(4);
        _edata = .; 
    } > RAM


    .noinit (NOLOAD) : {
        *(.noinit*)
    } > RAM


    .bss : {
        . = ALIGN(4);
        _sbss = .;
        __bss_start__ = .;
        *(.bss*)
        *(COMMON)
        . = ALIGN(4);
        _ebss = .;
        __bss_end = .;
        __bss_end__ = .;
    } > RAM


    _estack = ORIGIN(RAM) + LENGTH(RAM);
}

This alone leads to about 500 Bytes additional needed space, because these bytes are unused now.
Maybe we can place a table here or other short code.

[edit]
2)
comment out the follwowing lines in avr_functions.h (these functions are now in the library (i think since 4.9 (?) ) and not needed anymore)
Code:
/*
static inline char * utoa(unsigned int val, char *buf, int radix) __attribute__((always_inline, unused));
static inline char * utoa(unsigned int val, char *buf, int radix) { return ultoa(val, buf, radix); }
static inline char * itoa(int val, char *buf, int radix) __attribute__((always_inline, unused));
static inline char * itoa(int val, char *buf, int radix) { return ltoa(val, buf, radix); }
*/

3) Edit Boards.txt.
- Choose the arm-none-eabi-gcc-ar tool instead of arm-none-eabi-ar
- add the -flto flag

Code:
teensy31.build.command.ar=arm-none-eabi-gcc-ar
teensy31.build.flags.common=-g -Wall -ffunction-sections -fdata-sections -nostdlib -flto
teensy31.build.flags.ld=-flto -Wl,--gc-sections,--relax,--defsym=__rtc_localtime={extra.time.local} "-T{build.core.path}/mk20dx256.ld"

"Blink" is now reduced from 10.523 Bytes (o1) to 10.248 (plus we have still the ~500 unused bytes)
A larger sketch (about 70KB) was reduced by 3 KB


In addition gcc has some ARM CM4 Improvements for better codegeneration since Version 4.9
(One of them is -mslow-flash-data, i'm nt sure if it improves anything for us, i havn't tested it enough..)

 
Last edited:
Cool! Normally I see people using LTO more for speed improvements than code size. In PowerPC land, I see 9 of the 20 spec 2006 CPU benchmarks improve in speed (22% in the best case), and 3 regress (one of them due to LTO interfering with using the math vector library support that I need to track down)
 
I did not test the speed, it may be faster, too.
Why is lto faster for PowerPC ?

Lto can be faster on multiple architectures (x86, arm, powerpc, mips, etc.) because the compiler can see functions declared in another module. So for instance if you have the code:

Code:
// file outer.cc
extern uint16_t inner (int, uint16_t);

void outer (uint16_t *p, size_t n)
{
  for (size_t i = 0; i < n; i++)
    p[i] = inner (5, p[i]);
}

// file inner.cc
uint16_t inner (int which, uint16_t value)
{
  if (which == 5)
    return value + 1;
  else
    return value - 1;
}

Without lto, the compiler has to generate the call. With lto, the compiler could inline inner and by optimization change the program to:
Code:
void outer (uint16_t *p, size_t n)
{
  for (size_t i = 0; i < n; i++)
    p[i]++;
}

Even if inner can't be inlined, the compiler could see that the call always passes a constant, and it can clone the function where the argument is replaced with a constant, and using constant propagation, eliminate the tests, something like:

Code:
// file outer.cc
extern uint16_t inner (int, uint16_t);

void outer (uint16_t *p, size_t n)
{
  for (size_t i = 0; i < n; i++)
    p[i] = inner__5 (p[i]);
}

// file inner.cc
uint16_t inner (int which, uint16_t value)
{
  if (which == 5)
    return value + 1;
  else
    return value - 1;
}

uint16_t inner__5 (uint16_t value)
{
  const int which = 5;
  if (which == 5)
    return value + 1;
  else
    return value - 1;
}

Which after optimization would be:

Code:
uint16_t inner__5 (uint16_t value)
{
  return value + 1;
}

The major slowdown that I mentioned is in PowerPC we have an option that vectorizes calls to the math library when you enable auto-vectorization and fast-math, so if you have code of the form:

Code:
float a[1024], b[1024];

void foo (void)
{
  for (size_t i = 0; i < 1024; i++)
    a[i] = expf (b[i]);
}

and it would be transformed into:

Code:
float a[1024], b[1024];

void foo (void)
{
  vector float *p_a = (vector float *)&a[0];
  vector float *p_b = (vector float *)&b[0];

  for (size_t i = 0; i < 256; i++)
    p_a[i] = expw4 (p_b[i]);
}

and in the particular benchmark that uses a lot of the math library functions (tonto), it isn't doing the conversion to vector form, and is calling the scalar math library 4 times as much as it did the vector math function. The vector library (MASS) is hand tuned for the PowerPC.

The x86 has a similar option that targets either the AMD or the Intel math libraries.

I just noticed it recently, but I haven't had time to investigate it further.
 
I guess the library (newlib) should be compiled with -flto, too? Is that correct?
Is it worth a try ?
 
Yes, and so should arm_cortexM4l_math.

Ok.

Two Questions:
- I built it, but have the follwing error when compiling a sketch:
Code:
f:\build4def553c7c5cc6638db6fb6a30be3b17.tmp/core\core.a(mk20dx128.c.o): In function `ResetHandler':
C:\Arduino\hardware\teensy\avr\cores\teensy3/mk20dx128.c:903: undefined reference to `__libc_init_array'
f:\build4def553c7c5cc6638db6fb6a30be3b17.tmp/core\core.a(Print.cpp.o): In function `Print::printf(char const*, ...)':
C:\Arduino\hardware\teensy\avr\cores\teensy3/Print.cpp:91: undefined reference to `vdprintf'
collect2.exe: error: ld returned 1 exit status
exit status 1
Fehler beim Kompilieren.

I guess i have to add a switch ? Could you post your Buildscript ?


- where can i find the source of arm_cortexM4l_math ? (for M0 too)
 
- I built it, but have the follwing error when compiling a sketch:
Code:
C:\Arduino\hardware\teensy\avr\cores\teensy3/Print.cpp:91: undefined reference to `vdprintf'

I recently committed a fix for this.

- where can i find the source of arm_cortexM4l_math ? (for M0 too)

I've uploaded the source here.

https://github.com/PaulStoffregen/arm_math

At least, I believe this is the source. I've never actually compiled it. There isn't any build script, other than stuff which requires an expensive commercial toolchain.

I'm pretty sure it's ok to share this code on github, as long as the purpose is for developing Tools which will only be used to build code for chips licensed from ARM, which the Freescale/NXP Kinetis certainly is, and as long as we don't change the API and "logical functionality". At least that's how I read part 1.1 (iv).
 
I've added a makefile and a few headers that were missing.

It's able to compile for Cortex-M4. M0+ and M3 are broken, probably due to hacks I made to arm_math.h.

So far, I haven't tried actually using the generated library. But hopefully this makefile can be a starting point?
 
I've added a makefile and a few headers that were missing.

It's able to compile for Cortex-M4. M0+ and M3 are broken, probably due to hacks I made to arm_math.h.

So far, I haven't tried actually using the generated library. But hopefully this makefile can be a starting point?
I compiled the newlib with lto and without, using the build-script. With -flto added to the newlib-compilerswitches in that script, it does not work (see error above, missing libc_init_array) . I guess we need a patch or additional switches (configuration). I havn'nt tried your makefile so far.
 
Last edited:
Code:
--enable-initfini-array[COLOR=#000000][FONT=Times New Roman]Force the use of sections [/FONT][/COLOR].init_array[COLOR=#000000][FONT=Times New Roman] and [/FONT][/COLOR].fini_array[COLOR=#000000][FONT=Times New Roman] (instead of [/FONT][/COLOR].init[COLOR=#000000][FONT=Times New Roman] and [/FONT][/COLOR].fini[COLOR=#000000][FONT=Times New Roman]) for constructors and destructors. Option [/FONT][/COLOR]--disable-initfini-array[COLOR=#000000][FONT=Times New Roman] has the opposite effect. If neither option is specified, the configure script will try to guess whether the [/FONT][/COLOR].init_array[COLOR=#000000][FONT=Times New Roman] and [/FONT][/COLOR].fini_array[COLOR=#000000][FONT=Times New Roman] sections are supported and, if they are, use them. [/FONT][/COLOR]

...does it make sense to try this setting ?
 
Code:
--enable-initfini-array[COLOR=#000000][FONT=Times New Roman]Force the use of sections [/FONT][/COLOR].init_array[COLOR=#000000][FONT=Times New Roman] and [/FONT][/COLOR].fini_array[COLOR=#000000][FONT=Times New Roman] (instead of [/FONT][/COLOR].init[COLOR=#000000][FONT=Times New Roman] and [/FONT][/COLOR].fini[COLOR=#000000][FONT=Times New Roman]) for constructors and destructors. Option [/FONT][/COLOR]--disable-initfini-array[COLOR=#000000][FONT=Times New Roman] has the opposite effect. If neither option is specified, the configure script will try to guess whether the [/FONT][/COLOR].init_array[COLOR=#000000][FONT=Times New Roman] and [/FONT][/COLOR].fini_array[COLOR=#000000][FONT=Times New Roman] sections are supported and, if they are, use them. [/FONT][/COLOR]

...does it make sense to try this setting ?
Unless you are going to rewrite how constructors and destructors work, I would think it would be best to use the default for your system.
 
Ok..i asked because gcc complains about
Code:
[COLOR=#333333]undefined reference to `__libc_init_array'
after rebuilding the toolchain with lto when compiling sketches.

[/COLOR]
 
It has been at least 15+ years since I last worked on a newlib based toolchain, so I can't help you. You might want to use the -v option on the ARM gcc to see exactly what options were used to configure it. I don't recall if there is any place to find what the options used to build the library were.
 
No Problem.

libc.a has no symbol "libc_init_array"

As a quick test, i commented out __libc_init_array(); in mk20dx128.c.
Some simples sketches - and without "printf" work now ("Blink" for example).. so it seems to be a minor issue, at a quick glance.

My "coremark" test does not work - dont know why at the moment.
I "googled" a bit and found nothing relevant for us re. this issue. So i'm a bit lost..

I found an old patch from 2004: https://sourceware.org/ml/newlib/2004/msg00579.html

It has an "#ifdef HAVE_INITFINI_ARRAY" inside, so i guess it is somehow not defined with lto.
Unfortunately, i don't know wether this patch was approved or not.


 
I've uploaded the source here.

https://github.com/PaulStoffregen/arm_math

At least, I believe this is the source. I've never actually compiled it. There isn't any build script, other than stuff which requires an expensive commercial toolchain.

I'm pretty sure it's ok to share this code on github, as long as the purpose is for developing Tools which will only be used to build code for chips licensed from ARM, which the Freescale/NXP Kinetis certainly is, and as long as we don't change the API and "logical functionality". At least that's how I read part 1.1 (iv).

I compiled it with GCC 5.2 and -flto
Sketches are indeed a bit faster (tried with the fft-example (audiolib)) with added -flto switch.

I uploaded my build here: https://drive.google.com/open?id=0Bx2Jw84lqebkcnFtU1FVOGxnNlk
(M4, M4 hardfloat + M0)
 
I have a working gcc 5.2 version with link time optimized newlib (lto).
it's currently uploading to my googledrive (will last ~1 hour).

It includes versions for linux (not tested), the modified source-package, windows zip and and exe. sorry, no mac.
It's not entirely tested, but all skecthes i tried so far, work.

I had to modfiy the build-script:
- added to the compilerflags:
Code:
-flto -ffat-lto-objects -fuse-linker-plugin
- added to configure:
Code:
--enable-lto \
libc_cv_initfinit_array=yes \
AR_FOR_TARGET=arm-none-eabi-gcc-ar \
NM_FOR_TARGET=arm-none-eabi-gcc-nm \
RANLIB_FOR_TARGET=arm-none-eabi-gcc-ranlib

if you want to test it,
- extract the toolchain to new subdirectory inside c:\arduino\hardware\tools (windows)
- modify boards.txt and the linker script as decribed above (adjust the path in boards.txt !!)
- copy the math-lib (above) to it

The toolchain is M0 and M4 only, as it saved much time to run the compilation for these targets only.
There are a few additional libs that could be "lto" too, but i tried "newlib" first and try to make them in a later step.

The above change looks simple, but i had a hard time... esp. that "libc_cv_initfinit_array=yes" was §$%&/

Edit:
We could reduce the buffers inside newlib which will save RAM on the teensy.
I think they are 1024 Bytes default and "google" said, it works with 64 Bytes, too.
I don't know whate they are for... maybe reduce them for "nano" only would be a good idea?
 
Last edited:
Ok.. it works NOT.

Now functions like SQRT or POW are not found.

i think i give up now..

"§$%&/

I have other projects that take time..
 
Michael, its not only math, perhaps i should have mentioned that.
Funny enough, things like printf() work now, even that problematic __libc_init_array() in the Teensy-startup code works now.

Here some of the errormsgs of a random , a bit more complex sketch:

- <artificial>: (.text+0x2e8): undefined reference to `free'
- <artificial>: (.text+0x2fc): undefined reference to `strcpy'
- <artificial>: (.text+0x50a): undefined reference to `malloc'
- <artificial>: (.text+0x28e): undefined reference to `memcpy' <- thats interesting, becaus teensyduino has its own memcpy

And, ok, the problem isn't a missing "pow()", but, more exact, this:
(.....) /arm-none-eabi/lib/armv7e-m\libm.a(lib_a-w_pow.o): In function `pow':
w_pow.c: (.text.pow+0x26): undefined reference to `__fpclassifyd'

I have no idea where __fpclassifyd is.

In function `__errno':
errno.c: (.text.__errno+0x8): undefined reference to `_impure_ptr'

Does'nt look good.
 
Last edited:
Back
Top