Teensy 3.x - Large Sketch/Hex Files?

Status
Not open for further replies.

dozy

Member
Hey Everyone,

I have a quick question about the size of compiled sketches on the Teensy 3.x.
I'm wondering if anyone can enlighten me as to why the sketch sizes on the 3.x are so much larger than those of previous models (v <= 2)?

Here is a comparison of the binary sketch sizes for the stock Blink example.
Teensy++ 1.0 = ~2000 bytes
Teensy++ 2.0 = ~2000 bytes
Teensy 3.0/3.1 = ~12500 bytes
(...and without linker optomization, the size on v3.x can reach ~40000 bytes)

Any ideas?

Thanks for your time,
Dozy
 
There are several factors at work here.

The Teensy3 compiler settings don't build the core library into core.a first. All the files are just giving to the linker. That means some unused code is added. This is on my to-do list, but it's a very low priority.

A related matter is the vector table symbols. On AVR there's some special handling of interrupt vectors which allows the linker to not include interrupt code if the rest of the code from that same .c or .cpp file isn't actually used. Replicating this on Teensy3 is also on my to-do list, but also a low priority. The only harm is lots of serial and other code always gets compiled in, even if you never use it. I believe this special interrupt vector stuff works together with the core.a build process, but to be honest, I just haven't put a lot of work into this yet. It's only expected to save about 3-4K flash and maybe a couple hundred bytes of RAM for programs not using hardware serial. Other features have been in far more demand.

Another factor is the USB stack. On AVR, the USB hardware is pretty simple and limited. Other than endpoint 0 control requests, the code is extremely simple and uses a register-based FIFO-style interface for the data endpoints. The hardware uses a fixed 2 packet buffer for each endpoint, accessed only though the FIFO register. On Teensy3, the USB hardware is far more capable, with DMA that efficiently copies data to/from buffers in memory. That gives you pretty amazing performance, but the USB stack has to manage a pool of memory-based buffers and perform many data management tasks that aren't needed with the hardware restricts you to only the simplest access method without actual access to USB packets. The USB buffers are transferred between the pool, slots waiting for DMA, and queues that allow a high-latency Arduino sketch (eg, one that spends time in libraries that delay or do lots of computation) to still achieve excellent USB bandwidth utilization. All this performance does come at a cost of more code. You can get a good idea by reading through hardware/teensy/cores/teensy3/usb_dev.c.

If you use anything that causes printf() to be built into your program, your code size will jump about 25K. The newlib library has a large printf implementation. On avr-libc, printf is much smaller. Like the compiler/linker settings, someday I'm going to do something about this. Part of the issue is newlib's printf() code is somewhat bloated. But a large difference is floating point support. On AVR, printf() does not handle floats at all. On ARM, it does both float and double, and it supports both decimal and scientific notation. The float+double printing code adds a lot of size, even if you only ever print string and integer. The linker doesn't know whether you have and "%f" in your format strings.

Likewise, if you use the ARM math library for fast fourier transforms, about 90K of lookup tables are compiled in.

On top of all that, ARM code tends to compile somewhat larger than AVR, despite what certain powerpoint slides from Arm, Ltd. might say. The actual executable code is pretty similar, and slightly smaller on ARM when manipulating 16 or 32 bit data which takes 2 or 4 instructions on AVR. But AVR has special I/O instructions that avoid needing 16 bit addresses for many commonly used things. On ARM, everything is memory mapped using 32 bit pointers. The compiler initializes 32 bit pointers using tables of 32 bit numbers, usually located right after the function that needs them. It uses an indirect PC-relative addressing mode, so each pointer initialization costs 48 bits of flash, and then another 16 or 32 to actually use it. On AVR, a 32 bit opcode includes a 16 bit address AND the instruction that does the load/store. The AVR approach results in smaller, slower code. On ARM, usually the compiler optimizes all that bulky initialization stuff outside of loops. It's all a lot of subtle trade-offs in how each processor is designed, and on ARM they created a very powerful and flexible 32 bit system with great performance, but even with the thumb instruction encoding, it's just doesn't result in compact code compared to most 8 bit microcontrollers.
 
My first work after 8 bitters was with ARM7s in both ARM mode and Thumb mode, I was taken aback about the larger size of the resultant code. Thumb mode was a lot smaller but somewhat slower.
As the cost falls for non-MMU type MCUs with larger flash memory (128K, 256K, 512K), I now am able to not be so frugal as I was on MCUs with less than 32KB.

I do have a good/simple printf() for types other than floats... public domain, if you wish. I really dislike the Auduino serial.print() overloads.
 
I do have a good/simple printf() for types other than floats... public domain, if you wish.

Yes. :) :)

But of course, it may be many months until I even look at it. If you post it here, I'll copy the link onto my written to-do list, so I won't forget to look at it when I do work on code size optimization.
 
Status
Not open for further replies.
Back
Top