Program storage space in Teensy LC

Status
Not open for further replies.

Pedvide

Senior Member+
Hello,

I've noticed something that I think is strange about how much program memory a sketch uses in Teensy LC versus in 3.x.

I've done several tests, they involve simply reading the program storage space and dynamic memory values that the arduino IDE shows at the end of a compilation.
I've compared the Blink example with the analogRead.ino example in my ADC library. My initial objective was to know how "efficient" my code is.

These are the results for all boards, using the default options when you select the board in the arduino IDE (except to change from optimized to non-optimized):

Teensy 3.0Progmem abs / BRAM abs / BProgmem relRAM rel
Blink9.5082.1727%13%
analogRead15.2522.30411%14%
Difference5.744132
Teensy 3.1 optimized (default)Progmem abs / BRAM abs / BProgmem relRAM rel
Blink12.2683.5084%5%
analogRead21.5124.6248%7%
Difference6.176 136
Teensy 3.1 non-optimizedProgmem abs / BRAM abs / BProgmem relRAM rel
Blink9.8442.4283%3%
analogRead16.0202.5646%3%
Difference9.244 1.116
Teensy LC non-optimized (default)Progmem abs / BRAM abs / BProgmem relRAM rel
Blink10.0762.17615%26%
analogRead21.1962.30433%28%
Difference11.120 128
Teensy LC optimizedProgmem abs / BRAM abs / BProgmem relRAM rel
Blink12.1883.25219%39%
analogRead26.4004.36441%53%
Difference14.212 1.112

The RAM usage makes sense, non-optimized versions use the same ammount +-2 words. The optimized versions use much more, but again similar to each other.

It's the program memory that doesn't make much sense. The (non-optimized) analogRead example uses 5 kB in Teensy 3.0, 9 in 3.1 and about 11 kB in LC!! The optimized version of LC uses 14 kB while the 3.1 one 6 kB!

Why does LC in general use so much program space?
 
This is all of the command line output of compiling my latest project with and without optimization

48mhz optimized
Code:
Sketch uses 44,888 bytes (70%) of program storage space. Maximum is 63,488 bytes.
Global variables use 6,508 bytes (79%) of dynamic memory, leaving 1,684 bytes for local variables. Maximum is 8,192 bytes.
Low memory available, stability problems may occur.



48mhz
Code:
Sketch uses 38,720 bytes (60%) of program storage space. Maximum is 63,488 bytes.
Global variables use 4,324 bytes (52%) of dynamic memory, leaving 3,868 bytes for local variables. Maximum is 8,192 bytes.

I can't confirm anything, this is all I get from the arduino ide, but there is clearly a difference.
 
In Teensy LC the only differences I've been able to find in the compilaton/linking procedure are:

Compilation: Not-optimized: -Os, Optimized: -O.
Linking: Not-optimized: --specs=nano.specs, Optimized: (no other difference).

Is it possible that the difference in code size is due to the difference between cortex M0 and M4 assembler code? That M0 lacks some instructions and they have to be "emulated" using more assembler instructions?
 
Two things:

1: Both have quite a bit of code that's always linked, even if your sketch doesn't use it. I've known of this for a long time, but until Teensy-LC it hardly seemed worthwhile to spend time improving.

2: Small details matter. Please post the sketches you're using for testing.
 
Hello,

The sketches I used were the standard Blink that comes with Teensyduino and analogRead.ino from my ADC library:
Code:
/* Example for analogRead
*  You can change the number of averages, bits of resolution and also the comparison value or range.
*/


#include <ADC.h>

const int readPin = A9; // ADC0
const int readPin2 = A2; // ADC1

ADC *adc;

void setup() {

    pinMode(LED_BUILTIN, OUTPUT);
    pinMode(readPin, INPUT); //pin 23 single ended
    pinMode(readPin2, INPUT); //pin 23 single ended

    pinMode(LED_BUILTIN+1, OUTPUT);

    Serial.begin(9600);

    Serial.println("Begin setup");

    adc = new ADC(); // adc object

    ///// ADC0 ////
    // reference can be ADC_REF_3V3, ADC_REF_1V2 (not for Teensy LC) or ADC_REF_EXT.
    //adc->setReference(ADC_REF_1V2, ADC_0); // change all 3.3 to 1.2 if you change the reference to 1V2

    adc->setAveraging(4); // set number of averages
    adc->setResolution(12); // set bits of resolution

    // it can be ADC_VERY_LOW_SPEED, ADC_LOW_SPEED, ADC_MED_SPEED, ADC_HIGH_SPEED_16BITS, ADC_HIGH_SPEED or ADC_VERY_HIGH_SPEED
    // see the documentation for more information
    adc->setConversionSpeed(ADC_HIGH_SPEED); // change the conversion speed
    // it can be ADC_VERY_LOW_SPEED, ADC_LOW_SPEED, ADC_MED_SPEED, ADC_HIGH_SPEED or ADC_VERY_HIGH_SPEED
    adc->setSamplingSpeed(ADC_HIGH_SPEED); // change the sampling speed

    //adc->enableInterrupts(ADC_0);

    // always call the compare functions after changing the resolution!
    //adc->enableCompare(1.0/3.3*adc->getMaxValue(ADC_0), 0, ADC_0); // measurement will be ready if value < 1.0V
    //adc->enableCompareRange(1.0*adc->getMaxValue(ADC_0)/3.3, 2.0*adc->getMaxValue(ADC_0)/3.3, 0, 1, ADC_0); // ready if value lies out of [1.0,2.0] V

    ////// ADC1 /////
    #if defined(ADC_TEENSY_3_1)
    adc->setAveraging(32, ADC_1); // set number of averages
    adc->setResolution(16, ADC_1); // set bits of resolution
    adc->setConversionSpeed(ADC_VERY_LOW_SPEED, ADC_1); // change the conversion speed
    adc->setSamplingSpeed(ADC_VERY_LOW_SPEED, ADC_1); // change the sampling speed

    // always call the compare functions after changing the resolution!
    //adc->enableCompare(1.0/3.3*adc->getMaxValue(ADC_1), 0, ADC_1); // measurement will be ready if value < 1.0V
    //adc->enableCompareRange(1.0*adc->getMaxValue(ADC_1)/3.3, 2.0*adc->getMaxValue(ADC_1)/3.3, 0, 1, ADC_1); // ready if value lies out of [1.0,2.0] V
    #endif

    Serial.println("End setup");

}

int value;
int value2;
char c;

void loop() {

    value = adc->analogRead(readPin); // read a new value, will return ADC_ERROR_VALUE if the comparison is false.

    Serial.print("Pin: ");
    Serial.print(readPin);
    Serial.print(", value ADC0: ");
    Serial.println(value*3.3/adc->getMaxValue(ADC_0), DEC);

    #if defined(ADC_TEENSY_3_1)
    value2 = adc->analogRead(readPin2, ADC_1);

    Serial.print("Pin: ");
    Serial.print(readPin2);
    Serial.print(", value ADC1: ");
    Serial.println(value2*3.3/adc->getMaxValue(ADC_1), DEC);
    #endif

    /* fail_flag contains all possible errors,
        They are defined in  ADC_Module.h as

        ADC_ERROR_OTHER
        ADC_ERROR_CALIB
        ADC_ERROR_WRONG_PIN
        ADC_ERROR_ANALOG_READ
        ADC_ERROR_COMPARISON
        ADC_ERROR_ANALOG_DIFF_READ
        ADC_ERROR_CONT
        ADC_ERROR_CONT_DIFF
        ADC_ERROR_WRONG_ADC
        ADC_ERROR_SYNCH

        You can compare the value of the flag with those masks to know what's the error.
    */

    if(adc->adc0->fail_flag) {
        Serial.print("ADC0 error flags: 0x");
        Serial.println(adc->adc0->fail_flag, HEX);
        if(adc->adc0->fail_flag == ADC_ERROR_COMPARISON) {
            adc->adc0->fail_flag &= ~ADC_ERROR_COMPARISON; // clear that error
            Serial.println("Comparison error in ADC0");
        }
    }
    #if defined(ADC_TEENSY_3_1)
    if(adc->adc1->fail_flag) {
        Serial.print("ADC1 error flags: 0x");
        Serial.println(adc->adc1->fail_flag, HEX);
        if(adc->adc1->fail_flag == ADC_ERROR_COMPARISON) {
            adc->adc1->fail_flag &= ~ADC_ERROR_COMPARISON; // clear that error
            Serial.println("Comparison error in ADC1");
        }
    }
    #endif

    digitalWriteFast(LED_BUILTIN, !digitalReadFast(LED_BUILTIN));

    delay(500);
}

// If you enable interrupts make sure to call readSingle() to clear the interrupt.
void adc0_isr() {
        adc->adc0->readSingle();
}

One thing I'v realized is that Blink doesn't use Serial at all, while my example does.
The second thing is that cortex M0+ and M4 are quite different in some things. As far as I understand M0+ uses a smaller instruction set, basically most of Thumb and some Thumb-2, while M4 supports more instructions. It's logical to assume that in order to do the same thing M0+ may need more assembler instructions than M4.
I've disassembled (with objdump) both the Blink and analogRead examples and I'm (very) slowly going through them seeing if I can notice some difference.
As an example I show here a very simple function in the ADC library (checkPin):
Code:
ADC library code:

// check whether the pin is a valid analog pin
bool ADC_Module::checkPin(uint8_t pin) {

    if(pin>ADC_MAX_PIN) {
        return false;   // all others are invalid
    }

    // translate pin number to SC1A number, that also contains MUX a or b info.
    uint8_t sc1a_pin = channel2sc1a[pin];

    if( (sc1a_pin&ADC_SC1A_CHANNELS) == ADC_SC1A_PIN_INVALID ) { // note: ADC_SC1A_CHANNELS=ADC_SC1A_PIN_INVALID=0x1F
        return false;   // all others are invalid
    }

    return true;
}

LC: 13 instructions

d58:	1c03      	adds	r3, r0, #0
d5a:	2000      	movs	r0, #0
d5c:	292c      	cmp	r1, #44	; 0x2c
d5e:	d807      	bhi.n	d70 <_ZN10ADC_Module8checkPinEh+0x18>
d60:	6a9b      	ldr	r3, [r3, #40]	; 0x28
d62:	5c58      	ldrb	r0, [r3, r1]
d64:	231f      	movs	r3, #31
d66:	4018      	ands	r0, r3
d68:	381f      	subs	r0, #31
d6a:	1e43      	subs	r3, r0, #1
d6c:	4198      	sbcs	r0, r3
d6e:	b2c0      	uxtb	r0, r0
d70:	4770      	bx	lr


T 3.0: 11 instructions

c58:	292c      	cmp	r1, #44	; 0x2c
c5a:	d807      	bhi.n	c6c <_ZN10ADC_Module8checkPinEh+0x14>
c5c:	6a83      	ldr	r3, [r0, #40]	; 0x28
c5e:	5c58      	ldrb	r0, [r3, r1]
c60:	f000 001f 	and.w	r0, r0, #31
c64:	381f      	subs	r0, #31
c66:	bf18      	it	ne
c68:	2001      	movne	r0, #1
c6a:	4770      	bx	lr
c6c:	2000      	movs	r0, #0
c6e:	4770      	bx	lr

The same C++ code produces somewhat different asm codes for Teensy LC and 3.0. The size difference in this example is not great though.
 
I've finally isolated the problem (or at least one problem).
Compare the program memory that these very simple two sketches use (Teensyduino 1.23, Arduino 1.6.1):

EXAMPLE_INT:
Code:
void setup() {
    //pinMode(LED_BUILTIN, OUTPUT);

    Serial.begin(9600);

}

int test;

void loop() {

    Serial.print(3*test);

    //digitalWriteFast(LED_BUILTIN, !digitalReadFast(LED_BUILTIN));

    delay(500);
}

EXAMPLE_FLOAT:
Code:
void setup() {
    //pinMode(LED_BUILTIN, OUTPUT);

    Serial.begin(9600);

}

int test;

void loop() {

    Serial.print(3.3*test);

    //digitalWriteFast(LED_BUILTIN, !digitalReadFast(LED_BUILTIN));

    delay(500);
}


T 3.0T 3.1 (no optimizations)T LC
EXAMPLE_INT:9.628 B9.964 B10.212 B
EXAMPLE_FLOAT:12.172 B12.516 B16.292 B
Diff:2.544 B2.525 B6.080 B

Teensy LC uses much more space when it has to operate with floats, both in absolute numbers and relative to its smaller memory.
I'll try to analyze the assembler output to see in which functions increase in size so much, but I think this is way out of my league.
 
Last edited:
No it doesnt, but neither does Teensy 3.0 or 3.1.
The only K20 chips with FPU are the K20-120 line,
big chips with 1024MB flash 128k RAM in 144 pin chips.
Lots of possibilities, but wont fit the Teensy form factor
 
As far as I know none of them has!

In the disassembled code I'm seeing many functions that deal with floats (or doubles) like __aeabi_dmul, for T3.0 it uses 594 lines of asm code, for LC 1308.
I've opened an issue at github so Paul can have a look.
 
I think, this is because of the different architectures:

The LC is a Cortex-M0+:
- It is a different architecture: Von-Neumann, ARMv6-M
- It is "Thumb" and has only some Thumb2-instructions
- There is no instruction for division in hardware, which means more code for divisions

The T3 is a Cortex-M4, Harvard, ARMv7-ME
- Complete set of Thumb & Thumb2 instructions
- Division in Hardware
- DSP (don't know if this plays a role here)


(Pls. correct me if i'm wrong)
 
Last edited:
Frank B, I think you are right. And I guess the lack of hardware division makes any float algorithms much slower and longer.
I also guess that Paul can't do anything about it, but it's important to remind people not to use floats unless it's really necessary.
 
There are some things I can do about the program size. Obviously I can't make Cortex-M0+ have hardware divide and other great features of M4.
 
Worth looking at cost and benefits of ST's M41x line. Hardware floating point, lots of peripherals, 168MHz, etc. NXP/Freescale is a known entity to PJRC but...
 
LC has no hardware floating point, right?
Steve, you asked this (several times) about Teensy 3.0 and 3.1 and were consitently told that the MCU in Teensy 3.x does not have hardware FPU.
Given that LC costs less, what are the chances that it has an FPU?
 
Steve, you asked this (several times) about Teensy 3.0 and 3.1 and were consitently told that the MCU in Teensy 3.x does not have hardware FPU.
Given that LC costs less, what are the chances that it has an FPU?

I beg your forgiveness. I forgot that the T3 has no FPU. I work daily with several different ARM MCUs, and some M4's have an FPU even in the low cost, low pin count MCUs.

FPU saves code space. But moreover, there is a good demand for hardware floating point in mid-range MCUs these days. Else the ARM vendors would't offer it! As I recall, the ARM standards define the 32 bit floating point hardware and it's about the same on all Cortex MCUs.
 
Status
Not open for further replies.
Back
Top