Forum Rule: Always post complete source code & details to reproduce any issue!
Results 1 to 20 of 20

Thread: Program storage space in Teensy LC

  1. #1
    Senior Member
    Join Date
    Jul 2013
    Posts
    272

    Program storage space in Teensy LC

    Hello,

    I've noticed something that I think is strange about how much program memory a sketch uses in Teensy LC versus in 3.x.

    I've done several tests, they involve simply reading the program storage space and dynamic memory values that the arduino IDE shows at the end of a compilation.
    I've compared the Blink example with the analogRead.ino example in my ADC library. My initial objective was to know how "efficient" my code is.

    These are the results for all boards, using the default options when you select the board in the arduino IDE (except to change from optimized to non-optimized):

    Teensy 3.0 Progmem abs / B RAM abs / B Progmem rel RAM rel
    Blink 9.508 2.172 7% 13%
    analogRead 15.252 2.304 11% 14%
    Difference 5.744 132
    Teensy 3.1 optimized (default) Progmem abs / B RAM abs / B Progmem rel RAM rel
    Blink 12.268 3.508 4% 5%
    analogRead 21.512 4.624 8% 7%
    Difference 6.176 136
    Teensy 3.1 non-optimized Progmem abs / B RAM abs / B Progmem rel RAM rel
    Blink 9.844 2.428 3% 3%
    analogRead 16.020 2.564 6% 3%
    Difference 9.244 1.116
    Teensy LC non-optimized (default) Progmem abs / B RAM abs / B Progmem rel RAM rel
    Blink 10.076 2.176 15% 26%
    analogRead 21.196 2.304 33% 28%
    Difference 11.120 128
    Teensy LC optimized Progmem abs / B RAM abs / B Progmem rel RAM rel
    Blink 12.188 3.252 19% 39%
    analogRead 26.400 4.364 41% 53%
    Difference 14.212 1.112

    The RAM usage makes sense, non-optimized versions use the same ammount +-2 words. The optimized versions use much more, but again similar to each other.

    It's the program memory that doesn't make much sense. The (non-optimized) analogRead example uses 5 kB in Teensy 3.0, 9 in 3.1 and about 11 kB in LC!! The optimized version of LC uses 14 kB while the 3.1 one 6 kB!

    Why does LC in general use so much program space?

  2. #2
    I thought my LC sketches were unusually large, glad there's some evidence on it.

  3. #3
    Senior Member
    Join Date
    Jun 2013
    Location
    So. Calif
    Posts
    2,828
    Quote Originally Posted by ohnoitsaninja View Post
    I thought my LC sketches were unusually large, glad there's some evidence on it.
    Did you confirm that the IDE compiler optimization level does indeed appear in the compiler command line?

  4. #4
    This is all of the command line output of compiling my latest project with and without optimization

    48mhz optimized
    Code:
    Sketch uses 44,888 bytes (70%) of program storage space. Maximum is 63,488 bytes.
    Global variables use 6,508 bytes (79%) of dynamic memory, leaving 1,684 bytes for local variables. Maximum is 8,192 bytes.
    Low memory available, stability problems may occur.


    48mhz
    Code:
    Sketch uses 38,720 bytes (60%) of program storage space. Maximum is 63,488 bytes.
    Global variables use 4,324 bytes (52%) of dynamic memory, leaving 3,868 bytes for local variables. Maximum is 8,192 bytes.
    I can't confirm anything, this is all I get from the arduino ide, but there is clearly a difference.

  5. #5
    Senior Member
    Join Date
    Jul 2013
    Posts
    272
    In Teensy LC the only differences I've been able to find in the compilaton/linking procedure are:

    Compilation: Not-optimized: -Os, Optimized: -O.
    Linking: Not-optimized: --specs=nano.specs, Optimized: (no other difference).

    Is it possible that the difference in code size is due to the difference between cortex M0 and M4 assembler code? That M0 lacks some instructions and they have to be "emulated" using more assembler instructions?

  6. #6
    Senior Member PaulStoffregen's Avatar
    Join Date
    Nov 2012
    Posts
    20,606
    Two things:

    1: Both have quite a bit of code that's always linked, even if your sketch doesn't use it. I've known of this for a long time, but until Teensy-LC it hardly seemed worthwhile to spend time improving.

    2: Small details matter. Please post the sketches you're using for testing.

  7. #7
    Senior Member
    Join Date
    Jul 2013
    Posts
    272
    Hello,

    The sketches I used were the standard Blink that comes with Teensyduino and analogRead.ino from my ADC library:
    Code:
    /* Example for analogRead
    *  You can change the number of averages, bits of resolution and also the comparison value or range.
    */
    
    
    #include <ADC.h>
    
    const int readPin = A9; // ADC0
    const int readPin2 = A2; // ADC1
    
    ADC *adc;
    
    void setup() {
    
        pinMode(LED_BUILTIN, OUTPUT);
        pinMode(readPin, INPUT); //pin 23 single ended
        pinMode(readPin2, INPUT); //pin 23 single ended
    
        pinMode(LED_BUILTIN+1, OUTPUT);
    
        Serial.begin(9600);
    
        Serial.println("Begin setup");
    
        adc = new ADC(); // adc object
    
        ///// ADC0 ////
        // reference can be ADC_REF_3V3, ADC_REF_1V2 (not for Teensy LC) or ADC_REF_EXT.
        //adc->setReference(ADC_REF_1V2, ADC_0); // change all 3.3 to 1.2 if you change the reference to 1V2
    
        adc->setAveraging(4); // set number of averages
        adc->setResolution(12); // set bits of resolution
    
        // it can be ADC_VERY_LOW_SPEED, ADC_LOW_SPEED, ADC_MED_SPEED, ADC_HIGH_SPEED_16BITS, ADC_HIGH_SPEED or ADC_VERY_HIGH_SPEED
        // see the documentation for more information
        adc->setConversionSpeed(ADC_HIGH_SPEED); // change the conversion speed
        // it can be ADC_VERY_LOW_SPEED, ADC_LOW_SPEED, ADC_MED_SPEED, ADC_HIGH_SPEED or ADC_VERY_HIGH_SPEED
        adc->setSamplingSpeed(ADC_HIGH_SPEED); // change the sampling speed
    
        //adc->enableInterrupts(ADC_0);
    
        // always call the compare functions after changing the resolution!
        //adc->enableCompare(1.0/3.3*adc->getMaxValue(ADC_0), 0, ADC_0); // measurement will be ready if value < 1.0V
        //adc->enableCompareRange(1.0*adc->getMaxValue(ADC_0)/3.3, 2.0*adc->getMaxValue(ADC_0)/3.3, 0, 1, ADC_0); // ready if value lies out of [1.0,2.0] V
    
        ////// ADC1 /////
        #if defined(ADC_TEENSY_3_1)
        adc->setAveraging(32, ADC_1); // set number of averages
        adc->setResolution(16, ADC_1); // set bits of resolution
        adc->setConversionSpeed(ADC_VERY_LOW_SPEED, ADC_1); // change the conversion speed
        adc->setSamplingSpeed(ADC_VERY_LOW_SPEED, ADC_1); // change the sampling speed
    
        // always call the compare functions after changing the resolution!
        //adc->enableCompare(1.0/3.3*adc->getMaxValue(ADC_1), 0, ADC_1); // measurement will be ready if value < 1.0V
        //adc->enableCompareRange(1.0*adc->getMaxValue(ADC_1)/3.3, 2.0*adc->getMaxValue(ADC_1)/3.3, 0, 1, ADC_1); // ready if value lies out of [1.0,2.0] V
        #endif
    
        Serial.println("End setup");
    
    }
    
    int value;
    int value2;
    char c;
    
    void loop() {
    
        value = adc->analogRead(readPin); // read a new value, will return ADC_ERROR_VALUE if the comparison is false.
    
        Serial.print("Pin: ");
        Serial.print(readPin);
        Serial.print(", value ADC0: ");
        Serial.println(value*3.3/adc->getMaxValue(ADC_0), DEC);
    
        #if defined(ADC_TEENSY_3_1)
        value2 = adc->analogRead(readPin2, ADC_1);
    
        Serial.print("Pin: ");
        Serial.print(readPin2);
        Serial.print(", value ADC1: ");
        Serial.println(value2*3.3/adc->getMaxValue(ADC_1), DEC);
        #endif
    
        /* fail_flag contains all possible errors,
            They are defined in  ADC_Module.h as
    
            ADC_ERROR_OTHER
            ADC_ERROR_CALIB
            ADC_ERROR_WRONG_PIN
            ADC_ERROR_ANALOG_READ
            ADC_ERROR_COMPARISON
            ADC_ERROR_ANALOG_DIFF_READ
            ADC_ERROR_CONT
            ADC_ERROR_CONT_DIFF
            ADC_ERROR_WRONG_ADC
            ADC_ERROR_SYNCH
    
            You can compare the value of the flag with those masks to know what's the error.
        */
    
        if(adc->adc0->fail_flag) {
            Serial.print("ADC0 error flags: 0x");
            Serial.println(adc->adc0->fail_flag, HEX);
            if(adc->adc0->fail_flag == ADC_ERROR_COMPARISON) {
                adc->adc0->fail_flag &= ~ADC_ERROR_COMPARISON; // clear that error
                Serial.println("Comparison error in ADC0");
            }
        }
        #if defined(ADC_TEENSY_3_1)
        if(adc->adc1->fail_flag) {
            Serial.print("ADC1 error flags: 0x");
            Serial.println(adc->adc1->fail_flag, HEX);
            if(adc->adc1->fail_flag == ADC_ERROR_COMPARISON) {
                adc->adc1->fail_flag &= ~ADC_ERROR_COMPARISON; // clear that error
                Serial.println("Comparison error in ADC1");
            }
        }
        #endif
    
        digitalWriteFast(LED_BUILTIN, !digitalReadFast(LED_BUILTIN));
    
        delay(500);
    }
    
    // If you enable interrupts make sure to call readSingle() to clear the interrupt.
    void adc0_isr() {
            adc->adc0->readSingle();
    }
    One thing I'v realized is that Blink doesn't use Serial at all, while my example does.
    The second thing is that cortex M0+ and M4 are quite different in some things. As far as I understand M0+ uses a smaller instruction set, basically most of Thumb and some Thumb-2, while M4 supports more instructions. It's logical to assume that in order to do the same thing M0+ may need more assembler instructions than M4.
    I've disassembled (with objdump) both the Blink and analogRead examples and I'm (very) slowly going through them seeing if I can notice some difference.
    As an example I show here a very simple function in the ADC library (checkPin):
    Code:
    ADC library code:
    
    // check whether the pin is a valid analog pin
    bool ADC_Module::checkPin(uint8_t pin) {
    
        if(pin>ADC_MAX_PIN) {
            return false;   // all others are invalid
        }
    
        // translate pin number to SC1A number, that also contains MUX a or b info.
        uint8_t sc1a_pin = channel2sc1a[pin];
    
        if( (sc1a_pin&ADC_SC1A_CHANNELS) == ADC_SC1A_PIN_INVALID ) { // note: ADC_SC1A_CHANNELS=ADC_SC1A_PIN_INVALID=0x1F
            return false;   // all others are invalid
        }
    
        return true;
    }
    
    LC: 13 instructions
    
    d58:	1c03      	adds	r3, r0, #0
    d5a:	2000      	movs	r0, #0
    d5c:	292c      	cmp	r1, #44	; 0x2c
    d5e:	d807      	bhi.n	d70 <_ZN10ADC_Module8checkPinEh+0x18>
    d60:	6a9b      	ldr	r3, [r3, #40]	; 0x28
    d62:	5c58      	ldrb	r0, [r3, r1]
    d64:	231f      	movs	r3, #31
    d66:	4018      	ands	r0, r3
    d68:	381f      	subs	r0, #31
    d6a:	1e43      	subs	r3, r0, #1
    d6c:	4198      	sbcs	r0, r3
    d6e:	b2c0      	uxtb	r0, r0
    d70:	4770      	bx	lr
    
    
    T 3.0: 11 instructions
    
    c58:	292c      	cmp	r1, #44	; 0x2c
    c5a:	d807      	bhi.n	c6c <_ZN10ADC_Module8checkPinEh+0x14>
    c5c:	6a83      	ldr	r3, [r0, #40]	; 0x28
    c5e:	5c58      	ldrb	r0, [r3, r1]
    c60:	f000 001f 	and.w	r0, r0, #31
    c64:	381f      	subs	r0, #31
    c66:	bf18      	it	ne
    c68:	2001      	movne	r0, #1
    c6a:	4770      	bx	lr
    c6c:	2000      	movs	r0, #0
    c6e:	4770      	bx	lr
    The same C++ code produces somewhat different asm codes for Teensy LC and 3.0. The size difference in this example is not great though.

  8. #8
    Senior Member
    Join Date
    Jul 2013
    Posts
    272
    I've finally isolated the problem (or at least one problem).
    Compare the program memory that these very simple two sketches use (Teensyduino 1.23, Arduino 1.6.1):

    EXAMPLE_INT:
    Code:
    void setup() {
        //pinMode(LED_BUILTIN, OUTPUT);
    
        Serial.begin(9600);
    
    }
    
    int test;
    
    void loop() {
    
        Serial.print(3*test);
    
        //digitalWriteFast(LED_BUILTIN, !digitalReadFast(LED_BUILTIN));
    
        delay(500);
    }
    EXAMPLE_FLOAT:
    Code:
    void setup() {
        //pinMode(LED_BUILTIN, OUTPUT);
    
        Serial.begin(9600);
    
    }
    
    int test;
    
    void loop() {
    
        Serial.print(3.3*test);
    
        //digitalWriteFast(LED_BUILTIN, !digitalReadFast(LED_BUILTIN));
    
        delay(500);
    }

    T 3.0 T 3.1 (no optimizations) T LC
    EXAMPLE_INT: 9.628 B 9.964 B 10.212 B
    EXAMPLE_FLOAT: 12.172 B 12.516 B 16.292 B
    Diff: 2.544 B 2.525 B 6.080 B

    Teensy LC uses much more space when it has to operate with floats, both in absolute numbers and relative to its smaller memory.
    I'll try to analyze the assembler output to see in which functions increase in size so much, but I think this is way out of my league.
    Last edited by Pedvide; 05-27-2015 at 07:38 PM.

  9. #9
    Senior Member
    Join Date
    Jun 2013
    Location
    So. Calif
    Posts
    2,828
    LC has no hardware floating point, right?

  10. #10
    Senior Member
    Join Date
    Aug 2013
    Location
    Gothenburg, Sweden
    Posts
    293
    No it doesnt, but neither does Teensy 3.0 or 3.1.
    The only K20 chips with FPU are the K20-120 line,
    big chips with 1024MB flash 128k RAM in 144 pin chips.
    Lots of possibilities, but wont fit the Teensy form factor

  11. #11
    Senior Member
    Join Date
    Jul 2013
    Posts
    272
    As far as I know none of them has!

    In the disassembled code I'm seeing many functions that deal with floats (or doubles) like __aeabi_dmul, for T3.0 it uses 594 lines of asm code, for LC 1308.
    I've opened an issue at github so Paul can have a look.

  12. #12
    Senior Member+ Frank B's Avatar
    Join Date
    Apr 2014
    Location
    Germany NRW
    Posts
    5,679
    I think, this is because of the different architectures:

    The LC is a Cortex-M0+:
    - It is a different architecture: Von-Neumann, ARMv6-M
    - It is "Thumb" and has only some Thumb2-instructions
    - There is no instruction for division in hardware, which means more code for divisions

    The T3 is a Cortex-M4, Harvard, ARMv7-ME
    - Complete set of Thumb & Thumb2 instructions
    - Division in Hardware
    - DSP (don't know if this plays a role here)


    (Pls. correct me if i'm wrong)
    Last edited by Frank B; 05-27-2015 at 09:42 PM.

  13. #13
    Senior Member
    Join Date
    Jul 2013
    Posts
    272
    Frank B, I think you are right. And I guess the lack of hardware division makes any float algorithms much slower and longer.
    I also guess that Paul can't do anything about it, but it's important to remind people not to use floats unless it's really necessary.

  14. #14
    Senior Member PaulStoffregen's Avatar
    Join Date
    Nov 2012
    Posts
    20,606
    There are some things I can do about the program size. Obviously I can't make Cortex-M0+ have hardware divide and other great features of M4.

  15. #15
    Senior Member
    Join Date
    Jun 2013
    Location
    So. Calif
    Posts
    2,828
    Worth looking at cost and benefits of ST's M41x line. Hardware floating point, lots of peripherals, 168MHz, etc. NXP/Freescale is a known entity to PJRC but...

  16. #16
    Senior Member+ Frank B's Avatar
    Join Date
    Apr 2014
    Location
    Germany NRW
    Posts
    5,679
    That would mean to rewrite tons of sourcecode

  17. #17
    Senior Member
    Join Date
    Jun 2013
    Location
    So. Calif
    Posts
    2,828
    Or lessen reinvention by using the MCU vendor's libraries. Arduino API wrappers on these.

  18. #18
    Senior Member
    Join Date
    Nov 2012
    Location
    Boston, MA, USA
    Posts
    1,108
    Quote Originally Posted by stevech View Post
    LC has no hardware floating point, right?
    Steve, you asked this (several times) about Teensy 3.0 and 3.1 and were consitently told that the MCU in Teensy 3.x does not have hardware FPU.
    Given that LC costs less, what are the chances that it has an FPU?

  19. #19
    Senior Member+ Frank B's Avatar
    Join Date
    Apr 2014
    Location
    Germany NRW
    Posts
    5,679
    I can not imagine many applications where an FPU is really necessary.

  20. #20
    Senior Member
    Join Date
    Jun 2013
    Location
    So. Calif
    Posts
    2,828
    Quote Originally Posted by Nantonos View Post
    Steve, you asked this (several times) about Teensy 3.0 and 3.1 and were consitently told that the MCU in Teensy 3.x does not have hardware FPU.
    Given that LC costs less, what are the chances that it has an FPU?
    I beg your forgiveness. I forgot that the T3 has no FPU. I work daily with several different ARM MCUs, and some M4's have an FPU even in the low cost, low pin count MCUs.

    FPU saves code space. But moreover, there is a good demand for hardware floating point in mid-range MCUs these days. Else the ARM vendors would't offer it! As I recall, the ARM standards define the 32 bit floating point hardware and it's about the same on all Cortex MCUs.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •