Forum Rule: Always post complete source code & details to reproduce any issue!
Page 3 of 3 FirstFirst 1 2 3
Results 51 to 74 of 74

Thread: LC is 10.9 times slower than T3.1?

  1. #51
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    9,754
    Indeed - my question was: is sprintf codebase open to mod as was done for print in the base 10 case? Won't help 3.1 - but sprint prepping decimal data for (serial) xfer would be improved on LC.

  2. #52
    Senior Member+ Frank B's Avatar
    Join Date
    Apr 2014
    Location
    Germany NRW
    Posts
    5,679
    Quote Originally Posted by defragster View Post
    Indeed - my question was: is sprintf codebase open to mod as was done for print in the base 10 case? Won't help 3.1 - but sprint prepping decimal data for (serial) xfer would be improved on LC.
    sprintf is part of newlib.
    https://sourceware.org/newlib/

    they have a mailinglist - maybe ask there ?

    Edit :
    What is it?

    Newlib is a C library intended for use on embedded systems. It is a conglomeration of several library parts, all under free software licenses that make them easily usable on embedded products.
    Newlib is only available in source form. It can be compiled for a wide array of processors, and will usually work on any architecture with the addition of a few low-level routines.
    Contributions

    Newlib thrives on net contributions from people like you. We're looking for contributions of code, bugfixes, optimizations, documentation updates, web page improvements, etc. A nice testsuite to automate the testing of newlib is also needed. Contributions are currently done by posting patches and ideas to newlib@sourceware.org; check out the mailing list section to find out more.
    Who are we?

    Newlib is maintained by:
    Corinna Vinschen <vinschen AT redhat DOT com>
    Jeff Johnston <jjohnstn AT redhat DOT com>
    Please use the mailing list for all inquiries, bug reports, and patch submissions to Newlib. Please refrain from sending unsolicited personal email
    Last edited by Frank B; 06-21-2015 at 08:27 PM.

  3. #53
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    9,754
    Thx Frank - outside lib - that answers my question.

    Helps explain why my machine search didn't seem clear in the sources I saw (sprintf> printf > under WiFi). Had seen 'newlib' on forum posts - but never (knew I) had reason to come to what you posted.

  4. #54
    Senior Member+ Frank B's Avatar
    Join Date
    Apr 2014
    Location
    Germany NRW
    Posts
    5,679
    It's still possible to patch "our" copy, and build our own toolchain, all sources are available (But I have not looked at it.)
    That's a lot of work, and it must be well tested.
    Or one could tell Corinna or Jeff about the better udiv10 by mlu- it's worth a try.
    In this case, it will take some time, if they accept it.

    On the other hand, "our" print is much smaller and better if one wants to print fast.
    Printf is not meant to be fast, i think. It's advantage is the flexibilty.

    We can mix both.
    Last edited by Frank B; 06-21-2015 at 10:19 PM.

  5. #55
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    9,754
    Maybe a string aware print version of print class? Or am I missing something there too where there is a szRAM.print? That would allow T_3.1 to take advantage of the dedicated print.

  6. #56
    Junior Member
    Join Date
    May 2014
    Posts
    7
    I haven't read the whole thread, just wondering... Does this mean I'd have to buy 11 or so LC's in order to replicate the regular cost board's speed? If so it doesn't seem so "low cost"

  7. #57
    Senior Member
    Join Date
    Jun 2013
    Location
    So. Calif
    Posts
    2,828
    If this is to speed up printf, sprintf (vsprintf) for the divide case of the M0, and if printf often goes to a screen for a human to read, what's the impetus to make it a %d a bit faster?

  8. #58
    Senior Member PaulStoffregen's Avatar
    Join Date
    Nov 2012
    Posts
    20,585
    Quote Originally Posted by Budmo View Post
    Does this mean I'd have to buy 11 or so LC's in order to replicate the regular cost board's speed?
    Clearly you've missed most of this thread!

    Despite the dramatic title of this thread, you really should keep things in perspective. For example, in message #21 is a simple test that measures the time to print several large integers. If you run this on any regular 8 bit Arduino, like Uno or Leonardo, it will take 1380 microseconds.

    Serial.print() and micros() have become highly optimized as a result of this conversation. Until a couple days ago, Teensy-LC ran that test in 228 microseconds. Now it takes only 62 microseconds! A similar speedup was made in micro().

    However, Teensy-LC is still no match for the incredible speed of Teensy 3.1, which runs that test in 29 microseconds.

    So in terms of speed per dollar, Teensy 3.1 is indeed a much better deal than Teensy-LC. If your code will do a lot of division, definitely get Teensy 3.1.

    But keep these numbers in perspective. Even before the optimization, Teensy-LC was over 6X faster than regular 8 bit Arduino boards, which retail for much more if genuine boards, or about half the price is Chinese clones. Now with the optimization, it's 22X faster than regular AVR boards.

  9. #59
    Senior Member PaulStoffregen's Avatar
    Join Date
    Nov 2012
    Posts
    20,585
    For comparison, I just ran the message #21 test on Arduino Due and Arduino Zero.

    Arduino Due takes 71 microseconds. I suspect Due's slowness compared to Teensy 3.1 is extra one-byte-at-a-time overhead of moving the data from Print to Serial.

    Arduino Zero took well over 30000, but that's because its Serial code lacks transmit buffering. I couldn't get it working on SerialUSB, and now I've managed to get my only Zero board locked up.

  10. #60
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    9,754
    Quote Originally Posted by PaulStoffregen View Post
    Despite the dramatic title of this thread
    Dramatic indeed on the title - the first line would have been better "LC is unexpectedly slower than a T3.1".

    Math wise it was true enough - as far as apples & oranges comparisons go. After this speed up it is closer to 3.5 times - compared to overclocked 3.1 - and at the same speed 1.7X IIRC, which is inline with my expectations given what I knew of the core MCU differences.

    <edit> I previously updated 1st post to say it was improved - and just added a quote from the above to avoid further confusion
    Last edited by defragster; 06-22-2015 at 01:06 AM.

  11. #61
    Senior Member
    Join Date
    Feb 2015
    Location
    Finland
    Posts
    127
    Just wondering out loud:

    I wonder how well the following, completely different approach, would compare speed-wise:
    Code:
    #include <stdint.h>
    
    static const uint32_t udecimal[9][3] = {
        {         10uL,         30uL,         50uL },
        {        100uL,        300uL,        500uL },
        {       1000uL,       3000uL,       5000uL },
        {      10000uL,      30000uL,      50000uL },
        {     100000uL,     300000uL,     500000uL },
        {    1000000uL,    3000000uL,    5000000uL },
        {   10000000uL,   30000000uL,   50000000uL },
        {  100000000uL,  300000000uL,  500000000uL },
        { 1000000000uL, 3000000000uL,          0   }
    };
    
    char *decimal(char *buffer, uint32_t value)
    {
        int  digit;
    
        if (value >= udecimal[8][0]) {
            *buffer = '0';
            if (value >= udecimal[8][1]) {
                value -= udecimal[8][1];
                (*buffer) += 3;
            }
            if (value >= udecimal[8][0]) {
                value -= udecimal[8][0];
                (*buffer) += 1;
                if (value >= udecimal[8][0]) {
                    value -= udecimal[8][0];
                    (*buffer) += 1;
                }            
            }
            buffer++;
            digit = 8;
        } else {
            /* Non-optimal binary search for highest power of ten less than value */
            if (value >= udecimal[4][0]) {
                if (value >= udecimal[6][0]) {
                    if (value >= udecimal[7][0])
                        digit = 8;
                    else
                        digit = 7;
                } else {
                    if (value >= udecimal[5][0])
                        digit = 6;
                    else
                        digit = 5;
                }
            } else {
                if (value >= udecimal[2][0]) {
                    if (value >= udecimal[3][0])
                        digit = 4;
                    else
                        digit = 3;
                } else {
                    if (value >= udecimal[1][0])
                        digit = 2;
                    else
                    if (value >= udecimal[0][0])
                        digit = 1;
                    else
                        digit = 0;
                }
            }
        }
    
        while (digit--) {
            *buffer = '0';
    
            if (value >= udecimal[digit][2]) {
                value -= udecimal[digit][2];
                (*buffer) += 5;
            }
            if (value >= udecimal[digit][1]) {
                value -= udecimal[digit][1];
                (*buffer) += 3;
            }
            if (value >= udecimal[digit][0]) {
                value -= udecimal[digit][0];
                (*buffer) += 1;
                if (value >= udecimal[digit][0]) {
                    value -= udecimal[digit][0];
                    (*buffer) += 1;
                }
            }
    
            buffer++;
        }
    
        *(buffer++) = '0' + value;
        *buffer = '\0';
    
        return buffer;
    }
    This conversion starts with a simple unoptimized binary search for the highest power of ten not higher than the value. The highest digit possible is handled separately, since it can only be 1, 2, 3, or 4.

    All other digits are handled starting at the leftmost (most significant) decimal digits, substracting 5, 3, and 1 or 2 from the digit in turn. There are no multiplications, divisions, or modulus used.

    I'm not at all sure if this is faster than repeated division-by-ten, but if the speed was really important to me, this is one of the approaches I'd carefully check.

  12. #62
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    9,754
    That looks like 408 bytes of code with conditionals and looping, and the arrays to index and dereference, and a 2D array takes multiplies to index, as might a 1D depending on the offset instructions for the element size, i.e. is there a offset[word] instruction or just offset[byte] .

    The _v2 looks like 72 bytes and the ASM code Paul started with looks like 120 bytes? Both are linear code and IIRC the M0 can do multiplies in one cycle.

    I was wondering if inline was making things FAT/BIG so I ran without it - and it takes twice as long on 3.1 non-optimized:
    inline _v2 is 19 mins, removing that inline takes the time to 37 minutes.
    Adding all the CALLS/push/pop/RETURN 2^32 times adds up - apparently to run time of the 72 bytes worth of _v2 code.

  13. #63
    Senior Member PaulStoffregen's Avatar
    Join Date
    Nov 2012
    Posts
    20,585
    Quote Originally Posted by Nominal Animal View Post
    I wonder how well the following, completely different approach, would compare speed-wise:
    I tried it. 72 microseconds on Teensy-LC.

    That's very good, but not as good as 62 us from the optimized divmod10 function.

    Here's the code I tested, if you want to try reproducing the result or further optimizing.
    Attached Files Attached Files
    Last edited by PaulStoffregen; 06-23-2015 at 01:02 PM.

  14. #64
    Senior Member PaulStoffregen's Avatar
    Join Date
    Nov 2012
    Posts
    20,585
    Oh, compiling this with -O instead of -Os results in 67 microseconds on Teensy-LC.

    Edit: tried -O2 & -O3. Both are slower than -O. Probably the same compiler bug we've seen before with -O2 on Cortex-M0+.

    These numbers are from the simple benchmark test of message #21, which is different from how Defragster is testing.
    Last edited by PaulStoffregen; 06-23-2015 at 01:01 PM.

  15. #65
    Senior Member+ Frank B's Avatar
    Join Date
    Apr 2014
    Location
    Germany NRW
    Posts
    5,679
    Quote Originally Posted by PaulStoffregen View Post
    Oh, compiling this with -O instead of -Os results in 67 microseconds on Teensy-LC.

    Edit: tried -O2 & -O3. Both are slower than -O. Probably the same compiler bug we've seen before with -O2 on Cortex-M0+.

    These numbers are from the simple benchmark test of message #21, which is different from how Defragster is testing.
    Just for info :-)

    simple benchmark test of message #21 With gcc 4.9-2015-q2-update :

    • 48 MHz : 68..69 us
    • 48 MHz "Optimized" (-O): 65..66 us
    • 48 MHz -O2: 66..67 us (?)


    slightly (-1us) better with -O

    Edit:
    Just for fun:
    Const table in RAM (without const), same compiler

    • 48 MHz : 64..65 us
    • 48 MHz "Optimized" (-O): 62 us
    • 48 MHz -O2: 63..64 us (?)


    Edit2:
    MLU Version same compiler

    • 48 MHz : 56..57 us
    • 48 MHz "Optimized" (-O): 56..57 us
    • 48 MHz -O2: 59..60 us (?)


    gcc 4.9-q2 seems still to be a bit buggy (re optimizations) for cortex-m0+
    Last edited by Frank B; 06-23-2015 at 08:27 PM.

  16. #66
    Senior Member+ Frank B's Avatar
    Join Date
    Apr 2014
    Location
    Germany NRW
    Posts
    5,679
    Code:
        while (digit--) {
            char tmp = '0';
            //*buffer = '0';
    
            if (value >= udecimal[digit][2]) {
                value -= udecimal[digit][2];
                tmp += 5;
            }
            if (value >= udecimal[digit][1]) {
                value -= udecimal[digit][1];
                tmp += 3;
            }
            if (value >= udecimal[digit][0]) {
                value -= udecimal[digit][0];
                tmp += 1;
                if (value >= udecimal[digit][0]) {
                    value -= udecimal[digit][0];
                    tmp += 1;
                }
            }
            (*buffer) = tmp;
            buffer++;
        }
    is 2 us faster (60 us,table in RAM) with -O

  17. #67
    Senior Member+ MichaelMeissner's Avatar
    Join Date
    Nov 2012
    Location
    Ayer Massachussetts
    Posts
    3,265

    Cool

    Quote Originally Posted by PaulStoffregen View Post
    Oh, compiling this with -O instead of -Os results in 67 microseconds on Teensy-LC.

    Edit: tried -O2 & -O3. Both are slower than -O. Probably the same compiler bug we've seen before with -O2 on Cortex-M0+.
    I get that a lot with my day job (PowerPC 64-bit GCC support). But I guess tracking down these things is what keeps me employed.

    That reminds me of Conway's law:
    organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations

    Which restated for GCC means you have lots of independent passes, many of which don't talk with the other passes, and some passes probably have feuds between them.

  18. #68
    Senior Member PaulStoffregen's Avatar
    Join Date
    Nov 2012
    Posts
    20,585
    Maybe we should do a toolchain update soon? I've been a little nervous to start this while the Arduino releases have been coming out so quickly.

    Edit: the newer toolchain might be incompatible with Mac OS-X 10.6? Then again, Arduino dropped 10.6 support after 1.6.1...
    Last edited by PaulStoffregen; 06-24-2015 at 02:55 AM.

  19. #69
    Senior Member
    Join Date
    Feb 2015
    Location
    Finland
    Posts
    127
    Quote Originally Posted by PaulStoffregen View Post
    That's very good, but not as good as 62 us from the optimized divmod10 function.
    I suspected so (ie. not as good as divmod10).

    I felt it was interesting, because it's basically similar approach as CORDIC -- which is still one of the easiest ways to implement sin()/cos()/tan()/arcsin()/arccos()/arctan() functions without hardware division or 3232=64-bit multiply. You only need bit shifts and add/substract with carry.

  20. #70
    Senior Member PaulStoffregen's Avatar
    Join Date
    Nov 2012
    Posts
    20,585
    Long ago, I wrote assembly optimized float routines for the SDCC compiler. Cordic worked great for sin, cos and atan, but for only 24 bit mantissa the polynomial approximations ended up being faster after I'd heavily optimized multiply and add. As I recall, cordic was much harder for asin, acos and tan.

  21. #71
    Senior Member
    Join Date
    Feb 2015
    Location
    Finland
    Posts
    127
    It does need extra bits in the accumulator. I only experimented with fixed-point arithmetic (Q15.48, was it?) with angular units where 1.0 corresponds to 2π radians, using Maple to calculate the coefficients at arbitrary precision; being more worried about getting the error down to half LSB than optimum speed. It might have been on my Teensy 2.0++, too. I think -- but am not sure; I'm just blabbing here -- that the problems are mirrored using these units: the polynomial approximations being nasty to get to converge down to half ulp an error, with easy nice cordic (if you got the extra bits). Could be because π is irrational, could be I'm totally wrong.

  22. #72
    Senior Member+ Frank B's Avatar
    Join Date
    Apr 2014
    Location
    Germany NRW
    Posts
    5,679
    Quote Originally Posted by PaulStoffregen View Post
    Maybe we should do a toolchain update soon? I've been a little nervous to start this while the Arduino releases have been coming out so quickly.
    Some thoughts:

    - In 4.9 is a newer verson of Newlib, itoa() and utoa() are already included. These must be removed from avr_functions.h make it work ( I had error messages )

    - The launchpad- build of "4.9-2015-q2-update" is a few days old ( the current major version of GCC is 5.x )
    But, we are not the only ones who use it, I think there are already thousands of beta testers. The bug list (https://bugs.launchpad.net/gcc-arm-embedded) is very small, and there is contained nothing important for us.
    I've used 4.9-2015-q1-update and 4.9-2015-major before, without problems.

    - There may be a way to test it with Teensyduino without much effort from your side:
    Write a short howto how the new version to be installed ( I'd do it if my English was better ), in your blog, and wait for feedback.
    I'm sure that some of us are willing to try the new toolchain, you don't have to update Teensyduino.
    There is no reason to hurry up. Everyone can decide what he wants to use himself.

  23. #73
    Senior Member
    Join Date
    Aug 2013
    Location
    Gothenburg, Sweden
    Posts
    293
    This might sound like gibberish to some but here goes anyway:

    The reason sin, cos are easy to approximate is because they are
    entire functions in the complex plane and their taylor series have infinite radius of comvergence.
    atan is ok since it has singularities in +-i, and the taylor series has radius of convergence at least one on
    the real axis.

    tan, asin and acos have singularities on the real axis, nasty for approximation.

  24. #74
    Senior Member+ Frank B's Avatar
    Join Date
    Apr 2014
    Location
    Germany NRW
    Posts
    5,679
    A quick info for the new gcc 4.9-q2: I just switched back to Teensy 3.1 and got the message "cannot move location pointer backwards" when compiling "optimized"
    I just solved that with changing the line
    __attribute__ ((section(".startup"),optimize("Os")))
    void ResetHandler(void) [....]

    (EDIT: in mk20dx128.c, perhaps i have this done for gcc 4.9-q1 too, can't remember.. i skipped some teensyduino versions)

    Is that ok ? "ResetHandler" calls setup() and loop(), i hope they are not optimized for size now ? (inlined??) Edit: No.
    If yes, this function has to be made a bit smaller in an other way.., eventually..
    Last edited by Frank B; 06-26-2015 at 07:26 PM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •