LC is 10.9 times slower than T3.1?

defragster · Jun 21, 2015

Indeed - my question was: is sprintf codebase open to mod as was done for print in the base 10 case? Won't help 3.1 - but sprint prepping decimal data for (serial) xfer would be improved on LC.

Frank B · Jun 21, 2015

defragster said:
Indeed - my question was: is sprintf codebase open to mod as was done for print in the base 10 case? Won't help 3.1 - but sprint prepping decimal data for (serial) xfer would be improved on LC.

sprintf is part of newlib.
https://sourceware.org/newlib/

they have a mailinglist - maybe ask there ?

Edit :

[h=2]What is it?[/h] Newlib is a C library intended for use on embedded systems. It is a conglomeration of several library parts, all under free software licenses that make them easily usable on embedded products.
Newlib is only available in source form. It can be compiled for a wide array of processors, and will usually work on any architecture with the addition of a few low-level routines.
[h=2]Contributions[/h] Newlib thrives on net contributions from people like you. We're looking for contributions of code, bugfixes, optimizations, documentation updates, web page improvements, etc. A nice testsuite to automate the testing of newlib is also needed. Contributions are currently done by posting patches and ideas to newlib@sourceware.org; check out the mailing list section to find out more.
[h=2]Who are we?[/h] Newlib is maintained by:
Corinna Vinschen <vinschen AT redhat DOT com>
Jeff Johnston <jjohnstn AT redhat DOT com>
Please use the mailing list for all inquiries, bug reports, and patch submissions to Newlib. Please refrain from sending unsolicited personal email

defragster · Jun 21, 2015

Thx Frank - outside lib - that answers my question.

Helps explain why my machine search didn't seem clear in the sources I saw (sprintf> printf > under WiFi). Had seen 'newlib' on forum posts - but never (knew I) had reason to come to what you posted.

Frank B · Jun 21, 2015

It's still possible to patch "our" copy, and build our own toolchain, all sources are available (But I have not looked at it.)
That's a lot of work, and it must be well tested.
Or one could tell Corinna or Jeff about the better udiv10 by mlu- it's worth a try.
In this case, it will take some time, if they accept it.

On the other hand, "our" print is much smaller and better if one wants to print fast.
Printf is not meant to be fast, i think. It's advantage is the flexibilty.

We can mix both.

defragster · Jun 21, 2015

Maybe a string aware print version of print class? Or am I missing something there too where there is a szRAM.print? That would allow T_3.1 to take advantage of the dedicated print.

Budmo · Jun 21, 2015

I haven't read the whole thread, just wondering... Does this mean I'd have to buy 11 or so LC's in order to replicate the regular cost board's speed? If so it doesn't seem so "low cost"

stevech · Jun 21, 2015

If this is to speed up printf, sprintf (vsprintf) for the divide case of the M0, and if printf often goes to a screen for a human to read, what's the impetus to make it a %d a bit faster?

PaulStoffregen · Jun 21, 2015

Budmo said:
Does this mean I'd have to buy 11 or so LC's in order to replicate the regular cost board's speed?

Clearly you've missed most of this thread!

Despite the dramatic title of this thread, you really should keep things in perspective. For example, in message #21 is a simple test that measures the time to print several large integers. If you run this on any regular 8 bit Arduino, like Uno or Leonardo, it will take 1380 microseconds.

Serial.print() and micros() have become highly optimized as a result of this conversation. Until a couple days ago, Teensy-LC ran that test in 228 microseconds. Now it takes only 62 microseconds! A similar speedup was made in micro().

However, Teensy-LC is still no match for the incredible speed of Teensy 3.1, which runs that test in 29 microseconds.

So in terms of speed per dollar, Teensy 3.1 is indeed a much better deal than Teensy-LC. If your code will do a lot of division, definitely get Teensy 3.1.

But keep these numbers in perspective. Even before the optimization, Teensy-LC was over 6X faster than regular 8 bit Arduino boards, which retail for much more if genuine boards, or about half the price is Chinese clones. Now with the optimization, it's 22X faster than regular AVR boards.

PaulStoffregen · Jun 21, 2015

For comparison, I just ran the message #21 test on Arduino Due and Arduino Zero.

Arduino Due takes 71 microseconds. I suspect Due's slowness compared to Teensy 3.1 is extra one-byte-at-a-time overhead of moving the data from Print to Serial.

Arduino Zero took well over 30000, but that's because its Serial code lacks transmit buffering. I couldn't get it working on SerialUSB, and now I've managed to get my only Zero board locked up.

defragster · Jun 21, 2015

PaulStoffregen said:
Despite the dramatic title of this thread

Dramatic indeed on the title - the first line would have been better "LC is unexpectedly slower than a T3.1".

Math wise it was true enough - as far as apples & oranges comparisons go. After this speed up it is closer to 3.5 times - compared to overclocked 3.1 - and at the same speed 1.7X IIRC, which is inline with my expectations given what I knew of the core MCU differences.

<edit> I previously updated 1st post to say it was improved - and just added a quote from the above to avoid further confusion

Nominal Animal · Jun 22, 2015

Just wondering out loud:

I wonder how well the following, completely different approach, would compare speed-wise:

Code:

#include <stdint.h>

static const uint32_t udecimal[9][3] = {
    {         10uL,         30uL,         50uL },
    {        100uL,        300uL,        500uL },
    {       1000uL,       3000uL,       5000uL },
    {      10000uL,      30000uL,      50000uL },
    {     100000uL,     300000uL,     500000uL },
    {    1000000uL,    3000000uL,    5000000uL },
    {   10000000uL,   30000000uL,   50000000uL },
    {  100000000uL,  300000000uL,  500000000uL },
    { 1000000000uL, 3000000000uL,          0   }
};

char *decimal(char *buffer, uint32_t value)
{
    int  digit;

    if (value >= udecimal[8][0]) {
        *buffer = '0';
        if (value >= udecimal[8][1]) {
            value -= udecimal[8][1];
            (*buffer) += 3;
        }
        if (value >= udecimal[8][0]) {
            value -= udecimal[8][0];
            (*buffer) += 1;
            if (value >= udecimal[8][0]) {
                value -= udecimal[8][0];
                (*buffer) += 1;
            }            
        }
        buffer++;
        digit = 8;
    } else {
        /* Non-optimal binary search for highest power of ten less than value */
        if (value >= udecimal[4][0]) {
            if (value >= udecimal[6][0]) {
                if (value >= udecimal[7][0])
                    digit = 8;
                else
                    digit = 7;
            } else {
                if (value >= udecimal[5][0])
                    digit = 6;
                else
                    digit = 5;
            }
        } else {
            if (value >= udecimal[2][0]) {
                if (value >= udecimal[3][0])
                    digit = 4;
                else
                    digit = 3;
            } else {
                if (value >= udecimal[1][0])
                    digit = 2;
                else
                if (value >= udecimal[0][0])
                    digit = 1;
                else
                    digit = 0;
            }
        }
    }

    while (digit--) {
        *buffer = '0';

        if (value >= udecimal[digit][2]) {
            value -= udecimal[digit][2];
            (*buffer) += 5;
        }
        if (value >= udecimal[digit][1]) {
            value -= udecimal[digit][1];
            (*buffer) += 3;
        }
        if (value >= udecimal[digit][0]) {
            value -= udecimal[digit][0];
            (*buffer) += 1;
            if (value >= udecimal[digit][0]) {
                value -= udecimal[digit][0];
                (*buffer) += 1;
            }
        }

        buffer++;
    }

    *(buffer++) = '0' + value;
    *buffer = '\0';

    return buffer;
}

This conversion starts with a simple unoptimized binary search for the highest power of ten not higher than the value. The highest digit possible is handled separately, since it can only be 1, 2, 3, or 4.

All other digits are handled starting at the leftmost (most significant) decimal digits, substracting 5, 3, and 1 or 2 from the digit in turn. There are no multiplications, divisions, or modulus used.

I'm not at all sure if this is faster than repeated division-by-ten, but if the speed was really important to me, this is one of the approaches I'd carefully check.

defragster · Jun 23, 2015

That looks like 408 bytes of code with conditionals and looping, and the arrays to index and dereference, and a 2D array takes multiplies to index, as might a 1D depending on the offset instructions for the element size, i.e. is there a offset[word] instruction or just offset[byte] .

The _v2 looks like 72 bytes and the ASM code Paul started with looks like 120 bytes? Both are linear code and IIRC the M0 can do multiplies in one cycle.

I was wondering if inline was making things FAT/BIG so I ran without it - and it takes twice as long on 3.1 non-optimized:

inline _v2 is 19 mins, removing that inline takes the time to 37 minutes.

Adding all the CALLS/push/pop/RETURN 2^32 times adds up - apparently to run time of the 72 bytes worth of _v2 code.

PaulStoffregen · Jun 23, 2015

Nominal Animal said:
I wonder how well the following, completely different approach, would compare speed-wise:

I tried it. 72 microseconds on Teensy-LC.

That's very good, but not as good as 62 us from the optimized divmod10 function.

Here's the code I tested, if you want to try reproducing the result or further optimizing.

PaulStoffregen · Jun 23, 2015

Oh, compiling this with -O instead of -Os results in 67 microseconds on Teensy-LC.

Edit: tried -O2 & -O3. Both are slower than -O. Probably the same compiler bug we've seen before with -O2 on Cortex-M0+.

These numbers are from the simple benchmark test of message #21, which is different from how Defragster is testing.

Frank B · Jun 23, 2015

PaulStoffregen said:
Oh, compiling this with -O instead of -Os results in 67 microseconds on Teensy-LC.

Edit: tried -O2 & -O3. Both are slower than -O. Probably the same compiler bug we've seen before with -O2 on Cortex-M0+.

These numbers are from the simple benchmark test of message #21, which is different from how Defragster is testing.

Just for info

simple benchmark test of message #21 With gcc 4.9-2015-q2-update :

48 MHz : 68..69 us
48 MHz "Optimized" (-O): 65..66 us
48 MHz -O2: 66..67 us (?)

slightly (-1us) better with -O

Edit:
Just for fun:
Const table in RAM (without const), same compiler

48 MHz : 64..65 us
48 MHz "Optimized" (-O): 62 us
48 MHz -O2: 63..64 us (?)

Edit2:
MLU Version same compiler

48 MHz : 56..57 us
48 MHz "Optimized" (-O): 56..57 us
48 MHz -O2: 59..60 us (?)

gcc 4.9-q2 seems still to be a bit buggy (re optimizations) for cortex-m0+

Frank B · Jun 23, 2015

Code:

    while (digit--) {
        char tmp = '0';
        //*buffer = '0';

        if (value >= udecimal[digit][2]) {
            value -= udecimal[digit][2];
            tmp += 5;
        }
        if (value >= udecimal[digit][1]) {
            value -= udecimal[digit][1];
            tmp += 3;
        }
        if (value >= udecimal[digit][0]) {
            value -= udecimal[digit][0];
            tmp += 1;
            if (value >= udecimal[digit][0]) {
                value -= udecimal[digit][0];
                tmp += 1;
            }
        }
        (*buffer) = tmp;
        buffer++;
    }

is 2 us faster (60 us,table in RAM) with -O

MichaelMeissner · Jun 23, 2015

PaulStoffregen said:
Oh, compiling this with -O instead of -Os results in 67 microseconds on Teensy-LC.

Edit: tried -O2 & -O3. Both are slower than -O. Probably the same compiler bug we've seen before with -O2 on Cortex-M0+.

I get that a lot with my day job (PowerPC 64-bit GCC support). But I guess tracking down these things is what keeps me employed.

That reminds me of Conway's law:

organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations

Which restated for GCC means you have lots of independent passes, many of which don't talk with the other passes, and some passes probably have feuds between them.

PaulStoffregen · Jun 23, 2015

Maybe we should do a toolchain update soon? I've been a little nervous to start this while the Arduino releases have been coming out so quickly.

Edit: the newer toolchain might be incompatible with Mac OS-X 10.6? Then again, Arduino dropped 10.6 support after 1.6.1...

Nominal Animal · Jun 23, 2015

PaulStoffregen said:
That's very good, but not as good as 62 us from the optimized divmod10 function.

I suspected so (ie. not as good as divmod10).

I felt it was interesting, because it's basically similar approach as CORDIC -- which is still one of the easiest ways to implement sin()/cos()/tan()/arcsin()/arccos()/arctan() functions without hardware division or 32×32=64-bit multiply. You only need bit shifts and add/substract with carry.

PaulStoffregen · Jun 23, 2015

Long ago, I wrote assembly optimized float routines for the SDCC compiler. Cordic worked great for sin, cos and atan, but for only 24 bit mantissa the polynomial approximations ended up being faster after I'd heavily optimized multiply and add. As I recall, cordic was much harder for asin, acos and tan.

Nominal Animal · Jun 23, 2015

It does need extra bits in the accumulator. I only experimented with fixed-point arithmetic (Q15.48, was it?) with angular units where 1.0 corresponds to 2π radians, using Maple to calculate the coefficients at arbitrary precision; being more worried about getting the error down to half LSB than optimum speed. It might have been on my Teensy 2.0++, too. I think -- but am not sure; I'm just blabbing here -- that the problems are mirrored using these units: the polynomial approximations being nasty to get to converge down to half ulp an error, with easy nice cordic (if you got the extra bits). Could be because π is irrational, could be I'm totally wrong.

Frank B · Jun 24, 2015

PaulStoffregen said:
Maybe we should do a toolchain update soon? I've been a little nervous to start this while the Arduino releases have been coming out so quickly.

Some thoughts:

- In 4.9 is a newer verson of Newlib, itoa() and utoa() are already included. These must be removed from avr_functions.h make it work ( I had error messages )

- The launchpad- build of "4.9-2015-q2-update" is a few days old ( the current major version of GCC is 5.x )
But, we are not the only ones who use it, I think there are already thousands of beta testers. The bug list (https://bugs.launchpad.net/gcc-arm-embedded) is very small, and there is contained nothing important for us.
I've used 4.9-2015-q1-update and 4.9-2015-major before, without problems.

- There may be a way to test it with Teensyduino without much effort from your side:
Write a short howto how the new version to be installed ( I'd do it if my English was better ), in your blog, and wait for feedback.
I'm sure that some of us are willing to try the new toolchain, you don't have to update Teensyduino.
There is no reason to hurry up. Everyone can decide what he wants to use himself.

mlu · Jun 24, 2015

This might sound like gibberish to some but here goes anyway:

The reason sin, cos are easy to approximate is because they are
entire functions in the complex plane and their taylor series have infinite radius of comvergence.
atan is ok since it has singularities in +-i, and the taylor series has radius of convergence at least one on
the real axis.

tan, asin and acos have singularities on the real axis, nasty for approximation.

Frank B · Jun 24, 2015

A quick info for the new gcc 4.9-q2: I just switched back to Teensy 3.1 and got the message "cannot move location pointer backwards" when compiling "optimized"
I just solved that with changing the line
__attribute__ ((section(".startup"),optimize("Os")))
void ResetHandler(void) [....]

(EDIT: in mk20dx128.c, perhaps i have this done for gcc 4.9-q1 too, can't remember.. i skipped some teensyduino versions)

Is that ok ? "ResetHandler" calls setup() and loop(), i hope they are not optimized for size now ? (inlined??) Edit: No.
If yes, this function has to be made a bit smaller in an other way.., eventually..

LC is 10.9 times slower than T3.1?

Senior Member+

Senior Member

Senior Member+

Senior Member

Senior Member+

Member

Well-known member

Well-known member

Well-known member

Senior Member+

Well-known member

Senior Member+

Well-known member

Attachments

Well-known member

Senior Member

Senior Member

Senior Member+

Well-known member

Well-known member

Well-known member

Well-known member

Senior Member

Well-known member

Senior Member