defragster
Senior Member+
Indeed - my question was: is sprintf codebase open to mod as was done for print in the base 10 case? Won't help 3.1 - but sprint prepping decimal data for (serial) xfer would be improved on LC.
Indeed - my question was: is sprintf codebase open to mod as was done for print in the base 10 case? Won't help 3.1 - but sprint prepping decimal data for (serial) xfer would be improved on LC.
[h=2]What is it?[/h] Newlib is a C library intended for use on embedded systems. It is a conglomeration of several library parts, all under free software licenses that make them easily usable on embedded products.
Newlib is only available in source form. It can be compiled for a wide array of processors, and will usually work on any architecture with the addition of a few low-level routines.
[h=2]Contributions[/h] Newlib thrives on net contributions from people like you. We're looking for contributions of code, bugfixes, optimizations, documentation updates, web page improvements, etc. A nice testsuite to automate the testing of newlib is also needed. Contributions are currently done by posting patches and ideas to newlib@sourceware.org; check out the mailing list section to find out more.
[h=2]Who are we?[/h] Newlib is maintained by:
Corinna Vinschen <vinschen AT redhat DOT com>
Jeff Johnston <jjohnstn AT redhat DOT com>
Please use the mailing list for all inquiries, bug reports, and patch submissions to Newlib. Please refrain from sending unsolicited personal email
Does this mean I'd have to buy 11 or so LC's in order to replicate the regular cost board's speed?
Dramatic indeed on the title - the first line would have been better "LC is unexpectedly slower than a T3.1".Despite the dramatic title of this thread
#include <stdint.h>
static const uint32_t udecimal[9][3] = {
{ 10uL, 30uL, 50uL },
{ 100uL, 300uL, 500uL },
{ 1000uL, 3000uL, 5000uL },
{ 10000uL, 30000uL, 50000uL },
{ 100000uL, 300000uL, 500000uL },
{ 1000000uL, 3000000uL, 5000000uL },
{ 10000000uL, 30000000uL, 50000000uL },
{ 100000000uL, 300000000uL, 500000000uL },
{ 1000000000uL, 3000000000uL, 0 }
};
char *decimal(char *buffer, uint32_t value)
{
int digit;
if (value >= udecimal[8][0]) {
*buffer = '0';
if (value >= udecimal[8][1]) {
value -= udecimal[8][1];
(*buffer) += 3;
}
if (value >= udecimal[8][0]) {
value -= udecimal[8][0];
(*buffer) += 1;
if (value >= udecimal[8][0]) {
value -= udecimal[8][0];
(*buffer) += 1;
}
}
buffer++;
digit = 8;
} else {
/* Non-optimal binary search for highest power of ten less than value */
if (value >= udecimal[4][0]) {
if (value >= udecimal[6][0]) {
if (value >= udecimal[7][0])
digit = 8;
else
digit = 7;
} else {
if (value >= udecimal[5][0])
digit = 6;
else
digit = 5;
}
} else {
if (value >= udecimal[2][0]) {
if (value >= udecimal[3][0])
digit = 4;
else
digit = 3;
} else {
if (value >= udecimal[1][0])
digit = 2;
else
if (value >= udecimal[0][0])
digit = 1;
else
digit = 0;
}
}
}
while (digit--) {
*buffer = '0';
if (value >= udecimal[digit][2]) {
value -= udecimal[digit][2];
(*buffer) += 5;
}
if (value >= udecimal[digit][1]) {
value -= udecimal[digit][1];
(*buffer) += 3;
}
if (value >= udecimal[digit][0]) {
value -= udecimal[digit][0];
(*buffer) += 1;
if (value >= udecimal[digit][0]) {
value -= udecimal[digit][0];
(*buffer) += 1;
}
}
buffer++;
}
*(buffer++) = '0' + value;
*buffer = '\0';
return buffer;
}
Adding all the CALLS/push/pop/RETURN 2^32 times adds up - apparently to run time of the 72 bytes worth of _v2 code.inline _v2 is 19 mins, removing that inline takes the time to 37 minutes.
I wonder how well the following, completely different approach, would compare speed-wise:
Oh, compiling this with -O instead of -Os results in 67 microseconds on Teensy-LC.
Edit: tried -O2 & -O3. Both are slower than -O. Probably the same compiler bug we've seen before with -O2 on Cortex-M0+.
These numbers are from the simple benchmark test of message #21, which is different from how Defragster is testing.
while (digit--) {
char tmp = '0';
//*buffer = '0';
if (value >= udecimal[digit][2]) {
value -= udecimal[digit][2];
tmp += 5;
}
if (value >= udecimal[digit][1]) {
value -= udecimal[digit][1];
tmp += 3;
}
if (value >= udecimal[digit][0]) {
value -= udecimal[digit][0];
tmp += 1;
if (value >= udecimal[digit][0]) {
value -= udecimal[digit][0];
tmp += 1;
}
}
(*buffer) = tmp;
buffer++;
}
I get that a lot with my day job (PowerPC 64-bit GCC support). But I guess tracking down these things is what keeps me employed.Oh, compiling this with -O instead of -Os results in 67 microseconds on Teensy-LC.
Edit: tried -O2 & -O3. Both are slower than -O. Probably the same compiler bug we've seen before with -O2 on Cortex-M0+.
I suspected so (ie. not as good as divmod10).That's very good, but not as good as 62 us from the optimized divmod10 function.
Maybe we should do a toolchain update soon? I've been a little nervous to start this while the Arduino releases have been coming out so quickly.