LC is 10.9 times slower than T3.1?

I've reproduced your results display&timing {added to my micros() test} after going to 1.24b3 on LC and no problems on 3.1 [1 usec higher at 30 before and after]. Good for Beta_4?
?: does this affect Serial# usage and sprint or other families as well?

Hello World! ... LC w/F_CPU == 48000000
inWhileMax==266 outWhile Count==33182 Total Time==9999427
100
4294967295
12345
-97847383
_print_ microseconds = 69
{ BEFORE WAS :: _print_ microseconds = 233 }

Hello World! ... T_3.1 w/F_CPU == 96000000
inWhileMax==822 outWhile Count==33331 Total Time==9999856
100
4294967295
12345
-97847383
_print_ microseconds = 30

<edit> I added a count of loops 'outWhile' run around the 'inWhile' test I had been doing. It shows cycles lost doing the overhead between the tests but shows the overall time spent trying, perhaps showing if the results are influenced by 'compiler quirk or missed optimization' if those numbers are out of whack.
 
Last edited:
I wonder how many have been improved on LC with code and toolchain change? I don't see the testing source posted there? I don't see you tested Read/Write FAST?

delay() type stuff will probably be as slow ( :) ) - but may be more accurate as the micros() won't be as sluggish.
 
I wonder how many have been improved on LC with code and toolchain change? I don't see the testing source posted there? I don't see you tested Read/Write FAST?

delay() type stuff will probably be as slow ( :) ) - but may be more accurate as the micros() won't be as sluggish.

LC tests were done back in February, sketch is derived from Arduino-land, see
http://playground.arduino.cc/Main/ShowInfo

EDIT: Duff used speedTest in overclocking measurements discussed at
https://forum.pjrc.com/threads/25755-Teensy-3-1-overclock-to-168MHz?p=48695#post48695
 
Last edited:
The toolchain hasn't changed since Teensy-LC was released. I would be surprised if any of those benchmarks change significantly, at least with the default that optimizes for smaller code size.

So far, the setting to optimize for speed seems to make very little difference on LC. But redesigning code to avoid division makes a tremendous speedup!
 
okay - with the IDE compile breaks I wasn't sure it hadn't evolved - LC seems so long ago with change ever since. An interesting setpoint to update to track though for +/- surprises.
 
This may be beating a dead horse... but if anyone else is curious to check the optimization, here's a program that computes and verifies all 2^32 possible quotients and remainders.

Code:
#include <EEPROM.h>

void setup() {
  while (!Serial) ;
  delay(1000);
  Serial.println("Begin divmod10_asm optimization verify");
}

#define divmod10_asm(div, mod, tmp1, tmp2, const3333) \
asm (                              \
  " lsr   %2, %0, #16"     "\n\t"  \
  " mul   %2, %4"          "\n\t"  \
  " uxth  %1, %0"          "\n\t"  \
  " mul   %1, %4"          "\n\t"  \
  " add   %1, #1"          "\n\t"  \
  " lsr   %0, %2, #16"     "\n\t"  \
  " lsl   %2, %2, #16"     "\n\t"  \
  " add   %1, %2"          "\n\t"  \
  " mov   %3, #0"          "\n\t"  \
  " adc   %0, %3"          "\n\t"  \
  " lsl   %0, %0, #15"     "\n\t"  \
  " lsr   %2, %1, #17"     "\n\t"  \
  " orr   %0, %2"          "\n\t"  \
  " lsl   %1, %1, #15"     "\n\t"  \
  " lsr   %2, %1, #16"     "\n\t"  \
  " lsl   %3, %0, #16"     "\n\t"  \
  " orr   %2, %3"          "\n\t"  \
  " lsr   %3, %0, #16"     "\n\t"  \
  " add   %1, %0"          "\n\t"  \
  " adc   %0, %1"          "\n\t"  \
  " sub   %0, %1"          "\n\t"  \
  " add   %1, %2"          "\n\t"  \
  " adc   %0, %3"          "\n\t"  \
  " lsr   %1, %1, #4"      "\n\t"  \
  " mov   %3, #10"         "\n\t"  \
  " mul   %1, %3"          "\n\t"  \
  " lsr   %1, %1, #28"     "\n\t"  \
  : "+l" (div),                    \
    "=&l" (mod),                   \
    "=&l" (tmp1),                  \
    "=&l" (tmp2)                   \
  : "l" (const3333)                \
  :                                \
)

void loop() {
  uint32_t d, m, t1, t2, c;
  uint32_t i, correct_d, correct_m;
  uint32_t dotcount=100000, linecount=0;

  i = (EEPROM.read(3) << 24) | 0x00FFFFFF;
  Serial.print("begin at: ");
  Serial.println(i, HEX);

  correct_d = i / 10;
  correct_m = i % 10;

  c = 0x3333;
  d = 15842193;
  m = 0;

  while (1) {
    //correct_d = i / 10;  // very slow  :-(
    //correct_m = i % 10;
    d = i;
    divmod10_asm(d, m, t1, t2, c);
    if (d != correct_d || m != correct_m) {
      Serial.println();
      Serial.println("Error:");
      Serial.print("in  = ");
      Serial.println(i, HEX);
      Serial.print("in  = ");
      Serial.println(i);
      Serial.print("div = ");
      Serial.print(d);
      Serial.print(", correct = ");
      Serial.println(correct_d);
      Serial.print("mod = ");
      Serial.print(m);
      Serial.print(", correct = ");
      Serial.println(correct_m);
    }
    if (correct_m > 0) {
      correct_m = correct_m - 1;  // fast :-)
    } else {
      correct_m = 9;
      correct_d = correct_d - 1;
    }
    i--;
    if ((i & 0xFFFF) == 0) {
      if ((i & 0xFFFFFF) == 0) {
        EEPROM.write(3, i >> 24);
        Serial.print(i >> 24, HEX);
      }
      if (++dotcount > 65) {
        dotcount = 0;
        linecount++;
        Serial.println();
        Serial.print(linecount);
        Serial.print(": ");
      }
      Serial.print(".");
      
      if (i == 0) break;
    }
  }
  Serial.println();
  Serial.println("Done");
  EEPROM.write(3, 0xFF);  // reset to FF, to run again
  while (1) ; // end
}

I takes hours to do all. Every 2^24, it stores a count to EEPROM, so it can resume about where it left off... if you don't have time time to run it all at once.
 
Copied to my T3.1 and see lots of printing scrolling - and the word Error?

I assumed you put the ASM in to emulate what the LC does but would run on either MCU?

T3.1 @ 96 optimized - Win7x64 with IDE=1.6.3 and Teensy Loader 1.24-beta3
Error:
in = FFF90844
in = 4294510660
div = 429483836, correct = 429451066
mod = 5, correct = 0

?: I never ran before so what start value is used from EEProm ? Edit the code for first run? Or to reset the test?
 
Last edited:
I have a variant that on my setup, not the latest of anything (arduino, nor Tennsyduino) but it works ok so ...
Anyway the following code beats both Pauls asm code and my own asm version, not by huge amounts but around 10%.

Code:
void inline divmod10_v2(uint32_t n,uint32_t *div,uint32_t *mod) {
  uint32_t p,q;
  /* Using 32.16 fixed point representation p.q */
  /* p.q = (n+1)/512 */
  q = (n&0xFFFF) + 1;
  p = (n>>16);
  /* p.q = 51*(n+1)/512 */
  q = 13107*q;
  p = 13107*p;
  /* p.q = (1+1/2^8+1/2^16+1/2^24)*51*(n+1)/512 */
  q = q + (q>>16) + (p&0xFFFF);
  p = p + (p>>16) + (q>>16);
  /* divide by 2 */
  p = p>>1;
  *div = p;
  *mod = n-10*p;  
}
 
mlu - I swapped in the m#36 code and seems to be working?

" divmod10_v2(i, &d, &m);"

Begin divmod10_v2 optimization verify
begin at: FFFFFFFF
begin micros: 1733982

1: ..................................................................
2: ..................................................................
3: ..................................................................
4: .........................................................FF.........
5: ..................................................................
6: ..................................................................
7: ..................................................................
8: .................................................FE.................

989: ..................................................................
990: .....1.............................................................
991: ..................................................................
992: ..................................................................
993: ...............................................................0.
Done
END micros: 1176985217

manual diff : 1,175,251,235
Minutes: 19.58

No noted errors - running good start on LC now.
 
Last edited:
would signed variables make a speed difference?

Signed values would produce the wrong values. If you shift a negative signed value right, the result gives you a implementation defined behavior. On just about every computer you will run into today (except for some ancient mainframes) the implementation behavior is to replicate the sign bit. Unsigned values shifted right always fill the top bits with 0's.
 
Last edited:
I assumed you put the ASM in to emulate what the LC does but would run on either MCU?

No, that code is only for Cortex-M0+.

Cortex-M4 has more instructions, including variants which handle the carry bit differently. To make it work on M4, at least of those instructions would need to have "s" appended, like "add" becomes "adds". In fact, one of the most frustrating things about M0 assembly (at least to me) is the almost complete lack of instruction that preserve the carry bit. Even "mov" with immediate constants clobbers the carry bit. M4 has more options, but some of the mnemonics need to change to get the same behavior as M0.

But there's no point to running 27 instructions requiring five of the low-8 registers on Cortex-M4, because the M4 processor has a divide instruction that does the same thing in 2 cycles, and it also as a dedicated multiply-and-subtract instruction which can be used to get the remainder.
 
Interesting to know the ASM isn't compatible.

20 minutes for the mlu code - running now on the LC - 3-4 times longer . . . update in an hour?

It looks like 75% done in ~54 minutes ...

Begin divmod10_v2 optimization verify
begin at: FFFFFFFF
begin micros: 1252063

1: ..................................................................
2: ..................................................................
3: ..................................................................
4: .........................................................FF.........
{ ... }
990: .....1.............................................................
991: ..................................................................
992: ..................................................................
993: ...............................................................0.
Done
END micros: 107977615

manual diff: 106,725,552
minutes: 1.77 mins, opps add 71 minutes? - dang meant to use millis()
No Errors reported on LC (or T3.1)
 
Last edited:
Speaking of Dead Horses - the mlu code tests out on LC and Teensy:

No Errors from MLU code and I see it as 13.6 or 15.8% faster depending on which half of the reciprocal you use.

Begin divmod10_v2 optimization verify

Hello World! ... T_3.1 w/F_CPU == 120000000
begin at: FFFFFFFF
begin millis: 1862

1: ..................................................................
{ ... }
993: ...............................................................0.
Done
ERRORS detected: 0
END millis diff: 940313
END mins: 15

Begin divmod10_v2 optimization verify
Hello World! ... LC w/F_CPU == 48000000
{ ... }
ERRORS detected: 0
END millis diff: 3970690
END mins: 66

View attachment pjrcMathVer.ino

...the PJRC asm code there on LC:
ERRORS detected: 0
END millis diff: 4599301
END mins: 76
 
Last edited:
Signed values would produce the wrong values. If you shift a negative signed value right, the result gives you a implementation defined behavior. On just about every computer you will run into today (except for some ancient mainframes) the implementation behavior is to replicate the sign bit. Unsigned values shifted right always fill the top bits with 0's.

Sure, 2's complement. So, wondering if an unsigned-only speed test would be fudging, as compared to signed with various cases of operand values. And corner cases (over/underflow).
 
So, wondering if an unsigned-only speed test would be fudging,

Since the algorithm is only for unsigned numbers, and Print.cpp only needs it for unsigned numbers because signed integers are turned into unsigned for the conversion into a string of base-10 digits, I don't see how testing with only unsigned input could be considered fudging.

And corner cases (over/underflow).

divmod10() takes a single 32 bit number as input. It has no state. There are exactly 2^32 cases.

The test program verifies all 2^32 cases.
 
Can sprintf benefit from this @mlu code code?

The same numbers that print in 64 usecs take sprintf in 795 usecs (12.4X ) [before mlu optimization].[Same print code on 3.1 is 4X difference].[commas add 7 usecs but 4 sprintf calls adds another 7]

LC w/F_CPU == 48000000
_print_ sprintf microseconds = 795
100,4294967295,12345, -97847383

100
4294967295
12345
-97847383
_print_ microseconds = 64
 
Last edited:
OK. This optimizing software divide is only used by print()?

With unsigned.. the overflow/underflow is a bit simpler than with signed.
 
This optimizing software divide is only used by print()?

Yes, this particular optimization is only meant for printing numbers as ordinary base-10 digits. Since it only divides by 10, it's of limited value for other uses. But printing integers is certainly worth some effort to optimize!
 
Back
Top