LC is 10.9 times slower than T3.1?

defragster

Senior Member+
IMPROVED:: LC much slower than T3.1 : MICROS updated

<edit>: Confirming that 1.24b3 compiles LC to run micros FASTER and well.

<edit>:
Serial.print() and micros() have become highly optimized as a result of this conversation. Until a couple days ago, Teensy-LC ran that test in 228 microseconds. Now it takes only 62 microseconds! A similar speedup was made in micro().
{ ... }
Even before the optimization, Teensy-LC was over 6X faster than regular 8 bit Arduino boards ... Now with the optimization, it's 22X faster than regular AVR boards.



LC is unexpectedly slower than a T3.1 for seemingly simple counting loop? Is this from using elapsedMicros [unsigned long us;] comparison? [Putting a (uint16_t) cast on it doesn't help]
IDE_1.6.3 w/TLoader_1.24.b2

Code:
  elapsedMicros EUwait;
...
while ( 300 > EUwait ) {       inWhile++;     }

Full Code below gives:
> inWhile==78 for LC OPTIMIZED 48MHz
> inWhile==77 for LC 48MHz
> inWhile==44 for LC OPTIMIZED 24MHz

> inWhile==846 for T3.1 OPTIMIZED 96MHz
> inWhile==422 for T3.1 OPTIMIZED 48MHz

Code:
#define qBlink() (digitalWriteFast(LED_BUILTIN, !digitalReadFast(LED_BUILTIN) ))

void setup() {
  digitalWriteFast(LED_BUILTIN, 0);
  pinMode(LED_BUILTIN, OUTPUT);
  Serial.begin(9600);  // USB
  qBlink();
  while (!Serial && millis() <= 3000) if (!(millis() % 500))  qBlink();
  Serial.print("\n\n  Hello World! ... ");
  Serial.println(millis());
  qBlink();
}

void loop() {
  qBlink();

  elapsedMillis waiting;     // "waiting" starts at zero
  elapsedMillis EMwait;
  uint mStart, mStop, mMax = 0;
  mStop = millis();
  waiting = 0;           // reset waiting back to zero
  elapsedMicros EUwait;
  int inWhile;
  int inWhileMax = 0;

  mStop = micros();
  waiting = 0;           // reset waiting back to zero
  while (waiting < 10000) {
    inWhile = 0;
    EUwait = 0;
    mStart = micros();
[B]    while ( 300 > EUwait ) {
      // Do what you must while waiting
      inWhile++;
    }[/B]
    mStop = micros();
    if ( inWhile > inWhileMax )
    {
      qBlink();
      inWhileMax = inWhile;
      Serial.print(" inWhile==");
      Serial.print(inWhileMax);
    }
    if ( (mStop - mStart) > mMax )
    {
      mMax = mStop - mStart;
    }
  }
  Serial.print("\ninWhile==");
  Serial.print(inWhileMax);
  Serial.print(" Max Time Seen==");
  Serial.println(mMax);
}

PS: I should recode this with METRO to compare.
 
Last edited:
At the same clock frequency 48MHz the ratio is more like 5.5 times.

Probably most of difference is from the internals of calculating the value of EUWait.
EUWait calls micros() and micros contans one 32 multiplication and one 32 bit division.

I suspect that its this division that is significantly slower on the LC's Cortex-M0+ core.
 
Last edited:
Yup, Cortex-M4 has 32 bit division as a 2 cycle instruction.

Cortex-M0+ has a call a library function, which divides 2 numbers one bit at a time. At least I think it's 1 bit at a time, but I haven't looked at that specific code.

Maybe there's a crafty optimization possible for divide-by-1000? One of my low priority tasks involves porting the "Stimmer optimization" from the AVR version of Print.cpp to Teensy-LC. Maybe it can be extended to 1000?
 
its divide by 48 I think, using a small denominator 'should' help. so factoring out powers of two
into shift operations might help.
 
Last edited:
ODD RESULTS??? More is better iteration count.

I put it here - I:\Teensy163\hardware\teensy\avr\cores\teensy3 - REMEMBER the old copy needs to be removed or a NEW extension if the old file is preserved!
LC::
OLD CODE:: inWhile==78

24 MHz:: inWhile==338
24 MHz OPT:: inWhile==367

48 MHz OPT : inWhile==247
48 MHz : inWhile==247

T3.1::
inWhile==822 [ SLOWER than > inWhile==846 for T3.1 OPTIMIZED 96MHz ]
inWhile==410 [ SLOWER than > inWhile==422 for T3.1 OPTIMIZED 48MHz ]

Code DIFF I see shows the added KINETISL and F_CPU, going to rewrite output so it is self documenting:
Code:
#if defined(KINETISL) && F_CPU == 48000000
	return count * 1000 + ((current * (uint32_t)87381) >> 22);
#elif defined(KINETISL) && F_CPU == 24000000
	return count * 1000 + ((current * (uint32_t)174763) >> 23);
#endif
	return count * 1000 + current / (F_CPU / 1000000);
 
Last edited:
@mlu - thanks found that - now if I could ID OPTIMIZATION compile switch ...

I suspect a math error here because the 24MHz result is too HIGH:
Code:
#elif defined(KINETISL) && F_CPU == 24000000
	return count * 1000 + ((current * (uint32_t)174763) >> 23);

Since all I'm doing is RECOMPILING I need to move to my other faster workstation . . .

OPTIMIZED RESULTS :: WITH updated pins_teensy.c as provided
Hello World! ... T_3.1 w/F_CPU == 48000000
inWhileMax==411

Hello World! ... T_3.1 w/F_CPU == 96000000
inWhileMax==822

Hello World! ... LC w/F_CPU == 48000000
inWhileMax==266

Hello World! ... LC w/F_CPU == 24000000
inWhileMax==368

Refined CODE - removed extra notes - shows F_CPU and compiled type and only prints the inWhileMax on repeat tests if it increases:
Code:
#define qBlink() (digitalWriteFast(LED_BUILTIN, !digitalReadFast(LED_BUILTIN) ))

void setup() {
  digitalWriteFast(LED_BUILTIN, 0);
  pinMode(LED_BUILTIN, OUTPUT);
  Serial.begin(9600);  // USB
  qBlink();
  while (!Serial && millis() <= 3000) if (!(millis() % 500))  qBlink();

#if defined(KINETISL)
  Serial.print("\n\n  Hello World! ... LC w/F_CPU == ");
#else
  Serial.print("\n\n  Hello World! ... T_3.1 w/F_CPU == ");
#endif
  Serial.println( F_CPU );
  qBlink();
}

int inWhileMaxMax = -1;

void loop() {
  qBlink();

  elapsedMillis waiting;     // "waiting" starts at zero
  elapsedMicros EUwait;
  int inWhile;
  int inWhileMax = 0;

  waiting = 0;           // reset waiting back to zero
  while (waiting < 10000) {
    inWhile = 0;
    EUwait = 0;
    while ( 300 > (uint16_t) EUwait ) {
      inWhile++;
    }
    if ( inWhile > inWhileMax )
    {
      qBlink();
      inWhileMax = inWhile;
    }
  }
  if ( inWhileMax > inWhileMaxMax )
  {
    inWhileMaxMax = inWhileMax;
    Serial.print("inWhileMax==");
    Serial.println(inWhileMaxMax);
  }
}
 
I saw that strange result too. I don't know why it's happening.

I did some other quick checks to make sure 48 MHz mode really is about twice as fast as 24 MHz mode, which looked good.
 
Luckily I did a test on the old code (post #1) and saw it wasn't that way before. I'll repeat the unchanged code test when I can get to the other machine.

Perhaps add a calibration busy loop for a baseline. Ideally something that optimization would affect to indicate compile flags? I should also bring in METRO and see how it acts as it can do the same AFAIK.

Paul: Do you have a little snip that would give a baseline and maybe change perf based on optimization? I may add this code in as it was an LC issue that I don't see 'resolved' but have not revisited: https://forum.pjrc.com/threads/2829...lWriteFast-on-LC?p=70204&viewfull=1#post70204

// pseudo
count=200,000
start=micros
while (count) {
____count --;
____// some optimizable code snip
}
printtime(micros-start)
 
With the unexpected release of Arduino 1.6.5r2 today, I won't be able to look at this for a couple days, and more likely sometime next week.

My guess is your benchmark may be hitting a compiler quirk or missed optimization. I'm pretty sure it's not anything serious, but I can't be 100% certain until I investigate deeply. But that just can't happen soon. Sorry... this is simply a much lower priority.
 
The 24MHz define should probably be:

Code:
#if defined(KINETISL) && F_CPU == 48000000
	return count * 1000 + ((current * (uint32_t)87381) >> 22);
#elif defined(KINETISL) && F_CPU == 24000000
	return count * 1000 + ((current * (uint32_t)87381) >> 21);

Multiplying with 87381 and then dividing with 2^21 (shifting) is the same as (to six digits)
dividing with 24, the number of clocks per millisecond.
 
Last edited:
@Paul - Indeed the IDE is first for all. If there is an optimization here it can wait. Maybe a big help for the LC - but must be done right.

@mlu - thanks - didn't have any time to move this yet to look myself.

Top is what was from the thread above p#5, bottom two are shown after the mlu p#12 code swap
Hello World! ... LC w/F_CPU == 48000000
inWhileMax==266
Hello World! ... LC w/F_CPU == 24000000
inWhileMax==368

Hello World! ... LC w/F_CPU == 24000000
inWhileMax==184
Hello World! ... LC w/F_CPU == 48000000
inWhileMax==266
 
Putting this default code under #else would save it from being compiled into the KINETISL if && elif cases. I compiled and tested, results as follows:

https://github.com/PaulStoffregen/cores/blob/master/teensy3/pins_teensy.c#L888
Code:
888 #if defined(KINETISL) && F_CPU == 48000000 
889 	return count * 1000 + ((current * (uint32_t)87381) >> 22); 
890 #elif defined(KINETISL) && F_CPU == 24000000 
891 	return count * 1000 + ((current * (uint32_t)174763) >> 22); 
[B]892 #else [/B]
893 	return count * 1000 + current / (F_CPU / 1000000); 
[B]894 #endif [/B]

ALL OPTIMIZED:
Hello World! ... LC w/F_CPU == 24000000
inWhileMax==184
Hello World! ... LC w/F_CPU == 48000000
inWhileMax==266

Hello World! ... T_3.1 w/F_CPU == 48000000
inWhileMax==411
Hello World! ... T_3.1 w/F_CPU == 96000000
inWhileMax==822

<edit> Compiler seems to see the 2nd return as unreachable and drops it as the size is the same
 
Last edited:
Confirming that 1.24b3 compiles LC FASTER and well.

This will be a big improvement there for ANY use of Micros! In fact some LC users may need a warning about this build - turbo charging their app!
Hello World! ... LC w/F_CPU == 48000000 inWhileMax==248
Hello World! ... LC w/F_CPU == 24000000 inWhileMax==163

Hello World! ... T_3.1 w/F_CPU == 96000000 inWhileMax==822
Hello World! ... T_3.1 w/F_CPU == 48000000 inWhileMax==411

Versus these older numbers for the 1.23 build from above:
> inWhile==78 for LC OPTIMIZED 48MHz
> inWhile==44 for LC OPTIMIZED 24MHz
 
Good plan to remove other key points - glad I stumbled on this one and BUG'd it, sweet it had an easy fix and published quickly - given the prevalence and casual use of some things like micros() - direct and indirect.
 
Ok, here's a quick and dirty benchmark for printing base 10 numbers.

Code:
void setup() {
}

void loop() {
  uint32_t t1 = micros();
  Serial.println(100);
  Serial.println(0xFFFFFFFFul);
  Serial.println(12345);
  Serial.println(-97847383l);
  uint32_t t2 = micros();
  Serial.print("microseconds = ");
  Serial.println(t2 - t1);
  Serial.println();
  delay(1000);
}

Teensy 3.1 (hardware div) 29
Teensy-LC unoptimized 228
Teensy 2.0 with Stimmer 204
Teensy 2.0 unoptimized 1120

With similar unoptimized code, of course 32 bit Teensy-LC easily beats 8 bit AVR.

But you can see how impressive the Stimmer optimization is.....
 
Last edited:
This function implements the same mathematics as Stimmer describes

Code:
void divmod10(uint32_t n,uint32_t *div,uint32_t *mod) {
  uint32_t p,q;
  /* Using 32.16 fixed point representation p.q */
  /* p.q = (n+1)/512 */
  q = ((n<<7)&0xFFFF) + 0x80;
  p = (n>>9);
  /* p.q = 51*(n+1)/512 */
  q = 51*q;
  p = 51*p;
  /* p.q = (1+1/2^8)*51*(n+1)/512 */
  q = q + (q>>8) + ((p<<8)&0xFFFF);
  p = p + (p>>8) + (q>>16);
  q = q&0xFFFF;
  /* p.q = (1+1/2^8+1/2^16+1/2^24)*51*(n+1)/512 */
  q = q + (p&0xFFFF);
  p = p + (p>>16) + (q>>16);
  *div = p;
  *mod = n-10*p;  
}

Needs some careful testing.
 
I don't recall the timings of the Cortex-M0 that the Teensy-LC uses, but some M0's multiply takes 32 cycles, while others take 1 cycle. If GCC is thinking it is one method and the LC's cpu is the other, you can generate slower code. I.e. if the chip can do multiply in 1 cycle but the compiler thinks it has a slow multiply, it will do the *51 as a series of adds and shifts. Likewise, if the chip has a slow multiply and the compiler thinks it has a fast multiply, it will be slower to do the MUL instruction.

Unfortunately, the Cortex-M0 level does not have a multiply upper instruction, which would make calculating division by a constant easier.
 
True, this is the same for toolchain libraries, and everytime a multiplication is done on a Cortex M0,
and the problem persists until toolchains has flags for specifying the speed of M0 hardware multiply.

But in this case we can measure if there is a significant speedup or perhaps the opposite.
 
I've implemented the Print speedup for Teensy-LC. Here's the results. :)

Code:
Teensy 3.1 (hardware div)   29
Teensy-LC  optimized        69
Teensy-LC  unoptimized     228   (or 140 with unoptimized Teensy 2.0 code)
Teensy 2.0 Stimmer opt     204
Teensy 2.0 unoptimized    1120

Here are the modified files, if you'd like to drop them into hardware/teensy/avr/cores/teensy3 to give it a try.
 

Attachments

  • Print.cpp
    7 KB · Views: 198
  • Print.h
    5.3 KB · Views: 209
Last edited:
Back
Top