LC is 10.9 times slower than T3.1?

defragster · Jun 17, 2015

IMPROVED:: LC much slower than T3.1 : MICROS updated

<edit>: Confirming that 1.24b3 compiles LC to run micros FASTER and well.

<edit>:

PaulStoffregen said:
Serial.print() and micros() have become highly optimized as a result of this conversation. Until a couple days ago, Teensy-LC ran that test in 228 microseconds. Now it takes only 62 microseconds! A similar speedup was made in micro().
{ ... }
Even before the optimization, Teensy-LC was over 6X faster than regular 8 bit Arduino boards ... Now with the optimization, it's 22X faster than regular AVR boards.

LC is unexpectedly slower than a T3.1 for seemingly simple counting loop? Is this from using elapsedMicros [unsigned long us;] comparison? [Putting a (uint16_t) cast on it doesn't help]
IDE_1.6.3 w/TLoader_1.24.b2

Code:

  elapsedMicros EUwait;
...
while ( 300 > EUwait ) {       inWhile++;     }

Full Code below gives:
> inWhile==78 for LC OPTIMIZED 48MHz
> inWhile==77 for LC 48MHz
> inWhile==44 for LC OPTIMIZED 24MHz

> inWhile==846 for T3.1 OPTIMIZED 96MHz
> inWhile==422 for T3.1 OPTIMIZED 48MHz

Code:

#define qBlink() (digitalWriteFast(LED_BUILTIN, !digitalReadFast(LED_BUILTIN) ))

void setup() {
  digitalWriteFast(LED_BUILTIN, 0);
  pinMode(LED_BUILTIN, OUTPUT);
  Serial.begin(9600);  // USB
  qBlink();
  while (!Serial && millis() <= 3000) if (!(millis() % 500))  qBlink();
  Serial.print("\n\n  Hello World! ... ");
  Serial.println(millis());
  qBlink();
}

void loop() {
  qBlink();

  elapsedMillis waiting;     // "waiting" starts at zero
  elapsedMillis EMwait;
  uint mStart, mStop, mMax = 0;
  mStop = millis();
  waiting = 0;           // reset waiting back to zero
  elapsedMicros EUwait;
  int inWhile;
  int inWhileMax = 0;

  mStop = micros();
  waiting = 0;           // reset waiting back to zero
  while (waiting < 10000) {
    inWhile = 0;
    EUwait = 0;
    mStart = micros();
[B]    while ( 300 > EUwait ) {
      // Do what you must while waiting
      inWhile++;
    }[/B]
    mStop = micros();
    if ( inWhile > inWhileMax )
    {
      qBlink();
      inWhileMax = inWhile;
      Serial.print(" inWhile==");
      Serial.print(inWhileMax);
    }
    if ( (mStop - mStart) > mMax )
    {
      mMax = mStop - mStart;
    }
  }
  Serial.print("\ninWhile==");
  Serial.print(inWhileMax);
  Serial.print(" Max Time Seen==");
  Serial.println(mMax);
}

PS: I should recode this with METRO to compare.

mlu · Jun 17, 2015

At the same clock frequency 48MHz the ratio is more like 5.5 times.

Probably most of difference is from the internals of calculating the value of EUWait.
EUWait calls micros() and micros contans one 32 multiplication and one 32 bit division.

I suspect that its this division that is significantly slower on the LC's Cortex-M0+ core.

PaulStoffregen · Jun 17, 2015

Yup, Cortex-M4 has 32 bit division as a 2 cycle instruction.

Cortex-M0+ has a call a library function, which divides 2 numbers one bit at a time. At least I think it's 1 bit at a time, but I haven't looked at that specific code.

Maybe there's a crafty optimization possible for divide-by-1000? One of my low priority tasks involves porting the "Stimmer optimization" from the AVR version of Print.cpp to Teensy-LC. Maybe it can be extended to 1000?

mlu · Jun 17, 2015

its divide by 48 I think, using a small denominator 'should' help. so factoring out powers of two
into shift operations might help.

PaulStoffregen · Jun 17, 2015

Please give this speedup a try.

defragster · Jun 17, 2015

ODD RESULTS??? More is better iteration count.

I put it here - I:\Teensy163\hardware\teensy\avr\cores\teensy3 - REMEMBER the old copy needs to be removed or a NEW extension if the old file is preserved!
LC::

OLD CODE:: inWhile==78

24 MHz:: inWhile==338
24 MHz OPT:: inWhile==367
48 MHz OPT : inWhile==247
48 MHz : inWhile==247

T3.1::

inWhile==822 [ SLOWER than > inWhile==846 for T3.1 OPTIMIZED 96MHz ]
inWhile==410 [ SLOWER than > inWhile==422 for T3.1 OPTIMIZED 48MHz ]

Code DIFF I see shows the added KINETISL and F_CPU, going to rewrite output so it is self documenting:

Code:

#if defined(KINETISL) && F_CPU == 48000000
	return count * 1000 + ((current * (uint32_t)87381) >> 22);
#elif defined(KINETISL) && F_CPU == 24000000
	return count * 1000 + ((current * (uint32_t)174763) >> 23);
#endif
	return count * 1000 + current / (F_CPU / 1000000);

mlu · Jun 17, 2015

Configured, and compiled, clock speed is F_CPU

defragster · Jun 17, 2015

@mlu - thanks found that - now if I could ID OPTIMIZATION compile switch ...

I suspect a math error here because the 24MHz result is too HIGH:

Code:

#elif defined(KINETISL) && F_CPU == 24000000
	return count * 1000 + ((current * (uint32_t)174763) >> 23);

Since all I'm doing is RECOMPILING I need to move to my other faster workstation . . .

OPTIMIZED RESULTS :: WITH updated pins_teensy.c as provided

Hello World! ... T_3.1 w/F_CPU == 48000000
inWhileMax==411

Hello World! ... T_3.1 w/F_CPU == 96000000
inWhileMax==822

Hello World! ... LC w/F_CPU == 48000000
inWhileMax==266

Hello World! ... LC w/F_CPU == 24000000
inWhileMax==368

Refined CODE - removed extra notes - shows F_CPU and compiled type and only prints the inWhileMax on repeat tests if it increases:

Code:

#define qBlink() (digitalWriteFast(LED_BUILTIN, !digitalReadFast(LED_BUILTIN) ))

void setup() {
  digitalWriteFast(LED_BUILTIN, 0);
  pinMode(LED_BUILTIN, OUTPUT);
  Serial.begin(9600);  // USB
  qBlink();
  while (!Serial && millis() <= 3000) if (!(millis() % 500))  qBlink();

#if defined(KINETISL)
  Serial.print("\n\n  Hello World! ... LC w/F_CPU == ");
#else
  Serial.print("\n\n  Hello World! ... T_3.1 w/F_CPU == ");
#endif
  Serial.println( F_CPU );
  qBlink();
}

int inWhileMaxMax = -1;

void loop() {
  qBlink();

  elapsedMillis waiting;     // "waiting" starts at zero
  elapsedMicros EUwait;
  int inWhile;
  int inWhileMax = 0;

  waiting = 0;           // reset waiting back to zero
  while (waiting < 10000) {
    inWhile = 0;
    EUwait = 0;
    while ( 300 > (uint16_t) EUwait ) {
      inWhile++;
    }
    if ( inWhile > inWhileMax )
    {
      qBlink();
      inWhileMax = inWhile;
    }
  }
  if ( inWhileMax > inWhileMaxMax )
  {
    inWhileMaxMax = inWhileMax;
    Serial.print("inWhileMax==");
    Serial.println(inWhileMaxMax);
  }
}

PaulStoffregen · Jun 17, 2015

I saw that strange result too. I don't know why it's happening.

I did some other quick checks to make sure 48 MHz mode really is about twice as fast as 24 MHz mode, which looked good.

defragster · Jun 17, 2015

Luckily I did a test on the old code (post #1) and saw it wasn't that way before. I'll repeat the unchanged code test when I can get to the other machine.

Perhaps add a calibration busy loop for a baseline. Ideally something that optimization would affect to indicate compile flags? I should also bring in METRO and see how it acts as it can do the same AFAIK.

Paul: Do you have a little snip that would give a baseline and maybe change perf based on optimization? I may add this code in as it was an LC issue that I don't see 'resolved' but have not revisited: https://forum.pjrc.com/threads/2829...lWriteFast-on-LC?p=70204&viewfull=1#post70204

// pseudo
count=200,000
start=micros
while (count) {
____count --;
____// some optimizable code snip
}
printtime(micros-start)

PaulStoffregen · Jun 17, 2015

With the unexpected release of Arduino 1.6.5r2 today, I won't be able to look at this for a couple days, and more likely sometime next week.

My guess is your benchmark may be hitting a compiler quirk or missed optimization. I'm pretty sure it's not anything serious, but I can't be 100% certain until I investigate deeply. But that just can't happen soon. Sorry... this is simply a much lower priority.

mlu · Jun 17, 2015

The 24MHz define should probably be:

Code:

#if defined(KINETISL) && F_CPU == 48000000
	return count * 1000 + ((current * (uint32_t)87381) >> 22);
#elif defined(KINETISL) && F_CPU == 24000000
	return count * 1000 + ((current * (uint32_t)87381) >> 21);

Multiplying with 87381 and then dividing with 2^21 (shifting) is the same as (to six digits)
dividing with 24, the number of clocks per millisecond.

defragster · Jun 17, 2015

@Paul - Indeed the IDE is first for all. If there is an optimization here it can wait. Maybe a big help for the LC - but must be done right.

@mlu - thanks - didn't have any time to move this yet to look myself.

Top is what was from the thread above p#5, bottom two are shown after the mlu p#12 code swap

Hello World! ... LC w/F_CPU == 48000000
inWhileMax==266
Hello World! ... LC w/F_CPU == 24000000
inWhileMax==368

Hello World! ... LC w/F_CPU == 24000000
inWhileMax==184
Hello World! ... LC w/F_CPU == 48000000
inWhileMax==266

PaulStoffregen · Jun 17, 2015

I'm merging a fix now, so it'll be in the next installer.

defragster · Jun 17, 2015

Good. I'll look for that and to get the Arduino 1.6.5r2 and make sure all goes well . . .

defragster · Jun 17, 2015

Putting this default code under #else would save it from being compiled into the KINETISL if && elif cases. I compiled and tested, results as follows:

https://github.com/PaulStoffregen/cores/blob/master/teensy3/pins_teensy.c#L888

Code:

888 #if defined(KINETISL) && F_CPU == 48000000 
889 	return count * 1000 + ((current * (uint32_t)87381) >> 22); 
890 #elif defined(KINETISL) && F_CPU == 24000000 
891 	return count * 1000 + ((current * (uint32_t)174763) >> 22); 
[B]892 #else [/B]
893 	return count * 1000 + current / (F_CPU / 1000000); 
[B]894 #endif [/B]

ALL OPTIMIZED:
Hello World! ... LC w/F_CPU == 24000000
inWhileMax==184
Hello World! ... LC w/F_CPU == 48000000
inWhileMax==266

Hello World! ... T_3.1 w/F_CPU == 48000000
inWhileMax==411
Hello World! ... T_3.1 w/F_CPU == 96000000
inWhileMax==822

<edit> Compiler seems to see the 2nd return as unreachable and drops it as the size is the same

PaulStoffregen · Jun 18, 2015

1.24-beta3 has the fix.

defragster · Jun 18, 2015

Confirming that 1.24b3 compiles LC FASTER and well.

This will be a big improvement there for ANY use of Micros! In fact some LC users may need a warning about this build - turbo charging their app!

Hello World! ... LC w/F_CPU == 48000000 inWhileMax==248
Hello World! ... LC w/F_CPU == 24000000 inWhileMax==163

Hello World! ... T_3.1 w/F_CPU == 96000000 inWhileMax==822
Hello World! ... T_3.1 w/F_CPU == 48000000 inWhileMax==411

Versus these older numbers for the 1.23 build from above:

> inWhile==78 for LC OPTIMIZED 48MHz
> inWhile==44 for LC OPTIMIZED 24MHz

PaulStoffregen · Jun 18, 2015

This really shows I ought to port the Stimmer optimization from AVR to Cortex-M0+

defragster · Jun 18, 2015

Good plan to remove other key points - glad I stumbled on this one and BUG'd it, sweet it had an easy fix and published quickly - given the prevalence and casual use of some things like micros() - direct and indirect.

PaulStoffregen · Jun 19, 2015

Ok, here's a quick and dirty benchmark for printing base 10 numbers.

Code:

void setup() {
}

void loop() {
  uint32_t t1 = micros();
  Serial.println(100);
  Serial.println(0xFFFFFFFFul);
  Serial.println(12345);
  Serial.println(-97847383l);
  uint32_t t2 = micros();
  Serial.print("microseconds = ");
  Serial.println(t2 - t1);
  Serial.println();
  delay(1000);
}

Teensy 3.1 (hardware div) 29
Teensy-LC unoptimized 228
Teensy 2.0 with Stimmer 204
Teensy 2.0 unoptimized 1120

With similar unoptimized code, of course 32 bit Teensy-LC easily beats 8 bit AVR.

But you can see how impressive the Stimmer optimization is.....

mlu · Jun 19, 2015

This function implements the same mathematics as Stimmer describes

Code:

void divmod10(uint32_t n,uint32_t *div,uint32_t *mod) {
  uint32_t p,q;
  /* Using 32.16 fixed point representation p.q */
  /* p.q = (n+1)/512 */
  q = ((n<<7)&0xFFFF) + 0x80;
  p = (n>>9);
  /* p.q = 51*(n+1)/512 */
  q = 51*q;
  p = 51*p;
  /* p.q = (1+1/2^8)*51*(n+1)/512 */
  q = q + (q>>8) + ((p<<8)&0xFFFF);
  p = p + (p>>8) + (q>>16);
  q = q&0xFFFF;
  /* p.q = (1+1/2^8+1/2^16+1/2^24)*51*(n+1)/512 */
  q = q + (p&0xFFFF);
  p = p + (p>>16) + (q>>16);
  *div = p;
  *mod = n-10*p;  
}

Needs some careful testing.

MichaelMeissner · Jun 19, 2015

I don't recall the timings of the Cortex-M0 that the Teensy-LC uses, but some M0's multiply takes 32 cycles, while others take 1 cycle. If GCC is thinking it is one method and the LC's cpu is the other, you can generate slower code. I.e. if the chip can do multiply in 1 cycle but the compiler thinks it has a slow multiply, it will do the *51 as a series of adds and shifts. Likewise, if the chip has a slow multiply and the compiler thinks it has a fast multiply, it will be slower to do the MUL instruction.

Unfortunately, the Cortex-M0 level does not have a multiply upper instruction, which would make calculating division by a constant easier.

mlu · Jun 19, 2015

True, this is the same for toolchain libraries, and everytime a multiplication is done on a Cortex M0,
and the problem persists until toolchains has flags for specifying the speed of M0 hardware multiply.

But in this case we can measure if there is a significant speedup or perhaps the opposite.

PaulStoffregen · Jun 19, 2015

I've implemented the Print speedup for Teensy-LC. Here's the results.

Code:

Teensy 3.1 (hardware div)   29
Teensy-LC  optimized        69
Teensy-LC  unoptimized     228   (or 140 with unoptimized Teensy 2.0 code)
Teensy 2.0 Stimmer opt     204
Teensy 2.0 unoptimized    1120

Here are the modified files, if you'd like to drop them into hardware/teensy/avr/cores/teensy3 to give it a try.

LC is 10.9 times slower than T3.1?

Senior Member+

Well-known member

Well-known member

Well-known member

Well-known member

Attachments

Senior Member+

Well-known member

Senior Member+

Well-known member

Senior Member+

Well-known member

Well-known member

Senior Member+

Well-known member

Senior Member+

Senior Member+

Well-known member

Senior Member+

Well-known member

Senior Member+

Well-known member

Well-known member

Senior Member+

Well-known member

Well-known member

Attachments