Execution Speed of the Teensy 3.1

Frank B · Aug 16, 2015

bug still present with -D__FPU_PRESENT -mfloat-abi=hard -mfpu=fpv4-sp-d16
even with intel !! unbelievable.. ....is this really a bug ?

x86:

Code:

    .file    "bug.c"
    .section    .text.unlikely,"x"
.LCOLDB0:
    .text
.LHOTB0:
    .p2align 4,,15
    .globl    bug
    .def    bug;    .scl    2;    .type    32;    .endef
    .seh_proc    bug
bug:
    .seh_endprologue
    movl    $999, dummy(%rip)
    xorl    %eax, %eax
    .p2align 4,,10
.L2:
    pxor    %xmm0, %xmm0
    cvtsi2ss    %eax, %xmm0
    addl    $1, %eax
    cmpl    $1000, %eax
    jne    .L2
    movss    %xmm0, dummyfloat(%rip)
    movl    $1000, n(%rip)
    ret
    .seh_endproc
    .section    .text.unlikely,"x"
.LCOLDE0:
    .text
.LHOTE0:
    .comm    dummyfloat, 4, 2
    .comm    dummy, 4, 2
    .comm    n, 4, 2
    .ident    "GCC: (tdm64-1) 4.9.2"

GNU C (tdm64-1) version 4.9.2 (x86_64-w64-mingw32) compiled by GNU C version 4.9.2, GMP version 4.3.2, MPFR version 2.4.2, MPC version 0.8.2

stevech · Aug 16, 2015

For grins, I did some changes to the source to make it more hardware-focused.

Used an STM32F415 which is a low cost Cortex M4 with hardware floating point. Mikroe's mini-M4 DIP board uses it. I tested with a custom board.
Used the IAR compiler. ST-Link v2 SWD flash loader/debugger.
Ran CPU at 64MHz and at 168MHz.
I put all the code in one file... no separate .h.
Changed vars used inside compute loops to be static volatile to avoid optimizer omitting code.
typedef'd fpType to either float or double at my choice.
changed all floats and doubles in code to ftType as above
For all floating constants, I prefixed a cast to fpType.
changed computation elapsed time to do before print, not within print as that likely colors the time.
Changed to use printf instead of Serial.print
Trivial changes to eliminate dependencies on Arduino or Teensyduino libraries.

Code:

Compiler IAR EWARM 7.4 
All floating constants are prefixed with a cast as: (fpType)2.0. NOTE: This slowed some math but may help assessment of hardware not compiler strategy.
NOTE: declared static volatile the vars using in compute-loops.

CPU STM32F415 at 64MHz <<<
Optimization: SPEED/HIGH
TIMES in microseconds
Optimization: SPEED/HIGH
chosen floating point precision:float <<<
1000 iterations, FP Addition Time 314 uSec  << not optimized out
1000 iterations, FP Subtraction Time 314 uSec
1000 iterations, INTEGER Multiply Time 224 uSec
1000 iterations, FP Multiply Time 282 uSec
1000 iterations, FP Divide Time 486 uSec
1000 iterations, Sine Evaluation Time 31367 uSec
1000 iterations, Sin Table Lookup Time 2850 uSec
1000 iterations, Arctangent Time 28555 uSec
1000 iterations, Rational Approximation B Arctangent Time 1000 iterations, Rational Approximation C Arctangent Time 13138 uSec

CPU STM32F415 at168MHz <<<
Optimization: SPEED/HIGH
TIMES in microseconds
chosen floating point precision:float <<<
1000 iterations, FP Addition Time 120 uSec
1000 iterations, FP Subtraction Time 120 uSec
1000 iterations, INTEGER Multiply Time 86 uSec
1000 iterations, FP Multiply Time 108 uSec
1000 iterations, FP Divide Time 185 uSec
1000 iterations, Sine Evaluation Time 13153 uSec
1000 iterations, Sin Table Lookup Time 1081 uSec
1000 iterations, Arctangent Time 12042 uSec
1000 iterations, Rational Approximation B Arctangent Time 1000 iterations, Rational Approximation C Arctangent Time 4991 uSec

CPU STM32F415 at168MHz <<<
TIMES in microseconds
chosen floating point precision:double <<<<<<<<<
1000 iterations, FP Addition Time 697 uSec
1000 iterations, FP Subtraction Time 697 uSec
1000 iterations, INTEGER Multiply Time 85 uSec
1000 iterations, FP Multiply Time 564 uSec
1000 iterations, FP Divide Time 983 uSec
1000 iterations, Sine Evaluation Time 13738 uSec
1000 iterations, Sin Table Lookup Time 9629 uSec
1000 iterations, Arctangent Time 12534 uSec
1000 iterations, Rational Approximation B Arctangent Time 1000 iterations, Rational Approximation C Arctangent Time 11066 uSec

the modified test code

mlu · Aug 16, 2015

Frank B said:
bug still present with -D__FPU_PRESENT -mfloat-abi=hard -mfpu=fpv4-sp-d16
even with intel !! unbelievable.. ....is this really a bug ?

x86:

Code:

.file "bug.c" .section .text.unlikely,"x" .LCOLDB0: .text .LHOTB0: .p2align 4,,15 .globl bug .def bug; .scl 2; .type 32; .endef .seh_proc bug bug: .seh_endprologue movl $999, dummy(%rip) xorl %eax, %eax .p2align 4,,10 .L2: pxor %xmm0, %xmm0 cvtsi2ss %eax, %xmm0 addl $1, %eax cmpl $1000, %eax jne .L2 movss %xmm0, dummyfloat(%rip) movl $1000, n(%rip) ret .seh_endproc .section .text.unlikely,"x" .LCOLDE0: .text .LHOTE0: .comm dummyfloat, 4, 2 .comm dummy, 4, 2 .comm n, 4, 2 .ident "GCC: (tdm64-1) 4.9.2"

GNU C (tdm64-1) version 4.9.2 (x86_64-w64-mingw32) compiled by GNU C version 4.9.2, GMP version 4.3.2, MPFR version 2.4.2, MPC version 0.8.2

This is not the arm code generator and optimizer !!, optimization level ??

Frank B · Aug 16, 2015

-O2 or -O3

Let's see what happens... i filed it.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67242

Frank B · Aug 16, 2015

stevech said:
Changed vars used inside compute loops to be static volatile to avoid optimizer omitting code.

[/CODE]

Could you try it without that, and with enabled optimizations ? I am very curious.
..and can we use the IAR-Compiler wih Teensyduino ?

stevech · Aug 16, 2015

Frank B said:
Could you try it without that, and with enabled optimizations ? I am very curious.
..and can we use the IAR-Compiler wih Teensyduino ?

IAR... yes, but...
1) the free version is code size limited to 32KB on ARM.
2) It's normally used with its IDE (w/ SWD-JTAG debugging). It uses conventional logic for C/C++ compiling with headers, func. prototypes, etc. I have no idea what it takes to use it for Arduino's oddball concepts. I'd use GCC w/VIsual Micro if Windows is OK as a platform. I use IAR professionally and own several costly licenses.

Frank B · Aug 16, 2015

i bet, 50% of the user - "sketches" are less than 32 KB...

stevech · Aug 16, 2015

Frank B said:
Could you try it without that, and with enabled optimizations ? I am very curious.

Really? most of the for loops and the code in the loop would be tossed out? When I did so, many of the loops times' went to 1uSec (round-up).

Frank B · Aug 17, 2015

The bug was confirmed.
It's in the tree-optimization "This is SCEV-const-prop not handling floats"
So - I think - all targets are affected.

mlu said:
So we better write good code and not overly rely on the optimizer.

True. The best optimizer is between the ears.

stevech said:
Really? most of the for loops and the code in the loop would be tossed out? When I did so, many of the loops times' went to 1uSec (round-up).

I was just curious how IAR handles this.

stevech · Aug 17, 2015

@Frank B.
Same as this, circa 2013?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58417

Here's output from UART, with IAR EWARM for target=STM32F415.
Optimizer = "high" and "speed".

sum = 0
sum = 0
sum = 1
sum = 3
sum = 6
Final Result: 10

Note the %llu format string in printf, for long long unsigned. Use %d and get junk.

Code:

long long arr[6] = {0, 1, 2, 3, 4, 5};

int SCEV_const_check(void)
{
    int n = 5;
    long long sum = 0, prevsum = 0;
    
    for(int i = 1; i <= n; i++)
    {
        //cout << "sum = " << sum << endl;
        printf("sum = %llu\n", sum);
		sum = (i - 1) * arr[i] - prevsum;
        // cout<<"sum : "<<sum<<endl;
        prevsum += arr[i];
    }
	printf("Final Result: %llu\n", sum);
    //cout << "Final Result: " << sum << endl;
}

Frank B · Aug 17, 2015

stevech said:
@Frank B.
Same as this, circa 2013?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58417

Here's output from UART, with IAR EWARM for target=STM32F415.
Optimizer = "high" and "speed".

sum = 0
sum = 0
sum = 1
sum = 3
sum = 6
Final Result: 10

Note the %llu format string in printf, for long long unsigned. Use %d and get junk.

Code:

long long arr[6] = {0, 1, 2, 3, 4, 5}; int SCEV_const_check(void) { int n = 5; long long sum = 0, prevsum = 0; for(int i = 1; i <= n; i++) { //cout << "sum = " << sum << endl; printf("sum = %llu\n", sum); sum = (i - 1) * arr[i] - prevsum; // cout<<"sum : "<<sum<<endl; prevsum += arr[i]; } printf("Final Result: %llu\n", sum); //cout << "Final Result: " << sum << endl; }

hm, no..
it was wrong codegeneration and is fixed.

"our" bug is a missed scalar optimization only (only for floats). nothing dramatic, the generated code is ok, but not optimal.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67242

Frank B · Aug 17, 2015

in 2009 i reported a more serious bug..

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=41894

Jake · Aug 17, 2015

Thanks to all for all the great information.

I took a look to try and determine how to change the optimization level using the Arduino IDE with the teensy after I saw many playing with it, and no obvious menu item. I ended up changing the boards.txt file in the .../arduino1.6.5/hardware/teensy/avr directory by adding the commands as shown:

Code:

[COLOR="#FF0000"]teensy31.menu.speed.96optO3=96 MHz -O3 optimized (overclock)
teensy31.menu.speed.96optO2=96 MHz -O2 optimized (overclock)
[/COLOR]teensy31.menu.speed.96opt=96 MHz optimized (overclock)
teensy31.menu.speed.72opt=72 MHz optimized
teensy31.menu.speed.48opt=48 MHz optimized
teensy31.menu.speed.24opt=24 MHz optimized
teensy31.menu.speed.96=96 MHz (overclock)
teensy31.menu.speed.72=72 MHz
teensy31.menu.speed.48=48 MHz
teensy31.menu.speed.24=24 MHz
teensy31.menu.speed.16=16 MHz (No USB)
teensy31.menu.speed.8=8 MHz (No USB)
teensy31.menu.speed.4=4 MHz (No USB)
teensy31.menu.speed.2=2 MHz (No USB)
[COLOR="#FF0000"]
teensy31.menu.speed.96optO3.build.fcpu=96000000
teensy31.menu.speed.96optO3.build.flags.optimize=-O3
teensy31.menu.speed.96optO3.build.flags.ldspecs=
teensy31.menu.speed.96optO2.build.fcpu=96000000
teensy31.menu.speed.96optO2.build.flags.optimize=-O2
teensy31.menu.speed.96optO2.build.flags.ldspecs=
[/COLOR]teensy31.menu.speed.168opt.build.fcpu=168000000
teensy31.menu.speed.168opt.build.flags.optimize=-O
teensy31.menu.speed.168opt.build.flags.ldspecs=
teensy31.menu.speed.144opt.build.fcpu=144000000

I am curious if that is the normal way to do this.

I modified my code as per the suggestions and reran at varying levels of optimization. Here are the results:

96 MHz optimized overclock -O “-O2 “-O3
Starting Timer 10 10 10
Addition Time 32 1 1
Subtraction Time 32 1 1
Integer Multiply Time 32 1 0
float Multiply Time 1357 1332 1292
float Divide Time 2480 2732 2609
Sine Evaluation Time 27837 27959 27960
Sin Table Lookup Time 33249 32809 32907
Arctangent Time 53537 53596 53604
B Arctangent Time 20770 20819 20726
C Arctangent Time 24616 24781 24315
Polynomial 6231 6319 6409
Polynomial optimized 5072 4974 5104
n/2.0/PI 9645 9686 9686
n/(2.0*PI) 8726 8893 8891
Bhaskara approx for sin 19024 19136 19177

So what have I learned?

1. This community if very helpful, and can find bugs I would not know how to look for.
2. The -O2 and -O3 optimizations levels will optimize very simple statements, but more complex routines tend to run slower, as compared to -O
3. Table lookups are hopelessly slow, at least when using floats.
4. If you can take a hit on accuracy, rational approximations speed transcendental functions.
5. The optimization that goes on between your ears does far better than the compiler optimizations.
6. The compiler will not even move a constant in the denominator to its reciprocal in the numerator to change a divide to a multiply.

Modified code attached

doughboy · Aug 17, 2015

There is a thread in arduino forum about fast sin lookup table
http://forum.arduino.cc/index.php?topic=69723.0

Frank B · Aug 17, 2015

Code:

 serial.print...
 time1 = micros();

i'm not 100% sure, and i'm not familar with the usb-code , but i think that the usb-transfer (serial.print) might be after
time1 = micros(); and during the following loop...

well, IF there is a real loop.
your "for(n=0; n<1000; n++) dummy = 5 - n;" compiles to nothing

(with -O2 and maybe -Os (don't know))

stevech · Aug 17, 2015

yeah, in my code (above) I calculated elapsed before calling printf.
In embedded, I don't use iostreams and cout and all that sort of stuff that's heap-oriented.

Jake · Aug 22, 2015

Code:

  Serial.println(time);
  time = 0;
  Serial.println(time);

I tried the above code and the second Print statement mostly came to zero and seldom to 1.

Thanks for the link to the arduino forum.

jonr · Aug 23, 2015

If you really need speed, fixed point will usually work (even for fractional numbers). I used to work on large programs with lots of complex math - and no floating point anywhere. It also used table lookups for most things that you would think of as a math function.

https://en.wikipedia.org/wiki/Binary_scaling

stevech · Aug 23, 2015

Fixed point math... there are libraries for that. It's simple in concept but you have to use your short term memory a lot. And repeatedly visualize the string of bits and where the implicit decimal point falls.

It's a dying art. Many newbies have never heard of it.
It was a staple when CPU cycles cost a lot. And h/w floating point was rare and $$.
T3's ARM family member has no hardware floating point- it's just below the cut-line in the ARM family.

Frank B · Aug 23, 2015

The next generation Teensy will have hw floating point.

MichaelMeissner · Aug 23, 2015

Frank B said:
The next generation Teensy will have hw floating point.

Single precision floating point only. So you have to make sure the compiler does not automatically promote expressions to double, which is still emulated. This means always using the 'f' suffix on floating point constants, using the float data type instead of double, and using the single precision version of the math library functions (i.e. sinf instead of sin, etc.).

Frank B · Aug 23, 2015

10 useful tips to using the floating point unit on the ARM® Cortex®-M4 processor

stevech · Aug 23, 2015

Often, a variable is declared as type double (64 bit floating point). The hardware floating point in some ARMs and others is most often 32 bit "single" precision. Perhaps the habit of "double" is from PCs?

Remembering the early Intel Pentium chips infamous FDIV "flaw".. this is fun
THE TOP TEN REASONS TO BUY A PENTIUM MACHINE
============================================

10. YOUR CURRENT COMPUTER IS TOO ACCURATE
9. YOU WANT TO GET INTO THE GUINNESS BOOK AS "OWNER OF MOST
EXPENSIVE PAPERWEIGHT"
8. MATH ERRORS ADD ZEST TO LIFE
7. YOU NEED AN ALIBI FOR THE I.R.S.
6. YOU WANT TO SEE WHAT ALL THE FUSS IS ABOUT
5. YOU'VE ALWAYS WONDERED WHAT IT WOULD BE LIKE TO BE A
PLAINTIFF
4. THE "INTEL INSIDE" LOGO MATCHES YOUR DECOR PERFECTLY
3. YOU NO LONGER HAVE TO WORRY ABOUT CPU OVERHEATING
2. YOU GOT A GREAT DEAL FROM JPL
1. IT'LL PROBABLY WORK

PaulStoffregen · Aug 24, 2015

MichaelMeissner said:
This means always using the 'f' suffix on floating point constants,

Isn't there a gcc command line arg that causes floating point literals to default to single precision? I recall seeing info about it some time ago, but I didn't save a link.

I've been considering adding that flag to Teensy's platform.txt file on Arduino 1.6.x, partly in preparation for when we get a single precision FPU, and partly because nearly all Arduino sketches are designed with the intention of single precision floating point math.

Frank B · Aug 24, 2015

"-fsingle-precision-constant"

"-Wdouble-promotion" might be interesting too ?

Out of curiosity, is it possible to disable doubles completely?

Execution Speed of the Teensy 3.1

Senior Member

Well-known member

Attachments

Well-known member

Senior Member

Senior Member

Well-known member

Senior Member

Well-known member

Senior Member

Well-known member

Senior Member

Senior Member

Well-known member

Attachments

Well-known member

Senior Member

Well-known member

Well-known member

Well-known member

Well-known member

Senior Member

Senior Member+

Senior Member

Well-known member

Well-known member

Senior Member