How to be sure I'm using the FPU in T3.5/3.6

ninja2 · Dec 3, 2016

What specific coding (if any) must I use to ensure I'm using the single-precision FPU in T3.5/3.6 ?

For example I saw a post where Paul (I think) mentioned the FPU requires use of sqrtf.

Ideally a summary of all that's required ...

TIA

MichaelMeissner · Dec 3, 2016

Use the type 'float' and not 'double'. The Teensy 3.5/3.6 only support single precision floating point.

If you are using math functions, use the function that takes 'float' arguments (usually this function has a 'f' suffix, such as 'sqrtf', 'sinf', 'expf', ...).

If you ever plan to run your code on other platforms, get into the habit of using a 'f' suffix on all floating point constants. On the Teensy, the boards.txt file adds the '-fsingle-precision-constant' option to treat all constants as single precision, which other platforms might not do. Of course, there are people trying to use 'double' and get frustrated that the constants are converted to single precision and losing precision (for those people, use the 'l' or 'L' suffix to create a long double constant, which on ARM platforms has the same representation as double).

If you want to make extra sure, turn on the verbose compilation option and you should see the options '-mfloat-abi=hard' and '-mfpu=fpv4-sp-d16' which tell GCC to use single precision hardware floating point.

ninja2 · Dec 4, 2016

great info, thanks

So if I use sqrtL and the like on a T3.5/3.6 does the program completely ignore the FPU when running?
(for clarity only - I don't need double maths)

Frank B · Dec 4, 2016

ninja2 said:
great info, thanks

So if I use sqrtL and the like on a T3.5/3.6 does the program completely ignore the FPU when running?
(for clarity only - I don't need double maths)

yes........

chipaudette · Dec 4, 2016

ninja2 said:
great info, thanks

So if I use sqrtL and the like on a T3.5/3.6 does the program completely ignore the FPU when running?
(for clarity only - I don't need double maths)

If you want to go even faster, there's a DSP-accelerated sqrt function that is available to the T3.5/3.6. You can see its docs here:

https://www.keil.com/pack/doc/CMSIS/DSP/html/group__SQRT.html

If you want to use it, you have to include the ARM Math library (#include <arm_math.h>). Then, you call the ARM sqrt function using something like:

float32_t input = 4.67992;
float32_t output;
arm_sqrt(input, &output);

This super-fast sqrt call will work on any ARM chip with an FPU...whether it's a Teensy or an STM or whatever. While it's not as portable as pure C or a standard C library (like sqrtf()), there are quite a few ARM chips out there, so it's a useful trick to know.

Chip

ninja2 · Dec 4, 2016

well that's interesting as my current project uses lots of trig calculations, which I've derived from dot and cross products. I see these are all supported by that Cortex Microcontroller Software Interface Standard (CMSIS).

Portability is not a major concern so I would like give this a try.

If I #include <arm_math.h> will that bring the whole of that CMSIS-DSP library into play?

PS: The sine function is offered as 3 options: arm_sin_f32, arm_sin_q15 or arm_sin_q31
I searched unsuccessfully for some info on q15 and q31.
Are they formats? when should they be used?

DD4WH · Dec 5, 2016

as far as I know:

q15 = int16_t
q31 = int32_t

for Pythagoras calculation, you may also consider this CMSIS function:

arm_cmplx_mag_f32

which calculates sqrtf(I*I + Q*Q) for a whole vector with interleaved I and Q values {I0,Q0,I1,Q1,I2,Q2, . . . .}, fast and accurate. I use it as an AM demodulator in realtime. Also available for 16 and 32bit int.

ninja2 · Dec 5, 2016

so I wrote a little test sketch to compare sqrtf with arm_sqrt_f32 but the arm_math version is slower?

maybe the arm_math doesn't use the FPU?

maybe it's just my sketch. It simply times 100 sqrt calculations using both methods:

Code:

#define CJ_ID "ARM_DSP.a"

/* test performance of arm_math.h/CMSIS-DSP functions
 * see https://www.keil.com/pack/doc/CMSIS/DSP/html/group__SQRT.html 
 */

#define LED LED_BUILTIN

#include <Streaming.h>
#include <arm_math.h>

elapsedMillis tms; // ms timer
elapsedMicros tus; // μs timer

float32_t x,y;
float32_t xroot;

void setup(){
  Serial.begin(115200);
  while (!Serial && (millis() <= 4000)){
    digitalWriteFast(LED,!digitalReadFast(LED));
    delay(50);}
  Serial << F("\n######## ") << CJ_ID << F(" ########\n"); 
  Serial << " * Serial open, millis: " << millis() << '\n';
  tms = 0;
  //x = 4.67992; // sqrt = 2.163312275;
  //tus = 0;
  //arm_sqrt_f32(x, &xroot); 
  //Serial << "time: " << tus << "us" << '\n';
  //Serial << "sqrt(" << x << ") = " << _FLOAT(xroot,10) << '\n';
  float x1 = 0.1307;
  float delta = 0.1307;
  Serial << "--------\n";
  Serial << "do 100 x square root calculations ..." << '\n';
  x = x1;
  tus = 0;
  for (int i=0; i<100; i++){
    arm_sqrt_f32(x, &xroot); 
    x = x + delta;
    }
  Serial << "arm_math time: " << tus << "us" << '\n';
  Serial << "last sqrt(" << x << ") = " << _FLOAT(xroot,10) << '\n';
  Serial << "--------\n";
  x = x1;
  tus = 0;
  for (int i=0; i<100; i++){
    y = sqrtf(x); 
    x = x + delta;
    }
  Serial << "sqrtf time: " << tus << "us" << '\n';
  Serial << "last sqrt(" << x << ") = " << _FLOAT(y,10) << '\n';

  
  Serial << "---- end setup [" << tms << "ms " << tus << "us] ----" << '\n';
  }

void loop() {
  }

Here's the serial output:

######## ARM_DSP.a ########
* Serial open, millis: 850
--------
do 100 x square root calculations ...
arm_math time: 35us
last sqrt(13.20) = 3.6152467728
--------
sqrtf time: 32us
last sqrt(13.20) = 3.6152467728
---- end setup [0ms 124us] ----

DD4WH · Dec 5, 2016

I don´t think it´s your sketch that is causing the problems.

You need to do three things to use the new CMSIS lib and get it running with FPU:

1.) see Duffs description of how to use the new CMSIS lib for Teensyduino --> HERE and
2.) additionally add one file --> HERE)
3.) AND set #define __FPU_USED 1
in arm_math.h

Then it should run fast ;-).

ninja2 · Dec 5, 2016

thanks @DD4WH
I had a look at duff's stuff, but I need a little guidance.
When I open arm_math.h in C:\Program Files (x86)\Arduino\hardware\teensy\avr\cores\teensy3 it's not exactly the same as the CMSIS arm_math.h on GitHub.

Do I need to load the whole CMSIS library somehow? or just copy the four files listed by duff + libarm_cortexM4lf_math.a into
C:\Program Files (x86)\Arduino\hardware\teensy\avr\cores\teensy3 and then do the edits?

TIA

I'm using IDE 1.6.12 and latest teensyduino (not beta)

PS: happy to move this query to that audio projects thread if approriate

DD4WH · Dec 5, 2016

Hmm, did not check whether the files are the same. But you replace it anyway with the new version, so why matter about it;-)?

The builtin CMSIS version in Teensyduino is from 2011, so that is too old. The links in Duff´s message link to version 4.5.1, I think, that´s perfect, we don´t need the hottest version 5.0 yet (but if someone has good arguments to use it, go ahead).

ninja2 said:
Do I need to load the whole CMSIS library somehow? or just copy the four files listed by duff + libarm_cortexM4lf_math.a into
C:\Program Files (x86)\Arduino\hardware\teensy\avr\cores\teensy3 and then do the edits?

No, just copy the four files as described by Duff.
Then copy libarm_cortexM4lf_math.a as described in the thread.
Then do the edits.

include this in your script:

Code:

#include <arm_math.h>
#include <arm_const_structs.h>

And start to calculate fast ;-).

WMXZ · Dec 5, 2016

ninja2 said:
maybe the arm_math doesn't use the FPU?

the arm_sqrt_f32 definition in Paul's arm_math.g contains

Code:

  __STATIC_INLINE arm_status arm_sqrt_f32(
  float32_t in,
  float32_t * pOut)
  {
    if(in > 0)
    {

//    #if __FPU_USED
    #if (__FPU_USED == 1) && defined ( __CC_ARM   )
        *pOut = __sqrtf(in);
    #elif (__FPU_USED == 1) && defined ( __TMS_740 )
        *pOut = __builtin_sqrtf(in);
    #else
        *pOut = sqrtf(in);
    #endif

      return (ARM_MATH_SUCCESS);
    }
    else
    {
      *pOut = 0.0f;
      return (ARM_MATH_ARGUMENT_ERROR);
    }

  }

and the CMSIS V4.5 version says

Code:

static __INLINE arm_status arm_sqrt_f32(

  float32_t in,

  float32_t * pOut)

  {

    if(in >= 0.0f)

    {



#if   (__FPU_USED == 1) && defined ( __CC_ARM   )

      *pOut = __sqrtf(in);

#elif (__FPU_USED == 1) && (defined(__ARMCC_VERSION) && (__ARMCC_VERSION >= 6010050))

      *pOut = __builtin_sqrtf(in);

#elif (__FPU_USED == 1) && defined(__GNUC__)

      *pOut = __builtin_sqrtf(in);

#elif (__FPU_USED == 1) && defined ( __ICCARM__ ) && (__VER__ >= 6040000)

      __ASM("VSQRT.F32 %0,%1" : "=t"(*pOut) : "t"(in));

#else

      *pOut = sqrtf(in);

#endif



      return (ARM_MATH_SUCCESS);

    }

    else

    {

      *pOut = 0.0f;

      return (ARM_MATH_ARGUMENT_ERROR);

    }

  }

So CMSIS handles more compile options but in the end the same functions are called, which may be the same as calling directly "sqrtf"

MichaelMeissner · Dec 5, 2016

The ARM Cortex M4F microprocessors have a square root instruction for single precision floating point, but the GCC compiler typically does not use the instruction for sqrtf unless you use the -ffast-math option (or some of the more specific fast math options). This is because the ISO C specification says that the global error variable (errno) may be set if the input value is out of bounds. If you don't need the error checking, you can use the '__builtin_sqrtf' function instead, and it will generate the direct instruction (note there are 2 leading underscores).

The only other math function that has a direct ARM Cortex M4F instruction is floating absolute value ('fabsf' or '__builtin_fabsf'). Unlike 'fsqrtf', 'fabs' does not raise a domain error, so calling either function should generate the same code.

DD4WH · Dec 5, 2016

@Walter, @Michael: it seems you are right, and I was wrong in thinking that arm_sqrt_f32 would be faster than sqrtf!

This is the output of the modified ninja2´s script:

Code:

 STARTING TEST
 ARM math time:  microseconds = 2171 us
 sqrtf time:   microseconds = 1783 us
 __builtin_sqrtf time:   microseconds = 1784 us
 sqrt time:   microseconds = 60846 us

It is very astonishing for me to see sqrtf calculate significantly faster than arm_sqrt_f32 !

And it is very clear you should not forget to put the "f" behind the "sqrt" ;-).

This is the code I used on a Teensy 3.6 (by ninja2, modified)

Code:

/* test performance of arm_math.h/CMSIS-DSP functions
 * see https://www.keil.com/pack/doc/CMSIS/DSP/html/group__SQRT.html */

#define LED LED_BUILTIN

//#include <Streaming.h>
#include <arm_math.h>

elapsedMillis tms; // ms timer
elapsedMicros tus; // μs timer

float32_t x,y;
float32_t xroot;

void setup(){
  Serial.begin(115200);
  while (!Serial && (millis() <= 4000)){
    digitalWriteFast(LED,!digitalReadFast(LED));
    delay(5000);}
Serial.println(" STARTING TEST");
//Serial.println("millis = "); Serial.print(millis);
   tms = 0;
  //x = 4.67992; // sqrt = 2.163312275;
  //tus = 0;
  //arm_sqrt_f32(x, &xroot); 
  //Serial << "time: " << tus << "us" << '\n';
  //Serial << "sqrt(" << x << ") = " << _FLOAT(xroot,10) << '\n';
  float x1 = 0.1307;
  float delta = 0.1307;
  x = x1;
  tus = 0;
  for (int i=0; i<10000; i++){
    arm_sqrt_f32(x, &xroot); 
    x = x + delta;
    }
Serial.print(" ARM math time:  ");
Serial.print("microseconds = "); Serial.print(tus); Serial.println(" us");
//  Serial << "arm_math time: " << tus << "us" << '\n';
//  Serial << "sqrt(" << x << ") = " << _FLOAT(xroot,10) << '\n';
//  Serial << "--------\n";
  x = x1;
  tms = 0;
  tus = 0;
  for (int i=0; i<10000; i++){
    y = sqrtf(x); 
    x = x + delta;
    }
Serial.print(" sqrtf time:   ");
Serial.print("microseconds = "); Serial.print(tus); Serial.println(" us");
 
//  Serial << "sqrtf time: " << tus << "us" << '\n';
//  Serial << "sqrt(" << x << ") = " << _FLOAT(y,10) << '\n';
//  Serial << "---- end setup [" << tms << "ms " << tus << "us] ----" << '\n';
  x = x1;
  tms = 0;
  tus = 0;
  for (int i=0; i<10000; i++){
    y = __builtin_sqrtf(x); 
    x = x + delta;
    }
Serial.print(" __builtin_sqrtf time:   ");
Serial.print("microseconds = "); Serial.print(tus); Serial.println(" us");

  x = x1;
  tms = 0;
  tus = 0;
  for (int i=0; i<10000; i++){
    y = sqrt(x); 
    x = x + delta;
    }
Serial.print(" sqrt time:   ");
Serial.print("microseconds = "); Serial.print(tus); Serial.println(" us");
 }

void loop() {
  }

Have fun with the Teensy! All the best,

Frank

DD4WH · Dec 5, 2016

and also tried the difference between:

Code:

      for(i = 0; i < FFT_length; i++)
      {
          FFT_magn[i] = sqrtf(FFT_buffer[(i*2)] * FFT_buffer[(i*2)] + FFT_buffer[(i*2) + 1] * FFT_buffer[(i*2) + 1]);  
      }

AND

Code:

     arm_cmplx_mag_f32(FFT_buffer, FFT_magn, FFT_length);  // calculates sqrt(I*I + Q*Q) for each frequency bin of the FFT

by using it in the Teensy Convolution SDR.

arm_cmplx_mag_f32 is faster (43.3% vs. 45.5% MCU usage).

So, it seems the multiplications are faster in the arm functions, but the sqrtf is faster on pure squareroots!?

WMXZ · Dec 5, 2016

DD4WH said:

and also tried the difference between:

Code:

      for(i = 0; i < FFT_length; i++)
      {
          FFT_magn[i] = sqrtf(FFT_buffer[(i*2)] * FFT_buffer[(i*2)] + FFT_buffer[(i*2) + 1] * FFT_buffer[(i*2) + 1]);  
      }

AND

If you use

Code:

      for(i = 0; i < FFT_length; )
      {
          FFT_magn[i] = sqrtf(FFT_buffer[(i*2)] * FFT_buffer[(i*2)] + FFT_buffer[(i*2) + 1] * FFT_buffer[(i*2) + 1]);  i++;
          FFT_magn[i] = sqrtf(FFT_buffer[(i*2)] * FFT_buffer[(i*2)] + FFT_buffer[(i*2) + 1] * FFT_buffer[(i*2) + 1]);  i++;  
          FFT_magn[i] = sqrtf(FFT_buffer[(i*2)] * FFT_buffer[(i*2)] + FFT_buffer[(i*2) + 1] * FFT_buffer[(i*2) + 1]);  i++;
          FFT_magn[i] = sqrtf(FFT_buffer[(i*2)] * FFT_buffer[(i*2)] + FFT_buffer[(i*2) + 1] * FFT_buffer[(i*2) + 1]);  i++;
      }

(Assuming FFT_length is multiple of 4)

You should get similar performance to the CMSIS routine.

Have a look into the routine, this is what they are doing. obviously, if number of samples is not a multiple of 4, the unwrapping is not as simple as shown here, but again the CMSIS code shows you how to do it.

There are slight difference in carrying out the loop, but the compiler will optimize.

ninja2 · Dec 5, 2016

even if no real advantage over sqrtf, I gather CMSIS should offer superior performance for cosf, sinf, dot / cross products and the like. That's what I need ...

MichaelMeissner · Dec 5, 2016

Yeah, square root is a special case, because once you have the support within the machine to do a floating point divide, square root is pretty easy, so many machines provide both a square root and divide instructions.

Except for the x86's, few machines have support for the trig functions. And even in the case of the x86, it turns out that the hardware instruction is pretty slow, and it isn't as accurate, so most times the library implements these in software and ignores the hardware implementation.

tni · Dec 5, 2016

MichaelMeissner said:
The ARM Cortex M4F microprocessors have a square root instruction for single precision floating point, but the GCC compiler typically does not use the instruction for sqrtf unless you use the -ffast-math option (or some of the more specific fast math options). This is because the ISO C specification says that the global error variable (errno) may be set if the input value is out of bounds.

The magic compiler flag you need for fast "sqrtf" is "-fno-math-errno". "-ffast-math" also enables a bunch of other stuff that can be problematic.

If you don't need the error checking, you can use the '__builtin_sqrtf' function instead, and it will generate the direct instruction (note there are 2 leading underscores).

In my tests, you get exactly the same code with sqrtf and __builtin_sqrtf, regardless of optimization options. With "-Os", you get a math library call, with O1 / O2 / O3, you get an inlined hardware vsqrt.f32 instruction along with a ton of error handling code (unless you you used the "-fno-math-errno").

DD4WH · Dec 6, 2016

@Walter, thanks for clarifying, takes a bit of the mysterious time savings out ;-).

Tried the loop with the four instructions and it was still a little bit slower than the CMSIS routine (also when I used 16 instructions in the loop). Seems they have optimized other things too.

Lesson learned for me:

You have to try out, which specific function call / routine is faster. there does not seem to be a general rule like "CMSIS is always faster" etc.

ninja2 · Dec 6, 2016

this has been a very useful discussion and intro to the subtleties of using the FPU ... !

How to be sure I'm using the FPU in T3.5/3.6

Well-known member

Senior Member+

Well-known member

Senior Member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Senior Member+

Well-known member

Well-known member

Well-known member

Well-known member

Senior Member+

Well-known member

Well-known member

Well-known member