faster FPU operations

WMXZ · Oct 27, 2016

This is less a technical problem but a technical lesson learned:

I was stumbling across counterintuitive MCU behaviour

I need a fast modified fir_decimate oeration
So I downloaded latest CMSIS-5 and copied arm_fir_decimate_f32.c into a local folder to be modified.
But first I compiled the unmodified version and tested it against the one in the arm_math library.

To my surprise, the library version was running about 9% faster than the freshly compiled one.
I assumed the compiler flags are different.
Communicating via the CMSIS-5 GitHub, I learned the compiler flags used to generate the libraries.

Interestingly it had "-ffp-contract=off" as a flag.
with that flag also my compilation was as fast as the one from the library.

so I looked up -ffp-contract and found

-ffp-contract=style
-ffp-contract=off disables floating-point expression contraction. -ffp-contract=fast enables floating-point expression contraction such as forming of fused multiply-add operations if the target has native support for them. -ffp-contract=on enables floating-point expression contraction if allowed by the language standard. This is currently not implemented and treated equal to -ffp-contract=off.
The default is -ffp-contract=fast.

what does it mean for K66?
an single "fused" call to FPU "vfma.f32" is replaced by two call "vmul.f32" followed by "vadd.f32".
having now two operations instead of a single one, makes the overall execution faster. Obviously, allowing better/optimized use of FPU

I also learned that the local arm_fit_decimate_q31 needed the "-fno-strict-aliasing" flag to run as fast as the library version. This is still a mystery to me as I cannot find in the code a switch that could be triggered by this compiler flag.

Finally, I found that f32 fir decimations were 25% faster than q31 operations, which IMO is significant.

faster FPU operations

WMXZ

Well-known member