This is less a technical problem but a technical lesson learned:
I was stumbling across counterintuitive MCU behaviour
I need a fast modified fir_decimate oeration
So I downloaded latest CMSIS-5 and copied arm_fir_decimate_f32.c into a local folder to be modified.
But first I compiled the unmodified version and tested it against the one in the arm_math library.
To my surprise, the library version was running about 9% faster than the freshly compiled one.
I assumed the compiler flags are different.
Communicating via the CMSIS-5 GitHub, I learned the compiler flags used to generate the libraries.
Interestingly it had "-ffp-contract=off" as a flag.
with that flag also my compilation was as fast as the one from the library.
so I looked up -ffp-contract and found
what does it mean for K66?
an single "fused" call to FPU "vfma.f32" is replaced by two call "vmul.f32" followed by "vadd.f32".
having now two operations instead of a single one, makes the overall execution faster. Obviously, allowing better/optimized use of FPU
I also learned that the local arm_fit_decimate_q31 needed the "-fno-strict-aliasing" flag to run as fast as the library version. This is still a mystery to me as I cannot find in the code a switch that could be triggered by this compiler flag.
Finally, I found that f32 fir decimations were 25% faster than q31 operations, which IMO is significant.
I was stumbling across counterintuitive MCU behaviour
I need a fast modified fir_decimate oeration
So I downloaded latest CMSIS-5 and copied arm_fir_decimate_f32.c into a local folder to be modified.
But first I compiled the unmodified version and tested it against the one in the arm_math library.
To my surprise, the library version was running about 9% faster than the freshly compiled one.
I assumed the compiler flags are different.
Communicating via the CMSIS-5 GitHub, I learned the compiler flags used to generate the libraries.
Interestingly it had "-ffp-contract=off" as a flag.
with that flag also my compilation was as fast as the one from the library.
so I looked up -ffp-contract and found
-ffp-contract=style
-ffp-contract=off disables floating-point expression contraction. -ffp-contract=fast enables floating-point expression contraction such as forming of fused multiply-add operations if the target has native support for them. -ffp-contract=on enables floating-point expression contraction if allowed by the language standard. This is currently not implemented and treated equal to -ffp-contract=off.
The default is -ffp-contract=fast.
what does it mean for K66?
an single "fused" call to FPU "vfma.f32" is replaced by two call "vmul.f32" followed by "vadd.f32".
having now two operations instead of a single one, makes the overall execution faster. Obviously, allowing better/optimized use of FPU
I also learned that the local arm_fit_decimate_q31 needed the "-fno-strict-aliasing" flag to run as fast as the library version. This is still a mystery to me as I cannot find in the code a switch that could be triggered by this compiler flag.
Finally, I found that f32 fir decimations were 25% faster than q31 operations, which IMO is significant.