PDA

View Full Version : CMSIS-DSP library supports



TronicLabs
08-08-2013, 11:50 PM
it would be possible to integrate this library, as it is now designed the Teensy 3.0?
sorry if the question is stupid, but it is because I am a beginner in this world.

el_supremo
08-09-2013, 01:50 AM
The ARM cortex M4 math library is installed as part of Teensyduino 1.15.
See my message #42 in this thread http://forum.pjrc.com/threads/14845 I posted some of the CMSIS examples which run on Teensy 3.

Pete

TronicLabs
08-09-2013, 11:26 AM
many thanks, I missed it

PaulStoffregen
08-09-2013, 12:24 PM
If you do anything with the math library, even pretty simple stuff, I hope you'll consider posting about it.

This library is pretty complex, so even pretty simple "how to" info might really help everyone who tries to use it.

manitou
07-19-2018, 12:03 AM
If you do anything with the math library, even pretty simple stuff, I hope you'll consider posting about it.

This library is pretty complex, so even pretty simple "how to" info might really help everyone who tries to use it.

Here is an NXP link to some DSP benchmarks https://community.nxp.com/thread/327833 (fft, mult, sin, cos, fir, biquad cascade)
I ported some of the tests from the zip file and ran on T3.5@120mhz (IDE 1.8.5/1.42)


- arm_mult_f32 - 1.086 us ; // real float32 8
- arm_mult_f32 - 4.852 us ; // real float32 64
- arm_mult_f32 - 17.704 us ; // real float32 256
- arm_mult_f32 - 69.027 us ; // real float32 1024
- arm_mult_q31 - 1.671 us ; // real q31 8
- arm_mult_q31 - 6.891 us ; // real q31 64
- arm_mult_q31 - 24.553 us ; // real q31 256
- arm_mult_q31 - 95.128 us ; // real q31 1024
- arm_mult_q15 - 1.086 us ; // real q15 8
- arm_mult_q15 - 5.303 us ; // real q15 64
- arm_mult_q15 - 19.729 us ; // real q15 256
- arm_mult_q15 - 77.460 us ; // real q15 1024
- arm_sin_cos_f32 - 0.579 us ; // real float32
- arm_sin_cos_q31 - 0.671 us ; // real q31_t
- arm_cfft_radix2_q15 - 54.8 us ; // real q15_t 64
- arm_cfft_radix2_q15 - 257.0 us ; // real q15_t 256
- arm_cfft_radix2_q15 - 1169.7 us ; // real q15_t 1024
- arm_cfft_radix4_q15 - 34.5 us ; // real q15_t 64
- arm_cfft_radix4_q15 - 166.5 us ; // real q15_t 256
- arm_cfft_radix4_q15 - 784.7 us ; // real q15_t 1024
- arm_cfft_radix2_q31 - 102.6 us ; // real q31_t 64
- arm_cfft_radix2_q31 - 516.1 us ; // real q31_t 256
- arm_cfft_radix2_q31 - 2489.1 us ; // real q31_t 1024
- arm_cfft_radix4_q31 - 73.6 us ; // real q31_t 64
- arm_cfft_radix4_q31 - 390.8 us ; // real q31_t 256
- arm_cfft_radix4_q31 - 1947.9 us ; // real q31_t 1024
- arm_cfft_radix2_f32 - 71.1 us ; // real float32_t 64
- arm_cfft_radix2_f32 - 361.2 us ; // real float32_t 256
- arm_cfft_radix2_f32 - 1710.7 us ; // real float32_t 1024
- arm_cfft_radix4_f32 - 44.1 us ; // real float32_t 64
- arm_cfft_radix4_f32 - 220.9 us ; // real float32_t 256
- arm_cfft_radix4_f32 - 1079.8 us ; // real float32_t 1024

DD4WH
07-19-2018, 07:59 AM
Hi,

thanks for your tests!

it seems the thread and the link is very old and relates to a very old version of CMSIS, e.g. some/most of the functions are deprecated now and have been superseded by new functions (e.g. arm_cfft_f32 (https://www.keil.com/pack/doc/CMSIS/DSP/html/group__ComplexFFT.html#gade0f9c4ff157b6b9c72a1eafd 86ebf80)).

There are (much?) faster and more accurate versions of CMSIS available now, which have been used on the Teensy, Jan has a detailed description on how to use it:

https://forum.pjrc.com/threads/40590-Teensy-Convolution-SDR-(Software-Defined-Radio)?p=129081&viewfull=1#post129081

All the best,

Frank DD4WH

manitou
07-19-2018, 11:17 AM
it seems the thread and the link is very old and relates to a very old version of CMSIS, e.g. some/most of the functions are deprecated now and have been superseded by new functions


Tests were run with latest IDE 1.8.5/1.42. The arm_math.h for current IDE appears to be V1.1.0

DD4WH
07-19-2018, 11:34 AM
Yes, you are right, the recent IDE contains a very old version of CMSIS (and thus arm_math.h).

Therefore I provided the link to a way how to use a newer version which provides faster (and more accurate) algorithms, eg. for FFTs.

manitou
07-19-2018, 03:50 PM
Therefore I provided the link to a way how to use a newer version which provides faster (and more accurate) algorithms, eg. for FFTs.

I see there is now a version 5 https://github.com/ARM-software/CMSIS_5
Do your instructions for updating teensy core CMSIS includes/libs apply to version 5 as well?

DD4WH
07-19-2018, 03:52 PM
No, they only apply to the version given.

But there was a recent post (a few weeks ago, if I remember correctly) of somebody succesfully updating to version 5 by only editing one or two more lines. Unfortunately I could not find it by a forum search . . .

DD4WH
07-19-2018, 03:57 PM
found it:

https://forum.pjrc.com/threads/44570-Request-update-CMSIS-DSP-(arm_math-h)?p=182720&viewfull=1#post182720

(https://forum.pjrc.com/threads/44570-Request-update-CMSIS-DSP-(arm_math-h)?p=182720&viewfull=1#post182720)
I personally have succesfully used v4.5, but not version 5, so experiment yourself :-)

manitou
07-20-2018, 12:18 PM
Using the same benchmark (deprecated functions), and upgrading to v4.5 https://github.com/ARM-software/CMSIS, functions are faster

T3.5@120mhz arm_math.h v1.4.5
- arm_mult_f32 - 0.919 us ; // real float32 8
- arm_mult_f32 - 5.248 us ; // real float32 64
- arm_mult_f32 - 20.090 us ; // real float32 256
- arm_mult_f32 - 79.456 us ; // real float32 1024
- arm_mult_q31 - 1.187 us ; // real q31 8
- arm_mult_q31 - 5.984 us ; // real q31 64
- arm_mult_q31 - 22.431 us ; // real q31 256
- arm_mult_q31 - 88.219 us ; // real q31 1024
- arm_mult_q15 - 1.086 us ; // real q15 8
- arm_mult_q15 - 5.179 us ; // real q15 64
- arm_mult_q15 - 19.215 us ; // real q15 256
- arm_mult_q15 - 75.363 us ; // real q15 1024
- arm_sin_cos_f32 - 1.337 us ; // real float32
- arm_sin_cos_q31 - 2.506 us ; // real q31_t
- arm_cfft_radix2_q15 - 46.0 us ; // real q15_t 64
- arm_cfft_radix2_q15 - 214.9 us ; // real q15_t 256
- arm_cfft_radix2_q15 - 982.8 us ; // real q15_t 1024
- arm_cfft_radix4_q15 - 27.0 us ; // real q15_t 64
- arm_cfft_radix4_q15 - 136.4 us ; // real q15_t 256
- arm_cfft_radix4_q15 - 668.6 us ; // real q15_t 1024
- arm_cfft_radix2_q31 - 106.6 us ; // real q31_t 64
- arm_cfft_radix2_q31 - 542.2 us ; // real q31_t 256
- arm_cfft_radix2_q31 - 2643.6 us ; // real q31_t 1024
- arm_cfft_radix4_q31 - 58.9 us ; // real q31_t 64
- arm_cfft_radix4_q31 - 316.2 us ; // real q31_t 256
- arm_cfft_radix4_q31 - 1585.1 us ; // real q31_t 1024
- arm_cfft_radix2_f32 - 57.8 us ; // real float32_t 64
- arm_cfft_radix2_f32 - 291.5 us ; // real float32_t 256
- arm_cfft_radix2_f32 - 1417.6 us ; // real float32_t 1024
- arm_cfft_radix4_f32 - 37.6 us ; // real float32_t 64
- arm_cfft_radix4_f32 - 186.1 us ; // real float32_t 256
- arm_cfft_radix4_f32 - 893.1 us ; // real float32_t 1024


For V5.3 I had to ifdef out all of hardware/teensy/avr/cores/teensy3/core_cm4_simd.h for compile to work
https://github.com/ARM-software/CMSIS_5

T3.5@120mhz arm_math.h V1.5.3
- arm_mult_f32 - 0.919 us ; // real float32 8
- arm_mult_f32 - 5.248 us ; // real float32 64
- arm_mult_f32 - 20.089 us ; // real float32 256
- arm_mult_f32 - 79.454 us ; // real float32 1024
- arm_mult_q31 - 1.338 us ; // real q31 8
- arm_mult_q31 - 6.140 us ; // real q31 64
- arm_mult_q31 - 22.564 us ; // real q31 256
- arm_mult_q31 - 88.342 us ; // real q31 1024
- arm_mult_q15 - 0.944 us ; // real q15 8
- arm_mult_q15 - 4.922 us ; // real q15 64
- arm_mult_q15 - 18.558 us ; // real q15 256
- arm_mult_q15 - 73.103 us ; // real q15 1024
- arm_sin_cos_f32 - 1.254 us ; // real float32
- arm_sin_cos_q31 - 2.506 us ; // real q31_t
- arm_cfft_radix2_q15 - 45.3 us ; // real q15_t 64
- arm_cfft_radix2_q15 - 213.7 us ; // real q15_t 256
- arm_cfft_radix2_q15 - 988.5 us ; // real q15_t 1024
- arm_cfft_radix4_q15 - 27.2 us ; // real q15_t 64
- arm_cfft_radix4_q15 - 136.8 us ; // real q15_t 256
- arm_cfft_radix4_q15 - 658.5 us ; // real q15_t 1024
- arm_cfft_radix2_q31 - 117.3 us ; // real q31_t 64
- arm_cfft_radix2_q31 - 606.3 us ; // real q31_t 256
- arm_cfft_radix2_q31 - 2985.4 us ; // real q31_t 1024
- arm_cfft_radix4_q31 - 58.4 us ; // real q31_t 64
- arm_cfft_radix4_q31 - 314.3 us ; // real q31_t 256
- arm_cfft_radix4_q31 - 1577.9 us ; // real q31_t 1024
- arm_cfft_radix2_f32 - 57.9 us ; // real float32_t 64
- arm_cfft_radix2_f32 - 291.6 us ; // real float32_t 256
- arm_cfft_radix2_f32 - 1417.7 us ; // real float32_t 1024
- arm_cfft_radix4_f32 - 39.0 us ; // real float32_t 64
- arm_cfft_radix4_f32 - 192.5 us ; // real float32_t 256
- arm_cfft_radix4_f32 - 919.5 us ; // real float32_t 1024

Some fft's are slower than 4.5

Comparative anatomy

The Teensy audio library uses arm_cfft_radix4_q15(). So here are some performance comparisons.

q15 radix 4 1024 FFT, MCU @120MHz

MCU microseconds REVERSEBITS arm_math.h
T3.5 784.7 860.4 1.1.0 GCC Faster
668.6 726.9 1.4.5
658.5 717.0 1.5.3
K64F 635.7 691.6 1.4.5 mbed ARM CC, new arm_cfft_q15() 640.4 us
MK70F 755.0 1.1.0? NXP DSP benchmark, IAR CC

willie.from.texas
07-25-2018, 04:11 AM
Good comparison but instead recommend running arm_cfft_f32 for Version 5 .3 as the radix2 and radix4 algorithms have been deprecated.

willie.from.texas
07-30-2018, 12:44 AM
Here's the Version 5.3 performance for the floating point real and complex fft routines using the Teensy 3.6. The forward and inverse rfft (real-fft) uses arm_rfft_fast_f32. The forward and inverse cfft (complex-fft) uses arm_cfft_f32.

It is difficult to directly compare these results with the performance from Manitou above because they are obtained using a Teensy 3.6 @ 180 MHz. Assuming 1417.7 us is for a 1024 point forward transform using arm_cfft_radix4_f32, Version 5.3 is approximately 1.6 times faster when using the latest algorithms and taking the differences in clock speeds into account.