NXP CMSIS FFT on Teensy4?

DrM

Well-known member
Hi,

Are there instructions and examples, or some form of documentation specifically on how to use the NXP CMSIS, and in particular the FFT, on the Teensy 4 with the Aduino IDE?

The use case is data is read from an external 18 bit ADC, using a timer, and processed 1k to 8 k samples at a time, to produce the complex fourier transform.

Thruput peformance is critical. So my first thought is to look to the CMSIS which is supposed to be optomized to the i.MXRT106x architecture.

Thank you
 
Hi,

Are there instructions and examples, or some form of documentation specifically on how to use the NXP CMSIS, and in particular the FFT, on the Teensy 4 with the Aduino IDE?

The use case is data is read from an external 18 bit ADC, using a timer, and processed 1k to 8 k samples at a time, to produce the complex fourier transform.

Thruput peformance is critical. So my first thought is to look to the CMSIS which is supposed to be optomized to the i.MXRT106x architecture.

I don't know of any Teensy-specific documentation on FFT other than what is in the Audio tutorial, but the T4 core includes much of a somewhat old version of CMSIS DSP, including fixed-point and 32-bit float FFT. See file arm_math.h for details. I don't know how fast it is or how it would compare to generic optimized FFTs you can find online and build for Teensy.
 
There is not very much CMSIS optimization for T4. FFT is plain float multiplications-addition that are typical for all FFT implementations. 16-bit integer FFT for T3 are different. Here you find DSP optimization (as implemented in Audio Library). If you wanted to do 8 k FFTs then you leave the Teensy Audio library, with its 128 sample buffer, but you can always augment the audio library (for 8k FFT you need 64 buffers to accumulate before doing a FFT, which must very likely asynchronously to audio library.)
 
Sure? Even for radix4 versions?
Yes, If you look to the butterfly operations you see a one-to-one relation to the math as given in the comment above every instruction. The differences, for example radix 4 to radix8 is simply programming style but not specific to processor architecture. We all know that loop unwrapping will speed-up processing, but this is known also to the compiler. Optimizing FFTs by proper mixed radix operations is understood since the existence of the radix8 (i.e. since beginning) but I would not call this MCU specific. AFAIK, NEON specific and vectorized instructions are not for T4 but other processors.
This is not to say that CMSIS5 has not a faster implementation, but this is IMO not processor specific, but better structure and compilation.
 
Yes, If you look to the butterfly operations you see a one-to-one relation to the math as given in the comment above every instruction. The differences, for example radix 4 to radix8 is simply programming style but not specific to processor architecture. We all know that loop unwrapping will speed-up processing, but this is known also to the compiler. Optimizing FFTs by proper mixed radix operations is understood since the existence of the radix8 (i.e. since beginning) but I would not call this MCU specific. AFAIK, NEON specific and vectorized instructions are not for T4 but other processors.
This is not to say that CMSIS5 has not a faster implementation, but this is IMO not processor specific, but better structure and compilation.
Which butterfly operations, specifically? The arm_radix4_butterfly_q15 function for example, precompiled in libarm_cortexM7lfsp_math.a is full of DSP instructions like SHADD16, QADD16, QSUB16, etc.
 
Which butterfly operations, specifically? The arm_radix4_butterfly_q15 function for example, precompiled in libarm_cortexM7lfsp_math.a is full of DSP instructions like SHADD16, QADD16, QSUB16, etc.
OK, if you really wanted to use 16 bit that is optimized for 16 bit integer DSP, with a T4 having float32 and float64 processors then... BUT, these optimizations have not changed from CMSIS4 or earlier to CMSI5
 
Actually, the raw data is from an 18 bit ADC.
As the last two bits are anyway noise, to could easily use the teensy audio library that is optimized for 16 bit DSP. Nothing specific for T4. A 1k FFT is already implemented. To make a 8k FFT, you could easily extend the 1 k FFT module to 8k.
 
Incorrect. The last two bits in this case are not noise.
If you need the last two bits, you cannot use the optimized 16 bit DSP, then, AFAIK, there are no special DSP optimizations. Or you give up the top 2 bits and modify the Audiolibrary to extract you the 16 bits of interest,
 
Whether or not the last two bits of 18 are noise, a 16 bit FFT (q15) will not produce 16 meaningful bits due to rounding errors internal to the FFT. So a floating point or q31 FFT is needed to preserve 16 bits of accuracy anyway.
 
Back
Top