Cmsis 5.9.0 + cmsis dsp 1.12

mjs513

Senior Member+
Alcon

The latest version of CMSIS DSP is no longer easy to port to Teensy with out going through and generating your arm_math.a files. DSP is now broken out from CMSIS5 into its own repository as of version 5.5.1

The following folders have been removed:

CMSIS/Lib/ (superseded by CMSIS/DSP/Lib/)
CMSIS/DSP_Lib/ (superseded by CMSIS/DSP/)
The following folders are deprecated:
CMSIS/Include/ (superseded by CMSIS/DSP/Include/ and CMSIS/Core/Include/)

for 5.5.1
CMSIS/Include/ (superseded by CMSIS/DSP/Include/ and CMSIS/Core/Include/)

In previous versions ARM CMSIS provided the gcc pre-compiled binaries for arm_math. Now with the latest versions you have to build your own libraries. However, think I broke the code on how that you can do that.

With that said with come poking around I did find that in version 5.5.0 they provided a uVision project file that you can you to compile the binaries for the library:
1. Downlowad the MDK µVision® IDE from ARM-Keil website
https://www2.keil.com/mdk5/uvision/ (Overview)

https://www.keil.com/download/product/ (download MDK-ARM - you will have to register)

2. Download Arm GNU Toolchain (my case I picked 11.3.1) to be inline with what we are currently using.
https://developer.arm.com/downloads/-/arm-gnu-toolchain-downloads

3. Then I downloaded the latest versions of CMSIS5 and CMSIS-DSP and put them on my D-drive (made my life easier.

4. Copied the project file for gcc and made some major edits to cover all the new library functions and edited the file paths for each board I wanted to compile for.
a. In [ project->manage -> project Items ] I deleted all the groups and recreated them using all the new groups and files from the latest CMSIS-DSP source folder.
Capture.PNG

5. Then I selected the processor I wanted to compile for and did a left click and selected: Options for target board. In the CC Tab i edited the paths for files:

Capture1.PNG

Then I did the Build for the target and got my compiled binary for the library. However the library files are now 9Meg versus the 2-3 meg they were before so not sure if all the compile options are correct but they are the default ones that come with the project file.

The last step was to update the arm_structs.h and arm_constants.h files in the cores.

I will post the files in a repository if anybody wants to do some further testing - need to the end of the day.

EDIT: Ref this post: https://forum.pjrc.com/threads/71074-Teensyduino-1-58-Beta-2?p=312904&viewfull=1#post312904
 
Last edited:
Using @manitou's dsp benchmark sketch:
Code:
[B][U]Teensy Micromod[/U][/B]

- arm_mult_f32         -  0.092 us ; // real float32        8
- arm_mult_f32         -  0.418 us ; // real float32       64
- arm_mult_f32         -  1.539 us ; // real float32      256
- arm_mult_f32         -  6.019 us ; // real float32     1024
- arm_mult_q31         -  0.185 us ; // real q31            8
- arm_mult_q31         -  1.072 us ; // real q31           64
- arm_mult_q31         -  4.112 us ; // real q31          256
- arm_mult_q31         - 16.273 us ; // real q31         1024
- arm_mult_q15         -  0.150 us ; // real q15            8
- arm_mult_q15         -  0.804 us ; // real q15           64
- arm_mult_q15         -  3.044 us ; // real q15          256
- arm_mult_q15         - 12.004 us ; // real q15         1024
- arm_sin_cos_f32      -  0.170 us ; // real float32                
- arm_sin_cos_q31      -  0.180 us ; // real q31_t                  
- arm_cfft_radix2_q15  -    5.8 us ; // real q15_t             64
- arm_cfft_radix2_q15  -   28.0 us ; // real q15_t            256
- arm_cfft_radix2_q15  -  133.9 us ; // real q15_t           1024
- arm_cfft_radix4_q15  -    2.8 us ; // real q15_t             64
- arm_cfft_radix4_q15  -   14.5 us ; // real q15_t            256
- arm_cfft_radix4_q15  -   71.2 us ; // real q15_t           1024
- arm_cfft_radix2_q31  -    8.7 us ; // real q31_t             64
- arm_cfft_radix2_q31  -   43.9 us ; // real q31_t            256
- arm_cfft_radix2_q31  -  213.7 us ; // real q31_t           1024
- arm_cfft_radix4_q31  -    4.2 us ; // real q31_t             64
- arm_cfft_radix4_q31  -   22.6 us ; // real q31_t            256
- arm_cfft_radix4_q31  -  114.1 us ; // real q31_t           1024
- arm_cfft_radix2_f32  -    5.4 us ; // real float32_t         64
- arm_cfft_radix2_f32  -   28.2 us ; // real float32_t        256
- arm_cfft_radix2_f32  -  140.0 us ; // real float32_t       1024
- arm_cfft_radix4_f32  -    3.2 us ; // real float32_t         64
- arm_cfft_radix4_f32  -   17.0 us ; // real float32_t        256
- arm_cfft_radix4_f32  -   85.9 us ; // real float32_t       1024
- arm_cfft_q15         -   71.4 us ; // real q15_t           1024
- arm_cfft_q31         -  114.1 us ; // real q31_t           1024
- arm_cfft_f32         -   87.0 us ; // real float32_t       1024

Code:
[B][U]Teensy 3.5[/U][/B]

- arm_mult_f32         -  0.911 us ; // real float32        8
- arm_mult_f32         -  5.354 us ; // real float32       64
- arm_mult_f32         - 20.587 us ; // real float32      256
- arm_mult_f32         - 81.519 us ; // real float32     1024
- arm_mult_q31         -  1.078 us ; // real q31            8
- arm_mult_q31         -  6.222 us ; // real q31           64
- arm_mult_q31         - 23.861 us ; // real q31          256
- arm_mult_q31         - 94.415 us ; // real q31         1024
- arm_mult_q15         -  1.044 us ; // real q15            8
- arm_mult_q15         -  5.010 us ; // real q15           64
- arm_mult_q15         - 18.660 us ; // real q15          256
- arm_mult_q15         - 73.144 us ; // real q15         1024
- arm_sin_cos_f32      -  1.253 us ; // real float32                
- arm_sin_cos_q31      -  1.962 us ; // real q31_t                  
- arm_cfft_radix2_q15  -   44.7 us ; // real q15_t             64
- arm_cfft_radix2_q15  -  207.2 us ; // real q15_t            256
- arm_cfft_radix2_q15  -  929.9 us ; // real q15_t           1024
- arm_cfft_radix4_q15  -   25.0 us ; // real q15_t             64
- arm_cfft_radix4_q15  -  124.7 us ; // real q15_t            256
- arm_cfft_radix4_q15  -  609.8 us ; // real q15_t           1024
- arm_cfft_radix2_q31  -   88.7 us ; // real q31_t             64
- arm_cfft_radix2_q31  -  448.6 us ; // real q31_t            256
- arm_cfft_radix2_q31  - 2178.0 us ; // real q31_t           1024
- arm_cfft_radix4_q31  -   48.7 us ; // real q31_t             64
- arm_cfft_radix4_q31  -  257.5 us ; // real q31_t            256
- arm_cfft_radix4_q31  - 1284.1 us ; // real q31_t           1024
- arm_cfft_radix2_f32  -   63.0 us ; // real float32_t         64
- arm_cfft_radix2_f32  -  323.0 us ; // real float32_t        256
- arm_cfft_radix2_f32  - 1587.3 us ; // real float32_t       1024
- arm_cfft_radix4_f32  -   36.7 us ; // real float32_t         64
- arm_cfft_radix4_f32  -  182.9 us ; // real float32_t        256
- arm_cfft_radix4_f32  -  880.8 us ; // real float32_t       1024
- arm_cfft_q15         -  616.0 us ; // real q15_t           1024
- arm_cfft_q31         - 1281.6 us ; // real q31_t           1024
- arm_cfft_f32         -  990.0 us ; // real float32_t       1024

For comparisons check this link:
https://forum.pjrc.com/threads/24037-CMSIS-DSP-library-supports?p=183614&viewfull=1#post183614

Some are slower and some are faster.
 
Last edited:
Alcon

The latest version of CMSIS DSP is no longer easy to port to Teensy with out going through and generating your arm_math.a files.

Paul has a script for that. I had it, but deleted it. Maybe you can find it on github.
I guess he must have run it for the new compiler(?) I don't think that the lib is still the sameas for the old compiler. (One could compare the files, but I don't bother).
 
Paul has a script for that. I had it, but deleted it. Maybe you can find it on github.
I guess he must have run it for the new compiler(?) I don't think that the lib is still the sameas for the old compiler. (One could compare the files, but I don't bother).

No the library has completely changed from the version that is currently in the Teensy. If compared what was say in version 5.3 versus now you will see that there are an close to 50% more files than there was before.

Even if I could find the script it was probably based on the older versions and incomplete. I just posted everything to Github including the library's plus the updated core files. Pretty much just a cut and paste operation now.

https://github.com/mjs513/Teensy-DSP-1.12-Updates

Basically just copy and paste files in the TeensyX Files to the associated cores where ever you placed you TD install's. For me that would be
..\arduino-1.8.19-1131\hardware\teensy\avr\cores\teensy3 or teensy4 directories

Then copy the lib files in the Precompiled Binaries to:
..\arduino-1.8.19-1131\hardware\tools\arm\arm-none-eabi\lib

Then you are good to go. No need to edit the files.

There is are benchmarks you can run but I haven't got around to porting them. Right now recovering from getting shots in my eye and its a bit annoying.
 
@PaulStoffregen

wondering if the following is correct for the Teensy4's - would like to reduce that 9mb file size at some point

Misc controls
Code:
-fno-strict-aliasing -ffunction-sections -fdata-sections -mfpu=fpv5-sp-d16 -mfloat-abi=hard -ffp-contract=off

Compiler Control String
Code:
-c -mcpu=cortex-m7 -mthumb -gdwarf-2 -MD -Wall -O3 -I ..\..\..\Core\Include -I D:\CMSIS-DSP-1.12.0\Source -I D:\CMSIS-DSP-1.12.0\Include -I D:\CMSIS_5-5.9.0\CMSIS\Core\Include -I D:\CMSIS-DSP-1.12.0\PrivateInclude -fno-strict-aliasing -ffunction-sections -fdata-sections -mfpu=fpv5-sp-d16 -mfloat-abi=hard -ffp-contract=off -I"D:\Program Files (x86)\Arm GNU Toolchain arm-none-eabi\11.3 rel1\arm-none-eabi\include" -I"D:\Program Files (x86)\Arm GNU Toolchain arm-none-eabi\11.3 rel1\lib\gcc\arm-none-eabi\11.3.1\include" -I"D:\Program Files (x86)\Arm GNU Toolchain arm-none-eabi\11.3 rel1\arm-none-eabi\include\c++\11.3.1" -I"D:\Program Files (x86)\Arm GNU Toolchain arm-none-eabi\11.3 rel1\arm-none-eabi\include\c++\11.3.1\arm-none-eabi" -D__UVISION_VERSION="537" -D__GCC -D__GCC_VERSION="1131" -DARMCM7_SP -DARM_MATH_MATRIX_CHECK -DARM_MATH_ROUNDING -DARM_MATH_LOOPUNROLL -o *.o
 
Ok fixed issue I was having with the 9mb prebinary compile sizes. Reduced it to <4mb. Updated Github with the changes.

If anyone has any fun test sketches please give it try. Or post their examples for the fun of it.
 
Out of this tendency I have for torture I ran an example I found on line using a 697hz tone that was generated from audacity: https://m0agx.eu/2018/05/23/practical-fft-on-microcontrollers-using-cmsis-dsp/.

Porting the example to for the T4 - get the same bin for the max frequency as they got, bin 44. As for the graph - since I am a glutton for punishment I used @KrisKasprzak ILI9341_controls library to generate the following graph:
IMG-0743.jpg

If you are interested here is the file:
Code:
// https://m0agx.eu/2018/05/23/practical-fft-on-microcontrollers-using-cmsis-dsp/

#include <ILI9341_t3.h>           // fast display driver lib
#include <ILI9341_t3_Controls.h>
#include <font_Arial.h>           // custom fonts that ships with ILI9341_t3.h

// you must create and pass fonts to the function
#define FONT_TITLE Arial_16
#define FONT_DATA Arial_10

// For the Adafruit shield, these are the default.
#define TFT_DC  9
#define TFT_CS 10

// Use hardware SPI (on Uno, #13, #12, #11) and the above for CS/DC
ILI9341_t3 tft = ILI9341_t3(TFT_CS, TFT_DC);

// defines for graph location and scales
#define X_ORIGIN    50      //GraphXLoc
#define Y_ORIGIN    200     //GraphYLoc
#define X_WIDE       250    //GraphHeight
#define Y_HIGH      150     //GraphWidth
#define X_LOSCALE   0       //XAxisLow
#define X_HISCALE   300
#define X_INC       50       //XAxisInc
#define Y_LOSCALE   0
#define Y_HISCALE   10000
#define Y_INC       2000

#define TEXTCOLOR C_WHITE
#define GRIDCOLOR C_GREY
#define AXISCOLOR C_GREEN
#define BACKCOLOR C_BLACK
#define PLOTCOLOR C_DKGREY
#define VOLTSCOLOR C_RED
#define SINCOLOR C_YELLOW
#define COSCOLOR C_BLUE

// used to monitor elaspsed time
unsigned long oldTime;

// create a variable for each data data point
//float x, volts;

// create an ID for each data to be plotted
int fftID;

// create the display object
ILI9341_t3 Display(TFT_CS, TFT_DC);

// create the cartesian coordinate graph object
CGraph MyGraph(&Display, X_ORIGIN, Y_ORIGIN, X_WIDE, Y_HIGH, X_LOSCALE, X_HISCALE, X_INC, Y_LOSCALE, Y_HISCALE, Y_INC);

//=== End Graph setup

#include "arm_math.h"
#include "arm_const_structs.h"
#include "testData.h"

#define printf Serial.printf

#define FFT_SIZE 256
 
typedef struct {
  const char *desc;
  unsigned char *data;
} test_wave_t;
 
static const test_wave_t WAVES[] = {
    { "697Hz", __697hz_raw },
};
 
void fft_test(void){
  static arm_rfft_instance_q15 fft_instance;
  static q15_t output[FFT_SIZE*2]; //has to be twice FFT size
  static q15_t pResult;
  uint32_t index;
  
  arm_status status;
 
  status = arm_rfft_init_q15(&fft_instance, 256/*bin count*/, 0/*forward FFT*/, 1/*output bit order is normal*/);
  printf("FFT init %d\n", status);
 
  for (uint32_t i = 0; i < sizeof(WAVES)/sizeof(WAVES[0]); i++){
 
    uint32_t c_start = micros();
 
    arm_rfft_q15(&fft_instance, (q15_t*)WAVES[i].data, output);
 
    arm_abs_q15(output, output, FFT_SIZE);
 
    uint32_t c_stop = micros();
 
    //printf("%s %ld \n", WAVES[i].desc, c_stop-c_start);
 
    for (uint32_t j = 0; j < FFT_SIZE; j++){
      //printf("%d, %d\n ", j, output[j]);
        MyGraph.setX(j);
        MyGraph.plot(fftID, output[j]);
    }
    printf("\n");

    arm_max_q15(output, FFT_SIZE, &pResult, &index);
    //printf("Max Val: %d, Bin: %d\n", pResult, index);
    Display.setCursor(50,20);
    Display.setTextColor(C_YELLOW);
    Display.print("Peak at Bin "); Display.print(index);
  }
}
void setup() {
  Serial.begin(9600);
  while(!Serial){};

  // fire up the display
  Display.begin();
  Display.setRotation(1);
  Display.fillScreen(C_BLACK);

  // initialize the graph object
  MyGraph.init("", "bin", "mag", TEXTCOLOR, GRIDCOLOR, AXISCOLOR, BACKCOLOR, PLOTCOLOR, FONT_TITLE, FONT_DATA);
  fftID = MyGraph.add("mag", SINCOLOR);
  MyGraph.drawGraph();  //draws empty graph
  
  fft_test();
  
}

void loop() {
  // put your main code here, to run repeatedly:

}
 
Very nice. Now plot that in dB's, using log10 in the library. dBout = 20 * log10( mag ). Then you will be able to detect any low level spurs in the output since the log function lets you see big and little stuff at the same time.
 
Been playing a bit more with the CMSIS update using a Teensy micromod. This time I used a test case from a matlab example: https://github.com/Zafar577/MATLAB-DSP/blob/main/TTT.m. Basically a sine wave plus noise.

But I ported the windows from the GNURadio library that has support for:
Code:
        WIN_NONE = -1,       //!< don't use a window
        WIN_HAMMING = 0,     //!< Hamming window; max attenuation 53 dB
        WIN_HANN = 1,        //!< Hann window; max attenuation 44 dB
        WIN_HANNING = 1,     //!< alias to WIN_HANN
        WIN_BLACKMAN = 2,    //!< Blackman window; max attenuation 74 dB
        WIN_RECTANGULAR = 3, //!< Basic rectangular window; max attenuation 21 dB
        WIN_KAISER = 4, //!< Kaiser window; max attenuation see window::max_attenuation
        WIN_BLACKMAN_hARRIS = 5, //!< Blackman-harris window; max attenuation 92 dB
        WIN_BLACKMAN_HARRIS =
            5,            //!< alias to WIN_BLACKMAN_hARRIS for capitalization consistency
        WIN_BARTLETT = 6, //!< Barlett (triangular) window; max attenuation 26 dB
        WIN_FLATTOP = 7,  //!< flat top window; useful in FFTs; max attenuation 93 dB
        WIN_NUTTALL = 8,  //!< Nuttall window; max attenuation 114 dB
        WIN_BLACKMAN_NUTTALL = 8, //!< Nuttall window; max attenuation 114 dB
        WIN_NUTTALL_CFD =
            9, //!< Nuttall continuous-first-derivative window; max attenuation 112 dB
        WIN_WELCH = 10,  //!< Welch window; max attenuation 31 dB
        WIN_PARZEN = 11, //!< Parzen window; max attenuation 56 dB
        WIN_EXPONENTIAL =
            12, //!< Exponential window; max attenuation see window::max_attenuation
        WIN_RIEMANN = 13, //!< Riemann window; max attenuation 39 dB
        WIN_GAUSSIAN =
            14,         //!< Gaussian window; max attenuation see window::max_attenuation
        WIN_TUKEY = 15, //!< Tukey window; max attenuation see window::max_attenuation
and added that into the test sketch.

Just for reference Matlab shows a max value with a frequency at 500hz.
Capture1.PNG

and using a 32-byte Hamming window I show the same thing.

First chart shows the raw signal (yellow) and windowed signal (red)
IMG-0750.jpg

With spectrum:
IMG-0751.jpg

for which I get:
Code:
Index: 128, freq: 500.000000,  MaxValue: 247.432098

If interested here is the sketch:
View attachment sig-noise.zip
 
Out of this tendency I have for torture I ran an example I found on line using a 697hz tone that was generated from audacity: https://m0agx.eu/2018/05/23/practical-fft-on-microcontrollers-using-cmsis-dsp/.

Porting the example to for the T4 - get the same bin for the max frequency as they got, bin 44. As for the graph - since I am a glutton for punishment I used @KrisKasprzak ILI9341_controls library to generate the following graph:
View attachment 29448
Looks good!

I am sure glad that I don't have that same tendency ;)
 
Since I am the curious type wanted to see if the DSP Benchmark Sketch or the sketch in post #11 would run on the T3.x. Should run un-modified. What I found was that for the T3.x's looks like it uses a subset of the FFT commands in the DSP Library. Was gettting this error:
Code:
fatal error: arm_const_structs.h: No such file or directory
    4 | #include "arm_const_structs.h"
and sure enough that header is only included in the Teensy4 core and not the Teensy3 core - @PaulStoffregen I am assuming that was the intention here?

Anyway if I update the Teensy3 core the same way I did the Teensy4 core here are the comparisons for the sketch in post11:
Code:
Teensy 3.2 (96 Mhz)
---------------------------------------------------
Press anykey to continue
time to copy test array to FFT Array: 132
64
Time to setup window: 1782 (microseconds)
Press anykey to continue
FFT in Frequency Domain next
FFT init 0
Time to perform 2048 FFT: 52388 (microseconds)
Index: 128, freq: 500.000000,  MaxValue: 247.432098
=============================================
Teensy 3.6 (180Mhz)
---------------------------------------------------
Press anykey to continue
time to copy test array to FFT Array: 70
64
Time to setup window: 185 (microseconds)
Press anykey to continue
FFT in Frequency Domain next
FFT init 0
Time to perform 2048 FFT: 1103 (microseconds)
Index: 128, freq: 500.000000,  MaxValue: 247.432114

===========================================
Teensy MicroMod (600Mhz)
---------------------------------------------------
Press anykey to continue
time to copy test array to FFT Array: 9
64
Time to setup window: 36 (microseconds)
Press anykey to continue
FFT in Frequency Domain next
FFT init 0
Time to perform 2048 FFT: 203 (microseconds)
Index: 128, freq: 500.000000,  MaxValue: 247.432098

Unfortunately if you try to use even the FFT audio example there is not enough space for it compile even the core is not updated.

Out of curiosity I change the Clock to 150Mhz on the Teensy Micromod:
Code:
Press anykey to continue
time to copy test array to FFT Array: 35
64
Time to setup window: 147 (microseconds)
Press anykey to continue
FFT in Frequency Domain next
FFT init 0
Time to perform 2048 FFT: 803 (microseconds)
Index: 128, freq: 500.000000,  MaxValue: 247.432098

Still seems to be faster than the T3.6 at 180Mhz, :)
 
Comparisons for the DSP_Benchmark sketch:
Code:
Teensy 3.2 (96 Mhz)
---------------------------------------------------
- arm_mult_f32         -  4.763 us ; // real float32        8
- arm_mult_f32         - 31.081 us ; // real float32       64
- arm_mult_f32         - 121.313 us ; // real float32      256
- arm_mult_f32         - 482.219 us ; // real float32     1024
- arm_mult_q31         -  1.327 us ; // real q31            8
- arm_mult_q31         -  7.760 us ; // real q31           64
- arm_mult_q31         - 29.816 us ; // real q31          256
- arm_mult_q31         - 118.040 us ; // real q31         1024
- arm_mult_q15         -  1.181 us ; // real q15            8
- arm_mult_q15         -  6.152 us ; // real q15           64
- arm_mult_q15         - 23.196 us ; // real q15          256
- arm_mult_q15         - 91.366 us ; // real q15         1024
- arm_sin_cos_f32      - 28.143 us ; // real float32                
- arm_sin_cos_q31      -  2.965 us ; // real q31_t                  
- arm_cfft_radix2_q15  -   55.0 us ; // real q15_t             64
- arm_cfft_radix2_q15  -  255.8 us ; // real q15_t            256
- arm_cfft_radix2_q15  - 1161.7 us ; // real q15_t           1024
- arm_cfft_radix4_q15  -   30.9 us ; // real q15_t             64
- arm_cfft_radix4_q15  -  152.2 us ; // real q15_t            256
- arm_cfft_radix4_q15  -  732.1 us ; // real q15_t           1024
- arm_cfft_radix2_q31  -  113.3 us ; // real q31_t             64
- arm_cfft_radix2_q31  -  570.7 us ; // real q31_t            256
- arm_cfft_radix2_q31  - 2776.5 us ; // real q31_t           1024
- arm_cfft_radix4_q31  -   63.6 us ; // real q31_t             64
- arm_cfft_radix4_q31  -  326.3 us ; // real q31_t            256
- arm_cfft_radix4_q31  - 1592.7 us ; // real q31_t           1024
- arm_cfft_radix2_f32  -  930.8 us ; // real float32_t         64
- arm_cfft_radix2_f32  - 5170.9 us ; // real float32_t        256
- arm_cfft_radix2_f32  - 27314.8 us ; // real float32_t       1024
- arm_cfft_radix4_f32  -  654.5 us ; // real float32_t         64
- arm_cfft_radix4_f32  - 3772.4 us ; // real float32_t        256
- arm_cfft_radix4_f32  - 21554.9 us ; // real float32_t       1024
- arm_cfft_q15         -  732.1 us ; // real q15_t           1024
- arm_cfft_q31         - 1582.2 us ; // real q31_t           1024
- arm_cfft_f32         - 20333.8 us ; // real float32_t       1024
- arm_rfft_fast_f32    -  527.7 us ; // real float32_t         64
- arm_rfft_fast_f32    - 2702.8 us ; // real float32_t        256
- arm_rfft_fast_f32    - 13909.8 us ; // real float32_t       1024


=============================================
Teensy 3.6 (180Mhz)
---------------------------------------------------
- arm_mult_f32         -  0.606 us ; // real float32        8
- arm_mult_f32         -  3.564 us ; // real float32       64
- arm_mult_f32         - 13.705 us ; // real float32      256
- arm_mult_f32         - 54.269 us ; // real float32     1024
- arm_mult_q31         -  0.717 us ; // real q31            8
- arm_mult_q31         -  4.142 us ; // real q31           64
- arm_mult_q31         - 15.884 us ; // real q31          256
- arm_mult_q31         - 62.853 us ; // real q31         1024
- arm_mult_q15         -  0.628 us ; // real q15            8
- arm_mult_q15         -  3.275 us ; // real q15           64
- arm_mult_q15         - 12.348 us ; // real q15          256
- arm_mult_q15         - 48.641 us ; // real q15         1024
- arm_sin_cos_f32      -  0.640 us ; // real float32                
- arm_sin_cos_q31      -  0.941 us ; // real q31_t                  
- arm_cfft_radix2_q15  -   26.8 us ; // real q15_t             64
- arm_cfft_radix2_q15  -  127.5 us ; // real q15_t            256
- arm_cfft_radix2_q15  -  607.5 us ; // real q15_t           1024
- arm_cfft_radix4_q15  -   15.3 us ; // real q15_t             64
- arm_cfft_radix4_q15  -   77.0 us ; // real q15_t            256
- arm_cfft_radix4_q15  -  376.6 us ; // real q15_t           1024
- arm_cfft_radix2_q31  -   59.3 us ; // real q31_t             64
- arm_cfft_radix2_q31  -  303.0 us ; // real q31_t            256
- arm_cfft_radix2_q31  - 1484.3 us ; // real q31_t           1024
- arm_cfft_radix4_q31  -   31.0 us ; // real q31_t             64
- arm_cfft_radix4_q31  -  165.0 us ; // real q31_t            256
- arm_cfft_radix4_q31  -  822.9 us ; // real q31_t           1024
- arm_cfft_radix2_f32  -   42.7 us ; // real float32_t         64
- arm_cfft_radix2_f32  -  220.3 us ; // real float32_t        256
- arm_cfft_radix2_f32  - 1082.1 us ; // real float32_t       1024
- arm_cfft_radix4_f32  -   22.8 us ; // real float32_t         64
- arm_cfft_radix4_f32  -  116.0 us ; // real float32_t        256
- arm_cfft_radix4_f32  -  564.6 us ; // real float32_t       1024
- arm_cfft_q15         -  360.9 us ; // real q15_t           1024
- arm_cfft_q31         -  764.0 us ; // real q31_t           1024
- arm_cfft_f32         -  538.0 us ; // real float32_t       1024
- arm_rfft_fast_f32    -   22.1 us ; // real float32_t         64
- arm_rfft_fast_f32    -   98.5 us ; // real float32_t        256
- arm_rfft_fast_f32    -  388.0 us ; // real float32_t       1024

===========================================
Teensy MicroMod (600Mhz)
---------------------------------------------------
- arm_mult_f32         -  0.093 us ; // real float32        8
- arm_mult_f32         -  0.421 us ; // real float32       64
- arm_mult_f32         -  1.541 us ; // real float32      256
- arm_mult_f32         -  6.022 us ; // real float32     1024
- arm_mult_q31         -  0.182 us ; // real q31            8
- arm_mult_q31         -  1.069 us ; // real q31           64
- arm_mult_q31         -  4.109 us ; // real q31          256
- arm_mult_q31         - 16.269 us ; // real q31         1024
- arm_mult_q15         -  0.153 us ; // real q15            8
- arm_mult_q15         -  0.805 us ; // real q15           64
- arm_mult_q15         -  3.045 us ; // real q15          256
- arm_mult_q15         - 12.006 us ; // real q15         1024
- arm_sin_cos_f32      -  0.170 us ; // real float32                
- arm_sin_cos_q31      -  0.180 us ; // real q31_t                  
- arm_cfft_radix2_q15  -    5.8 us ; // real q15_t             64
- arm_cfft_radix2_q15  -   28.0 us ; // real q15_t            256
- arm_cfft_radix2_q15  -  133.9 us ; // real q15_t           1024
- arm_cfft_radix4_q15  -    2.8 us ; // real q15_t             64
- arm_cfft_radix4_q15  -   14.5 us ; // real q15_t            256
- arm_cfft_radix4_q15  -   71.2 us ; // real q15_t           1024
- arm_cfft_radix2_q31  -    8.7 us ; // real q31_t             64
- arm_cfft_radix2_q31  -   44.0 us ; // real q31_t            256
- arm_cfft_radix2_q31  -  214.5 us ; // real q31_t           1024
- arm_cfft_radix4_q31  -    4.2 us ; // real q31_t             64
- arm_cfft_radix4_q31  -   22.7 us ; // real q31_t            256
- arm_cfft_radix4_q31  -  114.6 us ; // real q31_t           1024
- arm_cfft_radix2_f32  -    5.4 us ; // real float32_t         64
- arm_cfft_radix2_f32  -   27.5 us ; // real float32_t        256
- arm_cfft_radix2_f32  -  135.1 us ; // real float32_t       1024
- arm_cfft_radix4_f32  -    3.2 us ; // real float32_t         64
- arm_cfft_radix4_f32  -   17.1 us ; // real float32_t        256
- arm_cfft_radix4_f32  -   86.3 us ; // real float32_t       1024
- arm_cfft_q15         -   71.0 us ; // real q15_t           1024
- arm_cfft_q31         -  114.6 us ; // real q31_t           1024
- arm_cfft_f32         -   87.0 us ; // real float32_t       1024
- arm_rfft_fast_f32    -    3.4 us ; // real float32_t         64
- arm_rfft_fast_f32    -   15.0 us ; // real float32_t        256
- arm_rfft_fast_f32    -   60.7 us ; // real float32_t       1024
 
Thanks for all the heavy lifting to get DSP libs updated and built!!

I tracked DSP arm_math.h versions from the comments in arm_math.h. In your github, version is "Revision: V.1.5.1", in the CMSIS-DSP/Include/arm_math.h repository it is "@version V1.10.0" and forum post suggests DSP-1.12

I realize it is cosmetic, but what do you think the proper version number is for your arm_math.h?
 
Thanks for all the heavy lifting to get DSP libs updated and built!!

I tracked DSP arm_math.h versions from the comments in arm_math.h. In your github, version is "Revision: V.1.5.1", in the CMSIS-DSP/Include/arm_math.h repository it is "@version V1.10.0" and forum post suggests DSP-1.12

I realize it is cosmetic, but what do you think the proper version number is for your arm_math.h?

Well thats kind of an interesting question. Arm-math.h is really different than before, in 1.10 its:
Code:
#include "arm_math_types.h"
#include "arm_math_memory.h"

#include "dsp/none.h"
#include "dsp/utils.h"

#include "dsp/basic_math_functions.h"  
#include "dsp/interpolation_functions.h"
#include "dsp/bayes_functions.h"
#include "dsp/matrix_functions.h"
#include "dsp/complex_math_functions.h"
#include "dsp/statistics_functions.h"
#include "dsp/controller_functions.h"
#include "dsp/support_functions.h"
#include "dsp/distance_functions.h"
#include "dsp/svm_functions.h"
#include "dsp/fast_math_functions.h"
#include "dsp/transform_functions.h"
#include "dsp/filtering_functions.h"
#include "dsp/quaternion_math_functions.h"

Most of what is currently in arm_math now is in arm_math_types. So I kind of just left it the at the current rev.

Probably should spend a bit more time on it but no one else seemed to be playing along.

The other thing is that for NEON processors it will include Bayes functions and for HELIUM processors it will include vector math and MVE functions which I currently included.

One question for you though is currently it does support float64+t but that is not defined for teensy :)

If you have any suggestions or recommendations I can make the changes. Relatively easy to recompile now that I broke the code :)

Mike
 
Working, but with a bug?

Question: How does this fit together?

void fft_test(void){
static arm_rfft_instance_q15 fft_instance;
static q15_t output[FFT_SIZE*2]; //has to be twice FFT size

...

arm_abs_q15(output, output, FFT_SIZE);

From what I read about CMSIS FFT I doubt the SIZE is wrong, but I'm not an expert.
 
Question: How does this fit together?

void fft_test(void){
static arm_rfft_instance_q15 fft_instance;
static q15_t output[FFT_SIZE*2]; //has to be twice FFT size

...

arm_abs_q15(output, output, FFT_SIZE);

From what I read about CMSIS FFT I doubt the SIZE is wrong, but I'm not an expert.

For fft_test example it was a port of the example from:
https://m0agx.eu/2018/05/23/practical-fft-on-microcontrollers-using-cmsis-dsp/

Was also looking at this to try and figure it out: https://www.keil.com/pack/doc/CMSIS/DSP/html/group__RealFFT.html#details after you asked the question.

Maybe someone more familar with DSP and FFT can answer for sure. Wish I can help more
 
An FFT of a real signal is twice the size of the real input. This is because any FFT produces both the real part of the FFT and the imaginary part of the FFT. When I would use an an ARM FFT I would do the complex FFT, but memset the imaginary numbers to zero. The resultant FFT was then the same size as the input. It really doesn't matter which choice you take, either RFFT with 2X output size or a CFFT with 1X output size, the number of output bytes are the same.

The absolute value of the FFT simply takes the square root of the sum of the squares of the real and imaginary parts, resulting in a single real value for each element. Hence, the abs_arm_q15 collapses the information to FFT_SIZE.

Hope that helps.
 
An FFT of a real signal is twice the size of the real input. This is because any FFT produces both the real part of the FFT and the imaginary part of the FFT. When I would use an an ARM FFT I would do the complex FFT, but memset the imaginary numbers to zero. The resultant FFT was then the same size as the input. It really doesn't matter which choice you take, either RFFT with 2X output size or a CFFT with 1X output size, the number of output bytes are the same.

The absolute value of the FFT simply takes the square root of the sum of the squares of the real and imaginary parts, resulting in a single real value for each element. Hence, the abs_arm_q15 collapses the information to FFT_SIZE.

Hope that helps.

Thanks @clinker8, was thinking that but didn't want to guess. Again - thanks for the explanation.
 
Here's an example of radar data captured and processed with CMSIS FFT's and converted to dB. I captured and processed this data on an M4 device this morning. This is the response of a doppler radar to a tuning fork. I made this doppler radar from an X band door opener module and an analog front end that I designed. Sample rate = 60 KHz, 1K floating point FFT's. Continuously running, one FFT per 17.05ms. Automatic detection and velocity estimation. ADC data collection using DMA and ping pong buffers. Process one buffer while other was filling up. All the signal processing was done in 17.05ms or less. Detected display was done asynchronously.

The peak detected line is about 60 dB greater than the median noise floor. ( 1,000,000 times more power than the noise ) I'm afraid one wouldn't know anything about the rest of the waveform had it been done in a linear vertical scale. Vertical scale is 10 dB per division. Basically, this is why hard core radar guys always use log scaled data.

Wish I could say I did this on a Teensy, but at the time I built this three years ago, I hadn't an inkling of Teensy's. I used the internal "12 bit" converter on the processor. Detects small objects, like an airgun pellet in flight at up to 1250 feet per second. Just an example of what can be done with the CMSIS DSP library.
 

Attachments

  • PXL_20221213_155416702_1080.jpg
    PXL_20221213_155416702_1080.jpg
    184.9 KB · Views: 36
Very cool.

You might what to check this project out - along the same lines but using a Teensy: https://forum.pjrc.com/threads/4562...mance-of-cheap-doppler-radar?highlight=x-band

Started playing but then dropped it.

That's where I got the initial idea. But the analog circuitry is not up to the task to detect very small objects at 1250 FPS. That corresponds to roughly 30 KHz. JBeale's circuit is good for a couple of KHz at most, if I recall correctly. I designed an active 7 pole analog Chebychev filter with at least 80 dB rejection at frequencies greater than fs/2. This greatly reduces noise from fs/2 to fs from folding over into the region 0 to fs/2. This is essential to reduce the noise in the system, which increases sensitivity.

Mr. Beale's design was a simple couple of poles filter and suffers from significant noise fold over. I have great admiration for his initial work. His work inspired me to try to roll my own, using the same radar module. I added in automatic target detection and a parabolic peak estimator for sub FFT bin speed determination. Mr. Beale's circuit was a fine demonstrator, but was designed for a less arduous task - showing pedestrians are seen by a radar. The target cross sections I am detecting are 100 times smaller and they are traveling 100 times faster. Totally different design goals and significantly greater demands on the design. It took me quite a while to get that filter to work right. High order analog filter design, at least for me, is tough. It's no wonder it has been supplanted by digital filtering. Automatic detection means I get a direct readout of the speed of the object at a far greater precision without me trying to interpret data.

Now in full disclosure, I had a long career in radar, so I'm familiar with the the signal processing techniques, detection theory, and parameter estimation. It was what I did for a living. For me, the hard part was the DMA and the ADC stuff, along with the ping pong buffer processing, not the radar aspects. All in all, it has been good fun. If I had any experience with Teensy's three years ago, I would have implemented it on one. But at the time, I had zero hands on experience. But in my favor :) I have seen the light, and used a Teensy 4.1 for my electronic lead screw for my lathe. Got that working this November. Really pleased with the ease of development and the level of support here on the PJRC forums.
 
Last edited:
Back
Top