I just posted 1.15-rc1 with the math library included.
I tweaked the header files slightly, so they always work for Teensy 3.0 regardless of whether you define ARM_MATH_CM4 or other stuff. I tested only briefly, and only on Linux, with a few of Pete's samples.
Please give 1.15-rc1 a try. I'm very open to ideas on how this math stuff should be included. My hope is to keep the interface stable once 1.15 is fully released, so please give it a try now while there's still time to easily make changes.
Hi Paul,
I tested the CMSIS examples on the new Beta and they worked fine. I'm trying to add an analogRead to the example and I'm getting some linking errors. I'm not familiar with these. Not including the analogRead links ok.
Thanks,
mauricio
Code:#define ARM_MATH_CM4 #include "arm_math.h" #define TEST_LENGTH_SAMPLES 2048 #include "arm_fft_sine_data.h" // NOTE: q15t is int16_t in arm_math.h uint32_t fftSize = 512; /* ------------------------------------------------------------------ * Global variables for FFT Bin Example * ------------------------------------------------------------------- */ uint32_t ifftFlag = 0; uint32_t doBitReverse = 1; uint32_t testInputIndex = 0; float32_t testInput[TEST_LENGTH_SAMPLES]; static float32_t testOutput[TEST_LENGTH_SAMPLES/2]; void setup() { Serial.begin(19200); pinMode(13, OUTPUT); for (int i=0; i < 10; i++) { Serial.println(" start program "); delay(1000); } } bool pit3Triggered = false; extern "C" { //! Audio input interrupt handler running at 15kHz void pit3_isr(void) { pit3Triggered = true; digitalWrite(13, HIGH); digitalWrite(13, LOW); PIT_TFLG3 = 1; } void startup_late_hook(void) { // This is called from mk20dx128.c //Turn on interrupts: SIM_SCGC6 |= SIM_SCGC6_PIT; // turn on PIT PIT_MCR = 0x00; NVIC_ENABLE_IRQ(IRQ_PIT_CH3); PIT_LDVAL3 = 3200 - 1; // setup timer 2 for frame timer period (15kHz) = 48MHz / 15kHz PIT_TCTRL3 = 0x2; // enable Timer 3 interrupts PIT_TCTRL3 |= 0x1; // start Timer 3 PIT_TFLG3 |= 1; } } void loop() { float32_t maxValue; float32_t length = 256.0; if (pit3Triggered) { int sample; sample = analogRead (14); testInput[testInputIndex] = sample / 1024 * 10.0; testInputIndex++; if (testInputIndex >= TEST_LENGTH_SAMPLES) { testInputIndex = 0; } } if (testInputIndex == 0) { arm_cfft_radix4_instance_f32 fft_inst; /* CFFT Structure instance */ arm_cfft_radix4_init_f32(&fft_inst, length, ifftFlag, doBitReverse); uint32_t startTime, fftTime, magTime, maxTime; Serial.println("Start"); startTime = millis(); /* Process the data through the CFFT/CIFFT module */ arm_cfft_radix4_f32(&fft_inst, testInput_f32_10khz); fftTime = millis(); /* Process the data through the Complex Magnitude Module for calculating the magnitude at each bin */ arm_cmplx_mag_f32(testInput_f32_10khz, testOutput, fftSize); magTime = millis(); /* Calculates maxValue and returns corresponding BIN value */ arm_max_f32(testOutput, fftSize, &maxValue, &testIndex); maxTime = millis(); Serial.println("End"); Serial.println(fftTime - startTime); Serial.println(magTime - fftTime); Serial.println(maxTime - magTime); Serial.println("TOTAL: "); Serial.println(maxTime - startTime); Serial.print("MaxValue: "); Serial.println(maxValue); Serial.print("MaxIndex: "); Serial.println(testIndex); Serial.print("Magnitudes: "); for (int j=0; j < length / 2; j++) { Serial.print(j); Serial.print(", "); Serial.println(testOutput[j]); } } }Code:/Applications/Development/Arduino/Arduino1.0.5.app/Contents/Resources/Java/hardware/tools/arm-none-eabi/bin/../lib/gcc/arm-none-eabi/4.7.2/../../../../arm-none-eabi/bin/ld: fftTest.cpp.elf section `.bss' will not fit in region `RAM' /Applications/Development/Arduino/Arduino1.0.5.app/Contents/Resources/Java/hardware/tools/arm-none-eabi/bin/../lib/gcc/arm-none-eabi/4.7.2/../../../../arm-none-eabi/bin/ld: region `RAM' overflowed by 7916 bytes collect2: error: ld returned 1 exit status
I have never used this arm_math library before, but I'm following this thread with great interest.
Hopefully I can help out here. Mauricio, I think your code just runs out of ram.
You did the following:
A float32_t is four bytes, so you allocate 2048 * 4 + (2048 / 2 ) * 4 = 12288 bytes here.Code:#define TEST_LENGTH_SAMPLES 2048 float32_t testInput[TEST_LENGTH_SAMPLES]; static float32_t testOutput[TEST_LENGTH_SAMPLES/2];
I could not find the arm_fft_sine_data.h file, but I guess this looks like arm_fft_bin_data.c which would also mean 8196 bytes, which would put you a few thousand over the 16384 bytes Teensy 3.0 has available.
By the way, your length variable is a float, which is unnecessary:
, the length argument should be an uint16_t according to the header file.Code:float32_t length = 256.0; arm_cfft_radix4_init_f32(&fft_inst, length, ifftFlag, doBitReverse);
In the following code:
You copy the fast Fourier transform result to testOutput, but fftSize is only 512, and testOutput is previously allocated with a size of 2048 / 2 = 1024 entries. So you allocate double the size you need. (I'm not sure how the library deals with the symmetry of the FFT though, as you only use an fft length of 256, the part from 255:511 will most likely consist of a mirror of the left part, so perhaps it would make sense to copy only the initial 256 entries.)Code:/* Calculates maxValue and returns corresponding BIN value */ arm_max_f32(testOutput, fftSize, &maxValue, &testIndex);
Hopefully this helps, perhaps someone who has experience using this library can help out on how large the output vector should be.
Last edited by iwanders; 06-07-2013 at 11:26 AM.
Thank you for finding the error in my ways. I was indeed careless and shouldn't post so late (only in the mornings from now on).
Fixed and working.
Thanks again,
mauricio
I came across a Freescale application note about CMSIS on ARM Cortex M4 which seems like a useful introduction
http://www.freescale.com/files/micro...ote/AN4489.pdf
Today i saw this topic and since i'm playing with FreeImu and Madgwick i did a little benchmark to see if things could be speed up with the dsp.
Maybe the vectormath could be speed up drastically.
Here`s a "quick and dirty" benchmark for sqrt and 1/sqrt.
Results first (-Os, 96 Mhz, teensy 3.1):
Source:Code:1000x dspSqrt:7580us. Result:21065.8378906 1000x sqrt :11700us. Result:21065.8378906 1000x 1 / dspSqrt:9302us. Result :4294967295. 1000x 1 / sqrt :23203us. Result :4294967295. 1000x invSqrt :4576us. Result:4294967295.
...but invSqrt is faster than arm_math (?)Code:#include <math.h> #include <arm_math.h> HardwareSerial Uart = HardwareSerial(); inline float dspSqrt(float x){ float result; arm_sqrt_f32(x, &result); return result; } inline float invSqrt(float x) { float halfx = 0.5f * x; float y = x; long i= *(long*)&y; i = 0x5f375a86 - (i>>1); y = *(float*)&i; y = y * (1.5f - (halfx * y * y)); return y; } inline float dspInvSqrt(float x){ float result; arm_sqrt_f32(x, &result); return 1 / result; } void setup() { Uart.begin(115200); } void loop(){ int time; volatile float f; time = micros(); f=0; for (int i=0; i<1000; i++) { f += dspSqrt(i); } time = micros() -time; Uart.print("\r\n1000x dspSqrt:"); Uart.print(time); Uart.print("us. Result:"); Uart.println(f,7); time = micros(); f=0; for (int i=0; i<1000; i++) { f += dspSqrt(i); } Uart.print("1000x sqrt :"); Uart.print(time); Uart.print("us. Result:"); Uart.println(f,7); time = micros(); f=0; for (int i=0; i<1000; i++) { f += dspInvSqrt(i); } time = micros() -time; Uart.print("\r\n1000x 1 / dspSqrt:"); Uart.print(time); Uart.print("us. Result :"); Uart.println(f); time = micros(); f=0; for (int i=0; i<1000; i++) { f += 1/sqrt(i); } time = micros() -time; Uart.print("1000x 1 / sqrt :"); Uart.print(time); Uart.print("us. Result :"); Uart.println(f); time = micros(); f=0; for (int i=0; i<1000; i++) { f += invSqrt(i); } time = micros() -time; Uart.print("1000x invSqrt :"); Uart.print(time); Uart.print("us. Result:"); Uart.println(f); while(1); }
Last edited by Frank B; 04-15-2014 at 04:50 PM.
That invSqrt() looks like an older version of FreeIMU. Please get the latest from here:
https://github.com/PaulStoffregen/FreeIMU
Part of the reason it's so fast is its low accuracy. It only performs one iteration of the Newton-Raphson approximation.
Thank you, Paul ! The c++ warning regarding "evil" operations is gone now, speed is identical.
I updated my "benchmark".
I think i'll use the "dspSqrt"-version from above for my new project, it is not much slower but gives better results (i hope).
But there is an other warning:
Code:In file included from C:\Arduino\hardware\teensy\cores\teensy3/WProgram.h:15:0, from C:\Arduino\hardware\teensy\cores\teensy3/Arduino.h:1, from arm_math.ino:5: C:\Arduino\hardware\teensy\cores\teensy3/wiring.h:42:0: warning: "PI" redefined [enabled by default] In file included from arm_math.ino:3:0: C:\Arduino\hardware\teensy\cores\teensy3/arm_math.h:303:0: note: this is the location of the previous definition
---
I don't want to optimze too much, because i'm sure now that teensy is fast enough.
Reading MPU6050 & HCM5883L over I2C (400kHz) + "Madgwick AHRS" 9-axis algorithm plus a few other tasks takes only 1.3 ms so far. Plenty of time left for other things.
My goal is to build my "Balancing Bot V3" (V1 with Raspberry + Arduino-Nano (Mega328) here: http://www.youtube.com/watch?v=n-noFwc23y0 or Blog- V2 was the same but without Raspberry)
With more features and eventually this time with only one wheel.
The teensy 3 is great !!
Last edited by Frank B; 04-15-2014 at 08:39 PM.
Wow.. playing with the dsp is fun :-)
this:
is 10 times faster thanCode:inline void deg2rad_vect(float32_t *fvect){ float32_t m[3] = { M_PI / 180, M_PI / 180, M_PI / 180 }; arm_mult_f32( m, &fvect[0], &fvect[0], 3); }
f[0] = f[0] * M_PI / 180;f[1] = f[1] * M_PI / 180;f[2] = f[2] * M_PI / 180;
I personally don't need these optimizations, but its fun to find out what the DSP can do.
I think there are much more things to "teensy-"optimize in FreeIMU. AHRSupdate() is worth a look.
If somebody is interested we can open a new thread.
I'm currently working on too many other things to do much with FreeIMU lately. But if you fork the github code, just send any well tested changes as pull requests and I'll merge them.
https://github.com/PaulStoffregen/FreeIMU
Hi, here are first changes:
https://github.com/FrankBoesing/free...aster...master
20% speedup of the calculation, but not entirely testet, but should give same results.
Unfortunately i can't test it with "real" flying hardware..
Last edited by Frank B; 04-18-2014 at 09:07 PM.
Hi,
It seems that the CMSIS lib that comes with Teensyduino is version 1.1.0. They are now at 1.4.4 and I'm very interested in using the newer and more convenient complex FFT functions (where you don't have to init and it will automatically select radix). I have tried to use an updated CMSIS lib, but without luck.
What I did:
1. download latest CMSIS-DSP
2. change the board.txt file so that teensy3.build.additionalobject1 links the new libarm_cortexM4l_math.a
3. update the arm-math and core header files in hardware/teensy/cores/teensy3/
4. Included the teensy3 "fix" in arm_math (could net see any teensy related edits in the other header files).
Something like this now compiles:
But the actual FFT calculation brings the MCU to grinding halt. Nothing happens after that. Any suggestions is greatly appreciated.Code:arm_cfft_instance_f32 fft_inst; arm_cfft_f32(&fft_inst, buffer_f, 0, 1);
Cheers,
Lars
I tried one of the newer versions some time ago. Unfortunately, it expanded the maximum FFT size by increasing the size of the lookup tables by 4X (even when computing a smaller FFT), so the compiled code could not fit into Teensy 3.0 or 3.1.
Thanks, Paul. But unless I did something wrong or missed something, it actually compiled and fitted into the Teensy 3.0. The sketch would run as normal until trying to use the new arm_cfft_x function. Please forgive any ignorance here - this is not within my comfort zone![]()