T3: linking libarm_cortexM3l_math.a

adrianfreed · Jun 5, 2013

mbustosorg said:
As for the market, in the Bay Area artistic community, this kind of audio analysis / visualization is what I'm hearing a lot about. The basic functionality with Arduino is all well and good but people are clamoring for something that can bridge the gap (like adrianfreed talks about) and keep the small footprint.

Yes, mbustosorg, watch out for CNMAT to host a BARCMUT meeting to pull this community together later in the summer.

PaulStoffregen · Jun 6, 2013

I just posted 1.15-rc1 with the math library included.

I tweaked the header files slightly, so they always work for Teensy 3.0 regardless of whether you define ARM_MATH_CM4 or other stuff. I tested only briefly, and only on Linux, with a few of Pete's samples.

Please give 1.15-rc1 a try. I'm very open to ideas on how this math stuff should be included. My hope is to keep the interface stable once 1.15 is fully released, so please give it a try now while there's still time to easily make changes.

mbustosorg · Jun 6, 2013

Hi Paul,

I tested the CMSIS examples on the new Beta and they worked fine. I'm trying to add an analogRead to the example and I'm getting some linking errors. I'm not familiar with these. Not including the analogRead links ok.

Thanks,
mauricio

Code:

#define ARM_MATH_CM4
#include "arm_math.h"
 
#define TEST_LENGTH_SAMPLES 2048
#include "arm_fft_sine_data.h"

// NOTE: q15t is int16_t in arm_math.h
uint32_t fftSize = 512; 
 
/* ------------------------------------------------------------------ 
* Global variables for FFT Bin Example 
* ------------------------------------------------------------------- */ 
uint32_t ifftFlag = 0; 
uint32_t doBitReverse = 1; 
 
uint32_t testInputIndex = 0;
float32_t testInput[TEST_LENGTH_SAMPLES];
static float32_t testOutput[TEST_LENGTH_SAMPLES/2]; 
 
void setup() {
  Serial.begin(19200);
  pinMode(13, OUTPUT);
  for (int i=0; i < 10; i++) {
    Serial.println(" start program ");
    delay(1000);
  }
}
 
bool pit3Triggered = false;

extern "C" {
//! Audio input interrupt handler running at 15kHz
void pit3_isr(void)
{
  pit3Triggered = true;
  digitalWrite(13, HIGH);
  digitalWrite(13, LOW);
  PIT_TFLG3 = 1;
}

void startup_late_hook(void) {
  // This is called from mk20dx128.c
  //Turn on interrupts:
  SIM_SCGC6 |= SIM_SCGC6_PIT;
  // turn on PIT
  PIT_MCR = 0x00;
  NVIC_ENABLE_IRQ(IRQ_PIT_CH3);
  
  PIT_LDVAL3 = 3200 - 1; // setup timer 2 for frame timer period (15kHz) = 48MHz / 15kHz
  PIT_TCTRL3 = 0x2; // enable Timer 3 interrupts
  PIT_TCTRL3 |= 0x1; // start Timer 3
  PIT_TFLG3 |= 1;
}
}

void loop() {
 
  float32_t maxValue; 
  float32_t length = 256.0;
  if (pit3Triggered) {
	int sample;
	sample = analogRead (14);
	testInput[testInputIndex] = sample / 1024 * 10.0;
	testInputIndex++;
	if (testInputIndex >= TEST_LENGTH_SAMPLES) {
	  testInputIndex = 0;
	}
  }
    
  if (testInputIndex == 0) {
	arm_cfft_radix4_instance_f32 fft_inst;  /* CFFT Structure instance */
	arm_cfft_radix4_init_f32(&fft_inst, length, ifftFlag, doBitReverse);
  
	uint32_t startTime, fftTime, magTime, maxTime;
	Serial.println("Start"); 
	startTime = millis();
	/* Process the data through the CFFT/CIFFT module */ 
	arm_cfft_radix4_f32(&fft_inst, testInput_f32_10khz);
	fftTime = millis();
	/* Process the data through the Complex Magnitude Module for  
	   calculating the magnitude at each bin */ 
	arm_cmplx_mag_f32(testInput_f32_10khz, testOutput, fftSize);  
	magTime = millis();
	/* Calculates maxValue and returns corresponding BIN value */ 
	arm_max_f32(testOutput, fftSize, &maxValue, &testIndex); 
	maxTime = millis();
	Serial.println("End");  

	Serial.println(fftTime - startTime);
	Serial.println(magTime - fftTime);
	Serial.println(maxTime - magTime);
	Serial.println("TOTAL: ");
	Serial.println(maxTime - startTime);

	Serial.print("MaxValue: ");
	Serial.println(maxValue);
	Serial.print("MaxIndex: ");
	Serial.println(testIndex);

	Serial.print("Magnitudes: ");
	for (int j=0; j < length / 2; j++) {
	  Serial.print(j);
	  Serial.print(", ");
	  Serial.println(testOutput[j]);
	}
  }  
}

Code:

/Applications/Development/Arduino/Arduino1.0.5.app/Contents/Resources/Java/hardware/tools/arm-none-eabi/bin/../lib/gcc/arm-none-eabi/4.7.2/../../../../arm-none-eabi/bin/ld: fftTest.cpp.elf section `.bss' will not fit in region `RAM'
/Applications/Development/Arduino/Arduino1.0.5.app/Contents/Resources/Java/hardware/tools/arm-none-eabi/bin/../lib/gcc/arm-none-eabi/4.7.2/../../../../arm-none-eabi/bin/ld: region `RAM' overflowed by 7916 bytes
collect2: error: ld returned 1 exit status

iwanders · Jun 7, 2013

I have never used this arm_math library before, but I'm following this thread with great interest.

Hopefully I can help out here. Mauricio, I think your code just runs out of ram.
You did the following:

Code:

#define TEST_LENGTH_SAMPLES 2048
float32_t testInput[TEST_LENGTH_SAMPLES];
static float32_t testOutput[TEST_LENGTH_SAMPLES/2];

A float32_t is four bytes, so you allocate 2048 * 4 + (2048 / 2 ) * 4 = 12288 bytes here.

I could not find the arm_fft_sine_data.h file, but I guess this looks like arm_fft_bin_data.c which would also mean 8196 bytes, which would put you a few thousand over the 16384 bytes Teensy 3.0 has available.

By the way, your length variable is a float, which is unnecessary:

Code:

float32_t length = 256.0;
arm_cfft_radix4_init_f32(&fft_inst, length, ifftFlag, doBitReverse);

, the length argument should be an uint16_t according to the header file.

In the following code:

Code:

/* Calculates maxValue and returns corresponding BIN value */ 
	arm_max_f32(testOutput, fftSize, &maxValue, &testIndex);

You copy the fast Fourier transform result to testOutput, but fftSize is only 512, and testOutput is previously allocated with a size of 2048 / 2 = 1024 entries. So you allocate double the size you need. (I'm not sure how the library deals with the symmetry of the FFT though, as you only use an fft length of 256, the part from 255:511 will most likely consist of a mirror of the left part, so perhaps it would make sense to copy only the initial 256 entries.)

Hopefully this helps, perhaps someone who has experience using this library can help out on how large the output vector should be.

mbustosorg · Jun 7, 2013

Thank you for finding the error in my ways. I was indeed careless and shouldn't post so late (only in the mornings from now on).

Fixed and working.

Thanks again,
mauricio

Nantonos · Jun 18, 2013

I came across a Freescale application note about CMSIS on ARM Cortex M4 which seems like a useful introduction
http://www.freescale.com/files/microcontrollers/doc/app_note/AN4489.pdf

Frank B · Apr 15, 2014

Today i saw this topic and since i'm playing with FreeImu and Madgwick i did a little benchmark to see if things could be speed up with the dsp.
Maybe the vectormath could be speed up drastically.

Here`s a "quick and dirty" benchmark for sqrt and 1/sqrt.
Results first (-Os, 96 Mhz, teensy 3.1):

Code:

1000x dspSqrt:7580us.  Result:21065.8378906
1000x sqrt   :11700us.  Result:21065.8378906

1000x 1 / dspSqrt:9302us.  Result  :4294967295.
1000x 1 / sqrt   :23203us.  Result  :4294967295.
1000x invSqrt    :4576us.  Result:4294967295.

Source:

Code:

#include <math.h>
#include <arm_math.h>

HardwareSerial Uart = HardwareSerial();

inline float dspSqrt(float x){
	float result;
	arm_sqrt_f32(x, &result);
	return result;
}


inline float invSqrt(float x) {
	float halfx = 0.5f * x;
	float y = x;
	long i= *(long*)&y;
	i = 0x5f375a86 - (i>>1);
	y = *(float*)&i;
	y = y * (1.5f - (halfx * y * y));
	return y;	
}

inline float dspInvSqrt(float x){
	float result;
	arm_sqrt_f32(x, &result);
	return 1 / result;
}

void setup() {
	Uart.begin(115200);
}


void loop(){
 int time;
 volatile float f;
 
 
 time = micros();
 f=0;
 for (int i=0; i<1000; i++) {
	f += dspSqrt(i);
 }
 time = micros() -time;
 
 Uart.print("\r\n1000x dspSqrt:");
 Uart.print(time);
 Uart.print("us.  Result:");
 Uart.println(f,7);
 
 time = micros();
 f=0;
 for (int i=0; i<1000; i++) {
	f += dspSqrt(i);
 }
 Uart.print("1000x sqrt   :");
 Uart.print(time);
 Uart.print("us.  Result:");
 Uart.println(f,7);
 
 
 
 
 time = micros();
 f=0;
 for (int i=0; i<1000; i++) {
	f += dspInvSqrt(i);
 }
 time = micros() -time;
 Uart.print("\r\n1000x 1 / dspSqrt:");
 Uart.print(time);
 Uart.print("us.  Result  :");
 Uart.println(f);
 
 time = micros();
 f=0;
 for (int i=0; i<1000; i++) {
	f += 1/sqrt(i);
 }
 time = micros() -time;
 Uart.print("1000x 1 / sqrt   :");
 Uart.print(time);
 Uart.print("us.  Result  :");
 Uart.println(f);
 

 time = micros();
 f=0;
 for (int i=0; i<1000; i++) {
	f += invSqrt(i);
 }
 time = micros() -time;
 Uart.print("1000x invSqrt    :");
 Uart.print(time);
 Uart.print("us.  Result:");
 Uart.println(f);



 
 while(1);
}

...but invSqrt is faster than arm_math (?)

PaulStoffregen · Apr 15, 2014

That invSqrt() looks like an older version of FreeIMU. Please get the latest from here:

https://github.com/PaulStoffregen/FreeIMU

Part of the reason it's so fast is its low accuracy. It only performs one iteration of the Newton-Raphson approximation.

Frank B · Apr 15, 2014

Thank you, Paul ! The c++ warning regarding "evil" operations is gone now, speed is identical.
I updated my "benchmark".

I think i'll use the "dspSqrt"-version from above for my new project, it is not much slower but gives better results (i hope).

But there is an other warning:

Code:

In file included from C:\Arduino\hardware\teensy\cores\teensy3/WProgram.h:15:0,
                 from C:\Arduino\hardware\teensy\cores\teensy3/Arduino.h:1,
                 from arm_math.ino:5:
C:\Arduino\hardware\teensy\cores\teensy3/wiring.h:42:0: warning: "PI" redefined [enabled by default]
In file included from arm_math.ino:3:0:
C:\Arduino\hardware\teensy\cores\teensy3/arm_math.h:303:0: note: this is the location of the previous definition

---

I don't want to optimze too much, because i'm sure now that teensy is fast enough.
Reading MPU6050 & HCM5883L over I2C (400kHz) + "Madgwick AHRS" 9-axis algorithm plus a few other tasks takes only 1.3 ms so far. Plenty of time left for other things.
My goal is to build my "Balancing Bot V3" (V1 with Raspberry + Arduino-Nano (Mega328) here: http://www.youtube.com/watch?v=n-noFwc23y0 or Blog- V2 was the same but without Raspberry)
With more features and eventually this time with only one wheel.

The teensy 3 is great !!

Frank B · Apr 16, 2014

Wow.. playing with the dsp is fun

this:

Code:

inline void deg2rad_vect(float32_t *fvect){
 float32_t m[3] = { M_PI / 180,  M_PI / 180, M_PI / 180 };
 arm_mult_f32( m, &fvect[0], &fvect[0], 3);
}

is 10 times faster than
f[0] = f[0] * M_PI / 180;f[1] = f[1] * M_PI / 180;f[2] = f[2] * M_PI / 180;

I personally don't need these optimizations, but its fun to find out what the DSP can do.
I think there are much more things to "teensy-"optimize in FreeIMU. AHRSupdate() is worth a look.
If somebody is interested we can open a new thread.

PaulStoffregen · Apr 18, 2014

Frank B said:
I think there are much more things to "teensy-"optimize in FreeIMU.

I'm currently working on too many other things to do much with FreeIMU lately. But if you fork the github code, just send any well tested changes as pull requests and I'll merge them.

https://github.com/PaulStoffregen/FreeIMU

Frank B · Apr 18, 2014

Hi, here are first changes:

https://github.com/FrankBoesing/freeIMU/compare/zrecommerce:master...master

20% speedup of the calculation, but not entirely testet, but should give same results.

Unfortunately i can't test it with "real" flying hardware..

ltj · Oct 3, 2014

Hi,

It seems that the CMSIS lib that comes with Teensyduino is version 1.1.0. They are now at 1.4.4 and I'm very interested in using the newer and more convenient complex FFT functions (where you don't have to init and it will automatically select radix). I have tried to use an updated CMSIS lib, but without luck.
What I did:
1. download latest CMSIS-DSP
2. change the board.txt file so that teensy3.build.additionalobject1 links the new libarm_cortexM4l_math.a
3. update the arm-math and core header files in hardware/teensy/cores/teensy3/
4. Included the teensy3 "fix" in arm_math (could net see any teensy related edits in the other header files).

Something like this now compiles:

Code:

arm_cfft_instance_f32 fft_inst;
arm_cfft_f32(&fft_inst, buffer_f, 0, 1);

But the actual FFT calculation brings the MCU to grinding halt. Nothing happens after that. Any suggestions is greatly appreciated.

Cheers,
Lars

PaulStoffregen · Oct 3, 2014

I tried one of the newer versions some time ago. Unfortunately, it expanded the maximum FFT size by increasing the size of the lookup tables by 4X (even when computing a smaller FFT), so the compiled code could not fit into Teensy 3.0 or 3.1.

ltj · Oct 3, 2014

Thanks, Paul. But unless I did something wrong or missed something, it actually compiled and fitted into the Teensy 3.0. The sketch would run as normal until trying to use the new arm_cfft_x function. Please forgive any ignorance here - this is not within my comfort zone

T3: linking libarm_cortexM3l_math.a

adrianfreed

Well-known member

PaulStoffregen

Well-known member

mbustosorg

Member

iwanders

Well-known member

mbustosorg

Member

Nantonos

Well-known member

Frank B

Senior Member

PaulStoffregen

Well-known member

Frank B

Senior Member

Frank B

Senior Member

PaulStoffregen

Well-known member

Frank B

Senior Member

ltj

New member

PaulStoffregen

Well-known member

ltj

New member