Where can I find list of math functions avail to Teensy 3.0

Status
Not open for further replies.

offgrid

Active member
Hello,
I am using Teensyduino 1.15 (and yep...new to Teensy) and am wondering what math functions are available
when I include the math library for 3.0? I have not been able to find this answer searching thru the forums.

Am specifically looking for the multiply and accumulate (MAC) for doing RMS sampling
of an ac waveform.

And is it this simple...just include the library and call the function?

Thanks
Tim
 
A lot of the DSP instructions are available as inline macros. Have a look at:
Arduino-1.0.5\hardware\teensy\cores\teensy3\core_cm4_simd.h

For example see the __SMLALD macro which does a signed multiply accumulate.

Pete
 
Last edited:
Am specifically looking for the multiply and accumulate (MAC) for doing RMS sampling
of an ac waveform.

This might be as simple as something like this each time you collect a new sample:

Code:
  sum += value * value;

If the result fits in 32 bits, I believe the compiler can implement this with 2 instructions. But the C language doesn't support saturating arithmetic.

Of course, for the square root you'll probably need something like the fast version in the library, or Newton Raphson approximation. If you use a log-weighted lookup table (eg, 32 entries indexed by __builtin_clz) for the initial guess, Newton Raphson can get very close with only a few iterations.

Then again, if you just collect up the samples into an array, you could use the library to do all the work the easy way. :D

http://www.keil.com/pack/doc/arm/cmsis/cmsis/documentation/dsp/html/group___r_m_s.html
 
This might be as simple as something like this each time you collect a new sample:

Code:
  sum += value * value;

If the result fits in 32 bits, I believe the compiler can implement this with 2 instructions. But the C language doesn't support saturating arithmetic.

However, GCC does have support for the N1169 draft of ISO/IEC DTR 18037 which defines fixed point fractional and saturating types. http://gcc.gnu.org/onlinedocs/gcc-4.8.1/gcc/Fixed_002dPoint.html#Fixed_002dPoint.

Looking at the machine descriptions, it looks like the 4.8.1 versions of AVR and ARM backends support some fractional and saturating types. I didn't look close enough to see what types, and if you need special options.
 
Callable RMS routine, what!!

Then again, if you just collect up the samples into an array, you could use the library to do all the work the easy way. :D

http://www.keil.com/pack/doc/arm/cmsis/cmsis/documentation/dsp/html/group___r_m_s.html[/QUOTE]
---------------------------

I was looking for the single cycle MAC, but this may prove even better!
When I get back to the lab, I will try out the routine, and time it.

I need to be able to disconnect from 3 phase grid in under 10 cycles, so
hopefully the RMS calculation using a large number of samples will not take too long.

Has anyone timed any of the RMS functions?

Thanks again Paul, Pete and Michael for your quick responses...
Tim
 
However, GCC does have support for the N1169 draft of ISO/IEC DTR 18037 which defines fixed point fractional and saturating types. http://gcc.gnu.org/onlinedocs/gcc-4.8.1/gcc/Fixed_002dPoint.html#Fixed_002dPoint.

Looking at the machine descriptions, it looks like the 4.8.1 versions of AVR and ARM backends support some fractional and saturating types. I didn't look close enough to see what types, and if you need special options.

I just tried, but I get "error: '_Fract' was not declared in this scope". Same for "_Sat". :(

Currently we're building with "-std=gnu++0x". Maybe some special option is needed, or perhaps 4.7.2 we're currently using is just too old?
 
I need to be able to disconnect from 3 phase grid in under 10 cycles, so
hopefully the RMS calculation using a large number of samples will not take too long.

If you square and accumulate each sample as you get them, it takes very little time at all (assuming you wait again for the next sample). You also only need to store the accumulation, rather than a big array of raw data.

If you need to do some hardware operation quickly, why not just do it ASAP and then "do the math" later with the array of data you captured? Even if you square-accumulate as you acquire each sample, the slower square root operation could be done later.


Has anyone timed any of the RMS functions?

No. I have not measured its speed, nor even ever used it (yet).

Please let us know how it works when you try it?
 
Last edited:
Sounds interesting

Ten cycles using a good front end for 3-phase measurements like the MCP3903 could be good enough for most applications. A lot depends on the level of decimation you'd be happy with. But 16 ENOB is possible, and the MCP line of AFE chips are custom-built for this application. My power meter is running along at 1.4ksps with each of those delivered samples based on a decimated set of 2048 samples each collected by a MCP3911 (i.e. massive over-sampling). It could run 25% faster but I prefer a direct connection from one of the PWM pins on the Teensy 3 to the OSC1 input vs. using a external 16Mhz crystal.

Now, ten line cycles is ~ 1/5th or 1/6th of a second (depends on where you live) so you'd have 200+ samples on your hands at that point. Not sure whether the Teensy could handle the matrix math, given that each channel is represented by at least a 16-bit result, if not 24-bits (your choice). Then there are 6 channels that the MCP3903 can monitor, i.e. over 1200 samples to put into a matrix. Etc.

As for the math/code required, have a look at the openenergymonitor source code for inspiration on how to do it (including a filter that removes the DC bias required for using an ADC like the ones found on the Teensy3 ARM or Atmega line of MCUs). But if there is a more efficient way to calculate it all, I'd be interested!
 
Last edited:
Well, if you collect 3 ksamples/sec on 6 channels over 0.1 seconds, that's only 1800 measurements. Even if they're 24 bits and you sign extend and store them into 32 bit "q31" variables, that's only 7200 bytes of RAM, or just under half of what Teensy3 has available.

I'd be impressed if anything beyond 16 bits really matters. Who knows, maybe? That MCP3903 is 91 dB signal to noise+distortion, so the even the 16th bit is probably not useful, even if you have a perfect noise-free signal (eg, no resistor noise from the voltage divider measuring the high voltages).

The 16 bit functions using "q15" variables are probably much faster. The SIMD instructions in the Cortex-M4 are all about manipulating signed 16 bit data.
 
Heres a function for squaring and summing an array of int16_t values using the Cortex M4 DSP instruction
SMLAD, Signed Multiply Accumulate Long Dual.

This code needs the number of data points (count) to be a multiple of 4, the data array must also be aligned
on a word boundary so each 32bit word is used as two 16 bit signed halfwords.

It squares and sums 512 16 bit signed halfwords in 13.9 uS on a 96MHz Teensy 3 , this is about 2.6 clock cycles
per data point including function call and loop overhead.

Code:
int32_t sumofsquares(int16_t * data, uint32_t count)
{
  int32_t accum = 0;
  int32_t val;
  int32_t * pdata = (int32_t *)data;
  int32_t * enddata = pdata + count/2;
  do  {
     val = *pdata++; 
     asm ("SMLAD %0, %1, %2, %3\t\n" : "=r"(accum) : "r"(val), "r"(val), "r"(accum) : );
     val = *pdata++; 
     asm ("SMLAD %0, %1, %2, %3\t\n" : "=r"(accum) : "r"(val), "r"(val), "r"(accum) : );
  }  while (pdata < enddata);
  return accum;
}
 
These two links have info about the instructions themselves, which might make those macros clearer.
http://www.arm.com/files/pdf/dspconc...esentation.pdf

Thanks Pete:
Using the single cycle MAC instruction for 16 X 16 + 64 = 64, which will accumulate data very quickly, the 13 ENOB ADC will be
just enough resolution for the job. The application is in the gridtie microinverter market. We are seeing these inverters make headway in commercial 3 phase
North American grids and the problem is this: multiples of 3 inverters do not act like a single 3 phase inverter when it comes to shutdown for IEEE1547 conditions.
What are those conditions? Hi and low voltage, high and low freq and loss of phase. So my solution is a relay that will detect out-of-1547 conditions and open
a 3 phase contactor, and shut all inverters off instantly.

So yeah, single cycle MAC, with a comparison based on the accumulated value, no need to even compute the root. I still need to have a zero crossing detector to make
the same number of samples each cycle,for amplitude accuracy. I also need the zero crossing detector for accurate frequency measurements.

I've done this on 8 bit micros with 10 bit ADC's and with a lot of hard work and bit-banging and software limiting (or saturating as it is being called now) and got as good as 1% accuracy consistently in a commercial product, so I have no doubt with a fast ARM and 13 bit ENOB a2d converter, it can be done much better and faster!!!

Thanks Paul for a very cool product!
Tim
P.S. I'm still not perfectly clear on proper usage of the macro, (yes I read the material you suggested Pete) but once I get back after this camping trip in the Olympic Peninsula, I'll test it out, and figure it out!!
 
I've attached a zip file of a sketch which demonstrates sum of squares.
The sketch contains an array of 6000 16-bit integers which is part of an audio file of 10-bit unsigned data. The sketch converts the array to 16-bit signed format and then does two loops. The first loop does the sum of squares using a simple non-dsp technique. The second loop uses the __SMLAD macro which calls the SMLAD DSP instruction. SMLAD can do two parallel 16-bit multiplies and accumulate the results. So each pass through the second loop is equivalent to two passes through the first loop.
I've also included a conditional bit of code which unrolls both loops once.

A result for the original code running at 96MHz is:
Code:
Not unrolled
Not using DSP
time = 447
rms       = 186092262
sqrt(rms) = 13641


Using DSP
time = 224
rms       = 186092262
sqrt(rms) = 13641
As you would expect (and hope) the second loop is twice as fast as the first.

Defining UNROLL gives these results:
Code:
Unrolled
Not using DSP
time = 284
rms       = 186092262
sqrt(rms) = 13641


Using DSP
time = 161
rms       = 186092262
sqrt(rms) = 13641
Unrolling the loop further will speed it up again but you soon reach a point where it doesn't significantly improve the execution time.

Pete
 

Attachments

  • dsp_rms.zip
    8.4 KB · Views: 176
I've attached a zip file of a sketch which demonstrates sum of squares.
Pete


Jeese... I've got to say, I'd marry this forum if I wasn't already married!!
That was fast!! and...that is fast.
Thanks for the quick and relevant answer.
Now I'll code my solution and post what I did ...thanks again Pete!
 
Status
Not open for further replies.
Back
Top