Floating-Point Audio Library Extension

chipaudette

Well-known member
Hi All,

Because I find it so much easier to code algorithms using floating-point operations (vs fixed-point), and because the Teensy 3.6 is so dang fast with floating-point math, I've extended the Teensy Audio Library to enable floating-point operations.

I tried to follow the structure and conventions of the Teensy Audio Library as my understanding allowed. But, as we know, the assumption of Int16 data is deeply entwined with the foundational structures of the Teensy Audio Library. But, through adding parallel classes and through inheritance, I was able to make floating-point objects that play nicely within the Teensy Audio Library world.

LibraryExtension.png

For more info, you can see my post here: http://openaudio.blogspot.com/2016/12/extending-teensy-audio-library-for.html.

My github repo for this library is here: https://github.com/chipaudette/OpenAudio_ArduinoLibrary

My next steps are to continue to add processing blocks. Right now, besides all of the new code that handles all the plumbing, the only audio processing block in the library is a simple gain block. I've got a block that does dynamic range compression almost complete. Then I'll do a filtering block, followed by moving to frequency domain methods.

I'd very much appreciate any comments or feedback as to how I could improve or refine my approach for enabling floating-point processing on top of the Teensy Audio Library structure. I'd love to make it better!

Chip
 
Chip, that´s a brilliant idea! There are several things that were impossible to do in fixed point, but are now doable in floating point, so it´s a logic development. I will try your lib in the next days.

I would really like to offer to contribute, but my programming skills are not very bright when it comes to the implementation of those modules for the audio lib . . . however this will probably be a longer lasting effort. So maybe in the course of the project I can contribute small snippets of code to the floating point lib.

It would be very good, especially at this early point of the development, to find a way to implement and use the new CMSIS DSP lib version. For example, the old FFT functions should not be used anymore because of precision issues and they are slower than the new implementations.

Keep on the good work!
 
Yes, the idea is fine, - did you measure the CPU-usage witrh more functions, like a mixer, FFT or peak ?
What are your plans ? Do you want to re-write the whole library ?
How fast is the conversion int16<->float ?
 
It would be very good, especially at this early point of the development, to find a way to implement and use the new CMSIS DSP lib version. For example, the old FFT functions should not be used anymore because of precision issues and they are slower than the new implementations.

My goal right now is to not modify the Teensy Audio Library in any way, if I can. I just want to add / overload classes and methods. Can the new CMSIS be added without modifying the Teensy Audio Library to remove the existing CMSIS library?
 
Yes, the idea is fine, - did you measure the CPU-usage witrh more functions, like a mixer, FFT or peak ?

As the mixer, FFT, and peak are all still Int16 functions, I have not evaluated their impact on the floating-point extensions. As their code is unchanged, I would not expect my code to affect their performance. If I (or someone) ever makes a F32 versioni of these modules, their load will probably be different, though the Teensy 3.6 is so good on floats, I don't expect it to be too

What are your plans ? Do you want to re-write the whole library ?

I'm trying to stay pretty focused on my hearing aid thing, so I do not plan on extending the whole library. The existing library works pretty darned well in Int16, so I'm probably only going to do the functions that I specifically want as operating in F32, such as the IIR/FIR filtering and FFT. After that, we'll see.

How fast is the conversion int16<->float ?

Using the Audio Library self-reporting, it says that the int2float conversion takes < 1%, the gain block takes < 1% and the float2int takes < 1%. Doing all three for two audio channels (ie, stereo) takes 3.2% of CPU.

>>CPU: Usage, Max: int2Float=0, 1 gain=0,0 float2Int1=1, 1 all=3.21, 3.27 Int16 Memory: 6, 6 Float Memory: 0, 2

I've used this same self-reporting while developing my new processing block to do dynamic range compression. Because I need to use the pow() operator on every sample, it's telling me that I'm exceeding 100% cpu due to the one operation. So, while the Teensy 3.6 is fast when you use the DSP acceleration, when there isn't acceleration (such as for the pow() operation), you'll bump into speed limits.

Thankfully the int2float and float2int conversions do not appear to be limiting.

Chip
 
I've used this same self-reporting while developing my new processing block to do dynamic range compression. Because I need to use the pow() operator on every sample, it's telling me that I'm exceeding 100% cpu due to the one operation. So, while the Teensy 3.6 is fast when you use the DSP acceleration, when there isn't acceleration (such as for the pow() operation), you'll bump into speed limits.

So I just got my compressor to work. Stupidly, I was using the pow() command when I should have been using the powf() command. In the middle of all of my floating point variables, the pow() was deathly slow, but the powf() did what I wanted!

The compressor has a number of processing elements. On every audio sample it needs to apply: (1) a DSP accelerated HP biquad filter to remove DC, (2) arm_mult_f32 to square signal, (3) the non-accelerated powf() command to calculated the compressed gain, (4) arm_sqrt_f32 to finish calculating the gain, and (5) another arm_mult_f32 to apply the gain onto the signal.

Even with all this work with floats, and even with doing it in stereo, the Teensy 3.6 is reporting the following CPU load:

>> CPU: Usage, Max: int2Float1=1, 1 comp=18, 19 float2Int1=1, 1 all=39.49, 40.93 Int16 Memory: 6, 6 Float Memory: 2, 4

So, one channel of the pressor takes 18-19% of CPU. Two channels, with all the other stuff is about 40%. Not bad!

Digging into the details, it appears that all of the measurable time seems to be in that powf() call. If I replace it with a lookup table, it'll scream!

Chip
 
Last edited:
Hi Chip,

one thing that came to my mind inspired by this thread:

it would be good to make sure, that your lib is really using floating point calculations with the CMSIS package. It seems that when you use eg. arm_sqrt_f32, it does not use the FPU when you use the standard implementation of CMSIS in the current Teensyduino (however this has to be proven by people knowing more about the internals of the Teensyduino and CMSIS). At least there are hints that it does not use FPU when using Teensyduino standard installation.
 
To accept this into the audio library (if that's even a goal?), I'd really like to see it structured a bit differently. My main concern is reusing the same memory allocation pool for deterministic timing. A pair of blocks will be needed to hold 32 bit float data.
 
Hi Paul,

Thanks so much for the reply.

Of course, my ego would love for my work to live beyond my own personal uses, but it's your baby and you've had the success to know what works and what doesn't. I'm very happy to make alterations based on your experience. Whether it's useful once the alterations are made is your call.

Looking at what you've suggested, I'm a bit confused by your comment above. I do understand that I did not use the same memory pool for my f32 data as you do for the int16 data, but I figured that was a good choice, as it decoupled the two portions of the library. Given my inexperience, I figured that decoupling was a good idea as it helped me avoid messing up the standard (int16) portions of the library. Also, as I will also soon be making an additional extension for complex-float data (for frequency-domain processing), I could re-use the same approach to keep me segregated from myself.

If that's the wrong approach, I could certainly (I think) re-write my floating point extensions to pull from the same memory pool that had been allocated by the original AudioStream::allocate() (or, perhaps alternately said, by the AudioStream::initialize_memory() command). The problem is that I don't understand how the shared-memory approach is deterministic whereas the segregated-memory approach is not. Is it possible for you to give a few more words on this issue so that I can see why the two approaches are different in this regard?

Chip
 
Last edited:
I got my first Pull Request with user contributed code. That' makes a guy feel pretty cool.

GitHub Screenshot.png

I did a quick write-up, including a test of the new code: http://openaudio.blogspot.com/2017/01/received-my-first-pull-request.html

If others are interested in floating-point audio processing, I'd love to receive more contributions. This contribution was directly modeled on Paul's existing (Int16) audio processing blocks. That's a great approach to follow.

Thanks for your interest!

Chip
 
hi Guys,

Great work chip you really inspired me to start to do something with float support :)

To accept this into the audio library (if that's even a goal?), I'd really like to see it structured a bit differently. My main concern is reusing the same memory allocation pool for deterministic timing. A pair of blocks will be needed to hold 32 bit float data.

i created a branch of the core library which includes float support which should work better with all the old stuff that is around. (it uses two blocks for floats)
Also it removes the need for all the conversion blocks chip uses as it automatically does a (cached) conversion from int to float and vice versa :).

Furthermore all float support can be disabled with a single define (AUDIO_FLOAT) so older boards are not burdened with all the float processing.

A float block basically consists of two blocks, i added a pointer to point to the next float block.

Code:
typedef struct audio_block_struct {
	unsigned char ref_count;
	unsigned char memory_pool_index;
	unsigned char type;
	unsigned char reserved2;
	struct audio_block_struct * nextBlock;
	int16_t data[AUDIO_BLOCK_SAMPLES];
} audio_block_t;

Below the mixer example to demo the api.

Code:
static void applyGain(float *data, float mult) {
	for (int i = 0; i < AUDIO_BLOCK_SAMPLES/2; i++) {
		data[i] = data[i] * mult;
	}
}

static void applyGainThenAdd(float *data, const float *in, float mult) {
	for (int i = 0; i < AUDIO_BLOCK_SAMPLES/2; i++) {
		data[i] = data[i] + in[i] * mult;
	}
}

void AudioMixerFloat4::update(void)
{
	audio_block_t *in, *out=NULL;
	unsigned int channel;

	for (channel=0; channel < 4; channel++) {
		if (!out) {
			out = receiveWritableFloat(channel);
			if (out) {
				float mult = multiplier[channel];
				applyGain((float *)out->data, mult);
				applyGain((float *)out->nextBlock->data, mult);
			}
		} else {
			in = receiveReadOnlyFloat(channel);
			if (in) {
				applyGainThenAdd((float *)out->data, (float *)in->data, multiplier[channel]);
				applyGainThenAdd((float *)out->nextBlock->data, (float *)in->nextBlock->data, multiplier[channel]);
				release(in);
			}
		}
	}
	if (out) {
		transmit(out);
		release(out);
	}
}


My demo code is at the moment not very impressive (a dummy block and a mixer). And AudioStream.cpp does need a little cleanup here and there.

You can check it out here:

https://github.com/b0rg3rt/cores/tree/floatsupport
https://github.com/b0rg3rt/Audio/tree/floatsupport
 
Last edited:
How costly is converting from float to int on the M4? I.e. do the tricky DSP stuff using Floats and move to int16 for the simpler fixed point arithmetic?
 
Well, the actual software code to do the conversion (and scaling) is trivial (see convertAudio_F32toI16 in this file: https://github.com/chipaudette/OpenAudio_ArduinoLibrary/blob/master/AudioConvert_F32.h).

The hard part is designing your filtering and other audio processing algorithms so that you don't get any numerical overflow and so that you minimize your numerical underflow. The amount of difficulty depends upon what kind of DSP operations you want to do.

FFT operations, for example, are relative easy to invoke (the ARM CMSIS library has one ready to call) but the impact on your audio signal is murderous if you want to do both an FFT and an IFFT so as to listen to your audio again -- You lose something like 6-7 bits of resolution!

FIR filters are a bit trickier than simply calling an FFT because you need to choose how to scale all of you FIR coefficients in order to avoid overflow. But, that's not too hard a choice. If you choose well, the impact on your audio quality isn't too bad at all (in contrast to FFT).

IIR filters are one of the trickiest to setup due to much more complex considerations of how to scale your coefficients. Choosing the scale factor is hard because it's dependent upon where you've put your filter's corner frequencies. Close to DC or close to nyquist makes the scaling really hard for IIR filters. But, as long as you stay away from these extremes, you can make a pretty decent fixed-point IIR.


So, even though I'm maybe making it sound not so bad...personally, I hate worrying about the numerical implications of doing fixed-point operations. It's probably just personal bias or personal laziness, but even for those fixed-point challenges that I (kinda) understand), I just hate having to even *think* about it. Hate it. So, that's why I use floats whenever I can.

Chip
 
Thanks Chip, I'm aware of most of that. I'm particularly curious how 'costly' it is in terms of cpu cycles to convert a floating point into an int and vice versa. E.g. on most ARM A[x] and Intel chips it's typically best to stick with one or the other because conversion is so slow that any optimization benefits of mixing paradigms is completely negated. I'm wondering if the same goes for the M4.
 
Oh, the conversion from Int to Float is slow only on those M4 processors that don't have a floating point unit. But that's only because handling floating point numbers is slow, not because the conversion itself is slow.

The simplest conversion from int to float is simply "my_float = (float_32)my_int;". That's not very expensive at all! Though I do prefer to add a division operation to change the scaling of my values "my_float = ((float_32)my_int) / 32767.f;". On the Teensy 3.5/3.6, which have a floating point unit (ie, they are an M4F), this both the casting and he single per-sample division are pretty dang fast.

Converting back to int from float is similarly fast because it's either just a casting operation (the simplest) or a casting operation plus a multiply to change the scaling. Again, on an M4F, this is wicked fast.

In my benchmarking on the Teensy 3.6, the conversion from Int16 to Float32 and back takes only 1-2% of CPU when running at 44 kHz. I don't consider that too much of a burden.

Chip
 
I have 3 questions'
1. I have created 2 FP modules: DC and a Moog filter. I'd like to add this to git-hub, but can't figure out how.
2. On my 3.6, a FP and Int mix doesn't work. It's either all FP flow or all integer. When there is a mix, the program seems to freeze. Also a sample by sample int<-> fp with division doesn't work, i.e. the result sounds bit-crushed. Any idea, what's going on?
3. Is there a way to get the Arduino IDE to work with the FP-CMSIS library?
 
Hi DragonSF,

if you want someone to comment on your code, you would have to provide it . . . otherwise nobody can help you.

re 1. github is hard to handle, at least for me working on the command line. But github also has the feature of drag-and-drop of files, so just create a repository and drop your files there for a quick start.

re 2. cannot comment on that, because you do not provide your code . . . maybe you would like to have a look into my Teensy Convolution SDR code, this uses conversion from fixed point to floating point and the FPU of the T3.6: https://github.com/DD4WH/Teensy-ConvolutionSDR

re 3. yes: try the search function, there are several people who have achieved that. Maybe this post is useful: https://forum.pjrc.com/threads/4059...Defined-Radio)?p=129081&viewfull=1#post129081

Have fun with the Teensy 3.6!

Frank
 
1. Here are the new files: FP-Audio-lib
2. Here is the suspious code:
#include "filter_moog.h"
#include "utility/dspinst.h"



#if defined(KINETISK)

void AudioFilterMoog::update_fixed(const int16_t *in,int16_t *lp)
{
const float MAX_INT = 32678.0;
for (int i = 0; i < AUDIO_BLOCK_SAMPLES; i++) {
float cs = (float)in/MAX_INT;

lp = (int16_t)cs*MAX_INT;
}
}


void AudioFilterMoog::update_variable(const int16_t *in,const int16_t *ctl, int16_t *lp)
{
const float MAX_INT = 32678.0;
for (int i = 0; i < AUDIO_BLOCK_SAMPLES; i++) {
float cs = (float)(in/MAX_INT);
lp =cs*MAX_INT;
}
}


void AudioFilterMoog::update(void)
{
audio_block_t *input_block=NULL, *control_block=NULL;
audio_block_t *lowpass_block=NULL;

input_block = receiveReadOnly(0);
control_block = receiveReadOnly(1);
if (!input_block) {
if (control_block) release(control_block);
return;
}
lowpass_block = allocate();
if (!lowpass_block) {
release(input_block);
if (control_block) release(control_block);
return;
}

if (control_block) {
update_variable(input_block->data, control_block->data, lowpass_block->data);
release(control_block);
} else {
update_fixed(input_block->data, lowpass_block->data);
}
release(input_block);
transmit(lowpass_block, 0);
release(lowpass_block);
return;
}

#elif defined(KINETISL)

void AudioFilterMoog::update(void)
{
audio_block_t *block;

block = receiveReadOnly(0);
if (block) release(block);
block = receiveReadOnly(1);
if (block) release(block);
}

#endif

Header:
#ifndef filter_moog_h_
#define filter_moog_h_

#include "Arduino.h"
#include "AudioStream.h"

class AudioFilterMoog: public AudioStream
{
float g;
float q;
float driv;



public:
AudioFilterMoog() : AudioStream(2, inputQueueArray) {
frequency(1000);
resonance(1);
drive(1);
}
void frequency(float freq) {
if (freq < 20.0) freq = 20.0;
else if (freq > AUDIO_SAMPLE_RATE_EXACT/2.5) freq = AUDIO_SAMPLE_RATE_EXACT/2.5;
g = 1 - expf(-2 * tanf(2 * M_PI * freq/(2 * AUDIO_SAMPLE_RATE_EXACT)));
}
void resonance(float qi) {
if (qi < 0.7) qi = 0.7;
else if (qi > 5.0) qi = 5.0;
q=qi;
}
void drive(float d) {
if (d > 10.0f) d = 10.0f;
if (d < 0.1f) d = 0.1f;
driv = d;
}
virtual void update(void);
private:
void update_fixed(const int16_t *in,int16_t *lp);
void update_variable(const int16_t *in, const int16_t *ctl, int16_t *lp);
audio_block_t *inputQueueArray[2];
};

#endif

3. Thanks for the advice. I'll search. the forum.
And yes, having much fun with teensy 3.2 and 3.6. BTW: In effect_delay_ext.h I had to made the initialize method public. Otherwise, I can't use the AUDIO_MEMORY_CY15B104 settings and still use the GUI. If you have a better idea, I'm open to suggestions. The AUDIO_MEMORY_CY15B104 is really nice. I'm using this on my digital hang-drum.
 
Thanks to Frank, I found the solution: using the FP settings from the given link, now the int<->fp conversion works flawless.
 
Are any of these 32-bit floating point forks working on Teensy 4.0? Also there doesn't seem to be F32 implementation for I2S slave mode...
 
Back
Top