PDA

View Full Version : Acoustic Echo Cancelation(AEC)



snirtikam
02-04-2016, 11:30 PM
Hello,
I'm hoping to get some help with this. What i'm implementing is AEC between two teensy boards and using Audio shield. So, to get rid of echo, I have to compare incoming audio signal from far end teensy and microphone and do some processing (i know how to do that part). What i'm having trouble with is retrieving sampled data from the microphone for manipulation. I know data is sampled at 44.1kSps in blocks of 128, which is about 2.9ms, but how to retrieve that data block as a double or int data is what i dont understand how to do. I've looked over 'Recording' example and in that example 'readBuffer()' returns a pointer to 128 block and then a 512 byte buffer is filled. How can i get that data and maybe cast is as double, or int... ? Is that possible? Any help, suggestions would be awesome.

Thank you!

PaulStoffregen
02-05-2016, 12:31 PM
The reference documentation for this (and all objects in the audio library) can be found in the design tool's right-side panel.

http://www.pjrc.com/teensy/gui/?info=AudioRecordQueue



I've looked over 'Recording' example and in that example 'readBuffer()' returns a pointer to 128 block and then a 512 byte buffer is filled. How can i get that data and maybe cast is as double, or int... ?


readBuffer() give you an array of int16_t data.

So you could do something like this:



int16_t *data = queue1.readBuffer();

Serial.println(data[0]); // the first sample
Serial.println(data[128]); // the last sample

queue1.freeBuffer(); // do not use "data" after this!

snirtikam
02-06-2016, 02:48 AM
Thank you very much for your quick response. I've tried your suggestion, but I'm a little confused by the output. The 'data' is printing samples of of the 128 array and they way i had it set up is that i'm sampling from microphone and printing all 128 sample in a 'for' loop. However, all of the printed samples are the same, and there are bunch of zeros ... Not sure what i'm doing wrong. If queue is filled with sample from microphone should they be somewhat random? Another question, does each sample represent 16 bit voltage from ADC from -1/+1 volts? Little confused on this... Any help/input is greatly appreciated. Thank you!

my screen capture of serial monitor:
6304
my code:

//
// Microphone to Speaker output

#include <Audio.h>
#include <Wire.h>
#include <SPI.h>
#include <SD.h>
#include <SerialFlash.h>


// GUItool: begin automatically generated code
AudioInputI2S Mic; //microphone
AudioRecordQueue MicMem; //queue for microphone
AudioOutputI2S headphones; //output
AudioConnection patchCord1(Mic, 0, MicMem, 0);
AudioConnection patchCord2(Mic, 0, headphones,0);
AudioConnection patchCord3(Mic, 0, headphones,1);
AudioControlSGTL5000 sgtl5000_1; //xy=339.6666564941406,285.6666488647461
// GUItool: end automatically generated code
int16_t data1[128];


const int micInput = AUDIO_INPUT_MIC;

void setup() {
Serial.begin(9600);
AudioMemory(20);
sgtl5000_1.enable();
sgtl5000_1.volume(0.5);
sgtl5000_1.inputSelect(micInput);
sgtl5000_1.micGain(36);
delay(1000);
}

void loop() {
MicMem.begin();
int16_t *data = MicMem.readBuffer();

for(int i = 0; i < 128; i++) {
Serial.println(data[i]);
}
MicMem.freeBuffer(); // do not use "data" after this!
MicMem.end();
delay(500);
}

Frank B
02-09-2016, 10:17 PM
it does not help anbody to ask questions twice. It bothers.

snirtikam
02-09-2016, 10:37 PM
My apologies. Thank you for posting reply. I've read over the Recorder example but that is of no help to me because it writes the data to SD card and plays the .wav. I just want the raw data samples from the buffer and I cant seem to get that.

Frank B
02-09-2016, 10:44 PM
Yes, it writes the Data. So, it gets them :)
It shows how to access the data, the usage of "MicMem".begin(); MicMem.end(); MicMem.readBuffer();" and much more. Try to "read" it.

AEC will be *much* more difficult than this...

stevech
02-10-2016, 01:08 AM
Is automatic echo cancellation (AEC) normally done as I'm accustomed to in RF? Adaptive equalization? For each frequency range A to B, amplitudes in that range are inverted and summed to cancel. The hard part of course is one-tie training for a particular room/auditorium, or real time adaptive. In RF, such as DOCSIS and other than OFDM, the adaptive equalizer finds the earliest direct signal in the time span, then finds that same weaker signature in a later time or times. Lots of DSP work. A fixed, non-adaptive (one time training, e.g,. after the room's arrangement is settled (sets in place, simulated audience), the equalizer inverse can be observed and retained. Much simpler. But some applications need adaptive.

In RF, adaptive equalizers reduce the affect of multipath (RF echos if you will). It's key to faster forms of DOCSIS (cable modems) and in some forms of cellular, mostly older TDMA (not spread spectrum/OFDM in LTE).

I apologize if this seems off the audio theme, but the principles are the same.

snirtikam
02-10-2016, 03:50 AM
Thank you for information. I'm not quite certain how automatic echo cancellation is done in RF, but what i was referring to is Acoustic Echo Cancellation. The two might be the same. I've implemented AEC in real time using Frequency Blocking Least-Mean Square (FBLMS) algorithm between two laptops using MATLAB via TCP/IP connection. However, MATLAB is not optimized for real-time processing, it is quite slow. There are many other algorithms that can be implemented but due to slower processing of MATLAB, FBLMS algorithm worked best for real time processing. My next step was to implement this on FPGA or some dev board and Teensy seemed like a good option. As far as algorithm goes, I know how to implement that and have a code written already, but for that i need samples from the microphone and a signal that coming from the microphone of another teensy so that i can run both through algorithm and send the result out to the other teensy.
So far i've been having trouble getting sampled data from microphone. It has been a while since i've done C/C++ programming in embedded systems so i'm trying to catch up with a lot of things here. At this point I have two teensy boards and i've been able to establish communications between both via WIFI chips (can send data both ways).

I've been studying "Recording" example, but, as I said, it has been a while with embedded programming and its a challenge.

Thanks for the input. I appreciate it.

snirtikam
02-18-2016, 04:38 AM
Hello All,

So i've been able to make little progress on this project. To test the echo cancellation algorithm i've chosen to use simple NLMS algorithm. the way i'm testing this algorithm is i'm running the set up on one teensy and sending mic signal to both inputs of NLMS algorithm, far side and near side. The code works, but the problem i'm having is that I'm not sure how to make it faster, or keep up with teensy's sampling. I've posted code bellow.

Basically, because of 'for' loops the NLMS algorithm takes long to process one block of audio samples. 1 block is about 2.9 ms of time, but running code in real time takes longer. As a result, the output to speaker is choppy because NLMS code has to complete before audio starts sampling again. I'm looking for some advice on how to implement it smarter, or mayby i'm approaching this wrong way. Any advice would be greatly appreciated.




//
// Microphone to Speaker output
#define NUMEL(x) (sizeof(x)/sizeof((x)[0])) // retuns/calculates the size of an array

//#include "NLMS.h"
#include <Audio.h>
#include <Wire.h>
#include <SPI.h>
#include <SD.h>
#include <SerialFlash.h>

//*****functions**********
void getBuffer1();
void playData();
void displayData();
void NLMS_AEC();

// GUItool: begin automatically generated code
AudioInputI2S Mic; //microphone
AudioPlayQueue MicMem; //queue for microphone
AudioRecordQueue Memory;
AudioOutputI2S headphones; //output
AudioConnection patchCord1(Mic, 0, Memory, 0);
AudioConnection patchCord2(MicMem, 0, headphones,0);
AudioConnection patchCord3(MicMem, 0, headphones,1);
AudioControlSGTL5000 sgtl5000_1; //xy=339.6666564941406,285.6666488647461

const int numBlocks = 2; //min block size is 2!!!!!!
const int lFilt = numBlocks-1;
const int gg = lFilt*128;
int16_t *pAddr; //pointer to 128 block of getBuffer
int16_t lblock = numBlocks*128; //length of arrays
int16_t ablock[128*numBlocks]; // set an array block
int16_t *pablock; // pointer to an array block
int16_t *pp; // pointer to getBuffer
int n;
unsigned long time1; // time variable


//***************NMLS const and settings*************************
const int16_t numTaps = gg;
float mu = 0.0001;
float mu0 = 0;
float psi = 0.0001;
float w[numTaps];
float *pw; // pointer to w
float yhat = 0;
float xtdl = 0;


//define input/source
const int micInput = AUDIO_INPUT_MIC;

void setup() {
memset(w,0,numTaps);
memset(ablock,0,128*numBlocks);
//initialize pointers for NLMS
pw = w;
Serial.begin(115200); // initiate baud rate for serial comm.
AudioMemory(8); // allocate # of audio memory (each block has 128 sample of data)
sgtl5000_1.enable(); // enalbe
sgtl5000_1.volume(0.5); // output volume
sgtl5000_1.inputSelect(micInput); //input selection (mic or line in)
sgtl5000_1.micGain(36); // mic gain
//set array values to zero
pablock = ablock; // assign pointer to array
delay(1000);
}


//***************************MAIN LOOP********************************************** **********************
void loop() {
getBuffer1(numBlocks); //returns pointer to the buffered data
//print sample to serial plotter
//displayData(pablock); // print data to serial plotter
NLMS_AEC(pablock,pablock,pablock,pw,gg,.0001);
}



//************************************************** ************************************************** *****
// print data to serial plotter
void displayData(int16_t * data){
for(int i = 0; i<128*numBlocks;i++){
Serial.println(*(data+i));
}
}
// play data to speaker
void playData(int16_t *data, int i){
pAddr = MicMem.getBuffer(); //get pointer to 128 array playBuffer
// take one block from buffer array (ablock) and copy it to playBuffer pointer
// play that block
memcpy((byte*)pAddr,(byte*)data+(256*i),256);
MicMem.playBuffer();
}

// buffer an array with samples from Microphone
// function receives and array and integer indicating number of blocks
void getBuffer1(int x) {
Memory.begin(); // begin putting sample data into memory
int l = 0;
memcpy((byte*)pablock,(byte*)pablock+256,256*(x-1));
// put 128 blocks of sample into an array size of 128*(number of blocks)
while(l == 0){ // wait until block samples are available
if(Memory.available() ==1) {// if availabe copy 128 block to array and get out of while loop
//read one 128 block of samples and copy it into array
memcpy((byte*)pablock+(256*(x-1)),Memory.readBuffer(),256);
Memory.freeBuffer(); // free buffer memory
//NLMS_AEC(pablock,pablock,pablock,pw,gg,.0001);
playData(pablock,(x-1)); // play 128 block from buffered array
l = 1; // set n to 1 to get out of while loop
}//end if
}//end while
l = 0;
//Memory.clear(); // clear all audio memory
Memory.end(); // end memory capture
// return ablock; // regurn pointer to array
}

// NMLS algorithm
// inputs are Mic Signal, far end signal and index of n-1 block wher n is the current block itteration
// previous block data is used to modify current block data samples
void NLMS_AEC(int16_t *Mic, int16_t *x){
for(int h = 0; h<128;h++){
for(int j = gg; j >0;j--){
yhat = yhat+x[j+h]*pw[gg-j];
}
pablock[h+gg] = (int16_t)(Mic[gg+h]-yhat);
yhat = 0;

for(int j = gg; j >0;j--){
xtdl = xtdl+x[j+h]*x[j+h];
}
xtdl = xtdl + psi;
mu0 = mu/xtdl;
xtdl = 0;

//update filter taps
for(int j = 0; j<gg;j++){
pw[j] = pw[j] + x[gg-j+h]*mu0*pablock[h+gg];
}//end for
}//end for outer
}// end NLMS_AEC

PaulStoffregen
02-19-2016, 07:34 PM
You're probably going to have to replace these floating point numbers with fixed point.



float mu = 0.0001;
float mu0 = 0;
float psi = 0.0001;
float w[numTaps];
float *pw; // pointer to w
float yhat = 0;
float xtdl = 0;


Before beginning this journey, you really should try to verify the math is connect with the floats. After converting to fixed point, it's much harder to change the math. In fact, some sorts of algorithm changes are probably best done by going back to floats, and then redoing the float to fixed conversion after you're sure the algorithm is correct.

The first step is determining the numerical ranges your floats actually use. The w[] array is probably the most important. The 2 ways involve carefully analyzing the equations, or just running a large amount of real data and collecting min/max with a little extra code. In extreme cases, you might log all the numbers and plot a histogram, to learn whether the numbers are uniformly distributed or if there's some unusual usage, like some cases really large numbers, others very small.

After you know the range, you choose an integer that will represent the number. For example, if you learn those w[] coefficients range from -1.7 to +1.7 with fairly uniform distribution, perhaps you'd choose a 3.13 format. The 3 top bits of an int16_t would represent the numbers +1 to -2. The bottom 13 bit represent the fraction. So the number 1.4 would be 001 in the top 3 bits, and 0110011001101 (integer 3277) in the lower 13 bits. To understand the fractional part, consider 13 bits is 8192 integers. Divide 3277 by 8192 and you get 0.4000244. Or if you divide the whole 16 bit number (11469) by 8192, you get 1.4000244. Of course, you can choose scaling by any arbitrary integer, but powers of 2 are commonly used so you know certain bits are the whole number part and the other bits are the fraction part.

Generally you'd try to use 16 bit numbers for the coefficients, if the round-off errors are tolerable. Usually you'd use int32_t or even int64_t for calculated values. For example:



yhat = yhat+x[j+h]*pw[gg-j];


For this code, you'd almost certainly use int64_t for yhat. The multiply will produce an int32_t, and when you add 128 of them together the result can easily become larger than 32 bits. Fortunately, Teensy has fast instructions for accumulating to 64 bit integers.

Then after the loop, you'd add code to convert the 64 bit result to something more useful. Often the conversion is as simple as right shifting to throw away the lower bits.

Usually when doing this sort of conversion, it helps to keep the float code and do both calculations. That's even slower, but the idea is you can do them both and add a little code to compare if they ended up with the same answer (plus or minus a tiny extra error due to round-off issues). Then when it's all working nice, delete or comment out the float code.

After the integer stuff is working, there's a second round of possible optimizations you can do with the Cortex-M4 DSP extensions. But that's quite complex, so usually your first goal should be converting from float to all int in the simplest way you can.

Hopefully this help?

Do you believe this echo cancel might be worth contributing to the audio library? Is the code open source?

snirtikam
02-20-2016, 04:00 AM
Hello,

Thank you for your response. I've tried your suggestion and just experimented with int16_t and int32_t and it cuts the processing time from 60ms to about 33ms. Now that is using a filter length of 640 which is about 5 blocks, roughly 14.5ms. The length of the filter comes from assuming that the device is in 12x12 foot room and given the speed of sound, return time from speaker back to microphone (bouncing off walls) will be about 14ms. Now that is minimum. Typical filter have lengths of more than 1024, depending on the environment. So, I still have a little bit of bottle neck here.... One way I though of dealing with this is to use lower sampling rate. Typical speech doesn't require 44.1ksps, so i though to cut that to 10.025ksps... I'm not sure how easy it would be to do that in audio shield library/hardware files. I was looking through "SGTL5000" files in audio library and there is something about setting sampling rate and clock rate, but i'm not sure if it would be that simple.

As far as the code, I'm assuming you are asking about NLMS or just overall, its open source. And, as far as contributing to audio library, I think it would be good. I'll be happy to share anything I'll have.

PaulStoffregen
02-20-2016, 06:00 AM
Ok, that's good progress. We only need another 5X speedup! Let's look at the DSP extensions first. :)

If this is going to be an open source contribution, and if I'm going to work on the DSP instruction speedups, I really need you to put this on github. If you've never used git & github before, there's lots of beginner tutorials. Especially git has some incredibly powerful but complex features, which you don't need. I use it every day and I don't have much idea how all that stuff works. All you really need to know is how to create a free account, create a new repository with your code, and then how to push changes from your PC to github, and how to pull contributions (like DSP speedups from me) you've merged on github down to your PC.

Well, the one obvious caveat to collaborating on github is to push your changes up to github regularly, even if they're unfinished, so I'm not try to work with an old version. Of course, also merge pull requests and pull the changes into your copy before starting new work, so you're working on the latest code too.

Hopefully you can do a couple simple things in the code...

1: If possible, please try to create at least 1 test that uses recorded or synthesized signals rather than live input. As I try to convert parts of the code to use the DSP extensions, having tests we can both run that produce known results will really help. It'll also help for an example to add to the library.

2: Please add an open source license. The Teensy Audio Library is MIT license, so it really needs to be MIT or something very close. Use your name, and the names of anyone else who wrote any substantial part of the code (of course with their consent to publish it with this open source license). If possible, please also add links or references to info you used for the algorithm. As part of an open source library that people will very likely use in non-profit and commercial projects, there needs to be at least some documentation that the author(s) allow the code to be open source, and in the case of complex algorithms how they were developed. If there's any published papers or articles or books older than 21 years, a reference to such "prior art" is really nice. No need to create a detailed bibliography, just URLs or casual references are fine.

Let me know when it's up on github? I'll take a look and see if I can apply the DSP extensions....

snirtikam
02-21-2016, 04:52 PM
Hello Paul,

I have heard of github, but have never used on myself. My partner, with whom I work on this does have an account so we'll be able to provide that to you. This is my first time doing open source license stuff, or MIT license, so I'll read up on that to get more familiar with your instructions.

Some success though....
I fallowed your advice from earlier post about getting rid of floats, and I'm happy to say that I was able to convert everything to be with types int16_t and one int64_t. Converting sped up the processing time drastically. I implemented the changes in NLMS and tested it. Happy to say it works in real time processing. However, that is only with filter length of 128. Increasing filter length makes output choppy due to delay in processing. Below is my code and plots of ranges of variables withing NLMS algorithm.

here is my updated code (I will work on getting the github going, promise)



//
// Microphone to Speaker output
#define NUMEL(x) (sizeof(x)/sizeof((x)[0])) // retuns/calculates the size of an array

#include "NLMS.h"
#include <Audio.h>
#include <Wire.h>
#include <SPI.h>
#include <SD.h>
#include <SerialFlash.h>

//*****functions**********
void getBuffer1();
void playData();
void displayData();
void NLMS_AEC();

// GUItool: begin automatically generated code
AudioInputI2S Mic; //microphone
AudioPlayQueue MicMem; //queue for microphone
AudioRecordQueue Memory;
AudioOutputI2S headphones; //output
AudioConnection patchCord1(Mic, 0, Memory, 0);
AudioConnection patchCord2(MicMem, 0, headphones,0);
AudioConnection patchCord3(MicMem, 0, headphones,1);
AudioControlSGTL5000 sgtl5000_1; //xy=339.6666564941406,285.6666488647461

const int numBlocks = 2; //min block size is 2!!!!!!
const int lFilt = numBlocks-1;
const int gg = lFilt*128;
int16_t *pAddr; //pointer to 128 block of getBuffer
int16_t lblock = numBlocks*128; //length of arrays
int16_t ablock[128*numBlocks]; // set an array block
int16_t error[128];
int16_t *Perror;
int16_t *pablock; // pointer to an array block
int16_t *pp; // pointer to getBuffer
int n;
unsigned long time1; // time variable


//***************NMLS const and settings*************************
const int16_t numTaps = gg;
int16_t mu = 13;
int16_t mu0 = 0;
int8_t psi = 1;
int16_t w[numTaps];
int16_t *pw; // pointer to w
int16_t yhat = 0;
int64_t xtdl = 0;


//define input/source
const int micInput = AUDIO_INPUT_MIC;

void setup() {
memset(w,0,numTaps);
memset(ablock,100,128*numBlocks*2);
memset(error,0,128);
//initialize pointers for NLMS
pw = w;
Perror = error;
Serial.begin(9600); // initiate baud rate for serial comm.
AudioMemory(8); // allocate # of audio memory (each block has 128 sample of data)
sgtl5000_1.enable(); // enalbe
sgtl5000_1.volume(0.5); // output volume
sgtl5000_1.inputSelect(micInput); //input selection (mic or line in)
sgtl5000_1.micGain(26); // mic gain
//set array values to zero
pablock = ablock; // assign pointer to array
delay(1000);
}


//***************************MAIN LOOP********************************************** **********************
void loop() {
//Serial.println(millis()); // print time
getBuffer1(numBlocks); //returns pointer to the buffered data
//print sample to serial plotter
//displayData(pablock); // print data to serial plotter
//NLMS_AEC(pablock,pablock);
//for(int i = 0; i<128;i++){
// Serial.println(w[i]);
//}
}
//************************************************** ************************************************** *****
// print data to serial plotter
void displayData(int16_t * data){
for(int i = 0; i<128*numBlocks;i++){
Serial.println(*(data+i));
}
}
// play data to speaker
void playData(int16_t *data, int i){
pAddr = MicMem.getBuffer(); //get pointer to 128 array playBuffer
// take one block from buffer array (ablock) and copy it to playBuffer pointer
// play that block
memcpy((byte*)pAddr,(byte*)data+(256*i),256);
MicMem.playBuffer();
}

// buffer an array with samples from Microphone
// function receives and array and integer indicating number of blocks
void getBuffer1(int x) {
Memory.begin(); // begin putting sample data into memory
int l = 0;
memcpy((byte*)pablock,(byte*)pablock+256,256*(x-1));
// put 128 blocks of sample into an array size of 128*(number of blocks)
while(l == 0){ // wait until block samples are available
if(Memory.available() ==1) {// if availabe copy 128 block to array and get out of while loop
//read one 128 block of samples and copy it into array
memcpy((byte*)pablock+(256*(x-1)),Memory.readBuffer(),256);
Memory.freeBuffer(); // free buffer memory
// NLMS_AEC(pablock,pablock);
// playData(Perror,0); // play 128 block from buffered array
l = 1; // set n to 1 to get out of while loop
}//end if
}//end while
NLMS_AEC(pablock,pablock);
playData(Perror,0); // play 128 block from buffered array
l = 0;
Memory.clear(); // clear all audio memory
Memory.end();
// return ablock; // regurn pointer to array
}

// NMLS algorithm
// inputs are Mic Signal, far end signal and index of n-1 block wher n is the current block itteration
// previous block data is used to modify current block data samples
void NLMS_AEC(int16_t *Mic, int16_t *x){
for(int h = 0; h<128;h+=1){
for(int j = gg; j>0;j-=1){
yhat += (x[j+h]*pw[gg-j])/32768;
xtdl += x[j+h]*x[j+h];
}
error[h] = Mic[gg+h]-yhat;
yhat = 0;
xtdl = xtdl + psi;
mu0 = (67108864*mu)/xtdl;
xtdl = 0;

//update filter taps
for(int j = 0; j<gg;j+=1){
pw[j] = pw[j] + (x[gg-j+h]*mu0*error[h])/131072;
}//end for
}//end for outer
}// end NLMS_AEC

6437
6438

PaulStoffregen
02-21-2016, 05:14 PM
That's great news!

I have some ideas for this code which should give you another large speed increase.

Github recently added the ability to just drag-and-drop files from your PC into the web page, so no software install needed. Very easy. It's free to sign up and create public projects (they only charge if you want to keep your data private - like a company developing code they don't share). I know it's some trouble to learn yet another website, but this really is important for sharing the code.

Let me know when it's on github?

Also, please try to make a copy that plays a pre-recorded sample. That way I can test with the exact same sound. For an example, look at Duff's NoteFrequency program:

https://github.com/PaulStoffregen/Audio/tree/master/examples/Analysis/NoteFrequency

Those other files are sound clips, converted to C code using the wav2sketch program.

https://github.com/PaulStoffregen/Audio/tree/master/extras/wav2sketch

PaulStoffregen
02-21-2016, 06:03 PM
I believe the DSP extensions can make a massive speed increase for these 2 loops:



for (int j = gg; j > 0; j -= 1){
yhat += (x[j+h]*pw[gg-j])/32768;
xtdl += x[j+h]*x[j+h];
}


The yhat variable will become int64_t (yes, that will actually be faster) with the 15 bit shift done only once after the loop.



for(int j = 0; j<gg;j+=1){
pw[j] = pw[j] + (x[gg-j+h]*mu0*error[h])/131072;
}


The DSP extensions can be used with 32 bit data packing and the successive memory fetch optimization and partial loop rolling, which gets the data from the arrays much faster, and then does the math much faster. The chip has 2 operation SIMD which can be used in this case. The details are complicated... so it's important to get this code correct before making the difficult conversion to the DSP extensions. It will make a tremendous speed increase!

Let me know when this is on github with a MIT license, and I'll work on the speedup part.

snirtikam
02-22-2016, 01:48 AM
Hi Paul,

'We got the github going. Added files and license stuff. Here is the link: https://github.com/kfillerup/NLMS_AEC

I also added NLMS.h file with the algorithm code. Working on the sound file. Is there an example of how to use the wav2sketch.c file? I use MATLAB a lot, so I can make a speech recording on it at different sampling rates and/or format types.

Thanks.

PaulStoffregen
02-22-2016, 04:35 AM
wav2sketch is a command line program, so you must run it in a terminal (Mac, Linux) or command prompt (Windows). First, type "cd" commands to change to the directory where you have WAV files for sound clips. The WAV files must be 44, 22 or 11 kHz. Then type either "wav2sketch" or "wav2sketch -16". The first uses ulaw compression. Adding "-16" preserves the 16 bit data, but at the cost of larger memory usage. It will read all the WAV files in the current directory and convert them all the .c and .h files.

Frank B
02-22-2016, 01:28 PM
"gg" is 128 (minimum), there are two loops of it which means 2*128 =256 loops per sample, 128*256 per block(=32768). Is there a way to save some iterations ?

snirtikam
02-22-2016, 03:45 PM
'gg' is the index location of the last block in the 'Mic' and 'x' array. so lets say we have a filter length of 128, this means that 'Mic' and 'x' need to be length of 256 and gg = 128. If filter length is 1024, then gg = 1024 and 'Mic' will have to be length of 1024+128. The outer 'for' loop, in NLMS, will always calculate 1 block of data, which is an array of 128. This block of data is calculated using past samples: length of past sample has to be same as length of filter. That block of data will be stored in error array and sent out to speaker, in my case, or to the far side speaker (the speaker of the other teensy). error array will always be lenth of 128, however length of filter will change, which will change length of 'Mic' and 'x' array. So, I don't think that there is away of avoiding iterations. what i have in NLMS is the minimum that would be needed to calculate error array. Now, as far as I understand some optimizations can be done to make this more efficient in DSP.

PaulStoffregen
02-22-2016, 08:00 PM
At 96 MHz, we get 100% CPU usage at approx 2000 cycles per sample. So ultimately that's going to impose a limit on how many iteration are possible.

I'm going to make a first attempt at using the DSP extensions, so we can see how many cycles/iteration are really needed...

Frank B
02-22-2016, 08:00 PM
perhaps unrolling the outer (variable h) loop.. by two or four

Frank B
02-22-2016, 08:17 PM
perhaps unrolling the outer (variable h) loop.. by two or four

..could help, and even if it is only to use combined loads and stores (which are more efficiant)

for example


for(int j = gg; j>0;j-=1){
yhat += (x[j+h]*pw[gg-j])/32768;
xtdl += x[j+h]*x[j+h];
}

could do 2 or more iterations of h (correct me)


for(int j = gg; j>0;j-=1){
pwt = pw[gg-j];
xt = x[j+h];
xt1 = x[j+h+1]; //combine reads, saves cpu cycles on ARM-CM4
yhat += (xt*pwt)/32768;
yhat1 += (xt1*pwt)/32768;
xtdl += xt*xt;
xtdl1 += xt1*xt1;
}


then, it may be more efficiant to calculate (67108864*mu) only once and not in every iteration
(and of course xtdl and so on should be used as temporary local variables.

edit: and doing the loop from above backwards with do..while means 1 cycle less per iteration (= 128*128 cycles for a block)
edit:gcc does no automatic loop-unrolling with -Os or -O1, not even automatic inlining...

PaulStoffregen
02-22-2016, 08:40 PM
Would this algorithm still work well enough if the w[] array were half the size and each filter tap applied to 2 audio samples?

Edit: some good recorded test cases would really help for collaborating....

snirtikam
02-22-2016, 08:55 PM
Hi Frank,
Thank you for suggestion. I'm not sure though that that will work for AEC part. It might be efficient as far as saving iterations, but not for the algorithm. In the code

pw[j] = pw[j] + (x[gg-j+h]*mu0*error[h])/131072; mu0 is the result of multiplication (mu*67108864)/xtdl, where xtdl is the mean taken from signal 'x' and it has to be computed before using it in filter update equation, otherwise NLMS algorithm wont work. So xtdl = x[gg:-1:0]'*x[gg:-1:0] where gg is the length of filter and " ' " is the transpose of vector 'x'. This, again, has to be calculated before being used in filter tap weights "pw" update.

Another thought I had on how to buy some time is to do NLMS calculations in between sample acquisition and also running at slower sampling rate. Here is what i mean. Currently we have 1 block of data being filled, which equals to 2.9ms of time. With a block of 128 we have 22.67 microseconds between samples. We can run NLMS algorithm in between each sample and the outer 'for' loop will run only once instead of 128 times. We can also buy more time by sampling at lower rate. For example if we sample at 10.025ksps we'll have 99.75 microseconds in between each sample and running NLMS for just 1 sample will give us needed time. I'm not sure how difficult it will be to do that, but at 10.025ksps 1 block of data will be ready every 12.76ms and NLMS will run for single sample instead of 128. Just a thought.

PaulStoffregen
02-22-2016, 09:06 PM
For example, can something like this work?



for(int h = 0; h < 128; h+=2) {

// compute yhat & xtdl
for(int j = gg; j > 0; j-=1) {
int16_t xavg = (x[j+h] + x[j+h+1]) / 2; // use 2 audio samples
yhat += (xavg*w[gg-j])/32768;
xtdl += xavg*xavg;
}

error[h] = Mic[gg+h]-yhat; // apply same yhat to 2 samples
error[h+1] = Mic[gg+h+1]-yhat;
xtdl = xtdl + psi;
mu0 = (67108864*mu)/xtdl;

//update filter taps
for(int j = 0; j < gg; j+=1){
int xavg = (x[gg-j+h] + x[gg-j+h+1]) / 2;
int erravg = (error[h] + error[h+1]) / 2;
w[j] = w[j] + (xavg*mu0*erravg)/131072;
}
}


Can this allow "gg" to be half the size?

Frank B
02-22-2016, 09:10 PM
mu: i meant, move only the multiplication outside , since it is constant inside that loop. this will not help much, but this is all very heavy, and every saved cycle helps :)

Could you upload a wavefile, please ?

Frank B
02-22-2016, 09:23 PM
and also running at slower sampling rate.
Would simple downsampling ((sample1+sample2)/2) work ?

linuxgeek
02-22-2016, 11:19 PM
I use MATLAB a lot, so...

In case it's useful to anyone, octave is a drop-in replacement for matlab. As long as there is no gui stuff, things mostly just work. octave is slower than matlab, but in case someone needs to run a matlab function to verify something, octave (https://www.gnu.org/software/octave/) would likely work.

PaulStoffregen
02-23-2016, 12:03 AM
Ok, I've made a first attempt at applying the DSP extensions to one part of this. Since there's no test cases, I don't really have any way to verify if I got all the math right.

Anyway, this original code:



for(int j = gg; j > 0; j-=1) {
// j from 128 to 0
// j+h from h+128 to h
// gg-j from 0 to 128
yhat += (x[j+h]*w[gg-j])/32768;
xtdl += x[j+h]*x[j+h];
}


gets replaced by this:



uint32_t *ww = (uint32_t *)(w);
uint32_t *wend = (uint32_t *)(w + gg);
uint32_t *xx = (uint32_t *)&(x[128 + h - 4]);
int64_t yhat64 = 0;
do {
// TODO: avoid slow unaligned access when h is odd
uint32_t w1 = *ww++; // bring in 4 values from w[]
uint32_t w2 = *ww++;
uint32_t x1 = *xx++; // bring in 4 values from x[]
uint32_t x2 = *xx++;
xx -= 8;
// x order: w1b, w1t, w2b, w2t
// w order: x2t, x2b, x1t, x1b
yhat64 = multiply_accumulate_16tx16b_add_16bx16t(yhat64, x2, w1);
yhat64 = multiply_accumulate_16tx16b_add_16bx16t(yhat64, w2, x1);
xtdl = multiply_accumulate_16tx16t_add_16bx16b(xtdl, x1, x1);
xtdl = multiply_accumulate_16tx16t_add_16bx16b(xtdl, x2, x2);
} while (ww < wend);
yhat = yhat64 / 32768;


Ideally, the outer loop would be changed to process 2 samples, so we can avoid the slow case where h is odd.

Before trying more DSP extension optimization, I really believe we should consider running the filter with 1 tap for every 2 audio samples. That should allow it to be run half as often, and require only half as many taps. Presumably echos don't occur on the time scale of 1/44100th of a second, so hopefully this won't affect the algorithm's performance too much?

Reducing the filter this way will also allow the DSP extensions to be much better, because processing 2 samples at once will work nicely with the 32 bit memory bus, to avoid the h = odd case that's much slower than when h is even.

snirtikam
02-23-2016, 12:47 AM
I'm not sure that this will be correct for NLMS. I will try to run it and see if it works. thank you for the update. I didnt get a chance to upload the sound sample but hopefully by tonight I will. Thank you again.

PaulStoffregen
02-23-2016, 12:53 AM
In theory, that code should perform exactly the same math. Well, except the divide by 32768 is moved outside the loop, so perhaps very slight differences from round off errors. Otherwise, it should be identical.

The DSP extension version just loads 4 values of w[] and 4 values of x[] into packed 32 bit registers. Then it does the exact same math, but using the special instructions which do two 16x16 multiplies and a 64 bit accumulate. Instead of index variables, pointers are used to sweep through the data.

Of course, it's entirely possible I made a mistake. This is utterly untested. But absent any mistakes, this is supposed to be exactly the same math. The code looks very different, and it should run much faster, but the results are supposed to be identical.

snirtikam
02-23-2016, 04:11 AM
Hello Paul, Thank you for improvements on the algorithm. I've uploaded the sound sample into github. Hope it will help with testing.

Thanks

PaulStoffregen
02-23-2016, 04:45 AM
That file is much too large to fit into Teensy's internal flash.

Can you upload the original WAV file?

snirtikam
02-23-2016, 05:03 AM
Sorry about that. Didn't realize it would be so bit. I thought it would fit on SD card. Anyway, i've uploaded original .wav file.

PaulStoffregen
02-23-2016, 05:09 AM
I'm listening to the file now. I do not hear any echo.

snirtikam
02-23-2016, 05:43 AM
That file contains no echo, but if you feed this file as input for both 'Mic' and 'x', then error should output nothing, or close to nothing. Basically, we are making sure that NLMS is working and it is eliminating the signal. They way i've tested NLMS algorithm on Teensy is with the sketch i've uploaded to github. In that sketch I'm taking signal from microphone and sending it to NLMS as both 'Mic' and 'x'. the error array is the one i'm sending to be played. If NLMS is working then speaking into microphone should not produce any output in the speaker or headphones. Now this set up is one sided and just shows proof of concept that NLMS is working. Next step would be to connect two teensys together, and in that case, 'Mic' will take input from microphone, 'x' will be incomming signal from the other teensy, which will also be played to speaker, and error will be sent to the speaker of the other teensy. Hopefully that makes sense.

6448

PaulStoffregen
02-23-2016, 06:13 AM
Oh, I misunderstood the state of this code's development. I assumed it was already known to be working and only in need of optimization. Looks like I need to undo the optimization which removed the redundant input pointer!

PaulStoffregen
02-23-2016, 06:24 AM
Ok, I've restored the Mic input pointer.

Perhaps it's much too early to consider applying the DSP extensions. As you can see from the portion I already did, the DSP extensions require complicated code. Normally this is only done after the simpler, normal style C code has been throughly tested on non-realtime data.

PaulStoffregen
02-23-2016, 06:37 AM
Next step would be to connect two teensys together....
6448

I'm afraid I must disagree. That's an insane plan for debugging this code!

The next step would be composing several WAV files with varying types of echo. Then test the NLMS code on all of them, initially by reading the file and writing another file, before the code is optimized enough to run int real time.

Optimization for speed should be attempted only after the code is very throughly tested for correctness!

The next step is most certainly NOT to start running poorly tested code on 2 devices separated by substantial distance and operated by 2 people! That's a certain recipe for a frustration and (probably) never getting this working.

snirtikam
02-23-2016, 06:54 AM
I was getting way ahead of myself. I agree, testing between two teensys is the last step. One way to simulate echo would be to put the audio file through a delay by few milliseconds and send that to 'Mic' in NLMS algorithm and the original one to 'x'. This would be a good way to simulate echo.

PaulStoffregen
02-23-2016, 07:29 AM
For the sake of testing, I believe you should create several small WAV files with various echoes already encoded into the data. You can use Audacity or any number of other sound editing programs to create these sounds files. Or you could even do it in Matlab and export the data (somehow).

At the very least, you should create some files with multiple copies of the original echoed at different times. At least a couple files where the echoes have been low-pass filtered would be a good plan.

Even better would be to physically use a tightly closed room with hard walls and record real sound with natural room echo. Many inexpensive sound recording devices exist, and if you don't have access to any, you can probably use a modern iPhone or Android to make a real recording with similar or better fidelity than you'll get from inexpensive microphones (eg, on the Teensy audio shield).

Sound clips should be limited to 4 or 8 seconds, so they'll fit into Teensy's internal flash memory, with ulaw encoding at 44 or 22 kHz rate. For testing, the same file can be played over and over. Small files that fit entire in Teensy's memory will make testing simpler, since they'll avoid the large overhead of SPI communication to the SD card or external flash chip.

Recording some good test sounds is relatively easy, compared to developing these algorithms and testing and optimizing the code. So don't skimp on this important step. The better your test clips, the better your odds of getting the code working without unpleasant surprises later. And by unpleasant surprise, I meant a critical flaw that means you have to go back to almost the beginning and redo almost everything!

EEKFILL
02-23-2016, 06:47 PM
The idea of the NLMS algorithm isn't to remove the echoes and allow the original speech through. It is to eliminate the speech and the echoes caused by the response of the room. This is mainly used for teleconferencing systems.

Consider the following diagram:

6451

Without cancelation the user on the far end hears their voice.

6452

The AEC NLMS algorithm removes this voice.

6453


The NLMS algorithm is a basically adapting its coefficients by taking steps the size of mu until it minimizes the error. You want the output of the system e[n] to be 0. This means that the error of the difference between the coefficients of your adaptive filter and the signal at the mic is 0.

6450



That being said, I think recording speech data from a typical room and using it to see if the algorithm is working is a good idea. The results we would be looking for from this experiment is what snirtikam described before. We wouldn't want to hear any speech from the wav file in the headphones.


https://en.wikipedia.org/wiki/Echo_suppression_and_cancellation

https://en.wikipedia.org/wiki/Least_mean_squares_filter

Frank B
02-23-2016, 07:10 PM
We have an example-wavefile here (dl the zip in the first post):

https://forum.pjrc.com/threads/32623-I-just-can-t-get-my-files-to-play-(Teensy-3-2-and-audio-board)?p=94492&viewfull=1#post94492