Forum Rule: Always post complete source code & details to reproduce any issue!

1. mu: i meant, move only the multiplication outside , since it is constant inside that loop. this will not help much, but this is all very heavy, and every saved cycle helps

2. Originally Posted by snirtikam
and also running at slower sampling rate.
Would simple downsampling ((sample1+sample2)/2) work ?

3. Originally Posted by snirtikam
I use MATLAB a lot, so...
In case it's useful to anyone, octave is a drop-in replacement for matlab. As long as there is no gui stuff, things mostly just work. octave is slower than matlab, but in case someone needs to run a matlab function to verify something, octave would likely work.

4. Ok, I've made a first attempt at applying the DSP extensions to one part of this. Since there's no test cases, I don't really have any way to verify if I got all the math right.

Anyway, this original code:

Code:
```    for(int j = gg; j > 0; j-=1) {
// j from 128 to 0
// j+h from h+128 to h
// gg-j from 0 to 128
yhat += (x[j+h]*w[gg-j])/32768;
xtdl += x[j+h]*x[j+h];
}```
gets replaced by this:

Code:
```    uint32_t *ww =   (uint32_t *)(w);
uint32_t *wend = (uint32_t *)(w + gg);
uint32_t *xx = (uint32_t *)&(x[128 + h - 4]);
int64_t yhat64 = 0;
do {
// TODO: avoid slow unaligned access when h is odd
uint32_t w1 = *ww++; // bring in 4 values from w[]
uint32_t w2 = *ww++;
uint32_t x1 = *xx++; // bring in 4 values from x[]
uint32_t x2 = *xx++;
xx -= 8;
// x order: w1b, w1t, w2b, w2t
// w order: x2t, x2b, x1t, x1b
} while (ww < wend);
yhat = yhat64 / 32768;```
Ideally, the outer loop would be changed to process 2 samples, so we can avoid the slow case where h is odd.

Before trying more DSP extension optimization, I really believe we should consider running the filter with 1 tap for every 2 audio samples. That should allow it to be run half as often, and require only half as many taps. Presumably echos don't occur on the time scale of 1/44100th of a second, so hopefully this won't affect the algorithm's performance too much?

Reducing the filter this way will also allow the DSP extensions to be much better, because processing 2 samples at once will work nicely with the 32 bit memory bus, to avoid the h = odd case that's much slower than when h is even.

5. I'm not sure that this will be correct for NLMS. I will try to run it and see if it works. thank you for the update. I didnt get a chance to upload the sound sample but hopefully by tonight I will. Thank you again.

6. In theory, that code should perform exactly the same math. Well, except the divide by 32768 is moved outside the loop, so perhaps very slight differences from round off errors. Otherwise, it should be identical.

The DSP extension version just loads 4 values of w[] and 4 values of x[] into packed 32 bit registers. Then it does the exact same math, but using the special instructions which do two 16x16 multiplies and a 64 bit accumulate. Instead of index variables, pointers are used to sweep through the data.

Of course, it's entirely possible I made a mistake. This is utterly untested. But absent any mistakes, this is supposed to be exactly the same math. The code looks very different, and it should run much faster, but the results are supposed to be identical.

7. Hello Paul, Thank you for improvements on the algorithm. I've uploaded the sound sample into github. Hope it will help with testing.

Thanks

8. That file is much too large to fit into Teensy's internal flash.

Can you upload the original WAV file?

9. Sorry about that. Didn't realize it would be so bit. I thought it would fit on SD card. Anyway, i've uploaded original .wav file.

10. I'm listening to the file now. I do not hear any echo.

11. That file contains no echo, but if you feed this file as input for both 'Mic' and 'x', then error should output nothing, or close to nothing. Basically, we are making sure that NLMS is working and it is eliminating the signal. They way i've tested NLMS algorithm on Teensy is with the sketch i've uploaded to github. In that sketch I'm taking signal from microphone and sending it to NLMS as both 'Mic' and 'x'. the error array is the one i'm sending to be played. If NLMS is working then speaking into microphone should not produce any output in the speaker or headphones. Now this set up is one sided and just shows proof of concept that NLMS is working. Next step would be to connect two teensys together, and in that case, 'Mic' will take input from microphone, 'x' will be incomming signal from the other teensy, which will also be played to speaker, and error will be sent to the speaker of the other teensy. Hopefully that makes sense.

12. Oh, I misunderstood the state of this code's development. I assumed it was already known to be working and only in need of optimization. Looks like I need to undo the optimization which removed the redundant input pointer!

13. Ok, I've restored the Mic input pointer.

Perhaps it's much too early to consider applying the DSP extensions. As you can see from the portion I already did, the DSP extensions require complicated code. Normally this is only done after the simpler, normal style C code has been throughly tested on non-realtime data.

14. Originally Posted by snirtikam
Next step would be to connect two teensys together....
I'm afraid I must disagree. That's an insane plan for debugging this code!

The next step would be composing several WAV files with varying types of echo. Then test the NLMS code on all of them, initially by reading the file and writing another file, before the code is optimized enough to run int real time.

Optimization for speed should be attempted only after the code is very throughly tested for correctness!

The next step is most certainly NOT to start running poorly tested code on 2 devices separated by substantial distance and operated by 2 people! That's a certain recipe for a frustration and (probably) never getting this working.

15. I was getting way ahead of myself. I agree, testing between two teensys is the last step. One way to simulate echo would be to put the audio file through a delay by few milliseconds and send that to 'Mic' in NLMS algorithm and the original one to 'x'. This would be a good way to simulate echo.

16. For the sake of testing, I believe you should create several small WAV files with various echoes already encoded into the data. You can use Audacity or any number of other sound editing programs to create these sounds files. Or you could even do it in Matlab and export the data (somehow).

At the very least, you should create some files with multiple copies of the original echoed at different times. At least a couple files where the echoes have been low-pass filtered would be a good plan.

Even better would be to physically use a tightly closed room with hard walls and record real sound with natural room echo. Many inexpensive sound recording devices exist, and if you don't have access to any, you can probably use a modern iPhone or Android to make a real recording with similar or better fidelity than you'll get from inexpensive microphones (eg, on the Teensy audio shield).

Sound clips should be limited to 4 or 8 seconds, so they'll fit into Teensy's internal flash memory, with ulaw encoding at 44 or 22 kHz rate. For testing, the same file can be played over and over. Small files that fit entire in Teensy's memory will make testing simpler, since they'll avoid the large overhead of SPI communication to the SD card or external flash chip.

Recording some good test sounds is relatively easy, compared to developing these algorithms and testing and optimizing the code. So don't skimp on this important step. The better your test clips, the better your odds of getting the code working without unpleasant surprises later. And by unpleasant surprise, I meant a critical flaw that means you have to go back to almost the beginning and redo almost everything!

17. The idea of the NLMS algorithm isn't to remove the echoes and allow the original speech through. It is to eliminate the speech and the echoes caused by the response of the room. This is mainly used for teleconferencing systems.

Consider the following diagram:

Without cancelation the user on the far end hears their voice.

The AEC NLMS algorithm removes this voice.

The NLMS algorithm is a basically adapting its coefficients by taking steps the size of mu until it minimizes the error. You want the output of the system e[n] to be 0. This means that the error of the difference between the coefficients of your adaptive filter and the signal at the mic is 0.

That being said, I think recording speech data from a typical room and using it to see if the algorithm is working is a good idea. The results we would be looking for from this experiment is what snirtikam described before. We wouldn't want to hear any speech from the wav file in the headphones.

https://en.wikipedia.org/wiki/Echo_s...d_cancellation

https://en.wikipedia.org/wiki/Least_mean_squares_filter

18. We have an example-wavefile here (dl the zip in the first post):