Audio Timestamp

janbor · Feb 23, 2015

@Janbor, regarding your question about the code: I might prepare an example, but not before the weekend

Wow. I would appreciate that very much.

kpc · Feb 23, 2015

@Janbor: Just to make sure: you want the auto-correlation and not the cross-correlation? The link you gave talks mostly about cross-correlation.
With the cross you can determine the delay between two signals.
With auto you can determine if there is a delayed signal inside the input, eg. an echo.

janbor · Feb 23, 2015

Indeed I was just realizing my confusion. I do indeed need the cross-correlation.

kpc · Feb 25, 2015

As promised, some code to do the cross-correlation
Without overclocking, for 1024 samples it takes 5 ms at 72 MHz.
In comparisson, collecting 1024 samples at 44.1 kHz takes 23 ms, so it can easily process the data real-time.

The code is not heavily optimized (I did not even bother looking at the generated assembly). Since you needed cross-cor, three FFT's are needed. I just combined two of them. Due to bit growth, I used the q31 version for the inverse fft. If you know the properties of the signal, you might apply some scaling to be able to use the q15 fft, this way also the fft and ifft buffers can be combined.

The code calculates the cross-cor by:
R = ifft(conj(fft(s1)/FFTSIZE) * fft(s2)/FFTSIZE);

Code:

/* This code is free to use */
/* Kire Pudsje, Feb. 2015 */

#include "arm_math.h"

#define FFTSIZE 1024

int16_t s1[FFTSIZE], s2[FFTSIZE];
int16_t fft_buf[FFTSIZE * 2];
int32_t ifft_buf[FFTSIZE * 2];
arm_cfft_radix4_instance_q15 fft_inst;
arm_cfft_radix4_instance_q31 ifft_inst;

void setup() {
  Serial.begin(9600);
  arm_cfft_radix4_init_q15(&fft_inst, FFTSIZE, 0, 1);
  arm_cfft_radix4_init_q31(&ifft_inst, FFTSIZE, 1, 1);
}

void loop() {
  // generate two signals with random data, s2 is delayed s1
  unsigned offset = random(FFTSIZE / 4);
  for (unsigned i = 0; i < FFTSIZE + offset; i++) {
    int16_t val = ((int16_t) random(65536));
    if (i < FFTSIZE)
      s1[i] = val;
    if (i > offset)
      s2[i - offset] = val;
  }

  // Copy samples interleaved to FFT buffer s1 in real part, s2 in complex part
  // ie. fftbuf = s1 + j * s2
  for (int i = 0; i < FFTSIZE; i++)
    ((int32_t *)fft_buf)[i] = (s2[i] << 16) | s1[i];
    
  elapsedMicros us; // start timing

  // Perform FFT
  arm_cfft_radix4_q15(&fft_inst, fft_buf);
    
  // Untangle fft of both signals and calculate conj(fft(s1)) * fft(s2)
  // Treat DC point special
  int16_t fs1_re, fs1_im, fs2_re, fs2_im;
  fs1_re = fft_buf[0]; // fs1_im = 0
  fs2_re = fft_buf[0]; // fs2_im = 0
  ifft_buf[0] = fs1_re * fs2_re;
  ifft_buf[1] = 0;
  // Then handle most other points
  int16_t *fft_fwd = fft_buf + 2;
  int16_t *fft_rev = fft_buf + 2 * (FFTSIZE - 1);
  for (size_t i = 2; i < FFTSIZE; i += 2, fft_fwd += 2, fft_rev -= 2) {
    fs1_re = (fft_fwd[0] + fft_rev[0]) >> 1;
    fs1_im = (fft_fwd[1] - fft_rev[1]) >> 1;
    fs2_re = (fft_rev[1] + fft_fwd[1]) >> 1;
    fs2_im = (fft_rev[0] - fft_fwd[0]) >> 1;
    // calculate conj(fft(s1)) * fft(s2)
    int32_t cfs1_fs2_re = (fs1_re * fs2_re + fs1_im * fs2_im);
    int32_t cfs1_fs2_im = (fs1_re * fs2_im - fs1_im * fs2_re);
    ifft_buf[i] = ifft_buf[2 * FFTSIZE - i] = cfs1_fs2_re;
    ifft_buf[i + 1] = cfs1_fs2_im;
    ifft_buf[2 * FFTSIZE - i + 1] = -cfs1_fs2_im;
  }
  // Finally handle Nyquist point
  fs1_re = fft_fwd[0]; // fs1_im = 0
  fs2_re = fft_fwd[1]; // fs2_im = 0
  ifft_buf[FFTSIZE] = fs1_re * fs2_re;
  ifft_buf[FFTSIZE + 1] = 0;

  // Perform inverse FFT
  arm_cfft_radix4_q31(&ifft_inst, ifft_buf);

  uint32_t duration = us; // end of timing
    
  // The output data is in every second element of ifft_buf

  // As an example, find the cross-cor peak
  int peak_offset = 0;
  int32_t peak_value = ifft_buf[0];
  for (size_t i = 1, j = 2; i < FFTSIZE; i++, j += 2) {
    if (ifft_buf[j] > peak_value) {
      peak_value = ifft_buf[j];
      peak_offset = i;
    }
  }
  if (peak_offset > FFTSIZE / 2)
    peak_offset -= FFTSIZE;

  Serial.print("Generated offset:  ");
  Serial.println(offset);  
  Serial.print("Calculated offset: ");
  Serial.println(-peak_offset); // delay implies negative
  Serial.print("It took ");
  Serial.print(duration);
  Serial.println(" us\n");
  
  delay(500);
}

Edit: fixed a minor problem with the Nyquist point

janbor · Feb 25, 2015

This is awesome. I'll get right on trying to implement/understand it. I'll let you know how it works. Thanks.

MacroMachines · Feb 26, 2015

WMXZ said:
I run a T3.1 with 144 MHz and can easily do four 256-point FFTs on 200 kHz sampling rate (processing time 0.72 ms within 1.28 ms available timeframe).

How do you over clock the teensy to 144mhz? I only have the option of 96 as the highest rate.

kpc · Feb 26, 2015

Uncomment the lines in
arduino-1.0.6/hardware/teensy/boards.txt
Beware, be warned, etc. You know the drill.

WMXZ · Feb 26, 2015

MacroMachines said:
How do you over clock the teensy to 144mhz? I only have the option of 96 as the highest rate.

I'm running a makefile based system, not based on arduino IDE, so I directly specify F_CPU.

Otherwise kpc's comment is the arduino-way of doing it.

MacroMachines · Feb 26, 2015

Is the CPU clock rate some sort of PLL setting?

Have you noticed anything glitchy or dangerous overheating from clocking it at such a high frequency? Does the USB still work for uploading, but just not for midi/serial/mouse/keyboard output?

Where can I learn more about your make-file method? I'm trying to grow out of the arduino IDE.. Looked into Xcode but haven't gotten that functioning yet.

PaulStoffregen · Feb 26, 2015

Wow, an impressive number of questions packed into such a short message!

MacroMachines said:
Is the CPU clock rate some sort of PLL setting?

Yes, a PLL turns the 16 MHz crystal-based signals into the various clock rates.

Have you noticed anything glitchy or dangerous overheating from clocking it at such a high frequency?

Several people have reported crashes or lockups or other wrong behavior at 144 & 168.

120 seems pretty solid.

168 rarely works.

Does the USB still work for uploading,

Yes, the USB still works when the chip runs at 144. Well, assuming it runs successfully at all.

Also, during uploading, the Mini54 takes control. It reconfigures the clock to 24 MHz during the upload, no matter what speed you had the chip running before the Mini54 grabs control.

but just not for midi/serial/mouse/keyboard output?

That stuff all works too.

Where can I learn more about your make-file method? I'm trying to grow out of the arduino IDE.. Looked into Xcode but haven't gotten that functioning yet.

There are tons of resources online and in books. But make is complicated, and the syntax of makefiles is arcane.

You might also try turning on verbose info while compiling, in File > Preferences, if you haven't already. That will give you visibility of the exact compiler commands Arduino is running for you.

PaulStoffregen · Feb 26, 2015

It's also probably worth mentioning that the CPU and flash memory have separate clocks. The flash isn't rated to run faster than 25 MHz.

Most of the speed settings clock the flash at 24 MHz, which is generally able to support about 96 MHz CPU speed, because the flash is wider than the ARM bus and has a small cache. As you clock the CPU core faster, flash-based code has only diminishing returns in speed, due to waiting on the flash.

You can use FASTRUN on functions to put them into RAM, which always clocks at the CPU speed. If you're trying to make something run much faster by overclocking, you'll probably also need FASTRUN to really reap much benefit from those higher overclock speeds.

WMXZ · Feb 27, 2015

PaulStoffregen said:
It's also probably worth mentioning that the CPU and flash memory have separate clocks. The flash isn't rated to run faster than 25 MHz.

Most of the speed settings clock the flash at 24 MHz, which is generally able to support about 96 MHz CPU speed, because the flash is wider than the ARM bus and has a small cache. As you clock the CPU core faster, flash-based code has only diminishing returns in speed, due to waiting on the flash.

You can use FASTRUN on functions to put them into RAM, which always clocks at the CPU speed. If you're trying to make something run much faster by overclocking, you'll probably also need FASTRUN to really reap much benefit from those higher overclock speeds.

I would like to add, that speeding programs up, needs at least the will to look into the code and to modify it.

For example, if you look into teensy\cores\teensy3\arm_common_tables.h you will notice that the FFT twiddle coefficients are declared as constant. While this is logically correct, the compiler puts the table into Flash, but in order to speed up FFT processing, the table is better placed into RAM. This can be achieved by removing the 'constant' keyword (I'm using pure C).
Also the whole FFT code should be placed into RAM for fastest execution. FASTRUN is then the approach to take. BUT this way, you are starting to customize core and third party routines (CMSIS), with the consequence that any future update of these routines must be also modified.

Tuning the code for fastest execution does not stop here, if you have a lot of computation, as in Paul's audio library, you may also try to use dsp instructions that give you the ultimate racing engine.

PaulStoffregen · Feb 27, 2015

Those DSP instructions can be tough to use to good optimization effect. You have to carefully plan register usage. The biggest speed gains come when you can bring 4, 8 or 16 packed samples entire into packed registers (using the pipeline bus optimization of Cortex-M4 when multiple loads or stores are done in a row) and operate on them as a block with fully unrolled code. It takes a lot of anticipating how the compiler will allocate and manage registers, which ultimately takes a lot of compiling and looking at the generated code.

janbor · Mar 2, 2015

This is a really good piece of code. At 144 MHz it takes 3 ms to complete a 1024 cross correlation and peak identification. NICE.

randomrichard · Apr 30, 2015

Thanks for pointing me to the code in your note, above. I am trying to use it to extract chirp signal spikes embedded in noise as done in sonar and radar by using correlation. I had been using VB on a PC but your ARM code is much faster with the Teensy 3.1. I am trying your software by inputting a 1024 line padded "antichirp" and a 1024 line synthetic signal plus noise file as in the code below and using an SD card:

void setup() {
//SD
Sd2Card card; //I don't know what this does - it presumably instantiates "card"
boolean status;

// wait for the Arduino Serial Monitor to open before starting
while (!Serial) ;
delay(50);

// Configure SPI for the audio shield pins
SPI.setMOSI(7); // Audio shield has MOSI on pin 7
SPI.setSCK(14); // Audio shield has SCK on pin 14

//FFT
Serial.begin(57600);
arm_cfft_radix4_init_q15(&fft_inst, FFTSIZE, 0, 1);
arm_cfft_radix4_init_q31(&ifft_inst, FFTSIZE, 1, 1);
//SD
// Now open the SD card normally
status = SD.begin(10); // Audio shield has SD card CS on pin 10
if (status) {
Serial.println("SD library is able to access the filesystem");
} else {
Serial.println("SD library can not access the filesystem!");
}
//Now write data to arrays
File aFile = SD.open("achss.txt");
// if the file is available, read it:
while (aFile.available())
{
for(unsigned i = 0; i<FFTSIZE; i++){
s1 = int16_t(Serial.write(aFile.read()));
}
aFile.close();
}

File bFile = SD.open("snss.txt");

while (bFile.available())
{
for(unsigned i = 0; i<FFTSIZE; i++){
s2 = int16_t(Serial.write(bFile.read()));
}
bFile.close();
}

// generate two signals with random data, s2 is delayed s1
// unsigned offset = random(FFTSIZE / 4);
//for (unsigned i = 0; i < FFTSIZE + offset; i++) {
// int16_t val = ((int16_t) random(65536));
// if (i < FFTSIZE)
// s1 = val;
// if (i > offset)
// s2[i - offset] = val;
//}

However, the FFT code is unfamiliar and I have been unable to read the iFFT output from the correlation of the two files properly. Instead I get a 292 (should be 1024?) line file of zeroes (32768) with one big spike at line 147 (wrong place!). Assuming my file input code works can you advise me how to extract all the iFFT output to the serial port, just once, for the single test example?

kpc · Apr 30, 2015

However, the FFT code is unfamiliar

I got involved in this thread with the discussion about optimizing the autocorrelation (which turned out to be cross-correlation). My point was that you can use one complex FFT for two real FFT's simultaneously. This is probably why my code looks unfamiliar.

It just does this in "matlab-speak"
R = ifft(conj(fft(s1)/FFTSIZE) .* fft(s2)/FFTSIZE);
The division by FFTSIZE is due to the CMSIS implementation.

I might be wrong, but this code is outputting the read data and then fills the array with the number of byes send.

Code:

s1[i] = int16_t(Serial.write(aFile.read()));

Furthermore, If you want the correlation with "normal" FFT's, you need to put the data in every second element of s1. Every odd value should be zero (assuming you have real valued data), likewise for s2.

Finally, assuming you have 8-bit data, you better place it at the high end (ie data << 8), to keep away from the quantization noise.

randomrichard · May 2, 2015

Thanks. I found necessary parseInt() function for inputting data into the s1[] array in Paul's communication "Re: Reading Values from SDHC card into an Array" of Feb 9th 2014. Now for some results...

Audio Timestamp

janbor

Active member

kpc

Well-known member

janbor

Active member

kpc

Well-known member

janbor

Active member

MacroMachines

Well-known member

kpc

Well-known member

WMXZ

Well-known member

MacroMachines

Well-known member

PaulStoffregen

Well-known member

PaulStoffregen

Well-known member

WMXZ

Well-known member

PaulStoffregen

Well-known member

janbor

Active member

randomrichard

Active member

kpc

Well-known member

randomrichard

Active member