Fast Convolution Filtering with Teensy 4.0 and audio board

Status
Not open for further replies.
I am wondering why did you use LRCLK and BCLK resistors for ADC only? I'm resetting up the same thing with T4 (had it with 3.6). Would it make sense to have the resistors on all three LRCLK, BCLK and MCLK for both ADC and DAC?
I never had crackling noise with DAC alone, the ADC was the problematic part, so my thought was to add 100 Ohm resistors to all ADC clock lines (and to the master clock for DAC). I did not check systematically however. I just installed four resistors and it worked ;-).

You should just try out what is best for your setup.
 
I was wondering if it's needed to duplicate the resistor for ADC and DAC. My current thinking is to try something like:

Code:
                                  __ ADC LRC
                                 /
  T4 LRCLK (20) --- 100 Ohm ---<
                                 \__ DAC LCK


                                  __ ADC BCK
                                 /
  T4 BCLK (21)  --- 100 Ohm ---<
                                 \__ DAC BCK


                                  __ ADC SCK
                                 /
  T4 MCLK (23)  --- 100 Ohm ---<
                                 \__ DAC SCL

Do you think it makes sense/should work?
 
Possibly!

Would be interested to hear if it works and please post a photograph of your setup.
 
It seems to work. Thanks! Attaching a couple of photos. I've also used sockets for everything, so that I don't need to resolder if I nuke something.

IMG_4883.jpgIMG_4882.jpg
 
The Teensy 4.0 seems to be able to perform a real-time two channel FIR lowpass filter with up to 21632 taps for audio I2S signals . . .

See this code below which filters I2S audio (Stereo: left & right) simultaneously with such a lowpass filter and uses 65% of the processor power and uses the following RAM ressources:

Code:
FlexRAM-Banks: [DDDDDDDDDDDDDDDI] ITCM: 32 KB, DTCM: 480 KB, OCRAM: 0(+512) KB
MEM (static usage):
RAM1:
ITCM = FASTRUN:      30672   93.60%  of  32kb     (2096 Bytes free)
DTCM = Variables:    373440   75.98%  of  480kb     (118080 Bytes free)
RAM2:
OCRAM = DMAMEM:      522304   99.62%  of  512kb     (1984 Bytes free)
FLASH:               44304   2.11%  of  2048kb    (2052848 Bytes free)
processor load:  64.56%
1720.00 microsec for 1  stereo blocks    FFT-length = 256;   FIR filter length = 21632

Code:
/* (c) Frank DD4WH 2019-10-31
 *  
 * Real Time PARTITIONED BLOCK CONVOLUTION FILTERING (Stereo)
 * 
 * FIR filtering fun with the Teensy 4.0 with up to 21632 filter taps in each channel
 * 
 * implements a lowpass filter for each channel at a user-defined lowpass cutoff frequency
 * 
 * uses Teensy 4.0 and external ADC / DAC connected on perf board with ground plane
 * but should run also with the PJRC Teensy Audio board version D (compatible with Teensy 4.0)
 * 
 * this sketch does only run with the Teensy 4.0, do not use with Teensy 3.x
 * 
 *  PCM5102A DAC module
    VCC = Vin
    3.3v = NC
    GND = GND
    FLT = GND
    SCL = 23 / MCLK via series 100 Ohm
    BCK = BCLK (21)
    DIN = TX (7)
    LCK = LRCLK (20)
    FMT = GND
    XMT = 3.3V (HIGH)
    
    PCM1808 ADC module:    
    FMT = GND
    MD1 = GND
    MD0 = GND
    GND = GND
    3.3V = 3.3V --> ADC needs both: 5V AND 3V3
    5V = VIN
    BCK = BCLK (21) via series 100 Ohm
    OUT = RX (8)
    LRC = LRCLK (20) via series 100 Ohm
    SCK = MCLK (23) via series 100 Ohm
    GND = GND
    3.3V = 3.3V
 * 
 * uses code from wdsp library by Warren Pratt
 * https://github.com/g0orx/wdsp/blob/master/firmin.c
 * 
 ********************************************************************************
   GNU GPL LICENSE v3

   This program is free software: you can redistribute it and/or modify
   it under the terms of the GNU General Public License as published by
   the Free Software Foundation, either version 3 of the License, or
   (at your option) any later version.

   This program is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
   GNU General Public License for more details.

   You should have received a copy of the GNU General Public License
   along with this program.  If not, see <http://www.gnu.org/licenses/>
 ********************************************************************************/

#include <Audio.h>
#include <Wire.h>
#include <SPI.h>
#include <arm_math.h>
#include <arm_const_structs.h> // in the Teensy 4.0 audio library, the ARM CMSIS DSP lib is already a newer version
#include <utility/imxrt_hw.h> // necessary for setting the sample rate, thanks FrankB !

extern "C" uint32_t set_arm_clock(uint32_t frequency);

//////////////////////////////////////////////////////////////////////////////////////////////////////////////
// USER DEFINES
//uncomment for pass-thru
//#define PASSTHRU_AUDIO
// define your frequency for the lowpass filter
const double PROGMEM FHiCut = 22000.0; // for the (young) audiophile ;-)
const double PROGMEM FLoCut = -FHiCut;
const float32_t PROGMEM audio_gain = 6.5;
// define your sample rate
const double PROGMEM SAMPLE_RATE = 48000;  
// define the number of FIR taps of your filter
//const int nc = 384;
//const int nc = 128;
//const int nc = 256;
//const int nc = 512; // number of taps for the FIR filter
//const int nc = 1024; // number of taps for the FIR filter
//const int nc = 2048; // number of taps for the FIR filter
//const int nc = 4096; // number of taps for the FIR filter
//const int nc = 8192; // number of taps for the FIR filter
//const int nc = 16384; // number of taps for the FIR filter
const int PROGMEM nc = 21632; // MAXIMUM number of taps for the FIR filter --> memory constraint, not processor speed
// the latency of the filter is meant to be the same regardless of the number of taps for the filter
// partition size of 128 translates to a latency of 128/sample rate, ie. to 2.9msec with 44.1ksps
// however, measurements by bmillier contradict this . . .
// I performed changes in the code accordingly, so we need more measurements on the real-time latency of this code, I am unable to perform those . . .
const int PROGMEM partitionsize = 128; 
///////////////////////////////////////////////////////////////////////////////////////////////////////////////

#define DEBUG
#define FOURPI  (2.0 * TWO_PI)
#define SIXPI   (3.0 * TWO_PI)
#define BUFFER_SIZE 128
int32_t sum;
int idx_t = 0;
int16_t *sp_L;
int16_t *sp_R;
uint8_t PROGMEM FIR_filter_window = 1;
const uint32_t PROGMEM FFT_L = 256; 
float32_t mean = 1;
uint8_t first_block = 1; 
const uint32_t PROGMEM FFT_length = FFT_L;
const int PROGMEM nfor = nc / partitionsize; // number of partition blocks --> nfor = nc / partitionsize
float DMAMEM cplxcoeffs[nc * 2]; // this holds the initial complex coefficients for the filter BEFORE partitioning
float32_t  maskgen[FFT_L * 2];
float32_t fmask[nfor][FFT_L * 2]; // 
float32_t  fftin[FFT_L * 2];
float DMAMEM fftout[nfor][FFT_L * 2]; // 
float32_t  accum[FFT_L * 2];

int buffidx = 0;
int k = 0;

const uint32_t N_B = FFT_L / 2 / BUFFER_SIZE;
uint32_t N_BLOCKS = N_B;
float32_t float_buffer_L [BUFFER_SIZE * N_B];  // 
float32_t float_buffer_R [BUFFER_SIZE * N_B]; 
float32_t last_sample_buffer_L [BUFFER_SIZE * N_B];  
float32_t last_sample_buffer_R [BUFFER_SIZE * N_B]; 
// complex FFT with the new library CMSIS V4.5
const static arm_cfft_instance_f32 *S;
// complex iFFT with the new library CMSIS V4.5
const static arm_cfft_instance_f32 *iS;
// FFT instance for direct calculation of the filter mask
// from the impulse response of the FIR - the coefficients
const static arm_cfft_instance_f32 *maskS;

// this audio comes from the codec by I2S
AudioInputI2S            i2s_in;
AudioRecordQueue         Q_in_L;
AudioRecordQueue         Q_in_R;
AudioMixer4              mixleft;
AudioMixer4              mixright;
AudioPlayQueue           Q_out_L;
AudioPlayQueue           Q_out_R;
AudioOutputI2S           i2s_out;

AudioConnection          patchCord1(i2s_in, 1, Q_in_L, 0);
AudioConnection          patchCord2(i2s_in, 0, Q_in_R, 0);
AudioConnection          patchCord3(Q_out_L, 0, mixleft, 0);
AudioConnection          patchCord4(Q_out_R, 0, mixright, 0);
AudioConnection          patchCord9(mixleft, 0,  i2s_out, 1);
AudioConnection          patchCord10(mixright, 0, i2s_out, 0);
AudioControlSGTL5000     audio_codec;


void setup() {
  Serial.begin(115200);
  while(!Serial);

  AudioMemory(10); 
  delay(100);

  set_arm_clock(600000000);

  /****************************************************************************************
     Audio Setup
  ****************************************************************************************/
  // Enable the audio shield (if present !). select input, and enable output
  audio_codec.enable();
  audio_codec.inputSelect(AUDIO_INPUT_LINEIN);
  audio_codec.adcHighPassFilterDisable();
  audio_codec.volume(0.5); 

  mixleft.gain (0, 1.0);
  mixright.gain(0, 1.0);

  setI2SFreq(SAMPLE_RATE);

  /****************************************************************************************
     set filter bandwidth
  ****************************************************************************************/
  // this routine does all the magic of calculating the FIR coeffs
  fir_bandpass (cplxcoeffs, nc, FLoCut, FHiCut, SAMPLE_RATE, 0, 1, 1.0);

  /****************************************************************************************
     init complex FFTs
  ****************************************************************************************/
  switch (FFT_length)
  {
    case 256:
      S = &arm_cfft_sR_f32_len256;
      iS = &arm_cfft_sR_f32_len256;
      maskS = &arm_cfft_sR_f32_len256;
      break;
  }

  /****************************************************************************************
     Calculate the FFT of the FIR filter coefficients once to produce the FIR filter mask
  ****************************************************************************************/
    init_partitioned_filter_masks();

  /****************************************************************************************
     display info on memory usage
  ****************************************************************************************/
    flexRamInfo();
    
  /****************************************************************************************
     begin to queue the audio from the audio library
  ****************************************************************************************/
  delay(100);
  Q_in_L.begin();
  Q_in_R.begin();

} // END OF SETUP

void loop() {
  elapsedMicros usec = 0;
  // are there at least N_BLOCKS buffers in each channel available ?
    if (Q_in_L.available() > N_BLOCKS + 0 && Q_in_R.available() > N_BLOCKS + 0)
    {

      // get audio samples from the audio  buffers and convert them to float
      for (unsigned i = 0; i < N_BLOCKS; i++)
      {
        sp_L = Q_in_L.readBuffer();
        sp_R = Q_in_R.readBuffer();

        // convert to float one buffer_size
        // float_buffer samples are now standardized from > -1.0 to < 1.0
        arm_q15_to_float (sp_L, &float_buffer_L[BUFFER_SIZE * i], BUFFER_SIZE); // convert int_buffer to float 32bit
        arm_q15_to_float (sp_R, &float_buffer_R[BUFFER_SIZE * i], BUFFER_SIZE); // convert int_buffer to float 32bit
        Q_in_L.freeBuffer();
        Q_in_R.freeBuffer();
      }
 
      /**********************************************************************************
          Fast convolution filtering == FIR filter in the frequency domain
       **********************************************************************************/
      //  basis for this was Lyons, R. (2011): Understanding Digital Processing.
      //  "Fast FIR Filtering using the FFT", pages 688 - 694
      //  Method used here: overlap-and-save

      // ONLY FOR the VERY FIRST FFT: fill first samples with zeros
      if (first_block) // fill real & imaginaries with zeros for the first BLOCKSIZE samples
      {
        for (unsigned i = 0; i < partitionsize * 4; i++)
        {
          fftin[i] = 0.0;
        }
        first_block = 0;
      }
      else
      {  // HERE IT STARTS for all other instances
        // fill FFT_buffer with last events audio samples
        for (unsigned i = 0; i < partitionsize; i++)
        {
          fftin[i * 2] = last_sample_buffer_L[i]; // real
          fftin[i * 2 + 1] = last_sample_buffer_R[i]; // imaginary
        }
      }
      // copy recent samples to last_sample_buffer for next time!
      for (unsigned i = 0; i < partitionsize; i++)
      {
        last_sample_buffer_L [i] = float_buffer_L[i];
        last_sample_buffer_R [i] = float_buffer_R[i];
      }

      // now fill recent audio samples into FFT_buffer (left channel: re, right channel: im)
      for (unsigned i = 0; i < partitionsize; i++)
      {
        fftin[FFT_length + i * 2] = float_buffer_L[i]; // real
        fftin[FFT_length + i * 2 + 1] = float_buffer_R[i]; // imaginary
      }
      
      /**********************************************************************************
          Complex Forward FFT
       **********************************************************************************/
      // calculation is performed in-place the FFT_buffer [re, im, re, im, re, im . . .]
      arm_cfft_f32(S, fftin, 0, 1);
      for(unsigned i = 0; i < partitionsize * 4; i++)
      {
          fftout[buffidx][i] = fftin[i];
      }

      /**********************************************************************************
          Complex multiplication with filter mask (precalculated coefficients subjected to an FFT)
          this is taken from wdsp library by Warren Pratt firmin.c
       **********************************************************************************/
      k = buffidx;

      for(unsigned i = 0; i < partitionsize * 4; i++)
      {
          accum[i] = 0.0;
      }
      
      for(unsigned j = 0; j < nfor; j++)
      { 
          for(unsigned i = 0; i < 2 * partitionsize; i++)
          {
              accum[2 * i + 0] += fftout[k][2 * i + 0] * fmask[j][2 * i + 0] -
                                  fftout[k][2 * i + 1] * fmask[j][2 * i + 1];
              
              accum[2 * i + 1] += fftout[k][2 * i + 0] * fmask[j][2 * i + 1] +
                                  fftout[k][2 * i + 1] * fmask[j][2 * i + 0]; 
          }
          k = k - 1;
          if(k < 0)
          {
            k = nfor - 1;
          } 
      } // end nfor loop

      buffidx = buffidx + 1;
      if(buffidx >= nfor)
      {
          buffidx = 0;    
      } 
      /**********************************************************************************
          Complex inverse FFT
       **********************************************************************************/
      arm_cfft_f32(iS, accum, 1, 1);

      /**********************************************************************************
          Overlap and save algorithm, which simply means yóu take only the right part of the buffer and discard the left part
       **********************************************************************************/
      // somehow it seems it is reversed: I discard the right part and take the left part . . .
        for (unsigned i = 0; i < partitionsize; i++)
        {
          //float_buffer_L[i] = accum[partitionsize * 2 + i * 2 + 0];
          //float_buffer_R[i] = accum[partitionsize * 2 + i * 2 + 1];
          float_buffer_L[i] = accum[i * 2 + 0] * audio_gain;
          float_buffer_R[i] = accum[i * 2 + 1] * audio_gain;
        }

       /**********************************************************************
          CONVERT TO INTEGER AND PLAY AUDIO - Push audio into I2S audio chain
       **********************************************************************/
      for (int i = 0; i < N_BLOCKS; i++)
        {
          sp_L = Q_out_L.getBuffer();    
          sp_R = Q_out_R.getBuffer();
          arm_float_to_q15 (&float_buffer_L[BUFFER_SIZE * i], sp_L, BUFFER_SIZE); 
          arm_float_to_q15 (&float_buffer_R[BUFFER_SIZE * i], sp_R, BUFFER_SIZE);
          Q_out_L.playBuffer(); // play it !  
          Q_out_R.playBuffer(); // play it !
        }

       /**********************************************************************************
          PRINT ROUTINE FOR ELAPSED MICROSECONDS
       **********************************************************************************/
#ifdef DEBUG
      sum = sum + usec;
      idx_t++;
      if (idx_t > 40) {
        mean = sum / idx_t;
        if (mean / 29.00 / N_BLOCKS * SAMPLE_RATE / AUDIO_SAMPLE_RATE_EXACT < 100.0)
        {
          Serial.print("processor load:  ");
          Serial.print (mean / 29.00 / N_BLOCKS * SAMPLE_RATE / AUDIO_SAMPLE_RATE_EXACT);
          Serial.println("%");
        }
        else
        {
          Serial.println("100%");
        }
        Serial.print (mean);
        Serial.print (" microsec for ");
        Serial.print (N_BLOCKS);
        Serial.print ("  stereo blocks    ");
        Serial.print("FFT-length = "); Serial.print(FFT_length);
        Serial.print(";   FIR filter length = "); Serial.println(nc);
        Serial.print("k = "); Serial.println(k);
        Serial.print("buffidx = "); Serial.println(buffidx);
        idx_t = 0;
        sum = 0;
      }
#endif
    } // end of audio process loop
} // end main loop

void init_partitioned_filter_masks()
{
#ifndef PASSTHRU_AUDIO
    for(unsigned j = 0; j < nfor;j++)
    {
      // fill with zeroes
      for (unsigned i = 0; i < partitionsize * 4; i++)
      {
          maskgen[i] = 0.0;  
      }
      // take part of impulse response and fill into maskgen
      for (unsigned i = 0; i < partitionsize * 2; i++)
      {
          maskgen[i + partitionsize * 2] = cplxcoeffs[i + j * partitionsize * 2];  
      }
      // perform complex FFT on maskgen
      arm_cfft_f32(maskS, maskgen, 0, 1);
      // fill into fmask array
      for (unsigned i = 0; i < partitionsize * 4; i++)
      {
          fmask[j][i] = maskgen[i];  
      }    
    }
    
#else
// passthru
    
    for(unsigned j = 0; j < nfor;j++)
    {
      // fill with zeroes
      for (unsigned i = 0; i < partitionsize * 4; i++)
      {
          maskgen[i] = 0.0;  
      }
      if(j==0) maskgen[0] = 1.0; 
      arm_cfft_f32(maskS, maskgen, 0, 1);      
      for (unsigned i = 0; i < partitionsize * 4; i++)
      {
          fmask[j][i] = maskgen[i];  
      } 
    }
#endif
}

// taken from wdsp library by Warren Pratt
void fir_bandpass (float * impulse, int N, double f_low, double f_high, double samplerate, int wintype, int rtype, double scale)
{
  double ft = (f_high - f_low) / (2.0 * samplerate);
  double ft_rad = TWO_PI * ft;
  double w_osc = PI * (f_high + f_low) / samplerate;
  int i, j;
  double m = 0.5 * (double)(N - 1);
  double delta = PI / m;
  double cosphi;
  double posi, posj;
  double sinc, window, coef;

  if (N & 1)
  {
    switch (rtype)
    {
    case 0:
      impulse[N >> 1] = scale * 2.0 * ft;
      break;
    case 1:
      impulse[N - 1] = scale * 2.0 * ft;
      impulse[  N  ] = 0.0;
      break;
    }
  }
  for (i = (N + 1) / 2, j = N / 2 - 1; i < N; i++, j--)
  {
    posi = (double)i - m;
    posj = (double)j - m;
    sinc = sin (ft_rad * posi) / (PI * posi);
    switch (wintype)
    {
    case 0: // Blackman-Harris 4-term
      cosphi = cos (delta * i);
      window  =             + 0.21747
          + cosphi *  ( - 0.45325
          + cosphi *  ( + 0.28256
          + cosphi *  ( - 0.04672 )));
      break;
    case 1: // Blackman-Harris 7-term
      cosphi = cos (delta * i);
      window  =       + 6.3964424114390378e-02
          + cosphi *  ( - 2.3993864599352804e-01
          + cosphi *  ( + 3.5015956323820469e-01
          + cosphi *  ( - 2.4774111897080783e-01
          + cosphi *  ( + 8.5438256055858031e-02
          + cosphi *  ( - 1.2320203369293225e-02
          + cosphi *  ( + 4.3778825791773474e-04 ))))));
      break;
    }
    coef = scale * sinc * window;
    switch (rtype)
    {
    case 0:
      impulse[i] = + coef * cos (posi * w_osc);
      impulse[j] = + coef * cos (posj * w_osc);
      break;
    case 1:
      impulse[2 * i + 0] = + coef * cos (posi * w_osc);
      impulse[2 * i + 1] = - coef * sin (posi * w_osc);
      impulse[2 * j + 0] = + coef * cos (posj * w_osc);
      impulse[2 * j + 1] = - coef * sin (posj * w_osc);
      break;
    }
  }
}

void setI2SFreq(int freq) {
  // thanks FrankB !
  // PLL between 27*24 = 648MHz und 54*24=1296MHz
  int n1 = 4; //SAI prescaler 4 => (n1*n2) = multiple of 4
  int n2 = 1 + (24000000 * 27) / (freq * 256 * n1);
  double C = ((double)freq * 256 * n1 * n2) / 24000000;
  int c0 = C;
  int c2 = 10000;
  int c1 = C * c2 - (c0 * c2);
  set_audioClock(c0, c1, c2, true);
  CCM_CS1CDR = (CCM_CS1CDR & ~(CCM_CS1CDR_SAI1_CLK_PRED_MASK | CCM_CS1CDR_SAI1_CLK_PODF_MASK))
       | CCM_CS1CDR_SAI1_CLK_PRED(n1-1) // &0x07
       | CCM_CS1CDR_SAI1_CLK_PODF(n2-1); // &0x3f 
}

void flexRamInfo(void)
{ // credit to FrankB, KurtE and defragster !
#if defined(__IMXRT1052__) || defined(__IMXRT1062__)
  int itcm = 0;
  int dtcm = 0;
  int ocram = 0;
  Serial.print("FlexRAM-Banks: [");
  for (int i = 15; i >= 0; i--) {
    switch ((IOMUXC_GPR_GPR17 >> (i * 2)) & 0b11) {
      case 0b00: Serial.print("."); break;
      case 0b01: Serial.print("O"); ocram++; break;
      case 0b10: Serial.print("D"); dtcm++; break;
      case 0b11: Serial.print("I"); itcm++; break;
    }
  }
  Serial.print("] ITCM: ");
  Serial.print(itcm * 32);
  Serial.print(" KB, DTCM: ");
  Serial.print(dtcm * 32);
  Serial.print(" KB, OCRAM: ");
  Serial.print(ocram * 32);
#if defined(__IMXRT1062__)
  Serial.print("(+512)");
#endif
  Serial.println(" KB");
  extern unsigned long _stext;
  extern unsigned long _etext;
  extern unsigned long _sdata;
  extern unsigned long _ebss;
  extern unsigned long _flashimagelen;
  extern unsigned long _heap_start;

  Serial.println("MEM (static usage):");
  Serial.println("RAM1:");

  Serial.print("ITCM = FASTRUN:      ");
  Serial.print((unsigned)&_etext - (unsigned)&_stext);
  Serial.print("   "); Serial.print((float)((unsigned)&_etext - (unsigned)&_stext) / ((float)itcm * 32768.0) * 100.0);
  Serial.print("%  of  "); Serial.print(itcm * 32); Serial.print("kb   ");
  Serial.print("  (");
  Serial.print(itcm * 32768 - ((unsigned)&_etext - (unsigned)&_stext));
  Serial.println(" Bytes free)");
 
  Serial.print("DTCM = Variables:    ");
  Serial.print((unsigned)&_ebss - (unsigned)&_sdata);
  Serial.print("   "); Serial.print((float)((unsigned)&_ebss - (unsigned)&_sdata) / ((float)dtcm * 32768.0) * 100.0);
  Serial.print("%  of  "); Serial.print(dtcm * 32); Serial.print("kb   ");
  Serial.print("  (");
  Serial.print(dtcm * 32768 - ((unsigned)&_ebss - (unsigned)&_sdata));
  Serial.println(" Bytes free)");

  Serial.println("RAM2:");
  Serial.print("OCRAM = DMAMEM:      ");
  Serial.print((unsigned)&_heap_start - 0x20200000);
  Serial.print("   "); Serial.print((float)((unsigned)&_heap_start - 0x20200000) / ((float)512 * 1024.0) * 100.0);
  Serial.print("%  of  "); Serial.print(512); Serial.print("kb");
  Serial.print("     (");
  Serial.print(512 * 1024 - ((unsigned)&_heap_start - 0x20200000));
  Serial.println(" Bytes free)");

  Serial.print("FLASH:               ");
  Serial.print((unsigned)&_flashimagelen);
  Serial.print("   "); Serial.print(((unsigned)&_flashimagelen) / (2048.0 * 1024.0) * 100.0);
  Serial.print("%  of  "); Serial.print(2048); Serial.print("kb");
  Serial.print("    (");
  Serial.print(2048 * 1024 - ((unsigned)&_flashimagelen));
  Serial.println(" Bytes free)");
  
#endif
}

@Brian: it would be very interesting, whether the latency of this code is again much larger than the intended approx. 10msec . . . Would you be able to investigate into this in the next weeks?

My final goal (if we finally manage to cut down the latency to the intendend < 10-20msec) would be to put this code into an audio lib "frequency domain lowpass Stereo FIR filter object for T4" with flexible choice of cutoff frequency and flexible choice of number of taps from 201 to 21632 (thus supplementing the available FIR filtering with up to 200 taps in the time domain).

All the best, Frank DD4WH

P.S.: @Brian: It took me a while to think about our conversation about the difference between real Stereo filtering and I/Q signal filtering. Maybe I understood a little more now: Stereo version would take different input frequencies for the calculation of the complex FIR filter coefficients: STEREO filter: HiCut frequency = - LowCutFrequency --> which produces a lowpass filter for both channels
I/Q filtering would require something like this: LowCutFrequency = 30Hz, HiCutFrequency = 10kHz --> classical bandpass AND supression of the opposite side band because of the 90 degree shifted input frequencies (I & Q) AND the 90 degree phase shift by the "Hilbert Transform" caused by the complex FIR filter coefficients.
 
Last edited:
@Brian: before you get out your scope again, one idea: one could test the latency of the code by creating an impulse in software (just a 1.0 in one sample followed by all 0.0 samples) and then output the filtered samples in the Serial Monitor without running the system in realtime. We would need 256 samples of the output printed, if the latency is as we want it, and nfor * 256 samples, if it is not like we expect ;-). Maybe I will try this in the next days, if time allows . . .
 
@Frank: I must have accidentally "unsubscribed from this thread, as I didn't get any email alerts and just happened to run across your newest posts today. Before commenting on your remarks, I'll mention that I modified my convolution library object a few weeks back. Rather than accumulating 4 audio blocks before I perform the 1024-pt FFT-iFFT filtering, I now do 1024-pt FFT/iFFT for EACH audio block that comes in, and use 3 prior blocks from a buffer. That has reduced the overall latency from 20 ms down to 14 ms, which I think is the absolute minimum latency one can get while using the Audio library framework (allowing for the fact that the SGTL5000's ADC has a built in digital filter which adds inherent delay).
Where the old routine took 45- 50% CPU bandwidth during 1 out of 4 audio block times (T3.6 @ 240 MHz) and negligible bandwidth for the remaining 3 block times, the new one takes about 15-20% ( I forget exact number) continuously BUT it is using a T4 of course.
My new routine does not use the segmented method that you are experimenting with. It could be expanded to work with larger FFTs than the 1024 I now use. The latency would increase of course, but most of the latency (@ 1024-pt FFT) comes from the audio lib's block structure, not in doing the actual FFT/iFFT- especially on the T4 which is so fast.
Re your questions/remarks:
Stereo vs I/Q - Your SDR experience is I/Q of course, and I believe your thinking is correct in that regard. My experience is musical/consumer audio, so I think in terms of stereo. For the filter to claim to be stereo compatible, it would have to be able to take two completely different signals on L,R channels and process them independently (i.e.with no cross-talk between them showing up in the output signals). The actual filter characteristics would be immaterial, but for stereo use, they would be the same for both channels in virtually all use cases that I can think of.
The version of your filter that I checked out a month or so ago showed that a signal on L channel only would show up in both output channels, with the L channel correct, and the right channel distorted (differentiated) . I included a 'scope capture in post #21 above.
I just saw your new code now, so I haven't tried it out using hardware/scope yet.
Latency- post #31- If you do it that way you will get a low latency figure. But, you are not measuring the latency involved in loading and moving the blocks around in the audio library. There will be a couple of 2.9 ms block delays on the input side , and a couple on the output side. Plus, you won't see the inherent delay in the SGTL5000 (although you may be using another ADC with no DSP filter delay inherent in its design). But, you would be getting an accurate latency measurement apart from those two factors, I believe.
So, I think the scope is the only completely accurate way of measuring overall latency. I will try to make this measurement as soon as I get the chance.
Cheers.
 
Brian, nice to hear from you and especially that you optimized the low latency convolution object!

At the moment, there is no need to get your scope (thanks for offering that), because I discovered that the new routine also has the high latency we already observed. I tested that in software (you are of course right that this does not take into account the latency of the codec and the audio lib, it only measures the uniformly partitioned convolution routine latency).

Regarding stereo: The version in post #30 now is REAL stereo :) (Well, at least I hope it is and it sounds like real stereo with two independent channels). However, I think I will rebuild the whole routine to use two mono channels in order to get the latency problem solved . . . I hope I will be able to solve that, however I do not yet have a clue where the mistake is in the code ;-).
 
Frank- thanks for the prompt response. I hadn't yet examined your new code, so its good that you mentioned that you've re-written it to handle two discrete channels.
While I understand the basic FFT/iFFT concept well enough to have written my library object, that uniformly partitioned scheme is beyond my current level of understanding. While I thought I had solved the latency problem (post #16) by changing pointers, it screwed up the other channel with a delayed pulse of ~200 ms. However, if your new code is truly handling both channels independently, possibly the pointer changes I showed in Post #16 might work now. Did you take my post #16 into account? (I haven't had time to study your new code in detail yet as I am presently coding an ESP32 camera-based project).
My new low-latency code works well but is inefficient as it is doing the full 1024-pt FFT for every audio block (3 of the 4 blocks are "old" samples). But, with the T4, only 15-20% MCU bandwidth is needed @1024-pt. I could probably go out to 4096-pts before I ran out of processing power, and the latency would not be all that much greater than the 14 ms I get with the 1024-pt FFT now, given that most of that 14 ms is not actually taken up by the FFT/iFFT routine itself. But, I'd not get to the large FFT sizes you quote.
Good luck with your coding!
 
Brian! You already found it! You were just one off ;-). It had to be (k + 31) % nfor, NOT 32 . . .

It seems the low latency fast convolution code now does what it should.

It always has a latency of 128 samples regardless of which number of taps you use for the filtering (it has to be a power of two, however. Otherwise it still works, but the latency is a bit longer).

Still I do not understand the code thoroughly. What I changed is the following:

* use real FIR filter coefficients instead of complex ones, because no one needs Hilbert filtering for real audio input (that is necessary for IQ, however)
* for that purpose, I now use cos instead of sin for the second half of the coefficients. All the imaginary coeffs are zero now.
* the mask buffer for each block of FIR filter coefficients before the Coeffs´ FFT is now filled from the left side, and not the right side
* this makes the filtered audio appear in the right part of the inverse FFT output buffer and not in the left part
* as you suggested, I substituted [(k + kshift) % nfor] instead of k for the indexes in the complex multiply routine
* int kshift = nc / partitionsize / 2 - 1;
* for testing the latency in real time, the code does the following: it waits for one second, then it injects a very short impulse of two samples [-1, 1] in both channels. After that, it checks the output of the inverse FFT for those spikes and prints them to the Serial Monitor.
* for performing the latency test, just uncomment #define LATENCY_TEST [you will also hear a very short "smack", that is the two sample spike produced]

The code works for FIR filter sizes of 128 up to 21632 and ALWAYS has a latency of 128 samples [I hope :)].

BTW, for a filter length of 1024, it uses 5.1% of the processor ressources of a Teensy 4.0 :).

The separation of the two stereo channels seems to be perfect to my ears.

Please test thoroughly! This is still a preliminary version and probably still contains bugs.

Code:
/* (c) Frank DD4WH 2019-11-01
 *  
 * Real Time PARTITIONED BLOCK CONVOLUTION FILTERING (Stereo)
 * 
 * FIR filtering fun with the Teensy 4.0 with up to 21632 filter taps in each channel
 * 
 * LOW LATENCY is the aim of this approach, it should be no more than 128 samples
 * However, there is additional latency by the codec used and by the overhead of the audio lib
 * 
 * implements a lowpass filter for each channel at a user-defined lowpass cutoff frequency
 * 
 * uses Teensy 4.0 and external ADC / DAC connected on perf board with ground plane
 * but should run also with the PJRC Teensy Audio board version D (compatible with Teensy 4.0)
 * 
 * this sketch does only run with the Teensy 4.0, do not use with Teensy 3.x
 * 
 * My sincere thanks go to Brian Millier, who accurately measured latency in a first version and also found the bug in the system!  
 * 
 *  PCM5102A DAC module
    VCC = Vin
    3.3v = NC
    GND = GND
    FLT = GND
    SCL = 23 / MCLK via series 100 Ohm
    BCK = BCLK (21)
    DIN = TX (7)
    LCK = LRCLK (20)
    FMT = GND
    XMT = 3.3V (HIGH)
    
    PCM1808 ADC module:    
    FMT = GND
    MD1 = GND
    MD0 = GND
    GND = GND
    3.3V = 3.3V --> ADC needs both: 5V AND 3V3
    5V = VIN
    BCK = BCLK (21) via series 100 Ohm
    OUT = RX (8)
    LRC = LRCLK (20) via series 100 Ohm
    SCK = MCLK (23) via series 100 Ohm
    GND = GND
    3.3V = 3.3V
 * 
 * uses code from wdsp library by Warren Pratt
 * https://github.com/g0orx/wdsp/blob/master/firmin.c
 * 
 ********************************************************************************
   GNU GPL LICENSE v3

   This program is free software: you can redistribute it and/or modify
   it under the terms of the GNU General Public License as published by
   the Free Software Foundation, either version 3 of the License, or
   (at your option) any later version.

   This program is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
   GNU General Public License for more details.

   You should have received a copy of the GNU General Public License
   along with this program.  If not, see <http://www.gnu.org/licenses/>
 ********************************************************************************/

#include <Audio.h>
#include <Wire.h>
#include <SPI.h>
#include <arm_math.h>
#include <arm_const_structs.h> // in the Teensy 4.0 audio library, the ARM CMSIS DSP lib is already a newer version
#include <utility/imxrt_hw.h> // necessary for setting the sample rate, thanks FrankB !

extern "C" uint32_t set_arm_clock(uint32_t frequency);

//////////////////////////////////////////////////////////////////////////////////////////////////////////////
// USER DEFINES
//uncomment for pass-thru
//#define PASSTHRU_AUDIO

// uncomment for Serial print out of latency
#define LATENCY_TEST

// define your frequency for the lowpass filter
const double PROGMEM FHiCut = 22000.0; // for the (young) audiophile ;-)
const double PROGMEM FLoCut = -FHiCut;
const float32_t PROGMEM audio_gain = 6.5;
// define your sample rate
const double PROGMEM SAMPLE_RATE = 48000;  
const int PROGMEM partitionsize = 128; 

// define the number of FIR taps of your filter
//const int nc = 384;
//const int nc = 128;
//const int nc = 256;
//const int nc = 512; // number of taps for the FIR filter
const int nc = 1024; // number of taps for the FIR filter
//const int nc = 2048; // number of taps for the FIR filter
//const int nc = 4096; // number of taps for the FIR filter
//const int nc = 8192; // number of taps for the FIR filter
//const int nc = 16384; // number of taps for the FIR filter
//const int PROGMEM nc = 21632; // MAXIMUM number of taps for the FIR filter --> memory constraint, not processor speed

// this is the shift necessary for k, figured out by Brian Millier, thanks a lot !
int kshift = nc / partitionsize / 2 - 1;

// the latency of the filter is meant to be the same regardless of the number of taps for the filter
// partition size of 128 translates to a latency of 128/sample rate, ie. to 2.9msec with 44.1ksps [if the number of taps is a power of two]
///////////////////////////////////////////////////////////////////////////////////////////////////////////////

#define DEBUG
#define FOURPI  (2.0 * TWO_PI)
#define SIXPI   (3.0 * TWO_PI)
#define BUFFER_SIZE 128
int32_t sum;
int idx_t = 0;
int16_t *sp_L;
int16_t *sp_R;
uint8_t PROGMEM FIR_filter_window = 1;
const uint32_t PROGMEM FFT_L = 256; 
uint32_t sample_counter = 0;
uint32_t all_samples_counter = 0;
uint8_t no_more_latency_test = 0;
float32_t mean = 1;
uint8_t first_block = 1; 
const uint32_t PROGMEM FFT_length = FFT_L;
const int PROGMEM nfor = nc / partitionsize; // number of partition blocks --> nfor = nc / partitionsize
float DMAMEM cplxcoeffs[nc * 2]; // this holds the initial complex coefficients for the filter BEFORE partitioning
float32_t  maskgen[FFT_L * 2];
float32_t fmask[nfor][FFT_L * 2]; // 
float32_t  fftin[FFT_L * 2];
float DMAMEM fftout[nfor][FFT_L * 2]; // 
float32_t  accum[FFT_L * 2];

int buffidx = 0;
int k = 0;

const uint32_t N_B = FFT_L / 2 / BUFFER_SIZE;
uint32_t N_BLOCKS = N_B;
float32_t float_buffer_L [BUFFER_SIZE * N_B];  // 
float32_t float_buffer_R [BUFFER_SIZE * N_B]; 
float32_t last_sample_buffer_L [BUFFER_SIZE * N_B];  
float32_t last_sample_buffer_R [BUFFER_SIZE * N_B]; 
// complex FFT with the new library CMSIS V4.5
const static arm_cfft_instance_f32 *S;
// complex iFFT with the new library CMSIS V4.5
const static arm_cfft_instance_f32 *iS;
// FFT instance for direct calculation of the filter mask
// from the impulse response of the FIR - the coefficients
const static arm_cfft_instance_f32 *maskS;

// this audio comes from the codec by I2S
AudioInputI2S            i2s_in;
AudioRecordQueue         Q_in_L;
AudioRecordQueue         Q_in_R;
AudioMixer4              mixleft;
AudioMixer4              mixright;
AudioPlayQueue           Q_out_L;
AudioPlayQueue           Q_out_R;
AudioOutputI2S           i2s_out;

AudioConnection          patchCord1(i2s_in, 1, Q_in_L, 0);
AudioConnection          patchCord2(i2s_in, 0, Q_in_R, 0);
AudioConnection          patchCord3(Q_out_L, 0, mixleft, 0);
AudioConnection          patchCord4(Q_out_R, 0, mixright, 0);
AudioConnection          patchCord9(mixleft, 0,  i2s_out, 1);
AudioConnection          patchCord10(mixright, 0, i2s_out, 0);
AudioControlSGTL5000     audio_codec;


void setup() {
  Serial.begin(115200);
  while(!Serial);

  AudioMemory(10); 
  delay(100);

  set_arm_clock(600000000);

  /****************************************************************************************
     Audio Setup
  ****************************************************************************************/
  // Enable the audio shield (if present !). select input, and enable output
  audio_codec.enable();
  audio_codec.inputSelect(AUDIO_INPUT_LINEIN);
  audio_codec.adcHighPassFilterDisable();
  audio_codec.volume(0.5); 

  mixleft.gain (0, 1.0);
  mixright.gain(0, 1.0);

  setI2SFreq(SAMPLE_RATE);

  /****************************************************************************************
     set filter bandwidth
  ****************************************************************************************/
  // this routine does all the magic of calculating the FIR coeffs
  fir_bandpass (cplxcoeffs, nc, FLoCut, FHiCut, SAMPLE_RATE, 0, 1, 1.0);

  /****************************************************************************************
     init complex FFTs
  ****************************************************************************************/
  switch (FFT_length)
  {
    case 256:
      S = &arm_cfft_sR_f32_len256;
      iS = &arm_cfft_sR_f32_len256;
      maskS = &arm_cfft_sR_f32_len256;
      break;
  }

  /****************************************************************************************
     Calculate the FFT of the FIR filter coefficients once to produce the FIR filter mask
  ****************************************************************************************/
    init_partitioned_filter_masks();

  /****************************************************************************************
     properly initialise variables in DMAMEM
  ****************************************************************************************/

  for(unsigned jj = 0; jj < nfor; jj++)
  {
    for(unsigned ii = 0; ii < FFT_L * 2; ii++)
    {
      fftout[jj][ii] = 0.1;
    }
  }
  
  /****************************************************************************************
     display info on memory usage and convolution stuff
  ****************************************************************************************/
    flexRamInfo();
    Serial.println();
    Serial.print("Number of blocks:   "); Serial.println(nfor);
    Serial.println();    
    Serial.print("number of k shift:   "); Serial.println(kshift);
    Serial.println();    
    Serial.print("FIR filter length = "); Serial.println(nc);
    Serial.println();    
    
  /****************************************************************************************
     begin to queue the audio from the audio library
  ****************************************************************************************/
  delay(100);
  Q_in_L.begin();
  Q_in_R.begin();

} // END OF SETUP

void loop() {
  elapsedMicros usec = 0;
  // are there at least N_BLOCKS buffers in each channel available ?
    if (Q_in_L.available() > N_BLOCKS && Q_in_R.available() > N_BLOCKS)
    {

      // get audio samples from the audio  buffers and convert them to float
      for (unsigned i = 0; i < N_BLOCKS; i++)
      {
        sp_L = Q_in_L.readBuffer();
        sp_R = Q_in_R.readBuffer();

        // convert to float one buffer_size
        // float_buffer samples are now standardized from > -1.0 to < 1.0
        arm_q15_to_float (sp_L, &float_buffer_L[BUFFER_SIZE * i], BUFFER_SIZE); // convert int_buffer to float 32bit
        arm_q15_to_float (sp_R, &float_buffer_R[BUFFER_SIZE * i], BUFFER_SIZE); // convert int_buffer to float 32bit
        Q_in_L.freeBuffer();
        Q_in_R.freeBuffer();
      }
 
      /**********************************************************************************
          Fast convolution filtering == FIR filter in the frequency domain
       **********************************************************************************/
      //  basis for this was Lyons, R. (2011): Understanding Digital Processing.
      //  "Fast FIR Filtering using the FFT", pages 688 - 694
      //  Method used here: overlap-and-save

      // ONLY FOR the VERY FIRST FFT: fill first samples with zeros
      if (first_block) // fill real & imaginaries with zeros for the first BLOCKSIZE samples
      {
        for (unsigned i = 0; i < partitionsize * 4; i++)
        {
          fftin[i] = 0.0;
        }
        first_block = 0;
      }
      else
      {  // HERE IT STARTS for all other instances
        // fill FFT_buffer with last events audio samples
        for (unsigned i = 0; i < partitionsize; i++)
        {
          fftin[i * 2] = last_sample_buffer_L[i]; // real
          fftin[i * 2 + 1] = last_sample_buffer_R[i]; // imaginary
        }
      }

#if defined(LATENCY_TEST)
        if(all_samples_counter > SAMPLE_RATE && !no_more_latency_test)
        {
        // latency test
        fftin[42] = 1.0; fftin[44] = -1.0;
        fftin[43] = 1.0; fftin[45] = -1.0;
        no_more_latency_test = 1;
        }
      if(!no_more_latency_test) all_samples_counter += partitionsize; 
#endif           

      // copy recent samples to last_sample_buffer for next time!
      for (unsigned i = 0; i < partitionsize; i++)
      {
        last_sample_buffer_L [i] = float_buffer_L[i];
        last_sample_buffer_R [i] = float_buffer_R[i];
      }

      // now fill recent audio samples into FFT_buffer (left channel: re, right channel: im)
      for (unsigned i = 0; i < partitionsize; i++)
      {
        fftin[FFT_length + i * 2] = float_buffer_L[i]; // real
        fftin[FFT_length + i * 2 + 1] = float_buffer_R[i]; // imaginary
      }
      
      /**********************************************************************************
          Complex Forward FFT
       **********************************************************************************/
      // calculation is performed in-place the FFT_buffer [re, im, re, im, re, im . . .]
      arm_cfft_f32(S, fftin, 0, 1);
      for(unsigned i = 0; i < partitionsize * 4; i++)
      {
          fftout[buffidx][i] = fftin[i];
      }

      /**********************************************************************************
          Complex multiplication with filter mask (precalculated coefficients subjected to an FFT)
          this is taken from wdsp library by Warren Pratt firmin.c
       **********************************************************************************/
      k = buffidx;

      for(unsigned i = 0; i < partitionsize * 4; i++)
      {
          accum[i] = 0.0;
      }
      
      for(unsigned j = 0; j < nfor; j++)
      { 
          for(unsigned i = 0; i < 2 * partitionsize; i++)
          {
              accum[2 * i + 0] += fftout[(k + kshift) % nfor][2 * i + 0] * fmask[j][2 * i + 0] -
                                  fftout[(k + kshift) % nfor][2 * i + 1] * fmask[j][2 * i + 1];
              
              accum[2 * i + 1] += fftout[(k + kshift) % nfor][2 * i + 0] * fmask[j][2 * i + 1] +
                                  fftout[(k + kshift) % nfor][2 * i + 1] * fmask[j][2 * i + 0]; 
          }
          k = k - 1;
          if(k < 0)
          {
            k = nfor - 1;
          } 
      } // end nfor loop

      buffidx = buffidx + 1;
      if(buffidx >= nfor)
      {
          buffidx = 0;    
      } 
      /**********************************************************************************
          Complex inverse FFT
       **********************************************************************************/
      arm_cfft_f32(iS, accum, 1, 1);

      /**********************************************************************************
          Overlap and save algorithm, which simply means yóu take only the right part of the buffer and discard the left part
       **********************************************************************************/
      // I reversed the reversion of this, so this is history: "somehow it seems it is reversed: I discard the right part and take the left part . . ."
      // see in function init_filter_masks -->  
        for (unsigned i = 0; i < partitionsize; i++)
        { // the right part of the iFFT output is unaliased audio
          float_buffer_L[i] = accum[partitionsize * 2 + i * 2 + 0] * audio_gain;
          float_buffer_R[i] = accum[partitionsize * 2 + i * 2 + 1] * audio_gain;
          //float_buffer_L[i] = accum[i * 2 + 0] * audio_gain;
          //float_buffer_R[i] = accum[i * 2 + 1] * audio_gain;
        }

      /**********************************************************************************
          Serial print the first output samples in order to check for latency
       **********************************************************************************/
#if defined(LATENCY_TEST)
        if(sample_counter < nc * 4 + SAMPLE_RATE)
        {
          for(unsigned i = 0; i < partitionsize; i++)
          {
            if(accum[i * 2 + 0] > 0.1 || accum[i * 2 + 0] < -0.1)
            {
              Serial.print(sample_counter + i - all_samples_counter - 21); Serial.print("   left:   "); Serial.println(accum[i * 2 + 0]);
              Serial.print(sample_counter + i - all_samples_counter - 21); Serial.print("  right:   "); Serial.println(accum[i * 2 + 1]);
            }
          }
          sample_counter += partitionsize;
        }
        else
        {
          sample_counter = nc * 4;
        }
#endif
       /**********************************************************************
          CONVERT TO INTEGER AND PLAY AUDIO - Push audio into I2S audio chain
       **********************************************************************/
      for (int i = 0; i < N_BLOCKS; i++)
        {
          sp_L = Q_out_L.getBuffer();    
          sp_R = Q_out_R.getBuffer();
          arm_float_to_q15 (&float_buffer_L[BUFFER_SIZE * i], sp_L, BUFFER_SIZE); 
          arm_float_to_q15 (&float_buffer_R[BUFFER_SIZE * i], sp_R, BUFFER_SIZE);
          Q_out_L.playBuffer(); // play it !  
          Q_out_R.playBuffer(); // play it !
        }

       /**********************************************************************************
          PRINT ROUTINE FOR ELAPSED MICROSECONDS
       **********************************************************************************/
#ifdef DEBUG
      sum = sum + usec;
      idx_t++;
      if (idx_t > 500) {
        mean = sum / idx_t;
        if (mean / 29.00 / N_BLOCKS * SAMPLE_RATE / AUDIO_SAMPLE_RATE_EXACT < 100.0)
        {
          Serial.print("processor load:  ");
          Serial.print (mean / 29.00 / N_BLOCKS * SAMPLE_RATE / AUDIO_SAMPLE_RATE_EXACT);
          Serial.println("%");
        }
        else
        {
          Serial.println("100%");
        }
        Serial.print (mean);
        Serial.print (" microsec for ");
        Serial.print (N_BLOCKS);
        Serial.print ("  stereo blocks    ");
        Serial.print("FFT-length = "); Serial.print(FFT_length);
        Serial.print(";   FIR filter length = "); Serial.println(nc);
        Serial.print("k = "); Serial.println(k);
        Serial.print("buffidx = "); Serial.println(buffidx);
        idx_t = 0;
        sum = 0;
      }
#endif
    } // end of audio process loop
} // end main loop

void init_partitioned_filter_masks()
{
#ifndef PASSTHRU_AUDIO
    for(unsigned j = 0; j < nfor;j++)
    {
      // fill with zeroes
      for (unsigned i = 0; i < partitionsize * 4; i++)
      {
          maskgen[i] = 0.0;  
      }
      // take part of impulse response and fill into maskgen
      for (unsigned i = 0; i < partitionsize * 2; i++)
      { // this commented first option makes the unaliased output of the iFFT buffer appear in the left part 
//          maskgen[i + partitionsize * 2] = cplxcoeffs[i + j * partitionsize * 2];  
        // this makes the unaliased output appear in the right part of the iFFT buffer
          maskgen[i] = cplxcoeffs[i + j * partitionsize * 2];  
      }
      // perform complex FFT on maskgen
      arm_cfft_f32(maskS, maskgen, 0, 1);
      // fill into fmask array
      for (unsigned i = 0; i < partitionsize * 4; i++)
      {
          fmask[j][i] = maskgen[i];  
      }    
    }
    
#else
// passthru
    
    for(unsigned j = 0; j < nfor;j++)
    {
      // fill with zeroes
      for (unsigned i = 0; i < partitionsize * 4; i++)
      {
          maskgen[i] = 0.0;  
      }
      if(j==0) maskgen[0] = 1.0; 
      arm_cfft_f32(maskS, maskgen, 0, 1);      
      for (unsigned i = 0; i < partitionsize * 4; i++)
      {
          fmask[j][i] = maskgen[i];  
      } 
    }
#endif
}

// taken from wdsp library by Warren Pratt
void fir_bandpass (float * impulse, int N, double f_low, double f_high, double samplerate, int wintype, int rtype, double scale)
{
  double ft = (f_high - f_low) / (2.0 * samplerate);
  double ft_rad = TWO_PI * ft;
  double w_osc = PI * (f_high + f_low) / samplerate;
  int i, j;
  double m = 0.5 * (double)(N - 1);
  double delta = PI / m;
  double cosphi;
  double posi, posj;
  double sinc, window, coef;

  if (N & 1)
  {
    switch (rtype)
    {
    case 0:
      impulse[N >> 1] = scale * 2.0 * ft;
      break;
    case 1:
      impulse[N - 1] = scale * 2.0 * ft;
      impulse[  N  ] = 0.0;
      break;
    }
  }
  for (i = (N + 1) / 2, j = N / 2 - 1; i < N; i++, j--)
  {
    posi = (double)i - m;
    posj = (double)j - m;
    sinc = sin (ft_rad * posi) / (PI * posi);
    switch (wintype)
    {
    case 0: // Blackman-Harris 4-term
      cosphi = cos (delta * i);
      window  =             + 0.21747
          + cosphi *  ( - 0.45325
          + cosphi *  ( + 0.28256
          + cosphi *  ( - 0.04672 )));
      break;
    case 1: // Blackman-Harris 7-term
      cosphi = cos (delta * i);
      window  =       + 6.3964424114390378e-02
          + cosphi *  ( - 2.3993864599352804e-01
          + cosphi *  ( + 3.5015956323820469e-01
          + cosphi *  ( - 2.4774111897080783e-01
          + cosphi *  ( + 8.5438256055858031e-02
          + cosphi *  ( - 1.2320203369293225e-02
          + cosphi *  ( + 4.3778825791773474e-04 ))))));
      break;
    }
    coef = scale * sinc * window;
    switch (rtype)
    {
    case 0:
      impulse[i] = + coef * cos (posi * w_osc);
      impulse[j] = + coef * cos (posj * w_osc);
//      impulse[j] = - coef * cos (posj * w_osc);
      break;
    case 1:
/*      impulse[2 * i + 0] = + coef * cos (posi * w_osc);
      impulse[2 * i + 1] = - coef * sin (posi * w_osc);
      impulse[2 * j + 0] = + coef * cos (posj * w_osc);
      impulse[2 * j + 1] = - coef * sin (posj * w_osc);
*/
      impulse[2 * i + 0] = + coef * cos (posi * w_osc);
      impulse[2 * i + 1] = 0; //+ coef * cos (posi * w_osc);
      impulse[2 * j + 0] = + coef * cos (posj * w_osc);
      impulse[2 * j + 1] = 0; //+ coef * cos (posj * w_osc);

      break;
    }
  }
}

void setI2SFreq(int freq) {
  // thanks FrankB !
  // PLL between 27*24 = 648MHz und 54*24=1296MHz
  int n1 = 4; //SAI prescaler 4 => (n1*n2) = multiple of 4
  int n2 = 1 + (24000000 * 27) / (freq * 256 * n1);
  double C = ((double)freq * 256 * n1 * n2) / 24000000;
  int c0 = C;
  int c2 = 10000;
  int c1 = C * c2 - (c0 * c2);
  set_audioClock(c0, c1, c2, true);
  CCM_CS1CDR = (CCM_CS1CDR & ~(CCM_CS1CDR_SAI1_CLK_PRED_MASK | CCM_CS1CDR_SAI1_CLK_PODF_MASK))
       | CCM_CS1CDR_SAI1_CLK_PRED(n1-1) // &0x07
       | CCM_CS1CDR_SAI1_CLK_PODF(n2-1); // &0x3f 
}

void flexRamInfo(void)
{ // credit to FrankB, KurtE and defragster !
#if defined(__IMXRT1052__) || defined(__IMXRT1062__)
  int itcm = 0;
  int dtcm = 0;
  int ocram = 0;
  Serial.print("FlexRAM-Banks: [");
  for (int i = 15; i >= 0; i--) {
    switch ((IOMUXC_GPR_GPR17 >> (i * 2)) & 0b11) {
      case 0b00: Serial.print("."); break;
      case 0b01: Serial.print("O"); ocram++; break;
      case 0b10: Serial.print("D"); dtcm++; break;
      case 0b11: Serial.print("I"); itcm++; break;
    }
  }
  Serial.print("] ITCM: ");
  Serial.print(itcm * 32);
  Serial.print(" KB, DTCM: ");
  Serial.print(dtcm * 32);
  Serial.print(" KB, OCRAM: ");
  Serial.print(ocram * 32);
#if defined(__IMXRT1062__)
  Serial.print("(+512)");
#endif
  Serial.println(" KB");
  extern unsigned long _stext;
  extern unsigned long _etext;
  extern unsigned long _sdata;
  extern unsigned long _ebss;
  extern unsigned long _flashimagelen;
  extern unsigned long _heap_start;

  Serial.println("MEM (static usage):");
  Serial.println("RAM1:");

  Serial.print("ITCM = FASTRUN:      ");
  Serial.print((unsigned)&_etext - (unsigned)&_stext);
  Serial.print("   "); Serial.print((float)((unsigned)&_etext - (unsigned)&_stext) / ((float)itcm * 32768.0) * 100.0);
  Serial.print("%  of  "); Serial.print(itcm * 32); Serial.print("kb   ");
  Serial.print("  (");
  Serial.print(itcm * 32768 - ((unsigned)&_etext - (unsigned)&_stext));
  Serial.println(" Bytes free)");
 
  Serial.print("DTCM = Variables:    ");
  Serial.print((unsigned)&_ebss - (unsigned)&_sdata);
  Serial.print("   "); Serial.print((float)((unsigned)&_ebss - (unsigned)&_sdata) / ((float)dtcm * 32768.0) * 100.0);
  Serial.print("%  of  "); Serial.print(dtcm * 32); Serial.print("kb   ");
  Serial.print("  (");
  Serial.print(dtcm * 32768 - ((unsigned)&_ebss - (unsigned)&_sdata));
  Serial.println(" Bytes free)");

  Serial.println("RAM2:");
  Serial.print("OCRAM = DMAMEM:      ");
  Serial.print((unsigned)&_heap_start - 0x20200000);
  Serial.print("   "); Serial.print((float)((unsigned)&_heap_start - 0x20200000) / ((float)512 * 1024.0) * 100.0);
  Serial.print("%  of  "); Serial.print(512); Serial.print("kb");
  Serial.print("     (");
  Serial.print(512 * 1024 - ((unsigned)&_heap_start - 0x20200000));
  Serial.println(" Bytes free)");

  Serial.print("FLASH:               ");
  Serial.print((unsigned)&_flashimagelen);
  Serial.print("   "); Serial.print(((unsigned)&_flashimagelen) / (2048.0 * 1024.0) * 100.0);
  Serial.print("%  of  "); Serial.print(2048); Serial.print("kb");
  Serial.print("    (");
  Serial.print(2048 * 1024 - ((unsigned)&_flashimagelen));
  Serial.println(" Bytes free)");
  
#endif
}
 
@ Frank: Congratulations! That didn't take you too long to "zero in" on the right code. Speaking of zeros, it's funny how we sometimes forget that, in general, arrays use zero-based indexing! The 32 made sense logically, but I forgot to subtract one to match the zero-based array. I will try it out on the scope as soon as I can and get back to you.
This is another fine example of how the Teensy forum/collaboration makes problem-solving so much easier.
The T4 is a fine board. I only wish it had the form factor of the T3.5,6 with the built in SD card and many GPIO pins. I just finished a Teensy 4 MIDI Soundfont wavetable synth project. It was based upon the Audio library object of the same name, but that stored only 1 synth sound in Flash. I spent a lot of time converting that lib. so that it loaded the wavetables, etc from SRAM, which I loaded in from SD card. So, you could pick the voice you wanted from the LCD screen.
It was a bit tricky wiring up an external SD card socket to the small footprint on the bottom of the T4, but I did manage to get it working. It would have been much easier if the T4 came in the T3.6 footprint (which I understand Paul may implement in the future).
Cheers
 
@Frank: I decided to hook up my test equipment and try out your new routine. The latency is 16.4 ms @ SR 48000 - regardless of the FFT size, as you mentioned. To compare directly against my routines, I reduced SR to 44100, as that is what I use. At 44100, the latency is 17.9 ms, which is what one would expect.
There is no feedthrough from one channel to the other, so, as you say, it is now true stereo.
However, here is the bad news. Going from 1024 all the way up to 16384, I see no improvement in the filter response. I picked a cutoff frequency of 2500 Hz. At frequencies beyond the stopband, there are numerous "birdies" (hams use this term in N.A.) up until about 5000 Hz that I can easily measure on my scope- i.e. they are 1-3X larger than the SGTL5000 noise level (and I can hear them in my headphones). As I increase filter taps, these birdies are somewhat smaller but there are more of them. You would expect this, if you look at the computer-generated response curves of FIR filters- The "spurs" get smaller and more frequent as the filter taps increase.
However, with my original convolution filter object @ 1024 FFT, and everything else the same, I do not see (or hear) any such "birdies"- they are probably there but so low as to be below the SGTL5000 noise level. Since my convolution lib object, at its heart, is based upon your original routine, you should see the same results if you used your "old" code.
I am guessing that you don't have a good quality sine wave generator like I used for this test, and you wouldn't be able to detect this with normal voices or music going through the filter. But, there is no point in going to such large filter lengths if there is no improvement in filter response. The larger-tap filter may have a sharper cutoff slope, but it's hard to see much improvement when I just scan through the stop-band manually.

The only difference (apart from the filtering routine itself) between your new segmented program, and the program I run to test my convolution library, is in the FIR coefficient calc. routine. I am using the one that you embedded in your SDR program (attributed as follows:
{ // modified by WMXZ and DD4WH after
// Wheatley, M. (2011): CuteSDR Technical Manual. www.metronix.com, pages 118 - 120

whereas your current routine is from wdsp library by warren Pratt.

There is a possibility that the Pratt one isn't producing as accurate an array of coefficents.

While I am on the subject, I discovered that the FIR coef routine that I got from your SDR program works great for lowpass filters, but is off by a factor of 2 on bandpass filters. I recently used this routine in the Soundfont wavetable synth project for a 5-band graphic EQ, and when I measured the bandpass responses, they were all off by a factor of 2. I am not sure if the version of this routine that I got from your older SDR code is up-to-date or not. I can send you the lines of code I changed, if that routine is still used by you for SDR etc.

While not directly relevant to the above observations, I just did some measurements on the newer, slightly lower latency version of my convolution library. Originally, I had not compared it directly to the older version, apart from measuring the latency. It turns out it also suffers from those "birdies" that I see on your segmented routine. Apart from the program changes I made to reduce the latency, I also changed the library to accept the FIR mask array as a float, where the old FIR mask array was passed in to the convolution routine as an integer array scaled to +/- 32678. I expected this change would be more accurate/better, but there is something amiss in my newer version with regard to those "birdies". So, I have some work to do there. Overall, the latency improvement was only from 19.5 down to around 14, so I may not spend too much time on it.
Cheers
 
Hi Brian, thanks a lot for your excellent measurements and information! I have to digest all that and think about that in order to understand fully.

* as you said: the good news is that the latency problem is solved, so we now have a real solution for low latency filtering even with high number of taps.
* the high number of taps is meant to be used with impulse responses (eg. from guitar cabinets or room responses), not for filtering per se, I cannot think of a problem where you would need more than a 2048 / 4096 tap filter . . .
* so, one could theoretically use an impulse response file that is about 22000 taps long and use it with the T4
* excellent, that you investigated in the audio quality and the birdies. Have to think about that. --> Did you try to use the alternative windowing (Blackman-Harris-7-term)? I used the 4-term window as the default. Maybe you see differences when you use this :

Code:
fir_bandpass (cplxcoeffs, nc, FLoCut, FHiCut, SAMPLE_RATE, 1, 1, 1.0);

in the setup instead of

Code:
fir_bandpass (cplxcoeffs, nc, FLoCut, FHiCut, SAMPLE_RATE, 0, 1, 1.0);

Will test some more and keep you informed!

All the best, Frank DD4WH

P.S.: the latest version is now on github (in order not to spam too much PJRCs forum)
this version is using a coefficient buffer of one quarter of the original length allowing filter lengths up to 28800 taps with 99.6% memory use and 85.8% processor load at 600MHz [it was a little tricky to get all the memory used and the variables split between DMA and RAM1/DTCM]
https://github.com/DD4WH/Uniformly_partitioned_convolution
 
Last edited:
Hi Frank:
1) I tried the 7-term B-H mask. It may have changed things but I couldn't detect any difference.
2) I was wondering why you were interested in higher # of taps, as I thought you were focused on SDR, and as you say, filters > 2048 are not really needed. I got into this when challenged by a guitar-playing friend to come up with a Teensy 3.6 cabinet simulation program. He gave me a number of guitar cabinet impulse files that he had. The ones I got were only 513-tap files- but he did mention that some available ones were longer. But, I doubt even those were greater than 2048. I am not an expert on this by a long shot, but my feeling is that a 2049-tap impulse/4096-pt FFT will have frequency bins of 44100/4096 or ~11 Hz. I can't imagine the frequency response of the cabinet would be such that <11 Hz resolution would be useful.
3) You mention room response. In my experience this is what is called convolution reverb. I have such "plug-ins" in the Digital Audio Workstation software on the PC in my (hobby) music recording studio. The "room" files (impulses) for the convolution reverb would be much larger, as they have to handle reverberation times of 2 or more seconds, in general. I haven't tried to do such convolution reverb using Teensy 4, but if I understand things correctly, you would need 44100 X 2 for a 2 second reverb time. That is way beyond what even your segmented routine will handle, both in terms of memory and MCU bandwidth. Am I right about this ???
4) After seeing those weak "birdies" in your routine, and then finding them in my newest, lower latency convolution routine, I went back and looked more closely at my code. To summarize, I was unable to eliminate them (like my older, slower routine does)- they seem subjectively to be about the same as your segmented routine. However, I was able to optimize the coding a lot more, and got the latency down to 9.28 ms with 513-tap FIR @ 44,100 SR. That would be about 8.5 ms at 48000 SR. My routine is hard-coded at 513 taps, but very little of the 9.28 ms is taken up with the actual filtering algorithm: only 254 us. So the total latency wouldn't increase that much if I expanded the FIR taps to 1025 or 2049 i.e. only a few more ms.
5) To really evaluate how the higher-tap filters are working, I think that quantitative measurements need to be taken. So, for example, instead of the I2S block feeding the filter, you could use the Sinewave Audio block. Then you would sweep that over the freq. range desired. At the output, you have to leave the I2S intact (to make the audio library run) , but also "wire" in an RMS Audio block in parallel. Then the program would give a frequency response graph if you printed out the sine freq and RMS output to Serial port. The only issue here is that as the Sinewave Audio block goes up in frequency towards SR/2, it loses waveform accuracy, and if my memory serves me correctly, since the Sinewave block is a DDS, its amplitude will diminish as a sinc(x) rate. So, my good-quality waveform generator, which doesn't suffer from those two limitations, would be better, although I couldn't automate the frequency sweep.
If you're curious, here is my latest filter code- it's pretty short. I don't need the overlap/add method, as the FFT input is getting 4 blocks from a circular buffer and always has 3/4 old data and 1/4 current data

Cheers
Code:
void AudioFilterConvolution2::update(void)
{
	audio_block_t *block;
	short *bp;

	if (enabled != 1 ) return;

	block = receiveWritable(0);   // MUST be Writable, as convolution results are written into block
		if (block) {
//			elapsedMicros usec = 0;
			bp = block->data;
			if (passThru == 0) {
			memcpy(&buffer[0], &buffer[128], 768);  // shift 384 previous samples ahead in the buffer by 128
			memcpy(&buffer[384], bp, 256);
			//  do the FFT1024,complex multiply and iFFT1024  on 512 samples of data
			// not using the overlap/add method since data coming in is 1/4 current audio block and 3/4 older audio blocks
			// 1st convert Q15 samples to float
			arm_q15_to_float(buffer, float_buffer_L, 512);
			// float_buffer samples are now standardized from > -1.0 to < 1.0
			 //if (passThru ==0) {
			memset(FFT_buffer + 1024, 0, sizeof(FFT_buffer) / 2);    // zero pad last half of array- necessary to prevent aliasing in FFT
			//fill FFT_buffer with current audio samples
			k = 0;
			for (i = 0; i < 512; i++)
			{
				FFT_buffer[k++] = float_buffer_L[i];   // real
				FFT_buffer[k++] = float_buffer_L[i];   // imag
			}
			// calculations are performed in-place in FFT routines
			arm_cfft_f32(&arm_cfft_sR_f32_len1024, FFT_buffer, 0, 1);// perform complex FFT
			arm_cmplx_mult_cmplx_f32(FFT_buffer, FIR_filter_mask, iFFT_buffer, FFT_length);   // complex multiplication in Freq domain = convolution in time domain
			arm_cfft_f32(&arm_cfft_sR_f32_len1024, iFFT_buffer, 1, 1);  // perform complex inverse FFT
			k = 1024;
			for (int i = 0; i < 512; i++) {
				float_buffer_L[i] = iFFT_buffer[k += 2];
			}
		//} //end if passTHru
		// convert floats to Q15 and save in output block array space
		bp = block->data;
		arm_float_to_q15(&float_buffer_L[0], bp, BUFFER_SIZE);
	}
//			Serial.println(usec);
			transmit(block, 0);
			release(block);
	}
}
 
For guitar cabinet simulation, a resolution of 11Hz or even lower would be benifitial to resolve the bass response at 80..120Hz accurately.. with 11Hz resolution there would be just in 4 bins available to resolve the resonance peak. I would love to use 22050 tap IR....

Following this thread, and cannt wait to try and build a Guitar cabinet IR simulator with my T4.

Thanks for all the good work!
 
Hi Frank: Pls ignore my last comment about my signal generator providing a better sine wave source than the Sinewave object. It does, of course, but by the time it goes through the SGTL5000 ADC at 44100 SR, it won't look any different than the values that the Sinewave Audio lib puts out.
@tschama: Good point. I didn't think of the really low frequencies of a Bass guitar. I'm not a guitar player and only have a few cabinet impulse files from a friend who is- those are only 513 tap IR files which matched quite nicely with the 513-tap convolution filter object I made for the Teensy3.6. The T4 does open up a lot more possibilities.
 
@Brian: thanks a lot for your very insightful comments and observations. I have to admit that the partitioned convolution as it is at the moment still suffers from a problem:

* if the code stays at it is (without shifting the k index), it behaves like a very good brickwall lowpass filter, however the latency is no. of coefficients / two [i.e. for a 4096 tap filter, the latency is 2048 samples]. BTW, at 2800Hz (cutoff frequency set to 2500Hz, I cannot hear anything at the loudest volume I can set, not the tone and no artefacts: how did you analyse the birdies? Would it make sense to program an FFT with a spectrum display after the filter to display on an ILI9341 screen?)

* if the code is used with k index shifting, the latency is 128 samples regardless of the filter length, but the filter does not work properly any more [i.e. a 4096 tap filter has similar properties as a 256 tap filter, making the whole effort useless :)] --> and I can hear the tone and also some overtone/birdie stuff

So, we have to search the bug in the code, which prevents it from being a low latency brickwall filter. I spent some time on that the last days, but was not successful. I even reprogrammed the code to use a mono channel with a filter using real coefficients, but that has the same latency properties . . .
 
Hi Frank: Yes- you get one or the other. Looks like with no shifting you get great results- just like I do with my original convolution library object. However, when I use my latest routine (code which I last sent you) using the 4 audio block circular buffer to feed the filter once per 128 audio samples, I get the same residual signal beyond the stopband, and some birdies mixed in. Just like your segmented routine does. I didn't "analyse" anything per se. I just swept the audio generator frequency and listened to the headphones/ monitored the wavefrom on the scope.
What I think is needed is to feed the filter with the sinewave object, and scan it in the main program slowly through the audio range. Then attach an RMS audio object in parallel with the I2S output. That will give you an idea of the total signal, irrespective of frequency, for the region beyond the stopband. I realize that the "birdies" that you hear are often not the fundamental freq. that you are feeding in, but some mixing products. However, its the amplitude of these that is important, and one can't distinguish this well enough by ear only.
My newest code, with the circular input buffer idea could certainly be extended to 2048/4096 without increasing the latency too much beyond the 9.28 ms it has at 513. But, it still has those artifacts that your segmented version does.
regards
 
@Brian: I think I found something. Have you got an impulse response like the one you use for your guitar cabinet defined in C as a const variable, something like this:

Code:
const impulse [1024] = {0.1, 0.05, .... }

That would be helpful for me to test the latency of the code.

If not, I will try to extract that from the WAV file, however I have no idea yet, how that should be done.
 
Hi Frank: Did you get my Conv. filter routine from me directly (or thru forum) or did you get it from my Github page? There is one sample impulse file on the github page. It is in the standard IR file format i.e. a WAV file with 16-bit integers. I don't have the numbers in a C array format stored anywhere. In the example program on Github, it reads in this file from an SD card and stores it in an integer array. I then pass a pointer to that integer array into the convolution.impulse function. Its only in the convolution.impulse function that I convert the integer array into a float array , normalized to +/-1. If you load in my example program from Github, and have an SD card with that file on it, you can get at the integer array that way. If that's a problem for you, I could hook up the board again, modify the program to send the coefficients out to the serial port, and save it to a file. ( My newer convolution routine assumes the coefficients are coming from an FIR calculation, so it passes the normalized floating point numbers into the convolution/impulse function instead of integers. I mention this just for interest, as this doesn't apply to the code on the Github site)
A few comments on this:
1) The few IR files that I obtained from my guitarist friend were very long WAV files, but the sample values beyond 513 were all zeros. That's one of the reasons I chose that tap size for my convolution filter (to be honest, the Teensy 3.6 wouldn't have gone much higher anyway).
2) I play keyboards, and am used to a wealth of fancy sounds from both hardware synths and software synth plug-ins. I don't really play guitar much , although I have one. When I demonstrated my Teensy 3.6 cabinet impulse simulator to my friend- where he played his own guitar into it, and used the IR files that he had given me, he was very impressed. The 20 ms latency of the routine (at the time) was the only downside, in his comments to me. On the other hand, I could not really tell the difference between the cabinet-simulated version and the pass-through version, apart from it having a bit different tonal balance, which, in my opinion, could be obtained more easily by bass- treble controls, or a 5 band EQ for sure. I like to think I am pretty discerning when it comes to music recording quality, but I can't really appreciate the subtleties he seems to hear. I only mention this for what it's worth in tempering your expectations! In the same vein, I am not one to claim that a vacuum tube HiFi amplifier is better than a good solid state one, or that oxygen-free, high priced speaker cables make any difference, but plenty of people are of that opinion.
Please let me know if you want some help getting those coefficient array values.
Cheers
 
It seems the uniformly partitioned convolution is implemented correctly, BUT the FIR coefficients are not calculated in the right way. Warren Pratt (the designer of the wdsp library, one of the ultimate SDR libraries in C) gave me very detailed answers to my questions and stated that the FIR filter coefficients have to be minimum phase, not linear phase! [my fault, I should have read more thoroughly on this topic] At the moment, the FIR filter coeffs are linear phase and thus cannot exhibit a latency lower than half the number of coefficients . . . and thus are totally unsuitable for testing low latency algorithms ;-).

Thats why I would like to do two things in the next weeks/months:

* take a guitar cabinet impulse response, feed it into the algorithm and have a look at the latency: my guess would be, it has the intended 128 sample latency

* look for a way to calculate FIR filter minimum phase coefficients, which seems not to be the easiest task, if I believe what the Internet tells me from a quick search. The algorithms to convert linear phase coeffs to minimum phase do not seem to run on microcontrollers, because they need extraordinarily large FFT sizes and exhaustive memory. So, maybe the only way (also suggested by Warren) is to calculate the coeffs and store them in an array (or even store the FFT result = filter masks of those coeff arrays, so filter switching can be done really fast). However, even the calculation of minimum phase FIR filters on a PC does not seem to be straightforward, because we would want large filter sizes of 512, 1024, 2048, 4096, 8192 . . . I have MATLAB at my university, but have never worked with it.

Does anybody have MATLAB code to calculate FIR filter minimum phase coefficients for complex bandpass filters of sizes 512 to 8192 or can suggest any possibility to do this on the Teensy 4.0?
 
Hi Frank: That explains a lot. I'm not sure that the sample guitar IR file that I have will work as a minimum phase IR. If it does, I'm wondering why it would have worked with the conventional filter algorithm that we were using before. I'm sure that file is the raw IR file- you still have to perform an FFT on it inside the filter routine itself. That being said, it would not be too big an issue if one had to take the impulse response data collected for a specific guitar cabinet, do all the heavy math on a PC, and store the final coefficient array in a file on the Teensy.
I am in the same situation as you re Matlab. I'm retired now, but still have access to Dalhousie University site licensed software, and Matlab is available. But, I also haven't used it for a long time, and only a tiny amount back then.
I think I will still go ahead and setup a Teensy program to feed a swept sinewaveAudio lib object into my filter, and collect the RMS signal coming out. One just can't discern enough about the filter by just listening to the output when adjusting my sine wave signal generator by hand. I still am curious why our original algorithm provides "clean" response whereas my newer algorithm (low latency but high MCU usage) generates those spurious signals beyond the stop-band. My method is far different from the segmented one, but seems to suffer from the same spurious signal issue as the segmented one.
Your doing some really involved work on this filter routine- you deserve lots of credit!
 
The uniformly partitioned low latency convolution seems to work now with the guitar cabinet impulse files AND with low latency (as far as I can test it here).

The recent version uses Stereo processing and AUDIO_BLOCK_SAMPLES 128

I hardcoded four different impulse response files to choose from in the code (all at 44.1ksps sample rate):

A) 512 samples [3.52% processing load, 7% RAM1 / 4% DMAMEM] - your MG.wav file from your github

B) 4096 samples [13.83% processing load, 25% RAM1 / 15% DMAMEM] - Marshall1960A rig (from cabIR.eu)

C) 8192 samples [25.5% processing load, 44% RAM1 / 27% DMAMEM] - Marshall 1971 rig (170ms impulse response from cabIR.eu)

D) 17920 samples [53.0% processing load, 92% RAM1 / 57% DMAMEM] - Telos desIRe (400ms impulse response)

All four sound really different. The maximum impulse response length that can be processed is a bit less than 0.5 sec long (in Stereo). The size of the impulse response MUST be a multiple of 128 samples, but you can easily zero pad the file to achieve this.

My latency measurements consistently show 128 sample latency in the software regardless of the size of the impulse response.

Then I changed AUDIO_BLOCK_SAMPLES to 64 in AudioStream.h and the latency could even be reduced to 64 samples by using a 128-point-FFT. The code automatically accounts for switching FFT size, if you alter "partitionsize".

So, it seems the only remaining problem now is to search for code that calculates minimum phase FIR coefficients on the Teensy 4.0 ;-).

Thanks for all your input, support and motivation, Brian! I am also very curious about the cause of the spurious signals!

The code for playing audio through one of the four impulse responses is on github:

https://github.com/DD4WH/Uniformly_partitioned_convolution/tree/master/guitar_cabinet_impulse
 
Last edited:
Hi Frank: You've been busy! I loaded your program, using the default IR file and adding a few lines to handle the SGTL5000 that I use. The measured latency is 15 ms, which is perfectly fine for an 8192 tap filter. There is 7 ms latency inherent in the audio library/SGTL5000 at 44100 SR .
I can definitely see the freq response vary markedly over the audio range on the scope, as well as hear it ( when I sweep the signal generator over the audio range). Here is the scope capture:
SDS00026.jpg
You can see the pulse shape is radically changed, but this is due to the filtering I expect.
So, to refresh me, the "k+31" change that I had earlier suggested to you is no longer needed/used because the filter mask is now properly defined as minimum phase in both my MG and the other 2 cabinet impulse files?
Good to see that you've finally achieved success!
 
Hi Brian, yes, the k+31 change is not necessary. The code probably worked from the very beginning, but only with impulse response files, NOT with linear phase FIR filter coefficients (my testing was exclusively done with linear phase FIR filter coeffs, so I did not recognize that it was actually doing what it should). If we use FIR coeffs for the code, they have to be minimum phase. Obviously, recorded impulse responses have no inherent delay, so the latency is determined by the FFT size alone, but linear phase FIR always have a latency of half the number of taps plus the delay for the sample block of the FFT.

It would be very good if we could find someone with knowledge on minimum phase filter coeffs calculation. My MATLAB is running and I did some trials on getting coeffs out of it, but I have to test much more to be sure they could work on the Teensy 4 with the partitioned convolution code (and I am unsure how to obtain complex bandpass coefficients). Still a long way to go ;-).
 
Status
Not open for further replies.
Back
Top