Forum Rule: Always post complete source code & details to reproduce any issue!
Results 1 to 25 of 29

Thread: Fast Convolution Filtering with Teensy 4.0 and audio board

Threaded View

  1. #1
    Senior Member DD4WH's Avatar
    Join Date
    Oct 2015
    Location
    Central Europe
    Posts
    458

    Fast Convolution Filtering with Teensy 4.0 and audio board

    To test T4 performance I extracted the fast convolution filtering code from my SDR and tested it with the Teensy 4.0 and the Teensy audio board. Both were connected with flying wires (MCLK with 100ohms series resistor).

    Code:
    /*
     * just a small sketch to play around with audio processing with the Teensy 4
     * 
     * FAST CONVOLUTION FILTERING in Stereo
     * 
     * uses Teensy 4.0 and the Teensy audio board connected with flying wires
     * 
     * Frank DD4WH 2019-08-17
     * 
     * in the public domain GNU GPL
     */
    
    #include <Audio.h>
    #include <Wire.h>
    #include <SPI.h>
    #include <arm_math.h>
    #include <arm_const_structs.h> // in the Teensy 4.0 audio library, the ARM CMSIS DSP lib is already a newer version
    
    #define DEBUG
    #define FOURPI  (2.0 * TWO_PI)
    #define SIXPI   (3.0 * TWO_PI)
    #define BUFFER_SIZE 128
    int32_t sum;
    int idx_t = 0;
    float32_t mean;
    int16_t *sp_L;
    int16_t *sp_R;
    uint8_t FIR_filter_window = 1;
    uint8_t first_block = 1; 
    //double FLoCut = 150.0;
    //double FHiCut = 4500.0;
    double FLoCut = 5.0;
    double FHiCut = 20000.0;
    double SAMPLE_RATE = (double)AUDIO_SAMPLE_RATE_EXACT;  
    //const uint32_t FFT_L = 256; 
    //const uint32_t FFT_L = 512; 
    //const uint32_t FFT_L = 1024;
    //const uint32_t FFT_L = 2048;
    const uint32_t FFT_L = 4096; 
    double FIR_Coef_I[(FFT_L / 2) + 1]; // 2048 * 4 = 8kb
    double FIR_Coef_Q[(FFT_L / 2) + 1]; // 2048 * 4 = 8kb
    #define MAX_NUMCOEF (FFT_L / 2) + 1
    uint32_t m_NumTaps = (FFT_L / 2) + 1;
    uint32_t FFT_length = FFT_L;
    const uint32_t N_B = FFT_L / 2 / BUFFER_SIZE;
    uint32_t N_BLOCKS = N_B;
    float32_t float_buffer_L [BUFFER_SIZE * N_B];  // 2048 * 4 = 8kb
    float32_t float_buffer_R [BUFFER_SIZE * N_B]; // 2048 * 4 = 8kb
    
    float32_t FFT_buffer [FFT_L * 2] __attribute__ ((aligned (4)));  // 4096 * 4 = 16kb
    float32_t last_sample_buffer_L [BUFFER_SIZE * N_B];  // 2048 * 4 = 8kb
    float32_t last_sample_buffer_R [BUFFER_SIZE * N_B]; // 2048 * 4 = 8kb
    // complex FFT with the new library CMSIS V4.5
    const static arm_cfft_instance_f32 *S;
    // complex iFFT with the new library CMSIS V4.5
    const static arm_cfft_instance_f32 *iS;
    float32_t iFFT_buffer [FFT_L * 2] __attribute__ ((aligned (4)));  // 4096 * 4 = 16kb
    
    // FFT instance for direct calculation of the filter mask
    // from the impulse response of the FIR - the coefficients
    const static arm_cfft_instance_f32 *maskS;
    float32_t FIR_filter_mask [FFT_L * 2] __attribute__ ((aligned (4)));  // 4096 * 4 = 16kb
    
    // this audio comes from the codec by I2S
    AudioInputI2S            i2s_in;
    AudioRecordQueue         Q_in_L;
    AudioRecordQueue         Q_in_R;
    AudioMixer4              mixleft;
    AudioMixer4              mixright;
    AudioPlayQueue           Q_out_L;
    AudioPlayQueue           Q_out_R;
    AudioOutputI2S           i2s_out;
    
    AudioControlSGTL5000     sgtl5000_1;
    
    AudioConnection          patchCord1(i2s_in, 0, Q_in_L, 0);
    AudioConnection          patchCord2(i2s_in, 1, Q_in_R, 0);
    
    AudioConnection          patchCord3(Q_out_L, 0, mixleft, 0);
    AudioConnection          patchCord4(Q_out_R, 0, mixright, 0);
    AudioConnection          patchCord9(mixleft, 0,  i2s_out, 1);
    AudioConnection          patchCord10(mixright, 0, i2s_out, 0);
    
    
    void setup() {
      Serial.begin(115200);
      delay(100);
    
      AudioMemory(60); // must be high enough to deliver 16 stereo blocks = 32 * 128 samples at any point in time! 
      delay(100);
    
      /****************************************************************************************
         Audio Shield Setup
      ****************************************************************************************/
      sgtl5000_1.enable();
      sgtl5000_1.inputSelect(AUDIO_INPUT_LINEIN);
      sgtl5000_1.adcHighPassFilterDisable(); // does not help too much!
      sgtl5000_1.lineInLevel(14);
      sgtl5000_1.lineOutLevel(24);
      mixleft.gain(0, 1.0);
      mixright.gain(0, 1.0);
      sgtl5000_1.volume(0.5); 
    
      /****************************************************************************************
         set filter bandwidth
      ****************************************************************************************/
      // this routine does all the magic of calculating the FIR coeffs
      calc_cplx_FIR_coeffs (FIR_Coef_I, FIR_Coef_Q, m_NumTaps, FLoCut, FHiCut, SAMPLE_RATE);
    
      /****************************************************************************************
         init complex FFTs
      ****************************************************************************************/
      switch (FFT_length)
      {
        case 256:
          S = &arm_cfft_sR_f32_len256;
          iS = &arm_cfft_sR_f32_len256;
          maskS = &arm_cfft_sR_f32_len256;
          break;
        case 512:
          S = &arm_cfft_sR_f32_len512;
          iS = &arm_cfft_sR_f32_len512;
          maskS = &arm_cfft_sR_f32_len512;
          break;
        case 1024:
          S = &arm_cfft_sR_f32_len1024;
          iS = &arm_cfft_sR_f32_len1024;
          maskS = &arm_cfft_sR_f32_len1024;
          break;
        case 2048:
          S = &arm_cfft_sR_f32_len2048;
          iS = &arm_cfft_sR_f32_len2048;
          maskS = &arm_cfft_sR_f32_len2048;
          break;
        case 4096:
          S = &arm_cfft_sR_f32_len4096;
          iS = &arm_cfft_sR_f32_len4096;
          maskS = &arm_cfft_sR_f32_len4096;
          break;
      }
    
      /****************************************************************************************
         Calculate the FFT of the FIR filter coefficients once to produce the FIR filter mask
      ****************************************************************************************/
      init_filter_mask();
    
      /****************************************************************************************
         begin to queue the audio from the audio library
      ****************************************************************************************/
      delay(100);
      Q_in_L.begin();
      Q_in_R.begin();
    
    } // END OF SETUP
    
    
    void loop() {
      elapsedMicros usec = 0;
      // are there at least N_BLOCKS buffers in each channel available ?
        if (Q_in_L.available() > N_BLOCKS + 0 && Q_in_R.available() > N_BLOCKS + 0)
        {
          // get audio samples from the audio  buffers and convert them to float
          for (unsigned i = 0; i < N_BLOCKS; i++)
          {
            sp_L = Q_in_L.readBuffer();
            sp_R = Q_in_R.readBuffer();
    
            // convert to float one buffer_size
            // float_buffer samples are now standardized from > -1.0 to < 1.0
            arm_q15_to_float (sp_L, &float_buffer_L[BUFFER_SIZE * i], BUFFER_SIZE); // convert int_buffer to float 32bit
            arm_q15_to_float (sp_R, &float_buffer_R[BUFFER_SIZE * i], BUFFER_SIZE); // convert int_buffer to float 32bit
            Q_in_L.freeBuffer();
            Q_in_R.freeBuffer();
          }
     
          /**********************************************************************************
              Digital convolution
           **********************************************************************************/
          //  basis for this was Lyons, R. (2011): Understanding Digital Processing.
          //  "Fast FIR Filtering using the FFT", pages 688 - 694
          //  numbers for the steps taken from that source
          //  Method used here: overlap-and-save
    
          // ONLY FOR the VERY FIRST FFT: fill first samples with zeros
          if (first_block) // fill real & imaginaries with zeros for the first BLOCKSIZE samples
          {
            for (unsigned i = 0; i < BUFFER_SIZE * N_BLOCKS; i++)
            {
              FFT_buffer[i] = 0.0;
            }
            first_block = 0;
          }
          else
          {  // HERE IT STARTS for all other instances
            // fill FFT_buffer with last events audio samples
            for (unsigned i = 0; i < BUFFER_SIZE * N_BLOCKS; i++)
            {
              FFT_buffer[i * 2] = last_sample_buffer_L[i]; // real
              FFT_buffer[i * 2 + 1] = last_sample_buffer_R[i]; // imaginary
            }
          }
          // copy recent samples to last_sample_buffer for next time!
          for (unsigned i = 0; i < BUFFER_SIZE * N_BLOCKS; i++)
          {
            last_sample_buffer_L [i] = float_buffer_L[i];
            last_sample_buffer_R [i] = float_buffer_R[i];
          }
    
          // now fill recent audio samples into FFT_buffer (left channel: re, right channel: im)
          for (unsigned i = 0; i < BUFFER_SIZE * N_BLOCKS; i++)
          {
            FFT_buffer[FFT_length + i * 2] = float_buffer_L[i]; // real
            FFT_buffer[FFT_length + i * 2 + 1] = float_buffer_R[i]; // imaginary
          }
    
          /**********************************************************************************
              Complex Forward FFT
           **********************************************************************************/
          // calculation is performed in-place the FFT_buffer [re, im, re, im, re, im . . .]
          arm_cfft_f32(S, FFT_buffer, 0, 1);
    
          /**********************************************************************************
              Complex multiplication with filter mask (precalculated coefficients subjected to an FFT)
           **********************************************************************************/
          arm_cmplx_mult_cmplx_f32 (FFT_buffer, FIR_filter_mask, iFFT_buffer, FFT_length);
    
          /**********************************************************************************
              Complex inverse FFT
           **********************************************************************************/
          arm_cfft_f32(iS, iFFT_buffer, 1, 1);
    
          /**********************************************************************************
              Overlap and save algorithm, which simply means yˇu take only the right part of the buffer and discard the left part
           **********************************************************************************/
            for (unsigned i = 0; i < FFT_length / 2; i++)
            {
              float_buffer_L[i] = iFFT_buffer[FFT_length + i * 2];
              float_buffer_R[i] = iFFT_buffer[FFT_length + i * 2 + 1];
            }
           /**********************************************************************************
              Demodulation / manipulation / do whatever you want 
           **********************************************************************************/
          //    at this time, just put filtered audio (interleaved format, overlap & save) into left and right channel       
    
           /**********************************************************************
              CONVERT TO INTEGER AND PLAY AUDIO - Push audio into I2S audio chain
           **********************************************************************/
          for (int i = 0; i < N_BLOCKS; i++)
            {
              sp_L = Q_out_L.getBuffer();    
              sp_R = Q_out_R.getBuffer();
              arm_float_to_q15 (&float_buffer_L[BUFFER_SIZE * i], sp_L, BUFFER_SIZE); 
              arm_float_to_q15 (&float_buffer_R[BUFFER_SIZE * i], sp_R, BUFFER_SIZE);
              Q_out_L.playBuffer(); // play it !  
              Q_out_R.playBuffer(); // play it !
            }
    
           /**********************************************************************************
              PRINT ROUTINE FOR ELAPSED MICROSECONDS
           **********************************************************************************/
    #ifdef DEBUG
          sum = sum + usec;
          idx_t++;
          if (idx_t > 40) {
            mean = sum / idx_t;
            if (mean / 29.00 / N_BLOCKS * SAMPLE_RATE / AUDIO_SAMPLE_RATE_EXACT < 100.0)
            {
              Serial.print("processor load:  ");
              Serial.print (mean / 29.00 / N_BLOCKS * SAMPLE_RATE / AUDIO_SAMPLE_RATE_EXACT);
              Serial.println("%");
            }
            else
            {
              Serial.println("100%");
            }
            Serial.print (mean);
            Serial.print (" microsec for ");
            Serial.print (N_BLOCKS);
            Serial.print ("  stereo blocks    ");
            Serial.print("FFT-length = "); Serial.print(FFT_length);
            Serial.print(";   FIR filter length = "); Serial.println(m_NumTaps);
    
            idx_t = 0;
            sum = 0;
          }
    #endif
        } // end of audio process loop
        
          /**********************************************************************************
              Add button check etc. here
           **********************************************************************************/
    
    }
    
    //////////////////////////////////////////////////////////////////////
    //  Call to setup complex filter parameters [can process two channels at the same time!]
    //  The two channels could be stereo audio or I and Q channel in a Software Defined Radio
    // SampleRate in Hz
    // FLowcut is low cutoff frequency of filter in Hz
    // FHicut is high cutoff frequency of filter in Hz
    // Offset is the CW tone offset frequency
    // cutoff frequencies range from -SampleRate/2 to +SampleRate/2
    //  HiCut must be greater than LowCut
    //    example to make 2700Hz USB filter:
    //  SetupParameters( 100, 2800, 0, 48000);
    //////////////////////////////////////////////////////////////////////
    
    void calc_cplx_FIR_coeffs (double * coeffs_I, double * coeffs_Q, int numCoeffs, double FLoCut, double FHiCut, double SampleRate)
    // pointer to coefficients variable, no. of coefficients to calculate, frequency where it happens, stopband attenuation in dB,
    // filter type, half-filter bandwidth (only for bandpass and notch)
    {
      //calculate some normalized filter parameters
      double nFL = FLoCut / SampleRate;
      double nFH = FHiCut / SampleRate;
      double nFc = (nFH - nFL) / 2.0; //prototype LP filter cutoff
      double nFs = PI * (nFH + nFL); //2 PI times required frequency shift (FHiCut+FLoCut)/2
      double fCenter = 0.5 * (double)(numCoeffs - 1); //floating point center index of FIR filter
    
      for (int i = 0; i < numCoeffs; i++) //zero pad entire coefficient buffer
      {
        coeffs_I[i] = 0.0;
        coeffs_Q[i] = 0.0;
      }
    
      //create LP FIR windowed sinc, sin(x)/x complex LP filter coefficients
      for (int i = 0; i < numCoeffs; i++)
      {
        double x = (float32_t)i - fCenter;
        double z;
        if ( abs((double)i - fCenter) < 0.01) //deal with odd size filter singularity where sin(0)/0==1
          z = 2.0 * nFc;
        else
          switch (FIR_filter_window) {
            case 1:    // 4-term Blackman-Harris --> this is what Power SDR uses
              z = (double)sin(TWO_PI * x * nFc) / (PI * x) *
                  (0.35875 - 0.48829 * cos( (TWO_PI * i) / (numCoeffs - 1) )
                   + 0.14128 * cos( (FOURPI * i) / (numCoeffs - 1) )
                   - 0.01168 * cos( (SIXPI * i) / (numCoeffs - 1) ) );
              break;
            case 2:
              z = (double)sin(TWO_PI * x * nFc) / (PI * x) *
                  (0.355768 - 0.487396 * cos( (TWO_PI * i) / (numCoeffs - 1) )
                   + 0.144232 * cos( (FOURPI * i) / (numCoeffs - 1) )
                   - 0.012604 * cos( (SIXPI * i) / (numCoeffs - 1) ) );
              break;
            case 3: // cosine
              z = (double)sin(TWO_PI * x * nFc) / (PI * x) *
                  cos((PI * (float32_t)i) / (numCoeffs - 1));
              break;
            case 4: // Hann
              z = (double)sin(TWO_PI * x * nFc) / (PI * x) *
                  0.5 * (double)(1.0 - (cos(PI * 2 * (double)i / (double)(numCoeffs - 1))));
              break;
            default: // Blackman-Nuttall window
              z = (double)sin(TWO_PI * x * nFc) / (PI * x) *
                  (0.3635819
                   - 0.4891775 * cos( (TWO_PI * i) / (numCoeffs - 1) )
                   + 0.1365995 * cos( (FOURPI * i) / (numCoeffs - 1) )
                   - 0.0106411 * cos( (SIXPI * i) / (numCoeffs - 1) ) );
              break;
          }
        //shift lowpass filter coefficients in frequency by (hicut+lowcut)/2 to form bandpass filter anywhere in range
        coeffs_I[i]  =  z * cos(nFs * x);
        coeffs_Q[i] = z * sin(nFs * x);
      }
    }
    
    
    void init_filter_mask()
    {
      /****************************************************************************************
         Calculate the FFT of the FIR filter coefficients once to produce the FIR filter mask
      ****************************************************************************************/
      // the FIR has exactly m_NumTaps and a maximum of (FFT_length / 2) + 1 taps = coefficients, so we have to add (FFT_length / 2) -1 zeros before the FFT
      // in order to produce a FFT_length point input buffer for the FFT
      // copy coefficients into real values of first part of buffer, rest is zero
      for (unsigned i = 0; i < m_NumTaps; i++)
      {
        FIR_filter_mask[i * 2] = FIR_Coef_I [i];
        FIR_filter_mask[i * 2 + 1] = FIR_Coef_Q [i];
      }
    
      for (unsigned i = FFT_length + 1; i < FFT_length * 2; i++)
      {
        FIR_filter_mask[i] = 0.0;
      }
      // FFT of FIR_filter_mask
      // perform FFT (in-place), needs only to be done once (or every time the filter coeffs change)
      arm_cfft_f32(maskS, FIR_filter_mask, 0, 1);
    
    } // end init_filter_mask
    The code takes stereo I2S input and performs bandpass filtering of both channels simultaneously (in one complex FFT/cplxmultiply/iFFT operation) and outputs them as I2S output to the SGTL5000 outputs.

    You can choose the FFT size (equivalent to steepness of the filter skirts / number of FIR filter taps). It uses fast convolution with overlap & save and 50% overlap and 4-term-Blackman-Harris-windowing. You can alter the FFT size [256, 512, 1024, 2048, 4096], which is equivalent to number of FIR taps [129, 513, 1025, 2049] and the FFT window. Also try to change the lower and upper bandpass cutoff frequency for the filter.

    The filter response with a 4096-sized FFT has skirts so steep, that you will never want any larger filter response ;-).

    The processor load is so amazingly low, that I can hardly believe it (they are output in the serial monitor). Maybe someone would be willing to check, if I did the math right in the sketch . . . Here are the figures:

    FFT size no. of FIR taps processing time no. of AUDIO Blocks processor load memory used
    256 129 69Ásec 1 2.38% 160kbytes
    512 257 133Ásec 2 2.29% 173kbytes
    1024 513 303Ásec 4 2.61% 193kbytes
    2048 1025 639Ásec 8 2.75% 234kbytes
    4096 2049 1249Ásec 16 2.69% 316kbytes

    Amazingly, the processor load does not differ much between the different FFT sizes and even drops when going from 1025 to 2049 FIR taps. However, memory usage is excessive!!! Can anybody explain this? It is not really fully explained by sum of the sizes of the assigned variables in the sketch . . .

    All the best,

    Frank DD4WH


    P.S.: if the figures for the processor load are really calculated correctly, we would really have to think about using fast convolution filtering for even the smallest filtering tasks, eg. a parametric Equalizer, this could solve the problems associated with low frequency IIR filtering, for example.
    Last edited by DD4WH; 08-17-2019 at 07:02 PM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •