Teensy 4.1 Accessing PSRAM quickly using EXTMEM

AndyB

Member
Hello!

I'm working on a project which requires the online implementation of a large number of digital low pass filters (moving averages). I am using an array of uint16_t ring buffers to store all of these moving averages and handle the supporting data members (i.e. current sum, size, etc...). I used to store all of these in either RAM1 or RAM2 depending on which one had the most space, but I wanted to extend the maximum filter length by using the external memory chip option. The problem I am facing is that reading and writing to PSRAM takes time (I believe the communication is handled over SPI if I'm not mistaken) and since this is a realtime system, microseconds (or tens of nanoseconds) can impact system performance by introducing delays in my ADC sampling loop. What I am seeing now is that when I move my ringbuffer array from RAM1 over to PSRAM using the EXTMEM keyword, I observe a large amount of variability in my system's sampling rate. What used to be 60us period +-1us is now ~80us +- 20us.

Relevant Code:

Here is the ring buffer class (modified lib from Jean-Luc)
Code:
/*
 * Ring Buffer Library for Arduino
 *
 * Copyright Jean-Luc Béchennec 2018
 *
 * This software is distributed under the GNU Public Licence v2 (GPLv2)
 *
 * Please read the LICENCE file
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
 * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
 * DEALINGS IN THE SOFTWARE.
 */

 /*
  * Note about interrupt safe implementation
  *
  * To be safe from interrupts, a sequence of C instructions must be framed
  * by a pair of interrupt disable and enable instructions and ensure that the
  * compiler will not move writing of variables to memory outside the protected
  * area. This is called a critical section. Usually the manipulated variables
  * receive the volatile qualifier so that any changes are immediately written
  * to memory. Here the approach is different. First of all you have to know
  * that volatile is useless if the variables are updated in a function and
  * that this function is called within the critical section. Indeed, the
  * semantics of the C language require that the variables in memory be updated
  * before returning from the function. But beware of function inlining because
  * the compiler may decide to delete a function call in favor of simply
  * inserting its code in the caller. To force the compiler to use a real
  * function call, __attribute__((noinline)) have been added to the push and
  * pop functions. In this way the lockedPush and lockedPop functions ensure
  * that in the critical section a push and pop function call respectively will
  * be used by the compiler. This ensures that, because of the function call,
  * the variables are written to memory in the critical section and also
  * ensures that, despite the reorganization of the instructions due to
  * optimizations, the critical section will be well opened and closed at the
  * right place because function calls, due to potential side effects, are not
  * subject to such reorganizations.
  */

#ifndef __RINGBUF_H__
#define __RINGBUF_H__

#include <Arduino.h>

/*
 * Set the integer size used to store the size of the buffer according of
 * the size given in the template instanciation. Thanks to Niklas Gürtler
 * to share his knowledge of C++ template meta programming.
 * https://niklas-guertler.de/
 *
 * If Index argument is true, the ring buffer has a size and an index
 * stored in an uint8_t (Type below) because its size is within [1,255].
 * Intermediate computation may need an uint16_t (BiggerType below).
 * If Index argument is false, the ring buffer has a size and an index
 * stored in an uint16_t (Type below) because its size is within [256,65535].
 * Intermediate computation may need an uint32_t (BiggerType below).
 */

namespace RingBufHelper {
  template<bool fits_in_uint8_t> struct Index {
    using Type = uint16_t;        /* index of the buffer */
    using BiggerType = uint32_t;  /* for intermediate calculation */
  };
  template<> struct Index<false> {
    using Type = uint8_t;         /* index of the buffer */
    using BiggerType = uint16_t;  /* for intermediate calculation */
  };
}

template <
  typename ET,
  size_t S,
  typename IT = typename RingBufHelper::Index<(S > 255)>::Type,
  typename BT = typename RingBufHelper::Index<(S > 255)>::BiggerType
>
class RingBuf
{
  /*
   * check the size is greater than 0, otherwise emit a compile time error
   */
  static_assert(S > 0, "RingBuf with size 0 are forbidden");

  /*
   * check the size is lower or equal to the maximum uint16_t value,
   * otherwise emit a compile time error
   */
  static_assert(S <= UINT16_MAX, "RingBuf with size greater than 65535 are forbidden");

private:
  ET mBuffer[S];
  IT mReadIndex;
  IT mSize;
  uint32_t curSum = 0;

  IT writeIndex();

public:
  /* Constructor. Init mReadIndex to 0 and mSize to 0 */
  RingBuf();
  //Return the curSum value
  uint32_t getCurSum();
  /* Push a large number of elements to the end of the buffer. */
  bool largePush(const ET * const inElement, uint16_t numElements);
  /* Push a data at the end of the buffer */
  bool push(const ET inElement) __attribute__ ((noinline));
  /* Push a data at the end of the buffer. Copy it from its pointer */
  bool push(const ET * const inElement) __attribute__ ((noinline));
  /* Push a data at the end of the buffer with interrupts disabled */
  bool lockedPush(const ET inElement);
  /* Push a data at the end of the buffer with interrupts disabled. Copy it from its pointer */
  bool lockedPush(const ET * const inElement);
  /* Pop a chunk of data (of size numElements) from the ring buffer */
  bool largePop(ET *outElement, uint16_t numElements);
  /* Pop the data at the beginning of the buffer */
  bool pop(ET &outElement) __attribute__ ((noinline));
  /* Pop the data at the beginning of the buffer with interrupt disabled */
  bool lockedPop(ET &outElement);
  /* Return true if the buffer is full */
  bool isFull()  { return mSize == S; }
  /* Return true if the buffer is empty */
  bool isEmpty() { return mSize == 0; }
  /* Reset the buffer  to an empty state */
  void clear()   { mSize = 0; }
  /* return the size of the buffer */
  IT size() { return mSize; }
  /* return the maximum size of the buffer */
  IT maxSize() { return S; }
  /* access the buffer using array syntax, not interrupt safe */
  ET &operator[](IT inIndex);

  bool peek(ET &outElement, const std::size_t distance = 0)  __attribute__ ((noinline));
  bool lockedPeek(ET &outElement, const std::size_t distance = 0);
};

template <typename ET, size_t S, typename IT, typename BT>
IT RingBuf<ET, S, IT, BT>::writeIndex()
{
 BT wi = (BT)mReadIndex + (BT)mSize;
 if (wi >= (BT)S) wi -= (BT)S;
 return (IT)wi;
}

template <typename ET, size_t S, typename IT, typename BT>
RingBuf<ET, S, IT, BT>::RingBuf() :
mReadIndex(0),
mSize(0)
{
}

template <typename ET, size_t S, typename IT, typename BT>
uint32_t RingBuf<ET, S, IT, BT>::getCurSum()
{
  return curSum;
}

template <typename ET, size_t S, typename IT, typename BT>
bool RingBuf<ET, S, IT, BT>::push(const ET inElement)
{
  if (isFull()) return false;
  mBuffer[writeIndex()] = inElement;
  mSize++;
  curSum += inElement;//May want to check here for overflow?
  return true;
}

template <typename ET, size_t S, typename IT, typename BT>
bool RingBuf<ET, S, IT, BT>::largePush(const ET * const inElement, uint16_t numElements)
{
  if ((uint32_t)mSize + (uint32_t)numElements >= S) return false;
  //mBuffer[writeIndex()] = *inElement;
  uint16_t wi = writeIndex();
  uint32_t tmp = (uint32_t)wi + (uint32_t)numElements;
  uint32_t size1 = numElements;
  uint32_t size2 = 0;
  if(tmp >= S){
    //Need to handle the case when the working index splits and we need to copy two chunks of data from inElement
    size2 = tmp-S;
    size1 = numElements - size2;
    memcpy(&mBuffer[wi],inElement,size1*2);//times two since these are uint16_t's which have two bytes
    memcpy(&mBuffer[wi+size1-S],inElement+size1,size2*2);
  }
  else{
    memcpy(&mBuffer[wi],inElement,numElements*2);
  }
  mSize += numElements;
  return true;
}

template <typename ET, size_t S, typename IT, typename BT>
bool RingBuf<ET, S, IT, BT>::push(const ET * const inElement)
{
  if (isFull()) return false;
  mBuffer[writeIndex()] = *inElement;
  mSize++;
  return true;
}

template <typename ET, size_t S, typename IT, typename BT>
bool RingBuf<ET, S, IT, BT>::lockedPush(const ET inElement)
{
  noInterrupts();
  bool result = push(inElement);
  interrupts();
  return result;
}

template <typename ET, size_t S, typename IT, typename BT>
bool RingBuf<ET, S, IT, BT>::lockedPush(const ET * const inElement)
{
  noInterrupts();
  bool result = push(inElement);
  interrupts();
  return result;
}

//Need to check if the number of elements we want to pop actually are in the ring buffer

template <typename ET, size_t S, typename IT, typename BT>
bool RingBuf<ET, S, IT, BT>::largePop(ET *outElement, uint16_t numElements)
{
  if (uint32_t(mSize) < uint32_t(numElements)) return false;
  uint32_t tmp = (uint32_t)mReadIndex + (uint32_t)numElements;
  uint32_t size1 = numElements;
  uint32_t size2 = 0;
  if(tmp >= S){
    //Need to handle the case when the working index splits and we need to copy two chunks of data from inElement
    size2 = tmp - S;
    size1 = numElements - size2;
    memcpy(outElement,&mBuffer[mReadIndex],size1*2);
    memcpy(outElement+size1,&mBuffer[mReadIndex+size1-S],size2*2);
  }
  else{
    memcpy(outElement, &mBuffer[mReadIndex], numElements*2);
  }
  //Serial.write((byte*)mBuffer[mReadIndex], numElements*2);
  mReadIndex = ((uint32_t)mReadIndex + (uint32_t)numElements);
  if (mReadIndex >= S) mReadIndex -= S;
  mSize -= numElements;
  return true;
}


template <typename ET, size_t S, typename IT, typename BT>
bool RingBuf<ET, S, IT, BT>::pop(ET &outElement)
{
  if (isEmpty()) return false;
  outElement = mBuffer[mReadIndex];
  mReadIndex++;
  mSize--;
  curSum -= uint32_t(outElement);
  if (mReadIndex == S) mReadIndex = 0;
  return true;
}

template <typename ET, size_t S, typename IT, typename BT>
bool RingBuf<ET, S, IT, BT>::lockedPop(ET &outElement)
{
  noInterrupts();
  bool result = pop(outElement);
  interrupts();
  return result;
}



template <typename ET, size_t S, typename IT, typename BT>
bool RingBuf<ET, S, IT, BT>::peek(ET &outElement, const std::size_t distance)
{
  if (isEmpty() || size() < distance) return false;
  //Take care of the wrap around
  std::size_t temp_read_index = mReadIndex + distance;
  if(temp_read_index >= S) {
    temp_read_index -= S;
  }

  outElement = mBuffer[temp_read_index];
  return true;
}

template <typename ET, size_t S, typename IT, typename BT>
bool RingBuf<ET, S, IT, BT>::lockedPeek(ET &outElement, const std::size_t distance)
{
  noInterrupts();
  bool result = peek(outElement,distance);
  interrupts();
  return result;
}

template <typename ET, size_t S, typename IT, typename BT>
ET &RingBuf<ET, S, IT, BT>::operator[](IT inIndex)
{
  if (inIndex >= mSize) return mBuffer[0];
  BT index = (BT)mReadIndex + (BT)inIndex;
  if (index >= (BT)S) index -= (BT)S;
  return mBuffer[(IT)index];
}

#endif /* __RINGBUF_H__ */

Here is the ring buffer declaration:

Code:
// Rolling average allocation space size
#define maxRollingAvgLengthFilter 8000
EXTMEM RingBuf<uint16_t, maxRollingAvgLengthFilter> rollingAvgDAC[16][16];

And an example of usage (noting that typically the "i" here is defined within some loop, and this code would be contained within a helper function in the main code):
Code:
i = 0;
uint32_t tmpRaDAC = 0, tmpSize = rollingAvgDAC[curCol][i].size();
tmpRaDAC = ((rollingAvgDAC[curCol][i].getCurSum() + tmpSize/2) / tmpSize); //Calculate the new filtered DAC value


My question is this:

Other than reducing the number of function calls (i.e. memory reads and writes), is there a way I can speed up or maximize efficiency of memory access so that I don't run into severe latency issues here? One thought was to try moving only the ring buffer onto PSRAM and all helper data members on to RAM but I'm not exactly sure how I can do this since defining data members as EXTMEM within a class definition throws an error on compile.

Looking at the PSRAM chip (APS6404L_3SQR) datasheet, it seems there is a fast read mode and various clock settings as well so perhaps there's hope for some optimizing here? Any input here would be greatly appreciated!

Thanks!
Andy
 
Awesome thanks for your reply! Do you know how prefetching works? I’ll look into this today but if it works by copying big chunks of sequentially stored data over to the queue or some other memory space then I might take a hit on performance depending on how things are handled. My crystal bal is showing me big lag spikes in my future haha.
 
Hey Joepasquariello,

It's a good question and one that I've been exploring. The nice thing with the moving average is it is extremely fast to implement. I need the filter calculations to be fast because I am using the filtered signal as feedback for an acquisition system. Basically there is a DAC which "servos" current away from the input of a high gain transimpedance amplifier. So since the digital filter is in that feedback loop, if it introduces too much lag then we directly slow down the system's sampling rate.

There are many problems with the moving average though, one being memory usage, and another being the frequency response (high side lobes).

If you have any suggestions on filters I could check out, I'm all ears!
 
How about this as another take on Moving Averaging.
It takes up some code space but not array space. Downside is it only removes a percentage of the average each time rather than an individual high or low reading but it may be sufficient for your needs.
I am sure that it could be further enhanced, just presented as an idea.
Code:
const uint32_t movingAverageNumReadings = 10;
const uint32_t removeReadingMultiplier = ((movingAverageNumReadings - 1) / movingAverageNumReadings);

uint32_t movingAverage = 0;
uint32_t readingCount = 0;

uint32_t GetReading() {
	// Code to obtain reading
	return digitalRead(1);
}

void UpdateMovingAverage() {

	if (readingCount < movingAverageNumReadings) {
		readingCount++;
	} else {
		// Remove 1 reading
		movingAverage -= (movingAverage * removeReadingMultiplier);
	}
	movingAverage += (GetReading() / movingAverageNumReadings);
}

void setup() {

}

uint32_t data;

void loop() {
	UpdateMovingAverage();
	if (movingAverage > 652) { // some number

	}
}
 
Do you mean "fast to implement" (write the code) or "fast to execute"? I'm pretty sure a simple lag filter or even 2nd-order low-pass would be at least as fast to execute (because the memory would be in fast RAM) and would require a tiny fraction of the memory.
 
Awesome thanks for your reply! Do you know how prefetching works? I’ll look into this today but if it works by copying big chunks of sequentially stored data over to the queue or some other memory space then I might take a hit on performance depending on how things are handled. My crystal bal is showing me big lag spikes in my future haha.

The FlexSPI module (used to map the PSRAM chip to memory) has internal buffers for prefetching, when a read happens it will automatically fill the buffer by doing SPI reads to read ahead. This comes at basically no cost, since any prefetch reads are suspended if a new read/write transaction arrives while they are in progress. If a read hits the prefetch buffer FlexSPI can return the data immediately without doing any SPI activity.
There's a buffer for writing too but that is active by default, that's why if you run memory speed tests on the PSRAM area raw writing is usually faster than raw reading.

CPU cache also plays a part, cachelines are 32 bytes long so reducing any critical data structures to a minimum multiple of this size can help.
If you wanted to implement the idea of having only the data array in PSRAM, you'd have to make mBuffer a pointer and allocate it using extmem_malloc() in the constructor (and free it in the destructor obviously).
 
Last edited:
If you have any suggestions on filters I could check out, I'm all ears!

A simple exponential filter is 2 multiplies and 1 addition (output = A*input + (1-A)*output). If the code and data were all in fast RAM, wouldn't that be faster than a moving average with data in PSRAM, which means at least 1 read and 1 write to PSRAM per update.
 
Do you mean "fast to implement" (write the code) or "fast to execute"? I'm pretty sure a simple lag filter or even 2nd-order low-pass would be at least as fast to execute (because the memory would be in fast RAM) and would require a tiny fraction of the memory.

I did mean fast execution, not quick development time. I suppose the IIR approach wouldn't be terribly slow, so I've taken a crack at it again. I'm having a lot of trouble getting a stable filter, so I imagine this will be difficult to implement, especially because I want to give the user control of the cutoff frequency and the sampling rate. I am starting by designing the filter in matlab and then quantizing the coefficients for a single sampling rate and cutoff frequency, and even with all of this effort I can't seem to get a stable output.

The FlexSPI module (used to map the PSRAM chip to memory) has internal buffers for prefetching, when a read happens it will automatically fill the buffer by doing SPI reads to read ahead. This comes at basically no cost, since any prefetch reads are suspended if a new read/write transaction arrives while they are in progress. If a read hits the prefetch buffer FlexSPI can return the data immediately without doing any SPI activity.
There's a buffer for writing too but that is active by default, that's why if you run memory speed tests on the PSRAM area raw writing is usually faster than raw reading.

CPU cache also plays a part, cachelines are 32 bytes long so reducing any critical data structures to a minimum multiple of this size can help.
If you wanted to implement the idea of having only the data array in PSRAM, you'd have to make mBuffer a pointer and allocate it using extmem_malloc() in the constructor (and free it in the destructor obviously).

Thanks for the info! This explains why I see these semi-periodic jumps in sampling rate. The ring buffer I am using takes up 512 bytes per element (or rather 1024 bytes per two elements, more precisely) so at least this is a multiple of the 32 byte cacheline. Regarding your idea of making mBuffer a pointer and allocating it with extmem_malloc(), do you know how this would be handled in terms of prefetching or cache? For example, if I am calling a helper function (i.e. push()) on the ringbuffer object, and within this function I am loading passed data into this external memory space, would this now only require a single SPI write operation?

Code:
template <typename ET, size_t S, typename IT, typename BT>
bool RingBuf<ET, S, IT, BT>::push(const ET inElement)
{
  if (isFull()) return false;
  mBuffer[writeIndex()] = inElement;
  mSize++;
  curSum += inElement;//May want to check here for overflow?
  return true;
}

I believe as it stands now, we have an SPI read for the isFull() call, and an SPI write for the mBuffer update, mSize increment, and curSum increment. So if I can reduce the number of SPI reads/writes down to one write and one read per push/pop call respectively, this should drastically improve performance. What are your thoughts on this?

A simple exponential filter is 2 multiplies and 1 addition (output = A*input + (1-A)*output). If the code and data were all in fast RAM, wouldn't that be faster than a moving average with data in PSRAM, which means at least 1 read and 1 write to PSRAM per update.

I am worried about the floating point arithmetic but maybe I shouldn't be... But yeah as I mentioned before, the stability, ease of calculating filter coefficients, and filter coefficient quantization are all challenges here.
 
Thanks for the info! This explains why I see these semi-periodic jumps in sampling rate. The ring buffer I am using takes up 512 bytes per element (or rather 1024 bytes per two elements, more precisely) so at least this is a multiple of the 32 byte cacheline. Regarding your idea of making mBuffer a pointer and allocating it with extmem_malloc(), do you know how this would be handled in terms of prefetching or cache? For example, if I am calling a helper function (i.e. push()) on the ringbuffer object, and within this function I am loading passed data into this external memory space, would this now only require a single SPI write operation?

Unfortunately not. Due to limitations of this particular ARM CPU, as soon as you write any value it will read the existing cacheline from memory (to preserve the unwritten portion). There's no way around this even if you know the entire cacheline is going to be rewritten unless you refactor the code so that the CPU isn't doing the writing, e.g. write to a temporary buffer in main memory then trigger a DMA copy to clone it to EXTMEM while you move onto the next element. I don't think this would be feasible for a ring buffer due to DMA alignment restrictions.
 
I did mean fast execution, not quick development time. I suppose the IIR approach wouldn't be terribly slow, so I've taken a crack at it again. I'm having a lot of trouble getting a stable filter, so I imagine this will be difficult to implement, especially because I want to give the user control of the cutoff frequency and the sampling rate. I am starting by designing the filter in matlab and then quantizing the coefficients for a single sampling rate and cutoff frequency, and even with all of this effort I can't seem to get a stable output.

I am worried about the floating point arithmetic but maybe I shouldn't be... But yeah as I mentioned before, the stability, ease of calculating filter coefficients, and filter coefficient quantization are all challenges here.

If you use a first-order low-pass, there are no stability issues. You can use floating-point, and you can modify the coefficients to account for whatever break frequency and sample frequency you want. Would 300 ns per update be acceptable? That's what I'm getting in a simple test. An exponential filter would probably be only about 100 ns.
 
A simple exponential filter is 2 multiplies and 1 addition (output = A*input + (1-A)*output). If the code and data were all in fast RAM, wouldn't that be faster than a moving average with data in PSRAM, which means at least 1 read and 1 write to PSRAM per update.

Can be done with only 1 multiply:

output += A * (input-output);

This structure avoids having 1-A in addition to A. It also allows fast execution on hardware without a multiplier by choosing A to have only a few bits, allowing shifts to implement the multiply. For instance:

output += (input - output) >> 8; // A = 1/256, input and output are fixed point, not floating point.

although its much better to have unbiased operation:

output += (input - output + 0x80) >> 8; // effectively rounded rather than truncated
 
Good points, Mark. The OP mentioned concern over quantization of coefficients, so I thought he wanted to use float, but either way it will be very fast.
 
float has only 24 bits mantissa, less than int as a fixed point type, quite often in DSP fixed point is a good fit as the range is strictly defined for an ADC or DAC. But its more cumbersome to work with.
 
Yes, agree. The OP said he needed variable sample rates and break frequencies, but didn't provide actual ranges, so I was guessing that the 8-bit coefficients in your fixed-point example wouldn't be enough. Who knows. He has some options now.
 
Hi all,

Thank you again for the great discussion. I've managed to get something up and running for the IIR implementation with negligible delays! I also got the FIR implementation to be faster by increasing the SPI clock speed but there was still a relatively large spread in sampling periods so I'll probably stick with the IIR filter moving forward. I am using floating point here because it's easier but it would be so much better to use fixed point since, as you mentioned, the ADC and DAC both use uint16_t (ADC has an offset, so 0 represents -5V and 65535 represents +5V). Anyway, I find that I can reliably implement stable 3rd order LPFs with single point float quantization. I set up a matlab script to simulate the output of my system and the benchtop measurements agree with the results! Hooray.

For those that are interested, here's how I did it. I am implementing a Direct-Form I single section butterworth IIR filter using the below code:

Code:
float iirFilt(IIR *iir, float input, float coefsB[], float coefsA[])
{
    float acc1 = 0.0;
    /* b coeficients*/
    //Shift in the new input x(n)
    for (int i = (iir->coefBLen - 1); i > 0; i--){
        iir->dlyX[i] = iir->dlyX[i - 1];
    }
    iir->dlyX[0] = input;
    
    for (int i = 0; i < iir->coefBLen; i++){
        acc1 += coefsB[i] * iir->dlyX[i];
    }
        
    /* a coeficients*/
    for (int i = 1; i < iir->coefALen; i++){
        acc1 -= coefsA[i] * iir->dlyY[i-1];
    }
        
    for (int i = (iir->coefALen) - 1; i > 0; i--){
        iir->dlyY[i] = iir->dlyY[i - 1];
    }
    
    iir->dlyY[0] = acc1;
    
    return acc1;
}


void calculateDigitalFilterStep(uint8_t curCol){
  uint32_t tmpStep = 0;
  uint16_t tmp, tmpADC = curADCVals[0], tmpCurMap = 0, nextCurVal = 0;
  float filterInput, K = (2.0/255.0);//K is a conversion constant to go from ADC code to corresponding current DAC code
  float filterOutput;
  for(int i = 0; i<16; i++){
    tmpADC = curADCVals[i];//this is the most recent ADC value
    //tmpCurMap is the y[n-1] (previous value written to the DAC).
    tmpCurMap = currentMap[curCol][i];
    filterInput = (float)tmpCurMap + (float)(tmpADC-32767.0)*K;
    filterOutput = iirFilt(&(iirLPF[curCol][i]), filterInput, &coefB[0].bValFloat, &coefA[0].bValFloat);
    if(filterOutput > 65535.0)
      filterOutput = 65535;
    else if(filterOutput <= 0.0)
      filterOutput = 0;
    
    currentMap[curCol][i] = uint16_t(round(filterOutput));
  }
}

I'm leaving out some details here for simplicity, but basically after each ADC sample is captured, I feed this into the filter and update the next DAC code using the filter output. The only challenge here is the gain I apply is HUGE so the ADC clips, causing the filter to slowly settle to the point where we enter the range of the ADC. I'm working on adding the ability to toggle the filter coefficients while maintaining the filter output vector. This would allow me to first set the LPF cutoff to, say 30Hz, then once we settle, toggle it to a lower cutoff.

I know this is moving away from the original topic, but if anyone has more ideas about how I can improve this, I'm all ears! I'd love to get rid of the floats and just use the fixed point representations, but since I need to apply a conversion between the ADC and DAC (one is voltage one is current) I'm not sure this will be straight-forward (from a quantization / rounding perspective). I'm also curious if anyone has ideas on how to calculate filter coefficients based on sampling rate and cutoff frequency?

EDIT:
OH I have another question relevant to this post. Does anyone know if there's a way I could dynamically allocate memory in RAM1 based on the type of filter I want to implement? Right now, I am using #if and #end statements to allocate space to accommodate different data structures for the FIR or the IIR filters. This works because the #if conditionals are evaluated on compile. But I would like to have a radiobutton in my computer software application that can select if I am using FIR versus IIR settings. This means I need to be able to tell the arduino to allocate the memory space in RAM1 for either FIR or IIR after compiling.

One thought is to load all of the memory space in PSRAM first, then copy it over to RAM1 when I receive the information over the serial port telling me what data structure to use. But I'm not entirely sure if this would work or not.

Thanks again everyone!
 
I'd love to get rid of the floats and just use the fixed point representations, but since I need to apply a conversion between the ADC and DAC (one is voltage one is current) I'm not sure this will be straight-forward (from a quantization / rounding perspective). I'm also curious if anyone has ideas on how to calculate filter coefficients based on sampling rate and cutoff frequency?

You could look at file biquad.h in the Audio library. It contains code to compute coefficients for 2nd-order filters with arguments including sample rate and critical frequency. Biquad filters can be cascaded to create higher-order filters. I believe the Audio library uses fixed-point, but you can decided the trade-offs there.

Regarding the memory question, can you allocate both and switch at run-time? That might be easier.
 
Back
Top