Teensy 4.0 Multiplexing 32 audio inputs

ghostintranslation

Well-known member
Hi,

I made a basic circuit with 4 multiplexers on A0 A1 A2 A3, and I coded an Audiostream class to sample those at audio frequency and be able to plug them to any other audio object. 8 inputs can be sampled at 44.1kHz each, 16 at 22.05kHz, and up to 32 at 11.025kHz.

This is part of a larger project for a Teensy 4.0 Eurorack modules base (actually v2 of it), for which it is useful to read signals up to 20kHz for FM for example.

I tried many different things, and that's the best I could do so far. I used a PIT timer and the ADC lib, but I couldn't take advantage of DMA, everything I tried didn't work for DMA, probably because I am not used to use it.

What the code does is it reads the 4 analog pins, then iterates the multiplexers, then read, then iterates... until I have 128 samples (or 64 or 32 when more than 8 inputs) for each input. The bottleneck is reading multiple pins "at the same time", but I'm almost sure someone will be able to add DMA or do something to improve and possibly get 44.1kHz on the 32 inputs.

Filtering can be set with a method so that for slower sampling frequency it smooth the signals better.

Here is a quick demo video:
https://www.youtube.com/watch?v=uRedUWVYY1Y

And here is the code:
https://gist.github.com/ghostintranslation/7804fd1ef46d85e38ad0f74df730480e

I'd be happy to hear from anyone that could increase the sampling frequency when more than 1 pin is read (more than 8 inputs) :)
 
It is very poorly documented, but many Teensys including 4.0 have two ADCs, which can be used simultaneously.
If you choose the appropriate analog input pins, you can read two pins at the same time, using the analogSynchronizedRead. Alternatively, you can schedule to the adc0 and adc1 to sample independently using startSingleRead and readSingle. This thread has some relevant discussion on the ADC library.

Some analog pins are only connected to one of the ADCs, so you need to know which combinations of pins will work.
There is some discussion about the Teensy 4.0 pins on this thread.
Member @KurtE has kindly provided a reference table for Teensy 4. If you check the "Analog" column of sheet "T4" you can see that pins A0-A9 are connected to both ADCs, pins A10-A11 connect to the first ADC, and pins A12-A13 connect to the second ADC.

Since you have used A0-A3 you should be safe to read any combinations of pins simultaneously, which should give you roughly twice the sampling throughput in your timer callback.
 
Thanks for the suggestions,

My understanding was that Teensy 4.0 had 2 ADCs yes but that only ADC0 was on pins A0 to A9, so that's one thing I learn.

I gave it a try by setting up adc1 like adc0 with the fastest settings and using adc0 on A0 and A2 and using adc1 on A1 and A3, and calling adc->adc0->analogRead(A0) and adc->adc1->analogRead(A1), so that I can have 16 inputs with both adcs. But when I do that it is not fast enough still.

Then I tried with adc->analogSynchronizedRead(A0,A1) and that is actually faster. I had looked at that function's source code before and thought it would not be faster and I probably didn't try at the time... So now with this one I can have 16 inputs at 44.1kHz!

But if I try to use it for 32 inputs, it's too slow and I still need to divide the sampling frequency by 4.

This is how I setup adc0 and adc1:
Code:
adc = new ADC();
  adc->adc0->setAveraging(1);   // set number of averages
  adc->adc0->setResolution(12); // set bits of resolution
  adc->adc0->setConversionSpeed(ADC_CONVERSION_SPEED::VERY_HIGH_SPEED);
  adc->adc0->setSamplingSpeed(ADC_SAMPLING_SPEED::VERY_HIGH_SPEED);
  adc->adc0->startSingleRead(A0);
  adc->adc1->setAveraging(1);   // set number of averages
  adc->adc1->setResolution(12); // set bits of resolution
  adc->adc1->setConversionSpeed(ADC_CONVERSION_SPEED::VERY_HIGH_SPEED);
  adc->adc1->setSamplingSpeed(ADC_SAMPLING_SPEED::VERY_HIGH_SPEED);
  adc->adc1->startSingleRead(A1);

This is how the code looks like in the interrupt with analogSynchronizedRead:

Code:
  if (headQueueTempCount[muxIndex] < AUDIO_BLOCK_SAMPLES && headQueueTempCount[muxIndex+8] < AUDIO_BLOCK_SAMPLES ) {
    ADC::Sync_result result = adc->analogSynchronizedRead(A0,A1);

    for (int i = 0; i < downSamplingFactor; i++) {
      accumulator[muxIndex] = (lowPassCoeff[muxIndex] * (result.result_adc0* 16 - 32768)) + (1.0f - lowPassCoeff[muxIndex]) * accumulator[muxIndex];
      headQueueTemp[muxIndex][headQueueTempCount[muxIndex]] = accumulator[muxIndex];
      headQueueTempCount[muxIndex]++;
      
      accumulator[muxIndex+8] = (lowPassCoeff[muxIndex+8] * (result.result_adc1* 16 - 32768)) + (1.0f - lowPassCoeff[muxIndex+8]) * accumulator[muxIndex+8];
      headQueueTemp[muxIndex+8][headQueueTempCount[muxIndex+8]] = accumulator[muxIndex+8];
      headQueueTempCount[muxIndex+8]++;
    }
  }

  if (inputsCount > 16 && headQueueTempCount[muxIndex+16] < AUDIO_BLOCK_SAMPLES && headQueueTempCount[muxIndex+24] < AUDIO_BLOCK_SAMPLES ) {
    ADC::Sync_result result2 = adc->analogSynchronizedRead(A2,A3);
    
    for (int i = 0; i < downSamplingFactor; i++) {
      accumulator[muxIndex+16] = (lowPassCoeff[muxIndex+16] * (result2.result_adc0* 16 - 32768)) + (1.0f - lowPassCoeff[muxIndex+16]) * accumulator[muxIndex+16];
      headQueueTemp[muxIndex+16][headQueueTempCount[muxIndex+16]] = accumulator[muxIndex+16];
      headQueueTempCount[muxIndex+16]++;
      
      accumulator[muxIndex+24] = (lowPassCoeff[muxIndex+24] * (result2.result_adc1* 16 - 32768)) + (1.0f - lowPassCoeff[muxIndex+24]) * accumulator[muxIndex+24];
      headQueueTemp[muxIndex+24][headQueueTempCount[muxIndex+24]] = accumulator[muxIndex+24];
      headQueueTempCount[muxIndex+24]++;
    }
  }

Regarding the asynchronous way, I can't actually do that because it needs to be timed precisely at 44100 * 8 = 352800Hz, or 4 pins to be read every 2.83us.
 
The problem is that functions like analogRead and even analogSynchronizedRead will block the CPU in a loop while waiting for conversion to complete. You probably want to avoid doing this in a high-frequency timer interrupt.
With the current refactored interrupt, each analogSynchronizedRead will take one conversion time, probably at least 0.7us to complete. Two of these calls gives you 1.4us blocking every 2.83us. In other words, roughly 50% of the CPU time is spent waiting for ADC conversions in the interrupt handler.

Instead you could try using startSynchronizedSingleRead in the timer interrupt to start the conversion.
Then in the ADC interrupt handler get the result via readSynchronizedSingle. Then start the next conversion, and handle its result, etc.
Given that you always start adc0 and adc1 together, they should finish at the same time but actually you probably want to wait for both interrupts before reading the value. For example, have one ISR routine for both ADC0 and ADC1, and process the results every second interrupt, zeroing the count in the timer handler, and counting each time you get an ADC interrupt. You could also use this state to decide what to do next. If the count is 2, read and save the first pair of samples, and request the next sample. If the count is 4, read and save the second pair of samples, and update the analog input multiplexer.

Its a bit messier to handle sampling in this asynchronous way, but if you can get it right you should have almost twice the CPU time left for other background processing, such as any Audio Library work. There will be some extra overhead for the additional interrupts, but I'll bet its small compared to the savings you can achieve.
 
I tried the async as you suggested but I couldn't get it right..

I did try again with DMA though and the timer provided by the ADC lib and I get better result than before, I can get the 32 inputs at 22.05kHz, so sampling frequency is only divided by 2 now.

I have a simple procedural code if you want to try, I haven't yet converted it into my Input class:

Code:
#include <ADC.h>
#include "DMAChannel.h"

DMAChannel dmaChannel1;
DMAChannel dmaChannel2;
ADC* adc;
uint16_t adc1PinIndex = 0;
uint16_t adc2PinIndex = 1;
const uint16_t buffSize = 128;
const uint16_t inputsCount = 32;
uint16_t buffers[inputsCount][buffSize * 2] {{0}};
uint16_t bufferCount[inputsCount] = {0};
uint16_t val1 = 0;
uint16_t val2 = 0;
uint16_t isr1Count = 0;
uint16_t isr2Count = 0;
uint16_t muxIndex = 0;
uint8_t pinToChannel[4] = {
  7, // 14/A0  AD_B1_02
  8,  // 15/A1  AD_B1_03
  12, // 16/A2  AD_B1_07
  11, // 17/A3  AD_B1_06
};
void setup() {
  Serial.flush();
  Serial.begin(9600);
  while (!Serial && millis() < 5000) ;

  pinMode(A0, INPUT);
  pinMode(A1, INPUT);
  pinMode(A2, INPUT);
  pinMode(A3, INPUT);
  pinMode(2, OUTPUT);
  pinMode(3, OUTPUT);
  pinMode(4, OUTPUT);

  // Reset multiplexer to channel 0
  digitalWriteFast(2, LOW);
  digitalWriteFast(3, LOW);
  digitalWriteFast(4, LOW);

  dmaChannel1.source((volatile uint16_t &)(ADC1_R0));
  dmaChannel1.destination((volatile uint16_t &)val1);
  dmaChannel1.transferSize(2);
  dmaChannel1.transferCount(1);
  dmaChannel1.interruptAtCompletion();
  dmaChannel1.attachInterrupt(isr1);
  dmaChannel1.triggerAtHardwareEvent(DMAMUX_SOURCE_ADC1);
  dmaChannel1.enable();

  dmaChannel2.source((volatile uint16_t &)(ADC2_R0));
  dmaChannel2.destination((volatile uint16_t &)val2);
  dmaChannel2.transferSize(2);
  dmaChannel2.transferCount(1);
  dmaChannel2.interruptAtCompletion();
  dmaChannel2.attachInterrupt(isr2);
  dmaChannel2.triggerAtHardwareEvent(DMAMUX_SOURCE_ADC2);
  dmaChannel2.enable();

  adc = new ADC();
  adc->adc0->setAveraging(1);   // set number of averages
  adc->adc0->setResolution(12); // set bits of resolution
  adc->adc0->setConversionSpeed(ADC_CONVERSION_SPEED::VERY_HIGH_SPEED);
  adc->adc0->setSamplingSpeed(ADC_SAMPLING_SPEED::VERY_HIGH_SPEED);
  adc->adc0->enableDMA();
  adc->adc0->startSingleRead(A0);

  adc->adc1->setAveraging(1);   // set number of averages
  adc->adc1->setResolution(12); // set bits of resolution
  adc->adc1->setConversionSpeed(ADC_CONVERSION_SPEED::VERY_HIGH_SPEED);
  adc->adc1->setSamplingSpeed(ADC_SAMPLING_SPEED::VERY_HIGH_SPEED);
  adc->adc1->enableDMA();
  adc->adc1->startSingleRead(A1);

  // Should be *16 to get 128 samples in 2902ms with 32 inputs
  // but then it actually takes 3780ms, so max is *8 to get 64 samples under 2902ms
  adc->adc0->startTimer(AUDIO_SAMPLE_RATE * 16);
  adc->adc1->startTimer(AUDIO_SAMPLE_RATE * 16);
}

void loop() {
}


void isr1() {
  isr1Count++;

  if (isr1Count <= 2) {
    uint16_t inputIndex = muxIndex + 8 * adc1PinIndex;

    buffers[inputIndex][bufferCount[inputIndex]] = val1;
    bufferCount[inputIndex]++;

    if (adc1PinIndex == 0) {
      // Switching ADC1 mux to pin A2
      adc1PinIndex = 2;
      ADC1_HC0 = pinToChannel[adc1PinIndex] | ADC_HC_AIEN;
    }
  }
  processBothIsr();
  dmaChannel1.clearInterrupt();
  asm("DSB");
}

void isr2() {
  isr2Count++;

  if (isr2Count <= 2) {
    uint16_t inputIndex = muxIndex + 8 * adc2PinIndex;

    buffers[inputIndex][bufferCount[inputIndex]] = val2;
    bufferCount[inputIndex]++;

    if (adc2PinIndex == 1) {
      // Switching ADC2 mux to pin A3
      adc2PinIndex = 3;
      ADC2_HC0 = pinToChannel[adc2PinIndex] | ADC_HC_AIEN;
    }
  }

  processBothIsr();
  dmaChannel2.clearInterrupt();
  asm("DSB");
}

elapsedMicros timer;

void processBothIsr() {
  if (isr1Count < 2 || isr2Count < 2) {
    return;
  }

  // Switching ADC1 mux to pin A0
  adc1PinIndex = 0;
  ADC1_HC0 = pinToChannel[adc1PinIndex] | ADC_HC_AIEN;
  
  // Switching ADC2 mux to pin A1
  adc2PinIndex = 1;
  ADC2_HC0 = pinToChannel[adc2PinIndex] | ADC_HC_AIEN;
  
  isr1Count = 0;
  isr2Count = 0;

  muxIndex++;
  muxIndex = muxIndex % 8;

  digitalWriteFast(2, muxIndex & 1);
  digitalWriteFast(3, muxIndex & 2);
  digitalWriteFast(4, muxIndex & 4);

  if (bufferCount[inputsCount - 1] >= buffSize) {
    // UNCOMENT THIS TO LOOK AT THE SIGNALS
    //    for (int i = 0; i < buffSize; i++) {
    //      for (int j = 0; j < inputsCount; j++) {
    //        if (j % 8 == 0) {
    //          Serial.print(buffers[j][i]);
    //          Serial.print(",");
    //        }
    //      }
    //      Serial.println("");
    //    }

    // UNCOMMENT THIS TO LOOK AT THE TIMING
    Serial.println(timer);
    timer = 0;

    Serial.flush();

    for (int i = 0; i < inputsCount; i++) {
      bufferCount[i] = 0;
    }
  }
}

You can try without having the hardware, this code is serial printing the timing it takes to get 128 samples.

It looks like the bottleneck is the switch of the ADCs mux in isr1 and isr2:
ADC1_HC0 = pinToChannel[adc1PinIndex] | ADC_HC_AIEN;
and
ADC21_HC0 = pinToChannel[adc2PinIndex] | ADC_HC_AIEN;

Though these instruction does not affect the timing in processBothIsr, probably because this one runs less frequently.

Not sure if there is a faster way to switch the muxs? Or else any idea?
 
Strangely if I run the ADCs like that:
Code:
adc->adc0->startContinuous(A0);
adc->adc1->startContinuous(A1);

instead of:
Code:
adc->adc0->startTimer(AUDIO_SAMPLE_RATE * 16);
adc->adc1->startTimer(AUDIO_SAMPLE_RATE * 16);

it manages to get 128 samples in 2403ms.., but with the timer the limit is at 3780ms unless I comment the ADCs mux switching instructions...

Any idea how to get it to run at AUDIO_SAMPLE_RATE * 16? It looks like the hardware is capable at least.
 
I have reworked this class and the best I can get is 16 inputs at 44.1Khz and 32 at 22.05kHz:
https://gist.github.com/ghostintranslation/7804fd1ef46d85e38ad0f74df730480e

This time the process is to separate the multiplexers by ADCs, so for 8 inputs there is just 1 multiplexer on A0 controlled by pins 2-3-4, for 16 inputs there is a 2 multiplexers on A0 and A1, using both ADCs and second multiplexer is controlled by pins 5-6-10, then for 32 inputs multiplexers on A0 and A2 work together and A1 and A3 together. Doing so it doesn't have to wait for both ADCs to finish the 4 readings before iterating all multiplexers, now it's separated by ADCs.

I think I tried everything and that's the limit for Teensy 4.0. Please prove me wrong if you can :)
 
Back
Top