Daisy Chain SPI Shift Register Control with Teensy 4.1

grinch

Well-known member
Hi, I'm working on a project where I need to control a large number of shift register outputs updated at 44.1kHz audio rate. I built a board in the past that worked well for controlling 64 outputs (8 shift registers), I'm now working to expand it to more outputs for a different application.

To control more outputs I'm designing a daisy chainable PCB . The master PCB will have a Teensy 4.1 populated sending SPI signal to the shift registers. This PCB will send output to a series of daisy chain PCBs without the 4.1 populated using differential encoders between each PCB.

The intention for this project is to control a very large number of shift register outputs (somewhere in low thousands). To this end I'm planning to run SPI at around 100 mHZ clock speed (100mHz / 44.1kHz = ~2267) and setting up an interrupt that starts a SPI transfers via DMA at 44.1kHz sample rate.

For sending SPI between the PCBs I'm using SN65LVDS32DR and SN65LVDS31DR differential transmitter + receivers with shielded Cat6 cable. For the shift registers I'm using 74LVC595 high speed shift registers. I'm also using SN74LVC125 schmitt trigger buffers for signal cleanup on the input and output of the PCB.

Please find my schematic attached? Does this seem like a reasonable approach to the stated application? I'm trying to make this as modular as possible while using a single PCB design.

Screenshot 2025-02-09 at 2.08.13 AM.png
 

Attachments

  • Printing Print Schematic.pdf
    192.1 KB · Views: 17
The SPI modules have a limit of ~30MHz. You'll need to use FlexIO (configured as SPI) to reach 100MHz
 
The SPI modules have a limit of ~30MHz. You'll need to use FlexIO (configured as SPI) to reach 100MHz

I thought they could be run faster if you configure the clock outside of the library function?

For FlexIO, can that be controllable via DMA? And can the pin assignment be the same as the regular SPI port?
 
Before answering, I have a question about the data you wish to transmit. Is it audio or audio-like, where maintaining a consistent sample rate is of paramount importance? Or is it insensitive to minor timing fluctuations, where small variable size gaps between each byte/word don't really matter?
 
Before answering, I have a question about the data you wish to transmit. Is it audio or audio-like, where maintaining a consistent sample rate is of paramount importance? Or is it insensitive to minor timing fluctuations, where small variable size gaps between each byte/word don't really matter?

It's audio like, essentially a bunch of 1-bit / square wave audio channels. The project is for musical robotics. Used the 64 channel version to build this if you're interested: https://emmettpalaima.com/CATHEDRAL-64

Because the shift registers have a latch / storage feature the outputs update on the rising "chip select" edge (really this isn't technically a chip select because the signal isn't technical spi, it's just spi like). So if all the shift registers detect the rising chip select edge at the same time the outputs will update simultaneously, regardless of what is happening with other digital signals. The rising chip select edge also just happens once per 44.1kHz, even if spi clock is set to be very fast.

The way I've approached this in previous iterations is to call an interrupt at 22.05 kHz and have it alternate between two functions. The first function does the SPI style shift register data write (CS low, clock out all the data). CS then stays low until the beginning of the next interrupt call, which calls a second function to write the CS pin high. Because the CS pin is always written high at the very beginning of the interrupt call the timing between outputs is very consistent. At least this has been the case audibly on the 64 channel version.
 
I believe your best option is probably TDM using 4 or 5 data pins. The TDM code in the audio library transmits a 256 bit frame, which was (nearly) the fastest Teensy 3.x hardware was rated to handle. I believe the max bitrate is twice has high for Teensy 4.x, so you ought to be able to transmit 512 bit frames. SAI1 can use 4 transmit pins, and SAI2 can use 1 transmit pin. So with all 5 in use, you could transmit 2560 bits per 44.1kHz update. Maybe 5 data pins clocking at 22.6 MHz isn't as appealing as only 1 at ~100 MHz, but it could really ease the requirements for the rest of your project. For example, TI's 74LV595 shift register has a 5.5ns spec for clock high and low time, so according to official specs it can't quite clock at 100 MHz. With buffer chips, connectors, cables you'll probably have quite a challenging time reliably achieving 100 MHz bitrate. Several pins at under 25 MHz still might give some challenges, but dealing with it sould be much simpler.

The other huge advantage of TDM is your sample rate will be much more stable. The SAI hardware will generate the sample rate entirely in hardware. It also has a FIFO, so even when you feed it using DMA, the FIFO allows the sample rate to remain perfect even if there is occasional bus access latency. It's also CPU efficient, and with so many signals you're probably going to want as much CPU time as possible just to work with the data. An interrupt-based approach from a timer will have per-sample jitter because of slightly varying interrupt latency. From your comment about doing this before, maybe quality problems from jitter isn't a main concern? Still, with digital audio it's well known that sample rate jitter becomes noise+distortion, so best to avoid it if you can.

The downside to TDM, or any SAI usage that's not already implemented by the audio library, is you'll need to dive into the SAI registers to figure out how to get it to create the waveforms you want. Then you need to edit the DMA config to match. The SAI and DMA peripherals are very configurable, which is good, but that goodness comes with some learning curve. You won't have to start from zero, since code already exists in the audio library for TDM with 256 bit frame size.
 
I believe your best option is probably TDM using 4 or 5 data pins. The TDM code in the audio library transmits a 256 bit frame, which was (nearly) the fastest Teensy 3.x hardware was rated to handle. I believe the max bitrate is twice has high for Teensy 4.x, so you ought to be able to transmit 512 bit frames. SAI1 can use 4 transmit pins, and SAI2 can use 1 transmit pin. So with all 5 in use, you could transmit 2560 bits per 44.1kHz update. Maybe 5 data pins clocking at 22.6 MHz isn't as appealing as only 1 at ~100 MHz, but it could really ease the requirements for the rest of your project. For example, TI's 74LV595 shift register has a 5.5ns spec for clock high and low time, so according to official specs it can't quite clock at 100 MHz. With buffer chips, connectors, cables you'll probably have quite a challenging time reliably achieving 100 MHz bitrate. Several pins at under 25 MHz still might give some challenges, but dealing with it sould be much simpler.

The other huge advantage of TDM is your sample rate will be much more stable. The SAI hardware will generate the sample rate entirely in hardware. It also has a FIFO, so even when you feed it using DMA, the FIFO allows the sample rate to remain perfect even if there is occasional bus access latency. It's also CPU efficient, and with so many signals you're probably going to want as much CPU time as possible just to work with the data. An interrupt-based approach from a timer will have per-sample jitter because of slightly varying interrupt latency. From your comment about doing this before, maybe quality problems from jitter isn't a main concern? Still, with digital audio it's well known that sample rate jitter becomes noise+distortion, so best to avoid it if you can.

The downside to TDM, or any SAI usage that's not already implemented by the audio library, is you'll need to dive into the SAI registers to figure out how to get it to create the waveforms you want. Then you need to edit the DMA config to match. The SAI and DMA peripherals are very configurable, which is good, but that goodness comes with some learning curve. You won't have to start from zero, since code already exists in the audio library for TDM with 256 bit frame size.

So I guess the idea would be to write TDM data out once per sample and use the TDM word clock as the "chip select"?

Could the TDM data outputs could share one word and bit clock output?

What is the maximum number of data pins for TDM on the Teensy 4.1?

Would it be possible to run TDM at a higher sample rate, say 96 kHZ and alternate sample frames, while still running the chip select at 48kHz to get 1024 bit frame performance?

This project will likely go through several iterations, but exploring different approaches is helpful. Are there any pins on the Teensy that could be used for both TDM and SPI output so I can build one PCB and try both firmware approaches?
 
Also this is more in the category of general project advice, but does the differential transmitter / receiver + Shielded CAT6 and Schmitt trigger setup seem reasonable for sending signal between PCBs?

CAT6 runs will mostly be < 12" at most < 36"
 
First quick answers and non-answers...

Yes, with TDM the word clock becomes a pulse at the beginning (or end) of each frame.

Max number of data pins is 5. SAI1 can have 4 output pins, and SAI2 can have 1 output pin. I'd highly recommend reading chapter 38 in the reference manual starting on page 1985.

With SAI1, the 4 data pins must share clocks. SAI2 is separate, so it would have its own clock pins not necessarily in sync with SAI1 even if running at the exact same clock speed.

96 kHz sample rate is possible. But if you look at the datasheet on page 61, minimum BCLK cycle time is 40ns (25 MHz). So if you stay within the official specs, you'd be limited to 256 byte frame size at 96 kHz. Even at 48 kHz, you simply can't (within published specs) have 1024 bit frame size on a single pin.

TDM pins are fixed and not on the same pins as SPI.

I have not used those transceiver chips. Even if I had, whether a high bandwidth signal will reliably work over many feet of a particular cable is not the sort of question for a quick & simple answer. For example, consider the complex modulation ethernet transceiver/PHY chips use for even 100 Mbit.

And just one answer of lengthy explanation...

Generally speaking, the SAI hardware isn't meant to be used in a 1-sample-at-a-time mode. You can do this, but it's extremely inefficient. Even with the 600 MHz Cortex-M7 CPU, you'll quickly burn up most or all of the CPU time as you try to scale up to a large number of signals. For example, interrupting at 88.2kHz is an interrupt every 11.3us. There's considerable overhead just in the hardware to enter and return. Your code will also have overhead, even if very simple, just from the compiler setting up registers to point to memory and peripherals.

Efficient audio is almost always processed in small blocks. The audio library defaults to 128 sample block size, though it can be changed by editing core library files. Block processing gives massive efficiency gains. DMA moves the actual data at audio rate and an interrupt is needed only every 2.9ms.

While code generating the waveforms needs to do the same work, that work is usually far more efficient if done over a block rather than 1 sample at a time. Usually there's considerable overhead just to set up local variables with pointers and constants, and even more for complex algorithms. With block processing, you suffer all that overhead only once per 128 samples.

You're not the first person to talk of scaling up simple 1-sample-at-a-time audio on this forum. In fact it's been discussed over and over. I've written similar answers several times, and so have others here. Maybe you can find those by searching, if you need any more info & confirmation. It's also pretty much universal practice on PCs for processing audio, for the same reasons. Even with GHz clocked CPUs and vast amounts of RAM, you just don't achieve good scaling without using efficient techniques.

I'd highly recommend spending a little time using the Teensy audio library. Even if you don't plan to actually use any of it, you'll certainly end up looking at its code as a starting point on the TDM and DMA configuration.
 
Generally speaking, the SAI hardware isn't meant to be used in a 1-sample-at-a-time mode. You can do this, but it's extremely inefficient. Even with the 600 MHz Cortex-M7 CPU, you'll quickly burn up most or all of the CPU time as you try to scale up to a large number of signals. For example, interrupting at 88.2kHz is an interrupt every 11.3us. There's considerable overhead just in the hardware to enter and return. Your code will also have overhead, even if very simple, just from the compiler setting up registers to point to memory and peripherals.

Efficient audio is almost always processed in small blocks. The audio library defaults to 128 sample block size, though it can be changed by editing core library files. Block processing gives massive efficiency gains. DMA moves the actual data at audio rate and an interrupt is needed only every 2.9ms.

While code generating the waveforms needs to do the same work, that work is usually far more efficient if done over a block rather than 1 sample at a time. Usually there's considerable overhead just to set up local variables with pointers and constants, and even more for complex algorithms. With block processing, you suffer all that overhead only once per 128 samples.

You're not the first person to talk of scaling up simple 1-sample-at-a-time audio on this forum. In fact it's been discussed over and over. I've written similar answers several times, and so have others here. Maybe you can find those by searching, if you need any more info & confirmation. It's also pretty much universal practice on PCs for processing audio, for the same reasons. Even with GHz clocked CPUs and vast amounts of RAM, you just don't achieve good scaling without using efficient techniques.

I'd highly recommend spending a little time using the Teensy audio library. Even if you don't plan to actually use any of it, you'll certainly end up looking at its code as a starting point on the TDM and DMA configuration.
What I mean is would I get 1 word clock per sample? And one sample written out at a time at audio rate? It sounds like this is the case. I am just making sure TDM isn't writing out a buffer of several consecutive samples at a time.

The actual audio processing / synthesis is done with the audio library like you described, I wrote a new audio object that feeds samples to a fifo buffer used by the SPI interrupt. The only job of the SPI interrupt is triggering the SPI write and toggling the chip select GPIO once per sample. The audio library object fills the buffer and the SPI interrupt reads from it:

C++:
#include <Arduino.h>
#include "output_copybuffer.h"
#include "utility/pdb.h"
bool AudioOutputCopyBuffer::update_responsibility = false;
void AudioOutputCopyBuffer::begin(void)
{
  write = COPY_BUFFER_COUNT / 2;
  read = 0;
  error = false;
  underrun = underrunInternal = overrun = false;
  overrunTime = 0;
  index = 0;
}

void AudioOutputCopyBuffer::update(void)
{
  audio_block_t *b1;
  audio_block_t *b2;
  b1 = receiveReadOnly(0); // input 0
  b2 = receiveReadOnly(1); // input 1
  if (!b1 || !b2) {
    return;
  }
  __disable_irq();
  memcpy(leftBuffer[write], b1->data, AUDIO_BLOCK_SAMPLES * 2);
  memcpy(rightBuffer[write], b2->data, AUDIO_BLOCK_SAMPLES * 2);
  __enable_irq();
  write++;
  if(write >= COPY_BUFFER_COUNT){ write = 0;}
  if(write == read){ underrunInternal = true; underrun = true; }
  else{ underrunInternal = false; }
  release(b1);
  release(b2);
}

void AudioOutputCopyBuffer::readFromBuffers(int16_t *p1, int16_t *p2)
{
  if(write != read){
    *p1 = leftBuffer[read][index];
    *p2 = rightBuffer[read][index];
    index++;
    if(index >= AUDIO_BLOCK_SAMPLES){
      index = 0;
      read++;
      if(read >= COPY_BUFFER_COUNT){
        read = 0;
      }
    }
    overrunCounter++;
  }else{
    overrun = true;
    overrunTime = overrunCounter;
    overrunCounter = 0;
  }
}
void AudioOutputCopyBuffer::readFromBuffersAlternating(int16_t *p1, int16_t *p2, int16_t *p3, int16_t *p4)
{
  if(write != read){
    *p1 = leftBuffer[read][index];
    *p2 = rightBuffer[read][index];
    index++;
    *p3 = leftBuffer[read][index];
    *p4 = rightBuffer[read][index];
    index++;
    if(index >= AUDIO_BLOCK_SAMPLES){
      index = 0;
      read++;
      if(read >= COPY_BUFFER_COUNT){
        read = 0;
      }
    }
  }else{
    if(!underrun && !underrunInternal){ overrun = true; }
  }
}
void AudioOutputCopyBuffer::readFromBuffersBitFlag(int16_t *p1, int16_t *p2, int16_t *p3, int16_t *p4)
{
 if(write != read){
    if(leftBuffer[read][index] < 0){
      *p3 = leftBuffer[read][index] & 32767;
      *p4 = rightBuffer[read][index] & 32767;
    }else{
      *p1 = leftBuffer[read][index];
      *p2 = rightBuffer[read][index];
    }
    index++;
    if(leftBuffer[read][index] < 0){
      *p3 = leftBuffer[read][index] & 32767;
      *p4 = rightBuffer[read][index] & 32767;
    }else{
      *p1 = leftBuffer[read][index];
      *p2 = rightBuffer[read][index];
    }
    index++;
    if(index >= AUDIO_BLOCK_SAMPLES){
      index = 0;
      read++;
      if(read >= COPY_BUFFER_COUNT){
        read = 0;
      }
    }
  }else{
    if(!underrunInternal){ overrun = true; }
  }
}
 
Last edited:
There probably is a way to make the SPI process faster by using DMA instead of an interrupt. Some complexity there because I need CS written low to high right on a sample downbeat rather than after any SPI processing. Interrupt worked well enough for the time being and did what I needed so never got any farther with it. But would be interested in setting this up as a proper output with DMA in the future if you have any advice on how this could work.

Here are the relevant sections of the original code:

C++:
//output callback for spdif audio
void doOutputSPDIF(){
  __disable_irq();
  if(alt){
    digitalWriteFast(SS_PIN, HIGH); //write shift reg latch high every other time, at beginning of interrupt, to ensure consistent timing
  }else{
    copySPDIF->readFromBuffersAlternating(&out[0], &out[1], &out[2], &out[3]); //copy from spdif buffers
    digitalWriteFast(SS_PIN, LOW);
    SPI.transfer16(out[3]); //write values to shift registers
    SPI.transfer16(out[2]);
    SPI.transfer16(out[1]);
    SPI.transfer16(out[0]);
  }
  alt = !alt;
  __enable_irq();
}

C++:
if(spdifIn->pllLocked() && !(spdifIn->sampleRate() > 250000) && !(spdifIn->sampleRate() < 20000)){ //PLL detected and it isn't a junk value
    debug_println("S/PDIF PLL Detected!"); //if spdif is plugged in, boot in spdif audio mode
    debug_println("Booting SPDIF Audio Program");
    copySPDIF = new AudioOutputCopyBuffer();
    patch1 = new AudioConnection(*spdifIn, 0, *copySPDIF, 0);
    patch2 = new AudioConnection(*spdifIn, 1, *copySPDIF, 1);
    prog = &loopSPDIF; //program pointer to spdif loop function
    startSPDIF_interrupt(spdifIn->sampleRate()); //start spdif timer interrupt (separate function to accomodate possible changes in sample rate from pc side while in use)
  }
 
Back
Top