I'm curious how much (how long in time) audio delay you're planning to use? And how many output taps, and how many instances? Each instance can give 8 output taps, but you need a new instance for each different input signal you wish to delay. Also, will your project be doing other effects, or just simple delay?
The slowness of 1 bit SPI really adds up to consume (waste) much of the available CPU time, because it's slow and implemented with a blocking API. Even if you manage to somehow connect a huge amount of RAM, the SPI overhead limits how many instances and taps you can realistically use. 4 bit QSPI on the bottom side of Teensy 4.1 is much faster than 1 bit SPI, because it's cached and cache misses run at 3-4 times higher clock speed than regular SPI and after command overhead QSPI moves 4 bits at a time. But even that speed is very slow compared to the internal RAM, which runs at about twice the clock speed of QSPI and moves 64 bits per clock rather than only 4. If you're going to also use the CPU for non-delay effects, wasting so much CPU time waiting on slow SPI may not be a good plan.
When it comes to technical design choices, there 2 modes of thought called maximizing versus satisfying. Thinking in terms of maximizing can be particularly dangerous when a lot of complicated trade-offs are involved and your every thought is regarding how to maximize something like the amount of memory without necessarily considering the trade-off costs.
But as a general rule, satisfying is usually the best way to think about designing real-time systems like audio, video, motion control, etc. Ultimately the amount of work to be done is dictated by the task at hand and absolutely must be completed on schedule to avoid problems like audio glitches. With real-time processing, extra capacity beyond exactly what is needed goes unused, because the amount of work per time is fixed. The maximizing thought process is often useful for non-real-time tasks, like file transfers, where increasing the raw processing speed causes the work to be done sooner, but thinking in terms of maximizing rather than satisfying for a real-time task like audio processing usually isn't a good way to make design choices.
So I believe you really should put some thought into how much memory is really needed, and how many separate delay instances and the number of taps per instance will be needed. It's easy to always want more, but is more really even useful? I can tell you from experience years ago when I first tested the code for Frank's 6-chip memory board, delaying a sound by 8-9 seconds is pretty much a novelty. It's such a long delay that by the time the sound finally comes out, the world (or my attention span) has moved on. Maybe you have some particular need for this sort of very long or even longer delay, and if that really is the case, I would very much like to understand what sort of application really would use an audio delay that long?
This is a very interesting post, Paul!
The Teensy community seems to have a huge spectrum of users, from hobbyists to professionals, beginners to experienced, and those with a plan to those just tinkering. Around here I'd call myself an experienced hobbyist who's tinkering, for the most part. But I'm close to a beginner on Teensy; my day job includes a lot of software design and understanding hardware; and occasionally I find something where I have a plan, or at least an end goal. I started out with a vague goal of building "yet another Teensy-based synthesizer", but have happily diverted into satisfying some of
my prerequisites for the beast by delving into the audio infrastructure, and doing a bit of maximisation by implementing the dynamic updates concept. Some degree of fixing up AudioEffectDelayExternal was needed to satisfy the need to work within that framework, but while I was there I took the opportunity to push its capabilities further towards some inchoate "maximum".
However, in some cases, and this is one of them, further progress is stymied by low-level infrastructure that satisfies but does not maximise. In this case, as you note, SPI is quite slow, but more crucially the library as-is is blocking. The first attribute seems pretty much inherent, but as far as I can see, the latter is definitely not: it would be
possible to implement an asynchronous SPI API, it just wasn't done. As a result, until someone with the motivation and time to do so comes along, a 20MHz SPI bus would take 100% CPU to support a measly 28 audio streams (roughly - I've not taken overhead into account). Of course, even if the CPU load is reduced, you still only get 28 streams, but you could at least run other effects, or even a second asynchronous SPI bus.
A lot of why we settle for satisfaction rather than maximisation (apart from personal bandwidth, of course!) is that we as developers don't necessarily perceive the need. "Who needs a massively long delay - it's just a novelty?". I've had "who needs dynamic audio objects?" thrown at me. Well, I
think I do, which is good enough for me to put in many hours of effort: I'm satisfied, and we're all closer to maximisation. Similarly with long delays; I'm not a looper myself, but I think that's one use, and an 8-track looper box would only need 16 of the possible 28 streams so the CPU load is "only" 57% (less if you bump the SPI clock, of course).
In library design, I'd suggest that maximisation is much closer to satisfaction than it is in (embedded) application design, where once it works OK no-one will care if it uses 90% or 9% of resources. Though if it's 9%, you've probably over-specified something...