Compatible PSRAM chips with Teensy 4

M4ngu

Well-known member
Hi, would it be possible to use an ESP-PSRAM64H Chip with Teensy 4 and AudioEffectDelayExternal?
In the audio design GUI I see 23LC1024 and CY15B104 FRAM chips mentioned.
Thanks
 
IIRC before the T_4.1 got running and showed the QSPI interface to the PSRAM chips there were threads on SPI usage - single data line SPI I/O.

But not seen any of those solutions standardized ( like FLASH or SD SPI ) and integrated for system use?

Or did I miss where a PSRAM could be populated for instance on the PJRC Audio board for SPI access for general usage?
 
As AudioEffectDelayExternal exists today, no. It doesn't support the 8 MB PSRAM chip.

But it should be possible to add PSRAM chip support to AudioEffectDelayExternal, since it's pretty similar to 23LC1024. Some programming required....
 
As AudioEffectDelayExternal exists today, no. It doesn't support the 8 MB PSRAM chip.

But it should be possible to add PSRAM chip support to AudioEffectDelayExternal, since it's pretty similar to 23LC1024. Some programming required....

Should be quite easy to add.
 
Thanks a lot for the answers,
so not actually but could be done.
That will allow some really long delays and looper function, which is pretty attractive feature.
 
We should probably also (someday) extend AudioEffectDelayExternal to have a mode where it just works with a user allocated buffer. Then it could be used with 16 MB on Teensy 4.1. Not only can you have twice as much memory that way, but the QSPI transfer runs with 4 bits and uses a much higher clock speed and is cached, so you get much less performance hit than slow 1 bit SPI.
 
Funnily enough I just started looking at this yesterday, as part of the dynamic audio objects development. It was mainly to do with being able to re-allocate the external memory when AudioEffectDelayExternal objects were created and destroyed, but as a side-effect I implemented an AUDIO_MEMORY_PSRAM64 option. Still in development, but I should probably aim to add AUDIO_MEMORY_EXTMEM and AUDIO_MEMORY_MALLOC options, too.

A limiting factor will probably be SPI speed: at the moment it looks like it's set to 20MHz, so a rough estimate suggests a pair of 1-input 4-tap external delays will take about 33% of the CPU time. Not sure if SPI is set up to use DMA; if so, a more complex implementation could reduce the CPU load, though total SPI bus bandwidth will still impose an upper limit of course.

Note to self - also look at allowing for use of the other [Flex]SPI busses.
 
v0.10-alpha of the dynamic audio library now implements options for an 8MB PSRAM fitted to the audio adaptor, and / or direct use of EXTMEM and / or heap. Rather than pass in pointers, I've implemented AUDIO_MEMORY_PSRAM64, AUDIO_MEMORY_EXTMEM, and AUDIO_MEMORY_HEAP options in the AudioEffectDelayExternal constructor. You can mix the memory types across delay objects (only one type per object, though!), so you could have a mix of PSRAM64 and EXTMEM for a total delay memory of 24MB or nearly 5 minutes. A 9-track looper with 30s of delay memory per loop seems feasible...

Caveat emptor - I've not given this a huge amount of testing.
 
Yup. The same QSPI RAM that you add for 8 or 16MB on T4.1 will work on the audio adaptor, though being clocked slower and using single-bit interface the one on the adaptor gives a higher CPU overhead.
 
Yup. The same QSPI RAM that you add for 8 or 16MB on T4.1 will work on the audio adaptor, though being clocked slower and using single-bit interface the one on the adaptor gives a higher CPU overhead.

Did you have to write 'SPI' interface code to get that to work on the Audio adapter? I had not seen that done so I never put any PSRAM chips there.
 
Did you have to write 'SPI' interface code to get that to work on the Audio adapter? I had not seen that done so I never put any PSRAM chips there.

The SPI code was already there for the 23LC1024, so I just checked the datasheets to ensure that single-wire mode was valid for the QSPI PSRAM, and the [used subset of the] command set was the same. All looked OK, so I tested it unchanged, then implemented a separate memory type when it all worked. Luckily for me there were lots of clues in the existing code so that was pretty easy.

The hardest part (and still rather untested) was doing crude "heap" management on the adapter RAM, so the dynamic version can give back its memory if you delete an AudioEffectDelayExternal object. The previous "management" simply looked at how much had already ever been allocated, and started from there: fine for a static system, not so useful for a dynamic one. The whole allocation system is in a new AudioExtMem base class, so any other (future) audio objects that need gobs of RAM should be able to co-exist happily with the delay.
 
I've made a static library version of the update, and opened PR#433 in the repo, so everyone can share the joy. If you want to to be an early adopter, my branch can be found here. You only need the updated effect_delay_ext.cpp and .h, the all-new extmem.cpp and .h, and the updated gui/index.html for the documentation.
 
Um. The part number and link appear to be different … the link is to a Flash memory, not RAM, so not directly relevant to this thread, though you may be able to use it for other purposes - it looks like the one illustrated on the Teensy 4.1 page at PJRC.
 
I'm curious how much (how long in time) audio delay you're planning to use? And how many output taps, and how many instances? Each instance can give 8 output taps, but you need a new instance for each different input signal you wish to delay. Also, will your project be doing other effects, or just simple delay?

The slowness of 1 bit SPI really adds up to consume (waste) much of the available CPU time, because it's slow and implemented with a blocking API. Even if you manage to somehow connect a huge amount of RAM, the SPI overhead limits how many instances and taps you can realistically use. 4 bit QSPI on the bottom side of Teensy 4.1 is much faster than 1 bit SPI, because it's cached and cache misses run at 3-4 times higher clock speed than regular SPI and after command overhead QSPI moves 4 bits at a time. But even that speed is very slow compared to the internal RAM, which runs at about twice the clock speed of QSPI and moves 64 bits per clock rather than only 4. If you're going to also use the CPU for non-delay effects, wasting so much CPU time waiting on slow SPI may not be a good plan.

When it comes to technical design choices, there 2 modes of thought called maximizing versus satisfying. Thinking in terms of maximizing can be particularly dangerous when a lot of complicated trade-offs are involved and your every thought is regarding how to maximize something like the amount of memory without necessarily considering the trade-off costs.

But as a general rule, satisfying is usually the best way to think about designing real-time systems like audio, video, motion control, etc. Ultimately the amount of work to be done is dictated by the task at hand and absolutely must be completed on schedule to avoid problems like audio glitches. With real-time processing, extra capacity beyond exactly what is needed goes unused, because the amount of work per time is fixed. The maximizing thought process is often useful for non-real-time tasks, like file transfers, where increasing the raw processing speed causes the work to be done sooner, but thinking in terms of maximizing rather than satisfying for a real-time task like audio processing usually isn't a good way to make design choices.

So I believe you really should put some thought into how much memory is really needed, and how many separate delay instances and the number of taps per instance will be needed. It's easy to always want more, but is more really even useful? I can tell you from experience years ago when I first tested the code for Frank's 6-chip memory board, delaying a sound by 8-9 seconds is pretty much a novelty. It's such a long delay that by the time the sound finally comes out, the world (or my attention span) has moved on. Maybe you have some particular need for this sort of very long or even longer delay, and if that really is the case, I would very much like to understand what sort of application really would use an audio delay that long?
 
I'm curious how much (how long in time) audio delay you're planning to use? And how many output taps, and how many instances? Each instance can give 8 output taps, but you need a new instance for each different input signal you wish to delay. Also, will your project be doing other effects, or just simple delay?

The slowness of 1 bit SPI really adds up to consume (waste) much of the available CPU time, because it's slow and implemented with a blocking API. Even if you manage to somehow connect a huge amount of RAM, the SPI overhead limits how many instances and taps you can realistically use. 4 bit QSPI on the bottom side of Teensy 4.1 is much faster than 1 bit SPI, because it's cached and cache misses run at 3-4 times higher clock speed than regular SPI and after command overhead QSPI moves 4 bits at a time. But even that speed is very slow compared to the internal RAM, which runs at about twice the clock speed of QSPI and moves 64 bits per clock rather than only 4. If you're going to also use the CPU for non-delay effects, wasting so much CPU time waiting on slow SPI may not be a good plan.

When it comes to technical design choices, there 2 modes of thought called maximizing versus satisfying. Thinking in terms of maximizing can be particularly dangerous when a lot of complicated trade-offs are involved and your every thought is regarding how to maximize something like the amount of memory without necessarily considering the trade-off costs.

But as a general rule, satisfying is usually the best way to think about designing real-time systems like audio, video, motion control, etc. Ultimately the amount of work to be done is dictated by the task at hand and absolutely must be completed on schedule to avoid problems like audio glitches. With real-time processing, extra capacity beyond exactly what is needed goes unused, because the amount of work per time is fixed. The maximizing thought process is often useful for non-real-time tasks, like file transfers, where increasing the raw processing speed causes the work to be done sooner, but thinking in terms of maximizing rather than satisfying for a real-time task like audio processing usually isn't a good way to make design choices.

So I believe you really should put some thought into how much memory is really needed, and how many separate delay instances and the number of taps per instance will be needed. It's easy to always want more, but is more really even useful? I can tell you from experience years ago when I first tested the code for Frank's 6-chip memory board, delaying a sound by 8-9 seconds is pretty much a novelty. It's such a long delay that by the time the sound finally comes out, the world (or my attention span) has moved on. Maybe you have some particular need for this sort of very long or even longer delay, and if that really is the case, I would very much like to understand what sort of application really would use an audio delay that long?
This is a very interesting post, Paul!

The Teensy community seems to have a huge spectrum of users, from hobbyists to professionals, beginners to experienced, and those with a plan to those just tinkering. Around here I'd call myself an experienced hobbyist who's tinkering, for the most part. But I'm close to a beginner on Teensy; my day job includes a lot of software design and understanding hardware; and occasionally I find something where I have a plan, or at least an end goal. I started out with a vague goal of building "yet another Teensy-based synthesizer", but have happily diverted into satisfying some of my prerequisites for the beast by delving into the audio infrastructure, and doing a bit of maximisation by implementing the dynamic updates concept. Some degree of fixing up AudioEffectDelayExternal was needed to satisfy the need to work within that framework, but while I was there I took the opportunity to push its capabilities further towards some inchoate "maximum".

However, in some cases, and this is one of them, further progress is stymied by low-level infrastructure that satisfies but does not maximise. In this case, as you note, SPI is quite slow, but more crucially the library as-is is blocking. The first attribute seems pretty much inherent, but as far as I can see, the latter is definitely not: it would be possible to implement an asynchronous SPI API, it just wasn't done. As a result, until someone with the motivation and time to do so comes along, a 20MHz SPI bus would take 100% CPU to support a measly 28 audio streams (roughly - I've not taken overhead into account). Of course, even if the CPU load is reduced, you still only get 28 streams, but you could at least run other effects, or even a second asynchronous SPI bus.

A lot of why we settle for satisfaction rather than maximisation (apart from personal bandwidth, of course!) is that we as developers don't necessarily perceive the need. "Who needs a massively long delay - it's just a novelty?". I've had "who needs dynamic audio objects?" thrown at me. Well, I think I do, which is good enough for me to put in many hours of effort: I'm satisfied, and we're all closer to maximisation. Similarly with long delays; I'm not a looper myself, but I think that's one use, and an 8-track looper box would only need 16 of the possible 28 streams so the CPU load is "only" 57% (less if you bump the SPI clock, of course).

In library design, I'd suggest that maximisation is much closer to satisfaction than it is in (embedded) application design, where once it works OK no-one will care if it uses 90% or 9% of resources. Though if it's 9%, you've probably over-specified something...
 
Oh life is bigger
It's bigger than you
And you are not me
The lengths that I will go to
R.E.M. "Losing My Religion"​

"Life? Don't talk to me about life."
Marvin the Paranoid Android "The Hitchhikers' Guide to the Galaxy"​
 
I'm curious how much (how long in time) audio delay you're planning to use? And how many output taps, and how many instances? Each instance can give 8 output taps, but you need a new instance for each different input signal you wish to delay. Also, will your project be doing other effects, or just simple delay?

The slowness of 1 bit SPI really adds up to consume (waste) much of the available CPU time, because it's slow and implemented with a blocking API. Even if you manage to somehow connect a huge amount of RAM, the SPI overhead limits how many instances and taps you can realistically use. 4 bit QSPI on the bottom side of Teensy 4.1 is much faster than 1 bit SPI, because it's cached and cache misses run at 3-4 times higher clock speed than regular SPI and after command overhead QSPI moves 4 bits at a time. But even that speed is very slow compared to the internal RAM, which runs at about twice the clock speed of QSPI and moves 64 bits per clock rather than only 4. If you're going to also use the CPU for non-delay effects, wasting so much CPU time waiting on slow SPI may not be a good plan.

When it comes to technical design choices, there 2 modes of thought called maximizing versus satisfying. Thinking in terms of maximizing can be particularly dangerous when a lot of complicated trade-offs are involved and your every thought is regarding how to maximize something like the amount of memory without necessarily considering the trade-off costs.

But as a general rule, satisfying is usually the best way to think about designing real-time systems like audio, video, motion control, etc. Ultimately the amount of work to be done is dictated by the task at hand and absolutely must be completed on schedule to avoid problems like audio glitches. With real-time processing, extra capacity beyond exactly what is needed goes unused, because the amount of work per time is fixed. The maximizing thought process is often useful for non-real-time tasks, like file transfers, where increasing the raw processing speed causes the work to be done sooner, but thinking in terms of maximizing rather than satisfying for a real-time task like audio processing usually isn't a good way to make design choices.

So I believe you really should put some thought into how much memory is really needed, and how many separate delay instances and the number of taps per instance will be needed. It's easy to always want more, but is more really even useful? I can tell you from experience years ago when I first tested the code for Frank's 6-chip memory board, delaying a sound by 8-9 seconds is pretty much a novelty. It's such a long delay that by the time the sound finally comes out, the world (or my attention span) has moved on. Maybe you have some particular need for this sort of very long or even longer delay, and if that really is the case, I would very much like to understand what sort of application really would use an audio delay that long?

Hi Paul!
If I understand correctly, the standard SPI has plenty of speed for supporting, for example, multiple delays for different audio streams, but the problem is that the software implementation is blocking? Would the high CPU usage be solved by using DMA for the SPI transfers with the RAM chips?

I am mainly interested in this as I am considering using the MicroMod Teensy in a project and that does not have QSPI I think?

Best,
Miro
 
Would the high CPU usage be solved by using DMA for the SPI transfers with the RAM chips?

DMA alone is not enough. Some sort of scheduler layer which maintains & executes a list of queued transfer requests would need to be added to the SPI library. It is theoretically possible, but far from easy, especially to integrate without adding bugs to the many libraries which use SPI from main program and interrupt contexts.

Then of course the delays would need to be reworked to queue transfers they need in advance, rather than doing them when needed with the normal blocking API.
 
DMA alone is not enough. Some sort of scheduler layer which maintains & executes a list of queued transfer requests would need to be added to the SPI library. It is theoretically possible, but far from easy, especially to integrate without adding bugs to the many libraries which use SPI from main program and interrupt contexts.

Then of course the delays would need to be reworked to queue transfers they need in advance, rather than doing them when needed with the normal blocking API.

Thanks for the reply, Paul!
That seems to be way over my skill level currently. Guess I will stick to T4.1 and its QSPI for now.
 
Back
Top