ERAM Performance

Hello all :) I have been developing with the Teensy 4.1 for about a year now and have built some pretty impressive realtime DSP software on it, but the issue I keep bumping up against is RAM. The 600MHz CPU with floating point is a dream for DSP and I can do almost as much computation as my heart desires while operating at a 48kHz sample rate.

I hoped that adding ERAM would help with the memory ceiling, but I have not been able to get it to perform at 48kHz. Everything runs just fine, but the lookups are not happening fast enough to produce sound, for example when doing wave table lookups for direct digital synthesis.

I do not have a ton of experience with embedded systems and am learning as I go here, but is there any way to increase ERAM access speeds so I can expand my overall memory footprint? Use case here could be for realtime sampling and granular processing where the entire sample needs to be in scope at a process rate of 48kHz.
 
I have the strong feeling that there are important information missing in your question.
Even simple SPI , 20MHz not cached, was fast enough to play >10 channels /44kHz on a Teensy 3.2
ERAM is cached and uses quad spi. It should be able to do what you want.
I can think of one issue: As you you have a ton of experience, you know that random acess to spi or quad need to first send a command, then an adress over the "slow" bus. That results in really a lot of cycles needed.
So it might be just this pattern that leads to your observation?
Lookup tables are in no way good for spi/quadspi.
Its OK if they fit into the data cache, and as long the cache is not needed by other things..
 
Thanks for the quick response Frank! And I am certain you are right I am missing some info due to my lack of experience here. And to clarify, I do NOT have a ton of experience :) so I am not sure about the operation details of SPI or quad SPI and command sending over the "slow" address bus.

Gotcha, so the data still needs to be pulled into cache. By cache here you mean the lower 512k RAM in the Teensy 4.x case, right? I think can free up a significant portion of lower RAM if I can move my wave tables to ERAM. But they would just be getting cached anyways and just filling it back up?

Overall, I think my lack of understanding here is the cause of the performance issue, if I designed the SW to optimize for ERAM it sounds like it could work.
 
If you need more ram, try the usual things:
Keep code and const data in flash.
It has its own cache.

Use the FLASHMEM and PROGMEM keywords. That way code and const tables are not copied to the RAM.
Use the heap. Use local variables to use the stack.
Try to use a different linker script that does not use ITCM (I'm writing on my smartphone.. can post the link later if wanted)
 
Crosspost..you answered while i was still writing.. :)
Oops.. had not seen the "not".

No, the cache is a dedicated 32KB RAM that is handelt automatically.
 
I have used PROGMEM extensively and only have code in RAM that needs to be run at faster speeds. For the type of random access I am trying to do, for say granular processing, requires basically as much RAM as I can possibly get. I am able to sample ~2s of a mono signal at 48kHz in upper RAM (about 400kB worth) while still leaving some room for UI state etc, which is just barely adequate. The lower RAM is full of runtime code that needs to operate at process time for UI and DSP processing. I was able to use about 6% of flash using FLASHMEM and PROGMEM annotations.
 
When you say "ERAM", I'm assuming mean this PSRAM memory added to the bottom side of Teensy 4.1?

https://www.pjrc.com/store/psram.html

If you've used something else, please be specific.

And when you say "get it to perform at 48kHz", are you talking about using the Teensy Audio Library at 48 kHz sample rate, or some other code? This is particularly important, because if you're using other code which generates a new interrupt for each individual sample, achieving good performance is really hard. That way involves tremendous overhead. The audio library processes audio in blocks of 128 samples, which allows a lot of overhead to be done at only 375 Hz.

The M7 processor as two 32K caches (one for data, one for instructions) which are used when you access EXTMEM and DMAMEM. It's been a while since I looked at the cache details, but I recall its 4 way set associative. So doing one or 2 other tasks elsewhere in memory isn't likely to discard cache rows your waveform synthesis might be using. But as you scale up to more processing, each time you perform some work you may be looking at an essentially cold cache. If you process 128 samples, you'll probably suffer a lot of cache misses on the first couple loop iterations, but then probably enjoy the cache performance benefit for most of the rest. But if you process just 1 sample, not only do you incur all sorts of overhead each time, but you'll likely be doing that without much help from the cache.

So these sorts of details, like which code you're using and if you've written all your own from scratch how it actually works, matter greatly when it comes to performance optimization.
 
Yes I do mean PSRAM added to the bottom of the Teensy 4.1, 16MB in total.

No I am not using the Teensy Audio library, although I did reference it while working on my implementation. I am not using a universal buffer like you have in the Audio libs, but rather create buffers as needed in individual audio processing modules. One example of which is a granular processing module, which accepts one sample at a time, but buffers in a circular buffer located in upper RAM. I really have seen no performance issues for the system in terms of CPU cycles even when running at 96kHz, this chip is screaming fast and because I ported my implementation from a much less powerful execution context my code was already fairly optimized.

But what you are saying about caches misses and interrupt overhead makes perfect sense, and I see now why you chose to implement the Audio lib with a lockable buffer passed between processing units.

I think it also makes sense that ERAM does not work well for storing waves for direct digital synthesis because, by its very nature, the lookups are non-sequential as frequency increases. Similarly for granular processing you may have 100 grains spread out across a 400kB buffer each with only 300-1000 samples each, and within those grains pitch shifting acts very similarly to wave table lookups for DDS.

I am happy to provide more details and code samples, but I am beginning to get the sense that I need to do some careful thinking about my design here.
 
Maybe the answer is my SW design is not too bad (obv it can always be optimized further) and I really just need more on chip RAM to support my whacky data access patterns into large arrays... so I need to just wait for the Teensy 5.x to be released! ;)

jk, but seriously I love what you guys are doing here, keep up the good work!!!
 
Back
Top