Low latency multi voice sample playback

yeahtuna

Well-known member
I'm starting this thread as a blog.

I'm trying to get low latency multi-voice playback from files stored on an SSD drive connected to a USB host port. For now, I'm working with a T3.6 with headphones connected to the DAC pins through through 1uF ceramic caps. Sounds like crap, but it'll do for testing. I have a custom T4.1 in the works with an integrated audioboard circuit, but until the parts arrive, I'll settle for the T3.6.

There's an example project with a module (AudioPlayUSBWav) for playing back wave files from USB Host port. The code is spaghetti, but it got me started. I reworked it to read the wave header and prebuffer some data, so that when I want to play the track, it plays immediately. This dropped the time required to start playing the data down from 900ms to 70us. Huge improvement. Using this method, I could play back a max of 4 files simultaneously without any dropouts.

But I could still hear a significant amount of latency, so I changed the Audio buffers down to 16. This is extremely aggressive, and resulted in massive amount of dropouts. In fact I could only manage a playback of a single file without glitches, but at least the sound was immediate. I figure the T4 might get 4, but clearly this is not going to be good enough. I want at least 24 voices--the more the merrier.

Tomorrow I will try to rework my single AudioPlayUSBWav object to handle reading, buffering and mixing multiple audio files instead of using one AudiPlayUSBWav module for each file. Based on my SSD read performance tests through the host port, I should be able to comfortably read over 50 stereo files simultaneously, although I didn't do those tests with audio playing.
 
Curious as to what you're measuring the latency with respect to? Some sort of trigger input to first sample output? 70us sounds a bit unbelievable, as it's only 3 samples at 44.1kHz, and it's hard to believe you can hear that!

I've done some work on buffered playback, you can find a discussion thread with links to the code here. Its basic use case was for multi-track playback and recording to SD cards, but there are functions provided to use any File object, so you should be able to use it with an SSD on the USB Host port. I've tested it up to 16 mono tracks, and there's a demo of a sample-based 88-note piano using pre-buffering for fast response - I haven't actually measured the response time, but it's probably <10ms from MIDI message to sound. Pre-buffering is absolutely essential, at least for SD cards, as the initial load time from the medium can be several milliseconds due to filesystem overhead.

A lot of the Audio library modules react badly to changed block sizes - may not be a problem for your use case.
 
Thanks for the link. I've read through your description and it looks very interesting. You're doing a lot of the things that I was considering (using PSRAM for example). That 70us was not including audio latency. I simply measured the time delay from calling my play() method to the first call to update(). It's actually probably lower now that I reduced the audio buffer down to 16.

I'm using an electric drum kit to trigger the samples from MIDI, so any amount of latency is very noticeable. I'm going to go ahead playback multiple files within a single AudioStream to see if that removes dropout (compared to multiple AudioStreams).
 
So today I managed 32 simultaneous mono voices at 16 samples latency without dropping packets. The key is to do all buffering outside of the update() methods. Each voice has three buffers. I reserve one buffer for the first chunk of the file, and then I switch between two others that get updated in the background as the file is being played. I can't really go for more voice because I've already maxed the T3.6 memory. I'll go for stereo tomorrow.
 
Just did a non-rigorous test of my SDPiano demo with the audio block size set to 16 samples, so an audio update interval of 363us. Pleasingly, it all worked first time after the change of block size, no further fixes needed :D

As I’d vaguely expected, latency from play() call to first audio update() was anything up to 363us or so; actual audio output was then about 1ms (3x audio intervals?) after that, so worst case around 1.4ms total. Obviously you have to add sensing time in the drum kit and MIDI transmission time on top - no idea of the former, I usually estimate 1ms for a MIDI note transmission.

This is with a Teensy 4.1, PSRAM and audio adaptor. Given your excellent results with a 3.6 I’d expect you should be able to achieve your goal - looking forward to further posts!
 
So the first order of business today was to decouple my AudioFile class from my AudioVoice class. This allows me to have a single buffer (4K) for the AudioFile, and two (4K) buffers for the voice. The idea is to allow for many AudioFiles (hundreds perhaps) and to simply attach them to an available AudioVoice when they need to be played back. That all worked fine.

I then recorded 32 stereo tracks to see how many I could play simultaneously. Sadly, only 10 (from a USB flash drive). Actually, quite possibly more, but to go past 10 track, I need to increase my buffer sizes to 8K and the T3.6 just doesn't got enough RAM. I'm suspect using even greater buffer sizes up to 32K would be more efficient, but that would even be pushing it on the T4.1.

Unable to increase my buffer sizes, I decided to get an M.2 SSD (512GB) and an enclosure to see if I could get any better results. Warning, my USB port did not supply enough current to power both my T3.6 and the M.2 drive. Luckily the device I'm using to test with has a 9V DC connector and powering it from a wall wort did the trick. Using a powered USB hub also worked. It's a shame that external power is needed at all.

With M.2 SSD, I was able to muster 14 tracks. Significantly better than the USB flash drive, but still underwhelming. I guess that random read performance on an SSD is not spectacular. I'll need to wait for my T4.1 before I go on. I suspect with 8K buffers for the voices and offloading the Audiofile buffers to PSRAM will get me a lot closer to my goal. Anyone have any data on PSRAM read/write performance measures?
 
Anyone have any data on PSRAM read/write performance measures?
I thought I did, but couldn't find anything ... it may turn up. I did find this post where Paul states the PSRAM bandwidth is about 40Mbyte/s. I reckon that's nominally enough to play about 226 stereo 44.1kHz files, though obviously that doesn't allow for any code overhead, re-loading the PSRAM with new samples etc.

You might want to try the built-in SD card slot, just for comparison. This thread suggests speeds of 15 to 20Mbytes/sec are achievable, though not sure what buffer size was used. With a decent SD card I'd hope that 16 stereo tracks would be achievable on a Teensy 3.6, and maybe double that on a Teensy 4.1. Looking at real-world USB2.0 performance, you might struggle to get much over 30Mbytes/s with any drive, especially with the overhead of reading from multiple files at once (which also clobbers SD card performance).
 
I'm quite confident that with 4KB reads, 14 tracks stereo is about as good as it gets, and I doubt the T4.1 would improve much on that because it comes down to random file access time. The same is true for SD cards I assume. To get more tracks, I need bigger buffers. T4.1 has 4x the RAM and up to 16MB SPRAM, so I should be able to use 8KB or 16KB buffers/reads. And if SPRAM is fast enough, perhaps even 32KB buffers. That's the size you are using, right? BTW, my 32 tracks mono test yesterday was using short files, about 2 seconds each, and that's why I didn't get any dropouts. Today, my stereo tracks were about 20 seconds each, representing a much better stress test.
 
I did a quick test.

Code:
    int totalBytesRead = 0;

    for (int i = 0; i < NUM_TEST_FILES; i++) {

        if (fileToRead[i].available()) 
        {
            totalBytesRead += fileToRead[i].read(readBuffer, READ_BUFFER_SIZE);
        }
    }

Reading at 4KB, I get an average read time of 0.12 us / byte.
Reading at 8KB, I get an average read time of 0.07 us / byte.
Reading at 16KB, I get an average read time of 0.05 us / byte.
Reading at 32KB, I get an average read time of 0.04 us / byte.

There's no improvement going up to 64KB. Looks like I'll be going for 16KB, but even just going to 8K should get me almost double the tracks.
 
That all looks about like what I see using SD, topping out at about 20Mbytes/s (your 16k buffer test). The improvement I’d predict from using a Teensy 4.1 would indeed come from the bigger buffer sizes, not the CPU speed.

I do indeed seem to have adopted a 32k buffer as standard, probably because I’ve got 8MB of PSRAM installed. I think that corresponds well with your 16k reads, as my code splits it into two and plays directly from one half while the other half is (pending) reading from the file. So I probably use 2/3 the memory per voice than your triple-buffered scheme, and maybe avoid some memory shuffling? Hard to say, and probably doesn’t matter too much.

Technically my library can allocate different buffer sizes per voice, though I’ve not used it much except when using 4- to 8-channel WAV files - generally one size fits all!

Oh, and I did go to some trouble to ensure reads only rarely cross a sector boundary. Because I’m supporting WAV files, the first read is a few samples short because of the header, but ends on a sector boundary so all subsequent read do so as well (for buffers of a sane size!). And different voices fill the buffer by different amounts at start of play, so they don’t all try to read the file system at once. That latter aspect could do with some work, possibly, as I don’t keep track of “available read slots”, so samples that don’t start in sync may end up with a read clash after all!
 
I need the third buffer in case I need to restart playing a voice while it's being played.

I'll take a peek at your code tomorrow to see how you figure out those cluster boundaries. That's something I didn't think of.
 
I need the third buffer in case I need to restart playing a voice while it's being played.

I'll take a peek at your code tomorrow to see how you figure out those cluster boundaries. That's something I didn't think of.

Boundary issues indeed. I'm now up to 18 stereo tracks using 4KB buffers/reads. But again, out of memory.
 
Last edited:
Boundary issues indeed. I'm now up to 18 stereo tracks using 4KB buffers/reads. But again, out of memory.
Excellent news … I think when your 4.1 arrives you’re going to be very happy!
I need the third buffer in case I need to restart playing a voice while it's being played.
Ah yes. My SDPiano demo does that, total overkill with 30k pre-buffered for each of 261 samples. But that’s part of the application, not the library…
 
Not working on this much today, but I wanted compare the performance of 1 AudioStream object playing back 16 stereo tracks simultaneously vs 4 AudioStream objects each playing back 4 stereo tracks. To get an idea of the performance, I counted the number of times the loop() got called while the tracks were being played back.

While in both situations, the SSD was able to handle the workload, there is clearly an advantage to playing back multiple files in a single AudioStream.

AudioStream Objects: 1
Tracks / Object: 16
Loop Calls: 1201219
Playtime: 11028 ms
ms/call: 0.0092

AudioStream Objects: 4
Tracks / Object: 4
Loop Calls: 1019927
Playtime: 11029 ms
ms/call: 0.0108
 
While not recommended for the faint of heart, I've reimplemented my code to work with pointers so I can dynamically create and remove instruments. I envision simply putting in a flash drive or SD card with samples.and some config files and having the Teesy reconfigure itself on the fly.
 
Do you know the latency from trigger to audio output?
I put a scope on my little project and I'm seeing 7 msec from trigger to sound. I have three blocks:

sample player -> gain -> i2s output. The trigger detect code takes about 500 usec, call that a millisecond to get to the loop.

The audio system uses 6 buffers, and my block size is 64 samples (44.1 KHz), or about 1 msec
 
Do you know the latency from trigger to audio output?
I put a scope on my little project and I'm seeing 7 msec from trigger to sound. I have three blocks:

sample player -> gain -> i2s output. The trigger detect code takes about 500 usec, call that a millisecond to get to the loop.

The audio system uses 6 buffers, and my block size is 64 samples (44.1 KHz), or about 1 msec
Code? Image of audio design?

Based on my post #5 above, and given 64 samples is 1.5ms, you might hope for 6ms+trigger time, which is in line with what you're seeing. If you don't pre-buffer samples then the file open, parse and first samples load can dominate the latency, and be unpredictable...
 
I got my custom T4.1 device up and running on my first attempt. I'm really happy and surprised as it was my first time working with a BGA. Unfortunately I'm having issues with the integrated audio board. Likely cause is that I used a 1.2V version of the AP7313. I've ordered the 1.8V version and hopefully that gets it working.

I've moved all my audio file and audio voice buffers to PSRAM. Because the PSRAM memory stuff has to be static, I needed to make a buffer helper class to manage dynamically assigning buffers in the PSRAM. As far as I can tell, I'm able to process (and likely playback) 32 stereo files from a USB SSD drive simultaneously using 8K buffers without dropouts. If I use a USB flash drive, I can only manage 16. I'll try to get up to 64 files when I can get some audio out of the SGTL5000.
 
Last edited:
Sounds as if you're making good progress - hope you manage to get the hardware issue sorted out.

Inspired by this, I'm working on an update to my buffered playback which should make it easier to achieve low latency start with subsequent playback coming from a filesystem. The idea is to create an AudioPreload object which holds the file path, filesystem and pre-loaded data for an audio file. Using a reference to one of these as the parameter to play() will then seamlessly manage the transition without further effort from the application. The numbers of preload and playback objects is entirely independent, you can have 100 samples ready to play, and 16 channels available for playback, for example, limited only by buffer memory and processing power. A nice side-effect is that it's just as easy to use an 8-channel sample as it is a mono one.

For your use case, can you see any merit in either (a) starting such a playback in "paused" mode, with the intent to resume in sync with other files, or (b) allowing playback to start at any point other than the beginning of the file? The normal play() function can do both of these things, but I think they're of more use in multi-track file player / recorders and loopers - I can't really see them being useful for a low-latency sample player.

I'd be interested to know your thoughts.
 
Inspired by this, I'm working on an update to my buffered playback which should make it easier to achieve low latency start with subsequent playback coming from a filesystem. The idea is to create an AudioPreload object which holds the file path, filesystem and pre-loaded data for an audio file.
That's how I'm doing it. With 8k buffers,
I can preload 1024 stereo samples. Would like to get that number closer 4096, but I'm already maxed out.

For your use case, can you see any merit in either (a) starting such a playback in "paused" mode, with the intent to resume in sync with other files, or (b) allowing playback to start at any point other than the beginning of the file? The normal play() function can do both of these things, but I think they're of more use in multi-track file player / recorders and loopers - I can't really see them being useful for a low-latency sample player.

I'd be interested to know your thoughts.
Wouldn't be useful for my current plans, but would be useful for some dj style gear. It would get a little tricky because you still need to maintain alignment with sectors on the disk, resulting in wasted buffer space.
 
That's how I'm doing it. With 8k buffers, I can preload 1024 stereo samples. Would like to get that number closer 4096, but I'm already maxed out.
You could add another PSRAM to get to 16MB - that should allow buffering the start of 2048 sample files, each with an 8kB buffer. Or have I stuffed up my calculations at some point?

Wouldn't be useful for my current plans, but would be useful for some dj style gear. It would get a little tricky because you still need to maintain alignment with sectors on the disk, resulting in wasted buffer space.
Yes, though you only waste an average of half a sector per sample - say 512kB of 16MB, or 3.1%, or 1.5ms of each stereo sample. Not great, but not totally horrible. I already "waste" a tiny bit, because of the WAV file header, though I do use that space to store the file path - it would waste a lot if the path was slightly longer than the file header, but that's pretty unlikely... Actually, the buffers don't have to be a multiple of a sector long ... if a WAV header is 44 bytes long, your buffer can be 8kB - 44 and you get 44kB left over when you've buffered 1024 sample files, which gives you a whole 5 extra samples! Big deal...

OK, I'll probably do the pause and offset start point options, though maybe not for the first pass. It won't be a massive amount of extra code, and it'll keep things consistent. Thanks for your input.
 
Back
Top