Teesny audio over UDP

o.w.varley

Well-known member
Hey all,

I'm doing a project at the moment where I'm pushing out audio captured from a Teensy 4.1 + shield board over a LAN.

I've created my own Audio Player (AudioPlayMemoryRaw) that can take the raw audio blocks captured from an AudioRecordQueue and play it out via an AudioOutputI2S. My plan is to prove that I can capture and play back the audio locally (on the same Teensy) before progressing to passing the raw audio blocks over UDP to another Teensy. I'm getting some strange occurrences that make my think I don't have a correct understanding of how the Teensy/Audio Shield allocate and free memory.

With the current setup, I can happily capture and play back the first few seconds of audio, however, it then cuts out and I hear a constant tone in my headset. Doing further investigation if I remove the output and the call to playRaw.enqueue then the code runs absolutely fine. This makes me think that playRaw isn't correctly freeing the blocks or, I guess, it could be my use of std queue.

Main play:
Code:
#include "AudioPlayMemoryRaw.h"
#include "AudioPacket.h"
#include <Audio.h>

AudioInputI2S            mike;           //xy=105,63
AudioRecordQueue         queue;         //xy=281,63
AudioPlayMemoryRaw       playRaw;       //xy=302,157
AudioOutputI2S           headphones;           //xy=470,120
AudioConnection          patchCord1(mike, 0, queue, 0);
AudioConnection          patchCord2(mike, 0, headphones, 0);
AudioConnection          patchCord3(mike, 0, headphones, 1);
AudioControlSGTL5000     sgtl5000;

int led = 13;


// the setup routine runs once when you press reset:
void setup() 
{
	// initialize the digital pin as an output.
	pinMode(led, OUTPUT);
	digitalWrite(led, HIGH);   // turn the LED on (HIGH is the voltage level)

    Serial.begin(115200);
    AudioMemory(60);


    // Enable the audio shield, select input, and enable output
    sgtl5000.enable();
    sgtl5000.inputSelect(AUDIO_INPUT_MIC);
    sgtl5000.micGain(35);
    sgtl5000.volume(0.5);

    queue.begin();
    playRaw.play();
    
    Serial.println("Start Audio Up");
}

void report_memory()
{
    Serial.printf("CPU: %.2f (%.2f),  MEM: %i (%i)\n",
        AudioProcessorUsage(), AudioProcessorUsageMax(),
        AudioMemoryUsage(), AudioMemoryUsageMax());

}

void readFromQueue()
{
    t_AudioPacket packet;
    int16_t memoryBefore;

    if (queue.available() >= 1)
    {
        memoryBefore = AudioMemoryUsage();

        memcpy(packet.data, queue.readBuffer(), sizeof(packet.data));
        queue.freeBuffer();

        playRaw.enqueue(packet);
        
        if (memoryBefore == AudioMemoryUsage())
            Serial.println("No memory released by freeBuffer");

    }
}

// the loop routine runs over and over again forever:
void loop() 
{
    readFromQueue();
}

AudioPlayMemoryRaw:
Code:
#include "AudioPlayMemoryRaw.h"
#include <Arduino.h>
#include "play_sd_raw.h"
#include "spi_interrupt.h"
#include <queue>

std::queue <t_AudioPacket> _OutputQueue;

bool AudioPlayMemoryRaw::play()
{
	stop();

	playing = true;
	return true;
}

void AudioPlayMemoryRaw::enqueue(t_AudioPacket packet)
{
	_OutputQueue.push(packet);
}

void AudioPlayMemoryRaw::stop(void)
{
	if (playing)
	{
		playing = false;
	}
}

void AudioPlayMemoryRaw::update(void)
{
	unsigned int i;
	audio_block_t* block;

	// only update if we're playing
	if (!playing) return;

	// allocate the audio blocks to transmit
	block = allocate();
	if (block == NULL) return;

	if (_OutputQueue.size() > 0)
	{
		t_AudioPacket packet = _OutputQueue.front();
		_OutputQueue.pop();

		memcpy(block->data, packet.data, sizeof(block->data));

		transmit(block);

		release(block);
	}
}

AudioPacket
Code:
#pragma once

struct t_AudioPacket
{
    int16_t data[AUDIO_BLOCK_SAMPLES];
};

Any help would be greatly appreciated.
 
Figured it out, my dodgy code was allocating memory blocks in the update function regardless of the state of the _OutputQueue but only releasing them if the queue had any data within it.

The above implementation suffered from about a seconds worth of lag between speaking and hearing yourself. Doing some investigation this was because I'm double allocating the memory. I memcpy the data from the queue into a structure, push this onto the std::queue and then memcpy it again across to the audio block within the update method. By swapping to using pointers rather than a custom structure I've managed to get the lag down to next to nothing.

Next challenge is working out how to read audio blocks from UDP and to see what the lag on those is like when they are played back.
 
Also think about how you will sync the audio clocks between the sender and receiver. Without sync, you typically eventually get under/over runs.
 
Hey Jonr, thanks for the pointer and definitely something I'll need to solve. There's going to be about 20 teensy 4.1 with audio shields running on a LAN so coordinating the timing is going to be fun.

I'm still trying to prove whether the idea is technically feasible at the moment, I've got a single teensy receiving 4 different audio streams over UDP sent from a PC which is working ok (with bodged timing delays). I'm using a different AudioPlayMemoryRaw object for each of the streams and then mashing them together using a mixer. I've noticed that it's working fine with 4 different audio streams to 4 different objects, however, the queues quickly overrun the memory if try to send two audio streams over UDP to the same AudioPlayMemoryRaw object.

Currently pondering reducing the audio block size and/or reducing the sampling rate to 8kHz to help reduce the processing and network load.
 
reucing the block size does not help to reduce processing. with half block size you'll double the numer of processed blocks... that needs more time due to overhead!

UDP: as far i know it is not guaranteed that you'll receive the block in the same order you sent them. might be no issue in a local lan.
 
With proper design and buffering, I expect that a Teensy 4.1 could handle lots of UDP audio streams.
 
Hey Frank, thanks for that gem, saves me chasing a Red Herring. I'm presuming that down sampling to 8khz would help reduce the size of the packets being sent and might help me a little bit with the inherent delay between reading an audio block via the mike on one teensy, sending it by UDP, receiving it on another teensy and then transmitting it to the output device.

In terms of design, I'm aiming to try to support 10 distinct audio channels. Each channel has an ID that is attached to the UDP packet to identify which Audio Player should handle it. When a receiving teensy gets a packet it sticks it onto the relevant player's queue and then the player's update function handles outputting it.

At the moment I'm just using std::queue which hold int16 pointers to memory allocated when the UDP packet is read in. I'm not convinced that using std::queue is the right way to go at the moment but I wanted to get the mechanics up and running before seeing whether I needed to optimise the buffering/queuing logic. Any hints or tips on the most efficient way to buffer/queue the input from the UDP packets would be appreciated!
 
Following on from what Frank B said above, and doing some more reading around, can I check my understanding of some of the basic theory?

The standard sampling rate is 44.1 kHz, each sample is two bytes (int16) and the default audio block size is 128 samples (256 bytes) that roughly equates to 2.9 ms of audio. By using a block size of 128 is it correct to say that we are accepting that there will be an inherent 2.9ms delay (plus any processing time) between capturing audio and playing it back? If we reduce the block size, say down to 64 samples (128 bytes), would this now mean that an audio block is equivalent to 1.45 ms of audio? Obviously as Frank mentioned this will mean you are processing and handling twice as many blocks with the bonus of reducing the latency (delay) between capture and play back.

I'm doing some benchmarking now to look at how long it takes to transmit and receive 256 bytes of data between Teensy via UDP. My plan is to work out how long it takes and then to adjust the audio block size and sampling rate to ensure that I can minimise the delay between capturing, sending and then receiving the audio block via UDP.
 
Lets assume our sampling rate of 44.1 - what does that mean?
You divide a second in 44100 parts. 44100 samples.
You want to play each sample, the output needs to be 44100 times per second, too.

So far, easy.

Now, we have blocks. These blocks have noting to do with the samplerate. They are just there to let the Audio library do its work.
Our default blocksize is 128 samples.
So, you're right, 128 samples are ~2.9ms

For a second, you need:

~344 blocks with 128 samples.
- or -
~689 blocks with 64 samples

So, reducing the blocksize increases the # of blocks. There is (a little) overhead, so you need _more_ cpu time.

u could try to increase the blocksize:

~172 blocks of 256 samples


One sample is 16 bit, or 2 bytes.

On the other hand, a T4 is by far fast enough, the CPU should not play a role here.
 
Last edited:
And yes, you always have 2.9ms delay - minimum. How much exactly depends on the number of patchcords, used processing objects and the details of their internal code..
So, it is a multiple of 2.9ms (with 128 sample blocksize)

Good news: 2.9ms is not noticable.
 
So, you have to make sue that you can send and receive more faster than your samplerate is.
When this is solved, you will have the next problem:

Two independend oszilallators are _never_ synchronous.
On the receiver you have to compare the speed of the incomming data with the speed of your dac.
In other words: One Teensy has a real samplerate of 44100.0001Hz, the next 44099.9999HZ... etc

You will have to insert a sample from time time - or you have to delete it.

With USB Audio, there is the same problem. There is a complex synching mechanism involved.
 
Thanks Frank, great response.

Is there an known standard for when a delay becomes perceivable to a human? What should I be aiming to keep the delay below?

Sending and receiving doesn't look like it's going to be bottleneck. At the moment my benchmarks are showing that sending 256 bytes takes 77 microseconds, receiving the same size takes 62 microseconds and sending and receiving synchronously takes 96 microseconds to send, 69 to receive and the loop run takes 99 microseconds. So, like you say, the CPU isn't going to be an issue, UDP send/receive should be fine, the issue will be the timings between the two. I'll try to get my head round how what I need to do to get the timing right.
 
I don't know. Personally, I think, in a normal living enironment 10ms are OK. (Just waiting for the first user who says "No..never..that's too much!")

Just try it. If you hear the lag, reduce the blocksize. If not, you can increase it to reduce computing requirements.
 
Is there an known standard for when a delay becomes perceivable to a human? What should I be aiming to keep the delay below?

Opinions vary wildly on this topic!

50ms is a pretty good general purpose audio latency goal. Many usability studies have found around 100ms is the threshold where most people can notice delay from touch to audio/visual sensory response, so you definitely want stay away from that much delay.

But if you're building certain types of musical instruments, you might want under 20ms or maybe even 10ms. Cases where aftertouch pressure modifies an already playing note or other sound can be particularly sensitive. A skilled musician probably wouldn't be able to tell you they hear delay, but would probably give you feedback that the "feel" isn't good. Adrian Freed probably knows the most on this subject. Maybe he'll see this message and comment?

Human hearing is quite sensitive to differing delay heard by each ear. People perceive those differences in the 10-80ms range as the location of the sound source. So if you're transmitting the audio to multiple locations, matching the delay can be much more important than minimizing it.

Likewise, rapid changes in delay alter the pitch of sound. People can perceive that very easily, so keeping the delay consistent is important.

It's really easy to obsess over milliseconds & microseconds, so just to keep the physics of sound in perspective, every millisecond corresponds to approximately 1.1 feet or 28 cm more distance between the sound source and your ear. When you start feeling worried about a few milliseconds, just consider all the times you have a conversation or listen to music or movies while moving around a room.
 
Last edited:
My personal experience as a musician is that anything over 30 ms latency gets annoying, under 30 ms ranges from acceptable to undetectable depending on context.
 
> You will have to insert a sample from time time - or you have to delete it.

Or, much better, keep tweaking the audio clocks so they match. As systems like Dante prove, the match can be sufficiently perfect that you _never_ need to create distortion by adding or dropping a sample.
 
Thank you all for the guidance and incite on audio delay. Given that in this scenario, each teensy will be connected to an individual headset I think I might be able to get away without synchronising the delay across multiple teensy which should simplify the solution some what.

On the subject of audio clocks and timing, if the audio clocks aren't aligned I'm presuming this means that the audio samples will be stretched or squashed depending on whether the receiving teensy's clock is ahead or behind? What affect does this have on the audio signal? Would it be significant?

I guess I can use a common timing signal across the network to align the clocks. I'm presuming setting the board's audio clock is possible via the library?
 
Opinions vary wildly on this topic!

Indeed they do. This is probably because delay perception is so dependent on context.

in some situations people can hear jitter (variation in delay) down below a 1ns . https://www.audiosciencereview.com/...audio-measurements-and-listening-tests.21115/

This is why you have to work so hard to have a jitter-free clean clock for audio codecs.

We can hear fine delays between channels in stereo and multi-channel situations so you have to work to keep these streams in sync with each other and keep the reconstruction and anti-aliasing filters reasonably close in specs.

I think the context we are talking about here relates to gesture->responding sound delays. We published a lot using the 10+-1ms goal for this and this is bettered
regularly in dedicated synths and specialized systems (like Bela). It's also possible to achieve this with special flavors of ethernet like AVB which use reservation protocols to make devices communicate without contention.

If you are making drone music or some other styles where sloppy timing isn't important or is masked, you can head up to 50mS and beyond but people can only adapt to play well under these circumstances if there is low jitter, i.e. variation of the delay. It can be irritating to retrain if you improve or worsen the system every day.
 
Actually people don't really hear jitter, they hear the spurious tones due to jitter corrupting the spectrum - jitter
causes intermodulation with the signal tones, and if the intermodulation products are far enough from the signal
frequencies to not be perceptually masked they can be audible.
 
> if the audio clocks aren't aligned ... the audio samples will be stretched or squashed

The typical behavior is that you eventually have a buffer under or over run. Resampling would stretch/squash.
 
Interesting. So I've got the implementation up and running, I'll share the code shortly, however, I'm experiencing a static like tone when playing the received blocks rather audio. It feels like a timing issue as the tone modulates with a corresponding change in the volume of input on the microphone.

I'm going to record the audio to SD and play it back to see whether it is timing or something else.
 
The reason I think it's the frequency that was previously discussed is that it sounds exactly like listening to an old analogue radio when you are really close the actual stations frequency but not exactly on it. Would that make sense in this context?
 
Hi all,

Got some more time looking at the problem this evening. To try to get my head round what's happening I've created a 1Khz tone raw file. I'm now playing this on the Teensy using a custom AudioStream object that plays an audio block and also sends it out via UDP Broadcast. I've got a C# program running that's reading the UDP packets and then saving them to disk. I'm using Audacity to compare the original 1Khz tone raw file against what's being picked up over UDP, the results are below:

Initially all seems ok with the phase slightly out (I'm not starting to capture the UDP packets at the beginning of the transmission)
2021-07-14 22_41_24-Window.png

It doesn't take long before the samples start getting out of sync.
2021-07-14 22_41_37-Window.png

Is this the issue that has been described above regarding frequency/clock differences?
 
I'd say yes.. but it's difficult to say..
you may want to add some debug output to your program. some that prints a message when (a) the receiver has no block to play (b) when it gets too much data.

If you try syncing:
- the cystal speed changes with temperature.. so do it dynamically.
 
Back
Top