Teensy 4.0 problem with DMA to/from OCRAM interacting with Audio library

Blackaddr · Nov 2, 2019

After months of struggling to get DMA SPI memory working with the T4 OCRAM I think I finally found the problem, and I think it's in AudioStream.h. Basically, I'm getting audio buffer corruption when the Audio buffer pool is being used by both I2S and SPI when both involve DMA.

The memory pool for the audio buffers is defined as an array of audio_block_t in AudioStream.h. The L1 cache line size is 32 bytes. The ARM manual states that DMA buffers must be 32-byte aligned (but its size should really be a multiple of 32-bytes as well) otherwise indeterminate behaviour can result. Why this is important becomes clear when you consider what happens if two different buffers share a cache line.

Let's say two adjacent 48-byte buffers in memory have their boundary in the middle of a cache line. Cache lines are at 32-byte aligned addresses.

Code:

// OCRAM starts at 0x0202 0000
// Assume that cache line index 0 is mapped to 0x0202 0000 and ends at 001F, cache line index 1 starts at 0x0202 0020 and ends at 003F
uint8_t bufferA[48]; // starts at address 0x0202 0000
uint8_t bufferB[48] // starts at address 0x0202 0030 // this buffer starts in the middle of cache line index 1

If two different contexts are using those buffers, and each is attempting to use DMA to get data in and out of OCRAM, I think we can run into some cache coherency issues. Either context could be modifying OCRAM and needs to invalidate the cache, or is writing to the cache (dirty) and needs it flushed.

My thinking at the moment is the the atomic barriers in the cache maintenance functions are not sufficient to maintain coherency in this case because a DMA transfer initiated earlier by one context could asynchronously being running in parallel while the CPU is performing a cache maintenance operation. In other words, even if two different contexts are using the two buffers independently, CPU cache maintenance is not aware if a DMA transfer from a peripheral to OCRAM is currently underway and race conditions occur in the overlapping cache line.

The result is corruption of the portion of the buffers that share a cache line.

In my case, when I use I2S without SPI DMA, I get clean audio. When I stress test a SPI memory via DMA using test patterns, I get no data errors. When I use SPI memory to DMA audio buffers than have been allocated as receiveWritable(), I get corruption. if I disable caching in startup.c, the audio corruption goes away.

I seem to be able to fix the problem by restructuring the audio buffer pool to ensure two conditions
i) every data buffer in the audio blocks starts on a 32-byte aligned address
ii) no two data buffers share the same cache line.

The problem is the current audio_block_t structure puts the data buffer on unaligned boundaries, and two different data buffers from adjacent audio_block_t in the array will overlap in the same cache line.

Code:

// first audio block will start at 0x0202 0000 in OCRAM
typedef struct audio_block_struct {
	uint8_t  ref_count; // 0x00
	uint8_t  reserved1; // 0x01
	uint16_t memory_pool_index; // 0x02
        int16_t  data[AUDIO_BLOCK_SAMPLES]; // 0x04 (starts at unaligned 32-byte address)
} audio_block_t;

The following code change to AudioStream.h fixes the cache corruption:

Code:

typedef struct audio_block_struct {
	int16_t  data[AUDIO_BLOCK_SAMPLES]; // relocate the data buffer to the start of the struct
	uint8_t  ref_count;
	uint8_t  reserved1;
	uint16_t memory_pool_index;
        uint8_t padding[28]; // pad out the audio block from 260 to 288 bytes. This ensure all data buffers in an audio_block_t array are correctly aligned

} audio_block_t;

// don't forget to but the aligned(32) attribute on the array itself to ensure it starts at the correct address.
#define AudioMemory(num) ({ \
	static DMAMEM audio_block_t data[num] __attribute__ ((aligned(32))); \
	AudioStream::initialize_memory(data, num); \
})

Note, this requires a change to effect_freeverb.cpp which has an odd way of attempting to create a zero'd audio_block instance. You can fix this compile error by replace the #ifdefs with this:

Code:

static const audio_block_t zeroblock = { 0 };

In summary, all buffers in all the Teensy libraries that are used with DMA must start on a 32-byte aligned boundary and be rounded up to an integer size of 32-bytes to ensure no data corruption of adjacent memory.

...or I could be totally wrong...

Blackaddr · Nov 3, 2019

Okay, I was able to reproduce this issue with the Teensy Audio Shield. Maybe someone with a working Teensy 4.0 and audio shield using LINE IN left can give this a try?

The sketch is written to send audio to/from the SPI RAM on the board but it looks like you can actually reproduce the problem without needing the SPI RAM. When I run this on my PJRC Audio Shield I either get corrupted audio, or I get silence depending on whether the SPI RAM is used or not.

I can restore clean audio by doing one of two things:
1) comment out the call DMA read of the SPI memory. This prevents the cache line collision discussed in my previous post.
2) OR, change AudioStream.h so that audio data buffers are always 32-byte aligned as shown in the previous post. (you also need to fix the compile issue in freeverb as mentioned above).

It's important to note that the transmitted audio is simply the input audio. I never touch this buffer, but I can corrupt by doing DMA transfers to the adjacent buffer in the pool (obtained by calling allocate()).

Code:

#include <Audio.h>
#include <SPI.h>

// demo will create a cache-coherency problem between in the Audio pool buffers by perming SPI DMA
// on them. Here's what this demo does
// 1) Use AudioInputI2S for audio input which will DMA audio from SAI into the audio buffers
// 2) Pass it through AudioEffectFixedDelay
// 3) Send the output of AudioEffectFixedDelay to the AudioOutputI2S
// 
// Each time update() is called, AudioEffectDelay will read the contents of SPI RAM into the new buffer, then write
// the input audio block out to SPI RAM using DMA transfers.
// The misaligned cache maintenance operations will corrupt the audio buffers. If there is no actual SPI ram, the data
// will always read back as zeros. As result you many hear corrupted audio or zero'd audio.

// SPI Memory definitions
#define SPI_WRITE_CMD 0x2
#define SPI_READ_CMD 0x3
#define SPI_ADDR_2_MASK 0xFF0000
#define SPI_ADDR_2_SHIFT 16
#define SPI_ADDR_1_MASK 0x00FF00
#define SPI_ADDR_1_SHIFT 8
#define SPI_ADDR_0_MASK 0x0000FF

// SPI pinouts
constexpr uint8_t SPI_SCK = 13;
constexpr uint8_t SPI_CS = 10;
constexpr uint8_t SPI_MISO = 12;
constexpr uint8_t SPI_MOSI = 11;

AudioInputI2S i2sIn;
AudioOutputI2S i2sOut;
AudioControlSGTL5000 codec;

// DMA SPI stuff
constexpr size_t AUDIO_BLOCK_BYTES = sizeof(int16_t) * AUDIO_BLOCK_SAMPLES;
constexpr size_t CMD_SIZE = 4;
uint8_t commandBuffer[CMD_SIZE];
SPISettings memSettings(20000000, MSBFIRST, SPI_MODE0);
EventResponder callbackHandler, dummyCallbackHandler;
volatile bool dmaBusy = false;

// DMA Functions
void setSpiCmdAddr(int command, size_t address, volatile uint8_t *dest)
{
  dest[0] = command;
  dest[1] = ((address & SPI_ADDR_2_MASK) >> SPI_ADDR_2_SHIFT);
  dest[2] = ((address & SPI_ADDR_1_MASK) >> SPI_ADDR_1_SHIFT);
  dest[3] = ((address & SPI_ADDR_0_MASK));
}

void callback(EventResponderRef eventResponder)
{
  dmaBusy = false;
}

void mem0DmaWrite(unsigned address, volatile uint8_t *src, size_t count)
{
  if (!src) { return; }

  // First set the command
  setSpiCmdAddr(SPI_WRITE_CMD, address, commandBuffer);

  digitalWrite(SPI_CS, LOW);
  dmaBusy = true;
  SPI.beginTransaction(memSettings);
  SPI.transfer((void *)commandBuffer, nullptr, CMD_SIZE, callbackHandler);
  SPI.endTransaction();

  while (dmaBusy) { yield(); }
  dmaBusy = true;
  SPI.beginTransaction(memSettings);
  SPI.transfer((void *)src, nullptr, count, callbackHandler);
  SPI.endTransaction();

  while(dmaBusy) { yield(); }
  
  digitalWrite(SPI_CS, HIGH);
}

void mem0DmaRead(unsigned address, volatile uint8_t *dest, size_t count)
{
  if (!dest) { return; }

  // First set the command
  setSpiCmdAddr(SPI_READ_CMD, address, commandBuffer);
  
  digitalWrite(SPI_CS, LOW);
  dmaBusy = true;
  SPI.beginTransaction(memSettings);
  SPI.transfer((void *)commandBuffer, nullptr, CMD_SIZE, callbackHandler);
  SPI.endTransaction();

  while (dmaBusy) { yield(); }
  dmaBusy = true;
  SPI.beginTransaction(memSettings);
  SPI.transfer(nullptr, dest, count, callbackHandler);
  SPI.endTransaction();

  while(dmaBusy) { yield(); }
  
  digitalWrite(SPI_CS, HIGH);
}

class AudioEffectFixedDelay : public AudioStream {
  public:
  AudioEffectFixedDelay() : AudioStream(1, m_inputQueueArray) {}  
  
  virtual void update(void) {
    audio_block_t *inputAudioBlock = receiveReadOnly(); // get the next block of input samples

    if (!inputAudioBlock) { return; }
    
    audio_block_t *outputAudioBlock = allocate();
    if (!outputAudioBlock) {
      transmit(inputAudioBlock);
      release(inputAudioBlock);
      return;
    }

    // read the previous block from SPI memory to output block.
    // Comment out this line to avoid the problem or else fix the audio pool buffers (data field) to be 32-byte aligned
    mem0DmaRead(0, (uint8_t*)outputAudioBlock->data, AUDIO_BLOCK_BYTES);

    // write the current block out to SPI memory
    mem0DmaWrite(0, (uint8_t*)inputAudioBlock->data, AUDIO_BLOCK_BYTES);

    transmit(inputAudioBlock);    // If you have SPI RAM installed you can comment out this line
    //transmit(outputAudioBlock); // and uncomment this once. You should get clean audio if the cache fix is applied to the audio buffers
    release(inputAudioBlock);
    release(outputAudioBlock);
  }

  private:
  audio_block_t *m_inputQueueArray[1];
};

AudioEffectFixedDelay fixedDelay;

AudioConnection input0(i2sIn, 0, fixedDelay, 0);
AudioConnection output0(fixedDelay, 0, i2sOut, 0);
AudioConnection output1(fixedDelay, 0, i2sOut, 1);

void setup() {
  // put your setup code here, to run once:
  delay(100); // wait a bit for serial to be available
  Serial.begin(57600); // Start the serial port
  while(!Serial) { yield(); }
  delay(100);

  // Setup the SPI stuff
  pinMode(SPI_CS, OUTPUT);
  digitalWrite(SPI_CS, HIGH);
  SPI.setMOSI(SPI_MOSI);
  SPI.setMISO(SPI_MISO);
  SPI.setSCK(SPI_SCK);
  SPI.begin();
  
  // Setup the callback
  callbackHandler.attachImmediate(&callback);

  codec.disable();
  AudioMemory(32);

  Serial.println("Enabling codec...");
  codec.enable();
  codec.volume(0.3);
  codec.inputSelect(AUDIO_INPUT_LINEIN);
}

size_t loopCounter = 0;
void loop() {
  // put your main code here, to run repeatedly:
  if (loopCounter == 0) {
    Serial.println("First loop");
  }
  loopCounter++;
  if (loopCounter % (65536*64) == 0) {
    Serial.println("Still alive");
  }

}

Blackaddr · Nov 5, 2019

Any developers who contributed to the Audio library or DMA development care to comment on this? I'd like to get some validation or discussion before submitting a bug report on github.

Once again, in summary it's not enough that the calls to the cache maintenance instructions have the address modified to be 32-byte aligned, the actual buffers themselves must be as well. Here's the conditions for this fault:

Precondition: the current code has data buffers in the audio pool at 4-byte aligned addresses, but not 32-byte aligned address which is L1 cache line size. Adjacent audio_blocks in the pool array will have their data buffers that fall within the same 32-byte aligned cache line.
1. I2S is being used (AudioOutputI2S is using DMA and making calls to arm_dcache_flush_delete()).
2. SPI is being used to DMA data to from audio_blocks in the memory pool.
3. one buffer used by I2S is adjacent in the pool to one being used for SPI. These buffers are "entangled" if they both have bytes within the same cache line.

In other words, assume a 32-byte cache line contains the last 16 bytes of of bufferA.data, then the next 16-bytes contains the first four bytes of bufferB, {ref_count, reserved1, memory_pool_index}, then the first 12-bytes of bufferB.data.

Code:

Cache line: || bufferA.data (last 16 bytes) | bufferB.ref_count (1 bytes) | bufferB.reserved1 (1 byte) | bufferB.memory_pool_index (2 bytes) | bufferB.data (first 12 bytes)||

When a cache maintenance operation is done on a given cache line, it affects all 32-bytes of memory, not just the one the user is interested in. When the user of bufferA does a flush on that shared cache line, they are overwriting the contents of OCRAM with whatever is in the cache line. If the user of bufferB has the DMA engine moving bytes into entangled part of the buffer, their data has been corrupted by the flush operation triggered by bufferA's user.

Similarly, if the user of bufferB has been using DMA to move bytes to the entangled part of the buffer and now wants to invalidate the cache line, if the cache line was dirty in the bufferA bytes and the bufferB user simply invalidates, the dirty bytes will be lost. You might think that always flushing before invalidating will solve this problem but it doesn't. There is still a time interval from when the the flush completes, to when the following DMA transfer completes. During that interval, the user of the other buffer could still be writing to the cache line and it becomes dirty again.

The only way to ensure to different contexts don't tramp on each other's memory buffers is to break the coupling caused by sharing a cache line between a DMA buffer and really anything else, including any adjacent heap registers. This guarantees that the when you perform a cache line operation, no other context, interrupt or AXI/DMA transfer in progress is going to be trying to do something with the other bytes that just happen to be in the same cache line as the DMA buffer.

PaulStoffregen · Nov 5, 2019

Sorry, I can't look at this in depth right now. But I can quickly point you to the 3 cache maintenance functions in imxrt.h.

https://github.com/PaulStoffregen/cores/blob/master/teensy4/imxrt.h#L8349

I tried to write detailed comments about the intended usage of each one. Hopefully that helps?

At this moment I'm working on an urgent matter, and then I have travel planned. I know this isn't a great answer, but please understand I probably won't be able to do much with this or any other "deep" issues until Nov 25th. If this isn't resolved, please ping me then... even by direct email with a link to this thread.

Blackaddr · Nov 5, 2019

Hi Paul, thanks for the reply. I've looked at the cache functions in quite some detail as I've been investigating this issue for about 2 months so I'm happy to wait a few more weeks. I've also had about half a dozen long emails back and forth with KurtE as he's spent a lot of time with the cache and DMA subsystem as well, but not the audio system.

This all boils down to whether or not the audio pool data buffers need to be 32-byte aligned (cache line size) or not, as at the moment they are only 4-byte aligned. I believe they do if you want to support using SPI DMA with audio applications.

Thanks again for taking the time to reply.

PaulStoffregen · Nov 5, 2019

Quick answer, going only from my memory & without looking into the actual code, is everything using DMA within the lib copies from the audio buffers to dedicated (usually larger) buffers specifically for DMA. Those bigger buffers almost certainly should be 32 bit aligned.

The audio buffers are 4 byte aligned not for DMA, but because many parts of the library fetch pairs of 16 bit samples using 32 bit LD & ST instructions (or more precisely, C++ code which compiles to those instructions). Perhaps those should now be 8 byte aligned, since DTCM and AXI are 64 bit buses?

That's the best I can do at this moment. I do have an urgent PJRC business matter needing my attention right now, so I just can't do more at this moment. Hopefully I'll get a few days to catch up before travel, but it's unlikely I'll manage to look into deep issues like this until after Nov 25th. Please ping me then. And in the meantime, I do hope you'll post all the relevant info to this public thread where I'll be able to see read it later this month.

Blackaddr · Nov 5, 2019

PaulStoffregen said:
Quick answer, going only from my memory & without looking into the actual code, is everything using DMA within the lib copies from the audio buffers to dedicated (usually larger) buffers specifically for DMA. Those bigger buffers almost certainly should be 32 bit aligned.

The audio buffers are 4 byte aligned not for DMA, but because many parts of the library fetch pairs of 16 bit samples using 32 bit LD & ST instructions (or more precisely, C++ code which compiles to those instructions). Perhaps those should now be 8 byte aligned, since DTCM and AXI are 64 bit buses?

That's the best I can do at this moment. I do have an urgent PJRC business matter needing my attention right now, so I just can't do more at this moment. Hopefully I'll get a few days to catch up before travel, but it's unlikely I'll manage to look into deep issues like this until after Nov 25th. Please ping me then. And in the meantime, I do hope you'll post all the relevant info to this public thread where I'll be able to see read it later this month.

Once again, thanks for your time Paul.
- just to be clear, please note my suggestion is the AUDIO_BLOCK_SAMPLE sized audio data buffers (audio_block_t.data) must be be 32 BYTE aligned for DMA, not 32-BIT aligned.

All relevant technical info is posted to this thread, as well as example code that reproduces the problem and a potential solution on the PJRC Audio board.

Safe travels! We can discuss here when you get back.

WMXZ · Nov 15, 2019

Blackaddr said:
- just to be clear, please note my suggestion is the AUDIO_BLOCK_SAMPLE sized audio data buffers (audio_block_t.data) must be be 32 BYTE aligned for DMA, not 32-BIT aligned.

AFAIK, there should be NO DMA that directly accesses audio-block data. If there are few objects that do that, maybe they should be rewritten. Could you give some examples, where audioblocks are used as DMA addresses?

Blackaddr · Nov 16, 2019

WMXZ said:
AFAIK, there should be NO DMA that directly accesses audio-block data. If there are few objects that do that, maybe they should be rewritten. Could you give some examples, where audioblocks are used as DMA addresses?

Thanks WMXZ for taking the time to read and comment, the help is appreciated.

I looked over the output_i2s.cpp, you're right WMXZ, the output_i2s.cpp creating a local 32-byte aligned i2s_tx_buffer for DMA. It uses that buffer for DMA then performs memcpy between this buffer and the audio pool data buffer. I'll be honest, I overlooked this because I would have expected the DMA to go directly to/from the target audio buffer pool and avoid extra memcpy's.

The problem then is that I am in fact DMAing directly to the audio data buffer because I need to move audio blocks to and from SPI memory for delay. This is done in Pauls external audio delay with regular SPI though, he just doesn't use SPI DMA which I use quite a bit in my BALibrary.

So I can alter my summary of this issue a bit. If you want to use SPI DMA with audio, either the audio pool data buffers need to become 32-byte aligned or I need to do the same thing as output_i2s: make a local 32-byte aligned buffer and memcpy every data buffer going in and out. It's my SPI DMA transfers that are clobbering the audio bool.

I'll be honest that I would greatly prefer to avoid memcpy's on all these buffers just to get around the fact the audio pool buffers aren't 32-byte aligned for DMA access.

PaulStoffregen · Nov 16, 2019

Why so much aversion to memcpy()?

Perhaps in the future the blocks many align to 32 byte cache boundaries. But doing that (easily) with the existing data structure would waste quite a lot of RAM. As trade-offs go, the choice between these 2 seems pretty clear to me.

Sure, restructuring how the block metadata is stored could avoid both. But that has a huge trade-off of requiring engineering time and extensive testing, which are the *really* limited resources!

Blackaddr · Nov 16, 2019

PaulStoffregen said:
Why so much aversion to memcpy()?

Perhaps in the future the blocks many align to 32 byte cache boundaries. But doing that (easily) with the existing data structure would waste quite a lot of RAM. As trade-offs go, the choice between these 2 seems pretty clear to me.

Sure, restructuring how the block metadata is stored could avoid both. But that has a huge trade-off of requiring engineering time and extensive testing, which are the *really* limited resources!

The aversion to memcpy is every single word copied takes multiple CPU instructions from the various load-stores. At 44 Khz, you're talking a couple hundred thousand instructions per second for a single stream. If you are having to copy multiple buffers in and out of SPI RAM as well as the I2S transfers perhaps were up to millions of instructions per second just to copy data. Yeah, it's probably only a few percent of a Teensy 4 but it would have been nice to avoid it with DMA so copies are not done by CPU load/stores at all.

As for wasting ram, yeah, it's gonna increase the the total memory pool consumption by 10% for padding bytes. Not ideal. It would be nice if the data buffers were separate from the control struct but that would have to be done in a completely backwards compatible otherwise it's gonna break everything.

Blackaddr · Nov 20, 2019

PaulStoffregen said:
Sure, restructuring how the block metadata is stored could avoid both. But that has a huge trade-off of requiring engineering time and extensive testing, which are the *really* limited resources!

There is a way to restructure the metadata in a fairly backwards compatible way. I tried this out and it it worked on my setup.

1) Modify the audio_block_t to contain (int16_t*) instead of a fixed array (int16_t[AUDIO_BLOCK_SAMPLES]). Replacing an array with a pointer is highly compatible as the array variable is really just a pointer anyway.

Code:

typedef struct audio_block_struct {
	uint8_t  ref_count;
	uint8_t  reserved1;
	uint16_t memory_pool_index;
	int16_t  *data; // changed from int16_t[AUDIO_BLOCK_SAMPLES]
} audio_block_t;

2) Next we will define our data buffers int16_t array at the same time we define the array for the audio_block_t. We also add a new parameter to the initialize function in order to pass the address of the new dataBuffers array.

Code:

#define AudioMemory(num) ({ \
	static DMAMEM audio_block_t data[num]; \
        DMAMEM __attribute__((aligned(32))) static int16_t dataBuffers[AUDIO_BLOCK_SAMPLES*num]; \
	AudioStream::initialize_memory(data, num, dataBuffers); \
})

3) We update the initialize_memory() function to assign the int16_t* data pointer.

Code:

void AudioStream::initialize_memory(audio_block_t *data, unsigned int num, int16_t *dataBuffers)
{
        // ...
	for (i=0; i < num; i++) {
		data[i].memory_pool_index = i;
                data[i].data = &dataBuffers[i*AUDIO_BLOCK_SAMPLES]; // assign the pointer to the slot in the dataBuffers array.
	}
        // ...
}

Compatability issues? There is only one. AudioStream.cpp does a sizeof() call on the original int16_t[AUDIO_BLOCK_SAMPLES] array variable. You can't call sizeof() on it if we change it to a simple pointer. But since this size is already known, it doesn't need to be obtained by sizeof(). Note: I still had to fix the issue in effect_freeverb.cpp as mentioned above.

Code:

//if (p) memcpy(p->data, in->data, sizeof(p->data)); // Old code called sizeof(), which would return the size in bytes of the int16_t[AUDIO_BLOCK_SAMPLES] array.
if (p) memcpy(p->data, in->data, sizeof(int16_t) * AUDIO_BLOCK_SAMPLES); // new code obtains the same value without relying on the audio_block_t.data being a type of fixed array size.

Does this approach have any chance of making it into the code base if I make a pull request? Ultimately, all we are doing is simply replacing the fixed size array variable with a pointer instead.

Blackaddr · Jan 23, 2020

PaulStoffregen said:
Sorry, I can't look at this in depth right now. But I can quickly point you to the 3 cache maintenance functions in imxrt.h.

https://github.com/PaulStoffregen/cores/blob/master/teensy4/imxrt.h#L8349

I tried to write detailed comments about the intended usage of each one. Hopefully that helps?

At this moment I'm working on an urgent matter, and then I have travel planned. I know this isn't a great answer, but please understand I probably won't be able to do much with this or any other "deep" issues until Nov 25th. If this isn't resolved, please ping me then... even by direct email with a link to this thread.

Hey Paul, I created a pull request to illustrate a potential solution to this problem. The method is to replace the audio data buffer array in audio_block_t with a pointer. This makes the change fairly backwards compatible for accessing the data as an array is a pointer anyway.

I'm happy to ping you to discuss but I can't send you private message due to permissions. Feel free to send me a private message and I can reply to you.

Frank B · Jan 24, 2020

The memcpys are not that slow. In fact they are quite fast, and most do not simply copy:
We have to deal with different formats, a simple example is I2S input : The output is LRLR but we need LLRR.
Then, the memcpys use optimized 32bit accesses (Take a look at the audio memcpy asm file and feel free to optimize it further). The gain with using pointers can not be that much..?

If we ever add some padding bytes ( on T4 only ), I'd knew to use some of them.. like a timestamp or some flags.

Blackaddr · Jan 26, 2020

It's more than just a speed difference. My particular issue is backwards comparability. Teensy 3.X permits to you DMA directly in and out of Audio buffer because that processer only as a 4-byte alignment requirement on the buffer. The Teensy 4.X has a 32-byte alignment requirement. The DMA audio support in my BALibrary is completely broken because of this (it's currently disabled for T4.0 builds), and it took me 3 months of work to track down the problem to what's described in this thread. By separating the data buffers from the control blocks, it allows you to support maintain direct DMA transfers in/out of the buffers for any cache alignment requirement. This is important if you want to DMA audio to any peripherals.

I proposed two ways we could change this, though there may be others. I'm just trying to find an acceptable way to get those buffers aligned for DMA. One way is to simply pad the audio_block_t and move the data buffer first instead of last. Another solution is to break out the data buffer as an embedded array and replace it with a pointer. Both are fairly backwards compatible solutions. The first is completely 100% compatible but wastes bytes, the second has no waste but any users who wrote code treating the audio_block_t fields and its data buffer as a contiguous block of raw memory will have issues (this is done in effect_freeverb.c but for creating a zero'd audio block but can easily adjust to support a pointer).

If we don't fix it in the AudioStream files, I'll need to go add custom code to my library specifically for the T4 to create extra 'man-in-the-middle' buffers and do memcpys in and out of them in order to allow the DMA engine to work betwen the middle-man buffers and the SPI device.

It looks like at this point I'll have to take the man-in-the-middle approach as their doesn't seem to be much appetite to make the audio buffers themself DMA compatible.

Frank B · Jan 26, 2020

Are there any negative side effects with your pointer solution?

Can I just use your pullrequest, and i.e. my MP3-thing or LTC, or other user-written code will still work?

WMXZ · Jan 26, 2020

Blackaddr said:
If we don't fix it in the AudioStream files, ….

Yes, the AudioStream way of handling data is... a little bit restrictive...
I ended up to use my own version "mAudioStream" only to have integer32 or even float data or keeping multiplexed data, etc.
the need to duplicate a whole file only to change the description of the data is not ideal.

Anyhow, If there are more flexible ways to implement data-driven processing, I would vote for it.

Frank B · Jan 26, 2020

@Wmxz: The least painful way to support different dataformats would involve a flag in every packet that tells the dataformat.
The default receiveReadOnly();, receiveWritable, allocate.. would need a automatic translation, i.e. float -> int16, float-capable objects would use a new "receiveReadOnlyFloat()" and so on.
So, worst case, i.e. a not float capable filter would get int16 and transmit it. Best case would be all parts in the chain would support the new format.
This way we have all time we want to make the existing code work with other data-formats.

Blackaddr · Jan 26, 2020

Frank B said:
Are there any negative side effects with your pointer solution?

Can I just use your pullrequest, and i.e. my MP3-thing or LTC, or other user-written code will still work?

You can use my pull request for cores/teensy4 as-is and it should work for your user code, but you must also make a change in the Audio library, effect_freeverb.cpp to compile. There is a nasty bit of non-portable code that hard-codes audio_block_t as an arrays of zeros, this can be temporarily replaced with

Code:

static const audio_block_t zeroblock = {0};

In fact, this hard-coded zeros bothers me so much I might make a pull request regardless to clean that up.

Regarding negative size effects, an array is really just a pointer anyway except it's guaranteed to be valid. It's only when you do something weird (like in effect_freeverb.cpp) that you have problems.

Frank B · Jan 26, 2020

Blackaddr said:
bothers me so much I might make a pull request regardless to clean that up.

Please do

I've seen this too, a few month ago..

I'll try your PR and then vote for it...

Blackaddr · Jan 26, 2020

WMXZ said:
Yes, the AudioStream way of handling data is... a little bit restrictive...
I ended up to use my own version "mAudioStream" only to have integer32 or even float data or keeping multiplexed data, etc.
the need to duplicate a whole file only to change the description of the data is not ideal.

Anyhow, If there are more flexible ways to implement data-driven processing, I would vote for it.

You are right. If we wanted AudioStream to be more future-proof, we need to have at least two objectives:
- separate the control information (meta data) from the audio data. This could be done with a pointer to the buffer rather than an embedded struct
- allow new fields to be added to the control struct in the future in backwards compatibility way. An enum specifying the format, and a size_t specifying the size would be ideal.

By abstracting the size of the buffer and the format in audio_stream_t, it provides an interface for users to customize the format to their needs. The Audio library doesn't need to implement support for different audio formats, it really just needs to provide an interface that is expandable. The way it is now "bakes in" a few things.

Frank B · Jan 26, 2020

Another great thing would be to find a way to support different sampling frequencies. For example input SPDIF 48kHz (or qeues, or...) , another input 96kHz, things inbetween (i.e. mixer), output 44.1kHz. The samplerate conversion is only one (solvable)part - there are more things to do, as the 44.1kHz are hardcoded everywhere.

Blackaddr · Jan 26, 2020

Frank B said:
P
I'll try your PR and then vote for it...

Let me know if you have any problems with the PR and I'll look into it. If you also have any ideas on how to accomplish the same thing in a cleaner or more-portable, future-proof way I'd love to hear it!

Frank B · Jan 26, 2020

Blackaddr said:
Let me know if you have any problems with the PR and I'll look into it. If you also have any ideas on how to accomplish the same thing in a cleaner or more-portable, future-proof way I'd love to hear it!

eh.. call me silly, but I can't find your pullrequest anymore..

Edit: Took your code from #12 (and deleted freeverb )
MP3 works. Great, since it is WIP again (I'm working on parallel playing files today)

No problems, no warnings, all is fine.
I'll support this. Esp. since it is for T4 only. (hmmm... is it? )

Edit: Found your PR.

Blackaddr · Jan 27, 2020

Frank B said:
No problems, no warnings, all is fine.
I'll support this. Esp. since it is for T4 only. (hmmm... is it? )

You are correct, the purpose of this is to fix the direct-DMA capability with the audio buffers which is only broken on the T4. There's no pressing need to make a similar change to the other Teensies, though for consistency it might make sense to do so once the change has had sufficient testing on the T4.

Teensy 4.0 problem with DMA to/from OCRAM interacting with Audio library

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Senior Member

Well-known member

Senior Member

Well-known member

Senior Member

Well-known member

Senior Member

Well-known member

Senior Member

Well-known member

Senior Member

Well-known member