Issues with DMAChannels

jmarsh

Well-known member
I'm having a problem related to Teensyduino's DMAChannels/eDMA module and am curious to hear any suggestions.

My project is for the T4. I have display driver code that constantly uses a pair of DMAChannels to render data. These channels are "bursty" - they activate infrequently to transfer large amounts of data from one part of memory to another.
I also have an I2S DAC connected (a PCM5102 but that's not particularly relevant) driven by the Audio library using an AudioOutputI2S object. This also uses a DMAChannel to feed the SAI TDR register as required to maintain continuous playback.

The problem: when the bursty DMA channels activate, they are given priority over the audio DMA channel. This lasts long enough that the audio output underruns (which the audio library doesn't attempt to handle) and it shuts off completely. The entire Audio library stops functioning due to the updates not being triggered.

The DMA channels operate in fixed priority mode. Higher channel numbers by default are given priority over lower ones. The DMAChannels are more-or-less initialized statically, which means their order is up to the whim of the compiler.
Inevitably the AudioOutputI2S object seems to get initialized first, meaning its DMAChannel ends up using channel 0 with the lowest priority.
It's a protected member of the class so it's inaccessible without breaking the API, meaning I can't use the DMAPriorityOrder() functions to give it higher priority than the display DMA channels.
The DMA engine has individual registers for setting each channel's priority, but you can't touch them once the DMA engine has been initialized without running the risk of triggering a priority conflict error (they are 8-bit registers and every channel must have a unique value).
The DMAChannel library only manages the lowest 16 channels; there are another 16 channels above those, but there is no API method to allocate them. If you use them you run the risk of stomping on someone else's code. Plus the upper 16 channels belong to group 1 while the lower 16 channels are all group 0, and group 1 by default has higher priority while I want my display channels to be lower priority.
I could reverse this by rewriting the DMA_CR register to switch the channel group priorities. But if another DMAChannel is initialized after this, it will reinitialize the DMA_CR register to the default setting (group 1 back to higher priority over group 0). Likewise for reconfiguring DMA_CR to active round-robin mode rather than fixed priority mode.

So in short, I'm stuck. I want my code to be portable and usable with other libraries, but the DMAChannel library isn't flexible enough and will potentially stomp on any workarounds I try to use.
 
You could try creating the AudioOutputI2S object last using new in your setup() code, along with the required AudioConnections to connect it to the rest of your design. I’m assuming here (no code posted…) there’s nothing else that will take on update responsibility.

I know Paul added some code to deal with updates not happening if the design had no object with update responsibility, e.g. with only USB output. I can’t find it in the codebase right now, so you may need to tinker a bit to get things happening in the order needed.

With the stock Audio library you’re OK to create objects at run time, just not to delete them.
 
…found it, it’s in where the AudioMemory is being set up. So, make sure you create your AudioOutputI2S before you call AudioMemory().
 
USB audio output works fine, that's how I tracked the problem down to the audio DMA channel being starved of runtime.
Dynamic object creation doesn't really fix the problem, it just shifts it; in this case my display driver is for a USB device, so in the ideal programming model the driver would only be instantiated when the device gets connected. Then the same issue would reappear.

The real problem is there's no way of properly setting DMA priorities, or alternatively specifying the desired channel to use* (since they effectively have fixed priorities). Even the existing DMAPriorityOrder functions are ineffective since they just swap the TCD pointers without transferring the contents, which wouldn't work with something like AudioOutputI2S since it only initializes the TCD one time during object creation.

(* Being able to specify a desired channel would allow periodic DMA triggering to be used, since that is limited to DMA channels 0-3 along with PIT channels 0-3.)
 
I was thinking in terms of a workaround for this:
The DMAChannels are more-or-less initialized statically, which means their order is up to the whim of the compiler.
Inevitably the AudioOutputI2S object seems to get initialized first, meaning its DMAChannel ends up using channel 0 with the lowest priority.
The mention of Paul’s timer mod to create updates if there’s no object with update responsibility was only to help with potential odd behaviour if you thought you’d created the AudioOutputI2S to pace the audio, when in fact you might not have done so.

Your post #4 appears to contradict #1 - first it was an issue with uncontrolled ordering of static objects (a known C++ issue), then you’ve started talking about things being instantiated at connection time.

Maybe you could allocate some DMA channels, then the AudioOutputI2S, then deallocate the DMA channels again, so the USB uses them? You’d need to look at the DMAchannel code to get an idea of whether that stands a chance of success.

It seems pretty unlikely, at least in the short term, that any changes to the Teensyduino API are going to happen, though you could code up what you want and submit a PR. But TD releases seem to take nearly a year each, and new functionality often takes several cycles before it’s adopted…
 
Trying to influence the dice roll of which DMA channel gets assigned to which object is not a solution. There isn't even a way to know which channel the AudioOutputI2S object has been assigned because it's a protected member.

I'm thinking maybe the simplest solution would be to add a "pre-emptible" bool to the DMAChannel constructor which would default to false (which matches existing behaviour). Channels initialized with that flag set would have preemption enabled and be allocated from 0-15 (group 0), while channels without the flag would have "can preempt" ability and be allocated from 16-31 (group 1). Channels could still be prioritized relative to other channels within the same group, if anyone is in fact using the PriorityOrder functions.
 
As noted above ... yes, your suggestion may be "best", but it's highly unlikely to get into Teensyduino soon, even less likely if you don't put the work in to generate a PR which meets Paul's criteria.

From your original post it sounds as if you have a use case very specific to you, maybe a product you want to market, or at least something you can publish on github; it's unclear if it's a library or an application. You would obviously like your code to stand a chance of still working for future releases of Teensyduino. As you also noted, within the existing API, you're stuck for a robust formal solution at this time. That's your baseline.

If you care to put in the work, you could try my suggested workarounds, or any others that occur to you. Yes, it will probably involve you unprotecting various classes so you can observe the system behaviour for various orders of execution, instantiation and so forth, but you can still re-test your system using a vanilla Teensyduino once you think you have it working.

Should you decide to do that, you will obviously have to look at the DMAChannel code very closely. A cursory inspection suggests that it's eminently possible to choose exactly which DMA channel is used (it's code, not a "dice roll"...), at least at construction of the AudioOutputI2S object, provided you do so using new in setup(). It'll be kludgy and risks not working if the DMAChannel code gets a major overhaul (unlikely, IMHO), but you really only need to re-test for each Teensyduino release.

If none of the above is palatable to you, then I'd have to agree you're right. You're stuck.
 
A solution to the specific problem in the opening post isn't what I created the thread for. I have other DMA-based code that performs table lookups by poking values into DMA TCDs and chaining them together; it's not triggered by external hardware signals so I'm pretty sure it would have the same issue if I tried to use audio at the same time.

We need a bulletproof way to create either a low priority DMA channel that won't block others, or a high priority channel that can't be blocked. At the moment we have neither, at best we can set relative priorities of channels that are created (but not yet initialized!) and accessible by the same piece of code, which is impractical when channel usage is sprinkled throughout libraries.
(The Audio library in particular always allocates a couple of channels regardless of them actually ever getting used, triggered merely by #including the Audio.h header.)

A cursory inspection suggests that it's eminently possible to choose exactly which DMA channel is used (it's code, not a "dice roll"...)
It's not when you have no idea how many static DMA channels there are in total, how many were initialized before you, or if some have been released causing the mapping to be fragmented...

The DMAChannel code as it currently exists is already broken. The T4 code also has at least one bug due to copy-pasting the T3 code. An overhaul (not that I would consider my proposed changes that extensive; handling shared IRQs is really not difficult) would not be a bad thing. At the very least it requires some attention.

Frankly I'm surprised this issue hasn't been discovered before given how common it is for projects to combine DMA driven ILI9XXX displays with the Teensy audio board...
Edit: now that I look around it has come up at least once, with the suggested solution being: using preemptive channels. "Audio can probably tolerate many microseconds of latency" not so much in my case, apparently.
 
Last edited:
Added support for pre-empt/non pre-emptible channels: https://github.com/A-Dunstan/cores/commit/331ad25279fd87351aee8e9ee6b69497339b5b41

The default is the old behaviour (non pre-emptible), so the Audio library and everything else that already uses DMAChannels will work as they did before. But now for my bursty channels, I can use channel.begin(false, true) to get a channel that will be both lower priority and pre-emptible.

Out of interest I timed how long the bursty transfers typically take to run, and they run for 666us at the beginning of each frame to transfer 32KB, then the two channels alternate transfers taking 333us to move 16KB each. So I suspect it's that initial long transfer at the beginning of the frame that is starving the I2S audio. Shrinking the size of transfers isn't an option since that increases the amount of interrupts per frame.
 
Looks good, from a glance at the code. Is there a "simple" scenario that would enable me to do a bit of testing? It's obviously a bit hardware dependent ... I have various audio boards and TFT screens I could roll out ... can SPI transfers to the screen be made to use DMA, e.g. save/restore an area? Thinking aloud here...
 
This example runs a (very stupid) DMA channel to demonstrate:
Code:
#include <DMAChannel.h>
#include <Audio.h>

DMAChannel dma(false);

AudioSynthWaveform waveform1;
AudioOutputI2S audio_out;
AudioConnection pc1(waveform1, 0, audio_out, 0);
AudioConnection pc2(waveform1, 0, audio_out, 1);

uint32_t g_int1;
uint32_t g_int2;

static void dma_isr(void) {
  static int count;
  if (++count == 1000000) {
    count = 0;
    digitalToggleFast(LED_BUILTIN);
  }
  dma.clearInterrupt();
}

void setup() {
  Serial.begin(0);
  while (!Serial);

  AudioMemory(15);
  waveform1.begin(WAVEFORM_SINE);
  waveform1.frequency(440);
  waveform1.amplitude(0.3);

  pinMode(LED_BUILTIN, OUTPUT);
  digitalWriteFast(LED_BUILTIN, LOW);

  // change second parameter here to true to make this channel pre-emptible
  dma.begin(false, false);
  dma.attachInterrupt(dma_isr);
  dma.triggerContinuously();
  dma.interruptAtCompletion();
  dma.sourceBuffer(&g_int1, sizeof(g_int1));
  dma.destinationBuffer(&g_int2, sizeof(g_int2));
  dma.triggerAtCompletionOf(dma);
  delay(3000);
  dma.triggerManual();
}

void loop() {}
If you leave dma.begin(false, false) as it is, that DMA channel will hog all the operation time and the sine wave audio will cut out as soon as the 3 second delay ends. Changing it to dma.begin(false, true) lets the audio continue as expected.
 
After a long gap I'm coming back to this, and using it in anger! It's definitely made an improvement in my attempts to have audio and asynchronous TFT updates co-existing, but I'm not quite there for some reason.

If I can't figure it out I'll try to make a small example (forum rule :) ), but here's a quick description in case you're in a position to say "oh yes, the gotcha here is...".

Existing DMA seems fine for async SPI updates from the framebuffer. However, for clipped updates I think I need to copy an area of the framebuffer to an intermediate buffer, and then DMA that out to the TFT. In theory I can chain a memory-to-memory DMA to a memory-to-SPI one, and interrupt out after the latter in order to set up the next pair (or stop).

Without your prioritisation code, the first such update killed audio for ever; with it, audio stops during the update, but resumes once it's complete. The SPI DMA and interrupts are set to lower priority than audio. Apart from that, the clipped async update works fine.

I think it may be something to do with triggering: AFAIK one needs to do a triggerManual() for memory-to-memory DMA, but a triggerAtHardwareEvent() for SPI ... but scatter-gather doesn't provide for that to be changed on the fly (so why does it even work?!). I'm currently exploring doing two separate transactions and interrupting after both - that's not working yet.
 
Existing DMA seems fine for async SPI updates from the framebuffer. However, for clipped updates I think I need to copy an area of the framebuffer to an intermediate buffer, and then DMA that out to the TFT. In theory I can chain a memory-to-memory DMA to a memory-to-SPI one, and interrupt out after the latter in order to set up the next pair (or stop).
Is there some sort of conversion happening here, or reordering? I'm just wondering why if you can use DMA for memory-to-memory, what is preventing you doing a direct memory-to-SPI?

Without your prioritisation code, the first such update killed audio for ever; with it, audio stops during the update, but resumes once it's complete. The SPI DMA and interrupts are set to lower priority than audio. Apart from that, the clipped async update works fine.
Hmmm... Even if the low priority channel has a large minor loop that could potentially delay an audio update, it should get pre-empted as soon as the audio is triggered. Maybe the pre-emption part isn't working correctly and the improvements are only due to the channel priorities providing better arbitration.
I think it may be something to do with triggering: AFAIK one needs to do a triggerManual() for memory-to-memory DMA, but a triggerAtHardwareEvent() for SPI ... but scatter-gather doesn't provide for that to be changed on the fly (so why does it even work?!). I'm currently exploring doing two separate transactions and interrupting after both - that's not working yet.
If the channel is set to trigger based on SPI then you'd have to link to a different channel (set to run continuous) rather than use scatter-gather.

Edit: Actually, if the memory-to-memory transfer is executed first I think this can be done with one channel: the mem-to-mem transfer should use triggerAtTransfersOf() to trigger itself after each minor loop. Then when you call triggerManual() it will just keep running without requiring the hardware trigger.
 
Last edited:
Is there some sort of conversion happening here, or reordering? I'm just wondering why if you can use DMA for memory-to-memory, what is preventing you doing a direct memory-to-SPI?
The memory-to-memory is assembling pixel data from separate framebuffer lines into a continuous stream. Say the clip area is 80x80, the minor loop copies 80 uint16_t and updates the source pointer to the start of the next 80 pixels on the next line, with the major loop copying up to 80 such lines (usually less). I can't do this memory to SPI, because 80 words is more than the 16-word FIFO in the SPI.
Even if the low priority channel has a large minor loop that could potentially delay an audio update, it should get pre-empted as soon as the audio is triggered.
I think the minor loop can't be stopped or pre-empted. For my use-case that's fine, it'd never be more than 480 pixels. My reading of the Reference Manual suggests that pre-emption can happen at the end of a major loop, but again, even if that's wrong, I've sized the transactions such that I can free up the DMA and the SPI bus every millisecond or so (needed for audio playback from SD card); that mechanism has been working fine.

Maybe the pre-emption part isn't working correctly and the improvements are only due to the channel priorities providing better arbitration.
It's definitely doing some good, as audio output is only semi-broken with it, as opposed to totally broken!

If the channel is set to trigger based on SPI then you'd have to link to a different channel (set to run continuous) rather than use scatter-gather.
That's why I'm a bit puzzled that it's working as-is! I don't really want to add another DMA channel to the mix, there aren't that many of them. I can either interrupt and re-purpose the one channel, or just do the memory-to-memory copy in code - it won't take long.

Edit: Actually, if the memory-to-memory transfer is executed first I think this can be done with one channel: the mem-to-mem transfer should use triggerAtTransfersOf() to trigger itself after each minor loop. Then when you call triggerManual() it will just keep running without requiring the hardware trigger.
I'll have to read up on that trigger method - thanks.
 
That's why I'm a bit puzzled that it's working as-is! I don't really want to add another DMA channel to the mix, there aren't that many of them. I can either interrupt and re-purpose the one channel, or just do the memory-to-memory copy in code - it won't take long.
I imagine it would still work because when SPI triggers the channel, the mem-to-mem transfer doesn't do anything to "untrigger" it - it doesn't give any data to SPI - so the trigger just stays continuously active until the chained TCD gets loaded and fills up the SPI buffer. Maybe that's good enough without doing anything more complicated.

The reference manual says "After eDMA activates a channel, it runs until the minor loop completes, unless preempted by a higher priority channel." So the minor loop should be interruptible (as long as a pre-emption hasn't already taken place - only one at a time). That's how I presumed it worked, since channel arbitration (based on channel priority) takes place at the end of a minor loop.
 
Making progress, I think. I noticed that interrupts are shared between DMA channels X and X+16 ... and SPI was on channel 0, and Audio on 16. Creating a dummy DMAChannel in the right place pushes Audio to 17, and it all starts working.

So further investigation needed, but at least there's an avenue to explore :)
 
In theory that shouldn't be an issue, the necessary code to handle interrupt sharing is there and the example posted above uses channels 0 and 16.
 
Yes - I stared at that for quite a while trying to pick holes in it, to no avail ... then the penny dropped.
Code:
void DMAChannel::attachInterrupt(isr_t isr, uint8_t prio) {
    // since two channels share each interrupt, use whichever priority is lower
    uint8_t irq = CHANNEL_TO_IRQ(channel);
    if (_VectorsRam[IRQ_TO_EXCEPTION(irq)] == unused_interrupt_vector ||
      NVIC_GET_PRIORITY(irq) > prio) {
        NVIC_SET_PRIORITY(irq, prio);
    }
    attachInterrupt(isr);
}
Previously this would just set the requested priority, regardless; now it sets the "lower" priority of the current or requested. Or of course, actually, the higher priority, because numerically lower values are higher priority...

Unfortunately, for my application, I need the SPI interrupt priority to be lower than the Audio update interrupt (IRQ_SOFTWARE - I don't care about its DMA IRQ priority); I've set SPI to 224, where Audio is 208. But by the time I get to setting it, the Audio has set the vector and the priority is locked to at least 128.

I'll need to give the API a bit of thought, see what I can do "under the hood" or if I should make it very visible if there's a clash.

Thanks for your help, anyway ... not sure I could've got there without it. Certainly not as fast...
 
I don't think there's a "clean" solution for this... at best the channel assignment in DMAChannel::begin() could be tweaked to prioritize unshared channels first before using shared ones. Otherwise you could manually twiddle dma_channel_allocated_mask before calling begin() on your SPI channel to ensure it gets a non-shared channel (and adjusting the mask back afterwards).
 
you could manually twiddle dma_channel_allocated_mask before calling begin() on your SPI channel to ensure it gets a non-shared channel (and adjusting the mask back afterwards).
Now that's a clever idea!

Do you plan to make a PR for this? Apart from those of my own making, I haven't had any issues with it, and I can't see that there would be any.
 
I still need to find the time to do some "heavy" testing with audio and my old VGA FlexIO code combined, since in that case both of the DMA operations are time critical (4bpp VGA is roughly 12MB/s). Mostly I just don't want to breadboard all the resistors for the VGA R2R ladder again...
 
Having some issues changing the interrupt priority after it's been attached. Your code allows it to be set if the vector is unused or you're attempting to raise it, but doesn't allow you to "change your mind later". Would it be sensible to allow any change, provided the vector is unused or the ISR address supplied is the same as it's currently set to?
 
If the vector is unused the priority won't matter because the irq shouldn't be enabled.
For the case where attachInterrupt() has already attached a function and you're calling it again with the same function just for the purposes of changing the priority... I would be questioning why since dynamically juggling IRQ priorities isn't really sane, but I guess that case should be supported since the original code allowed it (at least without having to use detachInterrupt() first). But if the IRQ is actively shared I would still want the higher priority to be used.
 
It’s only being called a second time because there’s no clean way to set up the display with DMA and a specific IRQ priority, given the legacy API. Hence I’ve tried to graft on priority setting after DMA begin(), because we don’t know the channel assignment until that’s happened.

I’ve taken steps to ensure the display claims a channel where it can mark the other one as also used … which is a lie, but I believe in a good cause :unsure:. It’s still a work in progress, so giving the option not to do that is definitely on the cards, and should probably be the default.

I’ve also included a method to forcibly set the priority, if all else fails - very much intended as a last resort for the fully informed user.
 
Back
Top