Teensy 4.1 How to start using DMA?

Fiskusmati

Active member
Hello.
I want to use DMA on Teensy 4.1 but I found out that there is hardly any good documentation of how to start (tutorial), which libraries to use, how to configure DMA, and simple examples.
What I want to do now is to read 28gpio states into an array, at speed ot 2M samples per second.
I'm using Arduino-IDE.

Regards Mateusz
 
Sorry, I am not sure if there is any easy answer to this, nor any tutorials,

The way I have learned about doing DMA, is by looking at other code doing DMA... Plus looking through the reference manual, starting with chapters 4 and 5.

I personally have not done GPIO DMA input or output myself. I have mostly done some with SPI and Analog inputs...

Source wise, I usually again start of by looking at other code that has worked and model after that...
In the cores\teensy4 directory of your install are the files DMAChannel.h and DMAChannel.cpp
That I use a lot to help setup things. Like a DMAChannel and DMASettings and the like. Information about formats of this and what each field does is in the chapters I mentioned above.

Also the file imxrt.h includes a set of defines for what the source is for the DMA transfer (DMAMUX_SOURCE_...)

Also to do GPIO, in this way you need to understand which pins are on what hardware point and where on the ports. Again you can figure that out from the header files. I keep my own spreadsheet with that information:
T4.1-Cardlike.jpg
Which is in an excel document up at: https://github.com/KurtE/TeensyDocuments

With GPIO there is an additional complication, that each GPIO port actually has two versions of the port. GPIO1 also maps to GPIO6, where the 6 runs the GPIO pins at faster way than GPIO 1, BUT it does not allow DMA from there. There is another set of hardware registers that tell by each pin if it should be in the normal GPIO or faster GPIO...
There is code in the main startup.c which maps all of our pins into fast mode... So to do GPIO to those you need to switch them back to low speed....

Probably the best example code for this is the OctoWS2811 code that Paul did to allow the OctoW2811 board to work with T4.x... But his code is only doing outputs to GPIO, so would need to reverse this.
But the sources again are part of Teensyduino in the OctoWS2811 library in OctoWS2811_imxrt.cpp

Hope that helps some.
 
Sadly, there aren't any easy tutorials. But you might read through the DMAChannel.h class. I put lots of comments in the source to explain how things work. As things are today, those comments are the closest thing we have to any sort of introduction to DMA.

Indeed the OctoWS2811 source code is probably the best place to start if you want GPIO to memory.

DMA is tricky to use, partly because you get pretty much no info when it doesn't work which makes troubleshooting very hard, but also because NXP's gigantic reference manual is sorely lacking in many important details that usually you can only discover by a lot of painful trial and error. Having done that for many hours to write the OctoWS2811 and other libs, here's a few specific tips to hopefully save you from some of those painful moments.

First you'll need to configure a timer to generate the DMA trigger events at the rate you want. While developing, I highly recommend using an extremely slow rate. If the timer is *very* slow, you can write code to print the actual memory locations and watch how they change as your DMA runs (may sure you use volatile keywords so the compiler doesn't assume memory can't spontaneously change). You can always just edit the timer config for the very high speeds once the rest is working.

While the timers can generate DMA events directly, that rarely works if you want to use the DMA to do something other than write to the timer's settings. The problem is the timer doesn't get any acknowledgement that its DMA request was serviced, so it keeps asserting the request signal and the DMA controller goes into an infinite loop running the DMA you requested (or if you configure DMA to halt rather than keep repeating the transfer, it does the whole thing as fast as the hardware can go). This DMA acknowledge process is the main thing that's sorely lacking in NXP's documentation. The rest of the manual does have all the needed info, though it's scattered and difficult to find across those thousands of pages!

You will probably need to route the timer's output pulse through the crossbar switch to 1 of it's 4 DMA request generators. Those do auto-acknowledge when the DMA controller service your request.

The other alternative for DMA acknowledgement is to set up 2 DMA channels, one which does the GPIO operation you want, and another which does a dummy write to the timer so it won't keep requesting more DMA until the next timer pulse. But you can't use the DMAMUX to route a trigger event to more than 1 DMA channel. To get 2 channels to run, you need to have the hardware event trigger the first channel, then set that channel up to trigger the other channel as it completes each event. This way is more complicated, less efficient, and consumes an extra DMA channel. But the upside is it doesn't consume any of those finite DMA triggering resources in the crossbar switch.

GPIO also requires configuration. Each 32 bit GPIO port has 2 sets of registers. GPIO1-4 are on the normal (slow) peripheral bus. GPIO5-9 are on a fast bus, but the DMA controller can't access that bus. The fact that DMA can't access those registers is the other painful detail not mentioned anywhere in NXP's documentation. It's crazy frustrating if you don't know this and have to find out by experimentation while so many other things are also unclear! You have to use only GPIO1-4 for DMA. Each individual pin can be assigned to be accessed by either the fast or slow registers. By default they're all assigned to the fast registers. So you will need to write to the GPR registers to reassign the pins you wish to use. See the code in OctoWS2811 for an example.

Once you have the timer, crossbar trigger and GPIO ready to be used by DMA, then comes a lot of decisions about how to configure the DMA channel. If you only need a fairly simple configuration, maybe the DMAChannel.h functions are enough. But you can also use DMAChannel to write directly to the 32 bytes of TCD registers which control the actual transfer. Most of the libraries do this, which gives the advantage that the DMAChannel class dynamically allocates an available DMA channel for you, which means your code is far more likely to be able to work together with the various libraries which use DMA (as far as I know, all of them use DMAChannel to avoid conflicts with the others).

The OctoWS2811 code uses a pretty complex TCD setup, because it generates data "on the fly" in chunks. You're probably better off to look at how the audio library configured the TCD, at least for the simpler protocols like input_i2s.cpp, where the DMA runs continuously circling through a buffer and the interrupt from DMA just copies half of the received data to a non-DMA buffer while the DMA keeps running to fill up the other half. If you're just looking to acquire input signals at steady speed, that is probably a usage model much closer to your needs than the complex reloading of the DMA settings which OctoWS2811 does. If you can just set up the DMA channel once and let if run forever while you respond to its interrupts, things are much simpler. Of course the hardware can do very complex things as OctoWS2811 shows, but best to avoid that sort of use, especially as a first learning experience.

Just to keep this in perspective, you're going to configure the TCD source address to read from the GPIO register, and you want zeros in all the source address offset fields so the DMA channel always reads the same place. Of course set it up for 32 bit transfer, as the GPIO registers do not support 8 or 16 bit access. For the destination, you'll set the address to your buffer that will receive the incoming DMA writes, and you'll set the destination offset to 4 so it increments through the buffer as it writes. You'll probably also set the DLAST offset so the destination address automatically returns to the start of your buffer. Especially if you don't set the bit for the transfer to be done at the end, it will automatically restart, which is really convenient to just set the DMA up once and let it keep running forever. The main trick is getting the "last" destination offset right so the destination address resets back to the beginning automatically.

If possible, use DMAChannel.h functions to do the dirty work of setting up all those TCD registers. Especially if you're going to just set up a transfer than runs continuously and gives you interrupts as it does it work (so you can copy the data it's acquired before it gets even more and overwrite the data it already got) that TCD init is only does once at startup. Of course you can read all about the TCD registers in the reference manual, but the amount of raw capability and the huge number of possible ways to set it up can be pretty overwhelming. Best to keep things simple if you don't need those advanced features (and this sounds like one of the pretty simple uses).

Whew, that turned out longer than I imagined. Hopefully it helps?
 
Last edited:
Thank you both for your replies.
I don't know if it helps with complexity, but I already have available square 2MHz signal that I use to clock external parallel ADC's.
So on every falling edge, I need to cath 28 GPIO states and put them as bits into variable (and into array of variables). When first array is full, let it know to rest of code, and start filling up second array. And so on.
Currently I'm doing this inside ISR and here is my code:

Code:
void myISR() //this will occur 2 000 000 times per second
{
  if (GPIO_array_sel)  //select second array
  {

    if (adc_sample_start == true)  //select if we really want adc samples or just need to max-out value to make short impulse that will indicate to further software that adc is about to start sampling
    {
      GPIO6_1.array[GPIO_array_pos] = 0xFFFFFFFF;
      GPIO7_1.array[GPIO_array_pos] = 0xFFFFFFFF;
      adc_sample_start = false;
    }
    else
    {
      GPIO6_1.array[GPIO_array_pos] = GPIO6_DR;
      GPIO7_1.array[GPIO_array_pos] = GPIO7_DR;
    }


  }
  else //select first array
  {

    if (adc_sample_start == true) //select if we really want adc samples or just need to max-out value to make short impulse that will indicate to further software that adc is about to start sampling
    {
      GPIO6_0.array[GPIO_array_pos] = 0xFFFFFFFF;
      GPIO7_0.array[GPIO_array_pos] = 0xFFFFFFFF;
      adc_sample_start = false;
    }
    else
    {
      GPIO6_0.array[GPIO_array_pos] = GPIO6_DR;
      GPIO7_0.array[GPIO_array_pos] = GPIO7_DR;
    }

  }

  GPIO_array_pos++;  //increment current position counter in array, so on next ISR we will write to next record


  if (GPIO_array_pos == 250)  //if array is full, select another array and reset position counter
  {
    GPIO_array_sel = !GPIO_array_sel;
    GPIO_array_pos = 0;
  }
}

If any array is full, rest of code in loop() prepares the data, makes 1000 byte array and sends that array via UDP ethernet using NativeEthernet library.
Code:
        Udp.beginPacket(Udp.remoteIP(), port);
        Udp.write(arrayToSend, 1000);
        Udp.endPacket();
And it works just fine for most of time but sometimes when UDP packet is composed or transferred, and ISR occurs, UDP packets contains distorted data.
The above 3 lines of code are blocking and take some time to execute, so I can't just put them between noInterrupts() and interrupts(). I could not find better library for ethernet so I thought I will get rid of this time sensitive and continues ISR.


Looking through your DMA libraries, I'm very, very sad to say that but I think that in this state, I'm just too stupid to use DMA in my projects :(
I am a "typical" user, I totally can not understand most of this libraries code that looks like this:
Code:
TMR4_SCTRL0 = TMR_SCTRL_OEN | TMR_SCTRL_FORCE | TMR_SCTRL_MSTR;
	TMR4_CSCTRL0 = TMR_CSCTRL_CL1(1) | TMR_CSCTRL_TCF1EN;
	TMR4_CNTR0 = 0;
	TMR4_LOAD0 = 0;
	TMR4_COMP10 = comp1load[0];
	TMR4_CMPLD10 = comp1load[0];
	TMR4_CTRL0 = TMR_CTRL_CM(1) | TMR_CTRL_PCS(8) | TMR_CTRL_LENGTH | TMR_CTRL_OUTMODE(3);
	TMR4_SCTRL1 = TMR_SCTRL_OEN | TMR_SCTRL_FORCE;
	TMR4_CNTR1 = 0;
	TMR4_LOAD1 = 0;
	TMR4_COMP11 = comp1load[1]; // T0H
	TMR4_CMPLD11 = comp1load[1];
	TMR4_CTRL1 = TMR_CTRL_CM(1) | TMR_CTRL_PCS(8) | TMR_CTRL_COINIT | TMR_CTRL_OUTMODE(3);
	TMR4_SCTRL2 = TMR_SCTRL_OEN | TMR_SCTRL_FORCE;
	TMR4_CNTR2 = 0;
	TMR4_LOAD2 = 0;
	TMR4_COMP12 = comp1load[2]; // T1H
	TMR4_CMPLD12 = comp1load[2];
	TMR4_CTRL2 = TMR_CTRL_CM(1) | TMR_CTRL_PCS(8) | TMR_CTRL_COINIT | TMR_CTRL_OUTMODE(3);
I don't blame anybody because I know writing library and commenting every line will took a lot of extra time and effort.
Writing this thread I was hoping that there is simple and easy to use DMA library with good documentation that does not require to use "register names" or dive into NXP manual.

If you still have any thoughts or tips about my issue please reply. Thanks again for your contribution.
 
Hi @Paul and @Fiskusmati

As I mentioned in the previous posts, some of this DMA stuff can get real confusing. Up till now with T4.x I have mainly done DMA operations to SPI device, maybe logical UART for flexIO.

But I am playing with trying to get some DMA to one GPIO port to work to read in 8 pins, using an external clock from a camera... And I think I am getting pretty close. I believe I am getting the data clocked logically correctly and getting DMA data, but the data does not look correct...

Paul - there are some simple things associated with GPIO on IMXRT that at times I am not 100% what is correct in reading in multiple pins. Example for the camera the code has been doing,
For example the example sketch I started playing with to convert to doing DMA was doing something like:
Code:
static inline uint32_t cameraReadPixel() 
{
  uint32_t pword= GPIO6_DR >> 18;  // get the port bits. We want bits 18, 19 and 22 to 27
  return (pword&3) | ((pword&0x3f0)>>2);
}
Which appears to work, but I wondering if it really should be: uint32_t pword= GPIO6_PSR >> 18;
As digitalRead does: return (*(p->reg + 2) & p->mask) ? 1 : 0;

where p->reg I am pretty sure points to the DR and the +2 (is + 8 bytes) gets to to PSR...

Note: I have tried both in my conversion of library (https://github.com/arduino-libraries/Arduino_OV767X)
The WIP code is up in this area including sketch...


I am trying to remember where I read about DMA needs to go to GPIO1 instead of GPIO6...

The interesting thing is that with reading without DMA, My current code, appears to work with either GPIO6_PSR or GPIO1_PSR regardless of the state of
// Need to switch the IO pins back to GPI1 from GPIO6
IOMUXC_GPR_GPR26 &= ~(0x0FCC0000u);


Note: to both: The DMA camera case again I am clocking the DMA (I think) by using external IO pin (clock from the camera).
The setting up for the clock/pin to drive the DMA is:
Code:
  // first see if we can convert the _pclk to be an XBAR Input pin...
  // OV7670_PLK   4
  *(portConfigRegister(_pclkPin)) = 3; // set to XBAR mode (xbar 8)

  // route the timer outputs through XBAR to edge trigger DMA request
  CCM_CCGR2 |= CCM_CCGR2_XBAR1(CCM_CCGR_ON);
  xbar_connect(XBARA1_IN_IOMUX_XBAR_INOUT08, XBARA1_OUT_DMA_CH_MUX_REQ30);
  digitalToggleFast(31);

  // Tell XBAR to dDMA on Rising
  //attachInterruptVector(IRQ_XBAR1_01, &xbar01_isr);
  //NVIC_ENABLE_IRQ(IRQ_XBAR1_01);
  XBARA1_CTRL0 = XBARA_CTRL_STS0 | XBARA_CTRL_EDGE0(1) | XBARA_CTRL_DEN0/* | XBARA_CTRL_IEN0 */ ;

  IOMUXC_GPR_GPR6 &= ~(IOMUXC_GPR_GPR6_IOMUXC_XBAR_DIR_SEL_8);  // Make sure it is input mode
  IOMUXC_XBAR1_IN08_SELECT_INPUT = 0; // Make sure this signal goes to this pin...
Note: This is using a pin(4) that is an XBar pin(XBARA1 8).
It sets the pin to mode 3(XBAR for that pin), I left the PAD alone which was previously configured for input, It enables XBAR, it sets up the connection of
XBARA1_IN_IOMUX_XBAR_INOUT08 to output to XBARA1_OUT_DMA_CH_MUX_REQ30, sets XBARA1_IN_IOMUX_XBAR_INOUT08 to be in input mode,
It also configures XBAR register to Do DMA on Edge 1 (Rising).

Note: I have a DMA chain setup that does 2 horizontal lines of the camera per DMASetting where it calls ISR to convert the raw 32 bit data into pixel data, and this does appear to be called at the right time.

But data does not look correct. So maybe/hopefully just missing some simple something, like need to tell the GPIO port to resample or ???

EDIT: Side Note: keep thinking in IMXRT we should create a structure for GPIO registers like we did for SPI, or UART or ... such that instead of the magical
things like: *(p->reg + 2) we might have something like p->reg.PSR. Wonder if that is worth doing at this point?

EDIT2:
Found a simple extracting bits from the data problem, which is a lot of the issue... Getting better now... Looks like I may have one extra byte that was read in at the start... When data was already high... can probably easily compensate if need be...
 
Last edited:
Hi @KurtE,

Like @Fiskusmati I am working on connecting an external A/D to the GPIO lines on a 4.1 and accessing it via DMA. It appears that your work on interfacing the OV767X would be really helpful to me (and others) as a starting place. Is the code available somewhere on GitHub? I browsed the link in P5 but only found the original OV767X library, not your adaptation for the 4.1.

Also, @Paul your overview on DMA in P3 was very helpful.

David
 
My DMA camera stuff is up at: https://github.com/KurtE/Arduino_OV767X/commits/Teensy_t4
Note: It is a PIP (Play In Progress) As I am just doing it for the fun of it.

The DMA version up there does get a frame... Working on more of a continuous update version, first attempt did not work (In different branch), Now going back to have the DMA only get one frame and stop and then have a pin change interrupt on VSYNC to start up DMA again for the next frame, it will probably skip frames this way, but again for me, I don't need it... So just doing it to learn a few more things.
 
Thanks Kurt.

I'll take a look. At this point I'm trying to work though as much example code as I can. DMA use on the 4.X is pretty complex and I'm hoping to get more comfortable with the triggering and data flow setups.
 
You are welcome. Question is what AtoD converter? Don't a lot of them use SPI or I2C to communicate? If so their communications using DMA will likely be different than this work. Probably closer to some of the display drivers, although in reverse...
 
I'm actually hoping to tie 2 parallel output A/Ds onto GPIO1. My desired bandwidth is quite high at ~50MHz, which may be unrealistic. I'm willing to use FIFO buffering between the A/Ds and the port if necessary, but right now I just want to see if I can use DMA to reliably read GPIO1, and what the max rate is. The DMA would be set to interrupt on buffer full, not to run continuously. So: ext. trigger - enable DMA w/ timer - xfer - completion interrupt - process [repeat]. My test bed is a couple of externally-clocked fast counters cascaded w/ parallel outputs connected to GPIO1 pins. This should let me look for missing or corrupt data in the received buffer.
 
@dgranger

I'm curious how fast you can go if you don't use DMA and just poll? Ie, wait for pin change, read/store GPIO, loop.
 
I made some non-DMA timings on the 4.1 at both 450 & 600 MHz

Note that the lib version of memcpy is pretty highly optimized w/ unrolled loops for word aligned data. Also note that the GPIO tests are using GPIO6 which uses fast bus access which is not available to DMA.

It's pretty clear that the test / branch overhead is significant for the transfer loops.

450 MHz:

Mem-toMem:

100x 10000 word (4 byte) memcpy: 2366 total uSec; 2.4 nSec (2 cy) per word; 422 MHz effective rate
100x 10000 *d++ = *s++ word copy loop: 18895 total uSec; 18.9 nSec (9 cy) per word; 52 MHz effective rate
100x 200 Unrolled (rep 50x) [*d++ = *s++] word copy loop: 2538 total uSec; 2.5 nSec (2 cy) per word; 394 MHz effective rate

GPIO6-to-Mem:

100x 10000 *d++ = GPIO6_DR word copy loop: 34452 total uSec; 34.5 nSec (16 cy) per word; 29 MHz effective rate
100x 200 Unrolled (rep 50x) [*d++ = GPIO6_DR] word copy loop: 20094 total uSec; 20.1 nSec (10 cy) per word; 49 MHz effective

600 MHz:

Mem-toMem:

100x 10000 word (4 byte) memcpy: 1774 total uSec; 1.8 nSec (2 cy) per word; 563 MHz effective rate
100x 10000 *d++ = *s++ word copy loop: 14171 total uSec; 14.2 nSec (9 cy) per word; 70 MHz effective rate
100x 200 Unrolled (rep 50x) [*d++ = *s++] word copy loop: 1904 total uSec; 1.9 nSec (2 cy) per word; 525 MHz effective rate

GPIO6-to-Mem:

100x 10000 *d++ = GPIO6_DR word copy loop: 25838 total uSec; 25.8 nSec (16 cy) per word; 38 MHz effective rate
100x 200 Unrolled (rep 50x) [*d++ = GPIO6_DR] word copy loop: 15070 total uSec; 15.1 nSec (10 cy) per word; 66 MHz effective
 
> 66 MHz effective

Interesting, slower than I thought. I'd guess that an unrolled loop of [*d = GPIO1_DR] is probably most similar to how fast DMA can run. If the DMA is triggered by a pin input, then delay between that pin changing and the input of the GPIO data could be another issue.

Worth trying, otherwise a hardware fifo.
 
I ran the test using GPIO1 as that's what DMA can connect to. The #'s are not too promising, only 21 MHz effective w/o overclocking. Looks like a FIFO will probably be required. I'm unclear as to why the cycle counts vary with the clock speed. In the first test the cycle counts are constant within an access method and only the clock period varies. Perhaps the peripheral clock divider is different for the overclocked run. Regardless, these numbers are a long way from my goal of 50 MHz.

450 MHz:

GPIO1-to-Mem:

100x 10000 *d++ = GPIO6_DR word copy loop: 63343 total uSec; 63.4 nSec (39 cy) per word; 15 MHz effective rate
100x 200 Unrolled (rep 50x) [*d++ = GPIO6_DR] word copy loop: 46940 total uSec; 46.9 nSec (29 cy) per word; 21 MHz effective

600 MHz:

GPIO1-to-Mem:

100x 10000 *d++ = GPIO6_DR word copy loop: 56673 total uSec; 56.7 nSec (35 cy) per word; 17 MHz effective rate
100x 200 Unrolled (rep 50x) [*d++ = GPIO6_DR] word copy loop: 46672 total uSec; 46.7 nSec (29 cy) per word; 21 MHz effective

816 MHz:

GPIO1-to-Mem:

100x 10000 *d++ = GPIO6_DR word copy loop: 41671 total uSec; 41.7 nSec (26 cy) per word; 23 MHz effective rate
100x 200 Unrolled (rep 50x) [*d++ = GPIO6_DR] word copy loop: 34318 total uSec; 34.3 nSec (21 cy) per word; 29 MHz effective
 
Hi everyone,

I have recently started looking into using some of the lower-level functionality of the Teensy 4.1/iMX RT 1060, including DMAs, register-level access to pins etc, and eventually landed here. Up until now, I have been only doing typical Arduino-like things, maybe with some simple ISR sprinkled around - much like people earlier in this thread - but it simply wasn't enough to achieve what I needed (pretty unoriginally, access camera sensor and stream the resulting video to a screen). Having a lot of experience with dealing with this sort of stuff in my day job, I knew pretty precisely what I wanted to do, I just needed to know "how" on this hardware.

As I was going through the reference, Paul's core library sources and KurtE's camera sources, I was writing most of my findings up, and I thought I'll post them here, to provide some entry level information for people who need to move from simple digitalWrites to more complex scenarios. Technically, it's all there in the reference manual (https://www.pjrc.com/store/teensy41.html#tech) but the problem is that the manual is riddled with a lot of domain-specific lingo, and (at least for me, coming to uControllers for a different field) it required some time to actually connect some dots and understand what the authors meant by certain things. Also the sources are well documented and are a great reference, but they can be totally overwhelming, when all you initially see is just tons of cryptic #defines.

All this is not really a rocket science, but I believe that it might be useful as an introduction to actually reading the reference manual - going over a particular use case and extracting the information needed for that from the docs. And if you think any of it is incorrect (which is quite possible), just say and I'll correct it here.

Just for some background: I wanted to have a DMA transfer data from an external camera sensor, to a buffer in memory. DMA stands for "Direct Memory Access" and the idea is that the CPU is not involved in that transfer - so free to do other stuff. I wanted that because the data I get from the sensor are just raw 12bit values from the internal ADC converter. Every few scanlines, I would like an interrupt to happen, so I could demosaic that received data, perform gamma correction, format conversion and write it out to a frame buffer. A separate DMA would periodically transfer the data from a frame buffer (that would be double buffered) to a screen. But the heart of all this are DMA transfers, and I wanted to get a fairly deep understanding of how they are set up and what is needed for them to work. I wanted to get a setup where:
- I have an clock signal that's connected to one of the Teensy pins (I also wanted to see how fast that clock signal can actually be)
- at each rising clock edge, the data is presented to some number of other Teensy input pins
- DMA picks up that data and stores it in a memory buffer
- when that buffer is full, I get an interrupt and I can do something inside the handler
- I don't really care about what happens next, any other situations are more complex scenarios; when I know how to deal with the basic stuff I can build on it and get something more complex

First things first: the external pins of the chip are called "pads" in the documentation. The word "pin" is used for input and output pins of the internal modules within the chip. And then there are the actual pins on Teensy itself, which are yet another thing. The pins from Teensy are connected to certain pads on the chip - and the schematics shows which pin goes there. The pads have these names like AD_B1_10 etc. The chip itself contains a number of modules serving different purposes - there's the DMA unit, there's the GPIO unit for dealing with General Purpose IO, there are timers, units to help with sensors, ADCs and tons of other stuff. Very important thing is that there's TONS of multiplexers (MUX). Their sole purpose is to connect certain pins of different units together. They are programmable and allow for tons of flexibility - so if you want the input from certain pad to act as DMA source, you set up a bunch of MUXes and voila. You want to just read the input from that pad in your code, you set them up differently and it works. The problem is the sheer number of these multiplexers, as they are literally everywhere. And they are documented out of order, the diagrams are crap and it takes quite some time to actually decipher all that. Configuration of all these systems is done through writing to different memory-mapped registers. This means that the configuration state is simply visible as an address somewhere in memory - which you can read or write to. The address for each register described in the docs is provided somewhere around it - for instance on page 500, SW_MUX_CTL_PAD_GPIO_AD_B1_08 SW MUX Control Register (IOMUXC_SW_MUX_CTL_PAD_GPIO_AD_B1_08) is said to reside at the address: 401F_8000h base + 11Ch offset = 401F_811Ch. Since it's 32 bit register, you can do uint32_t* register_IOMUXC_SW_MUX_CTL_PAD_GPIO_AD_B1_08 = (uint32_t*)(0x401F811C); and simply access it through that pointer. Teensy core libraries however provide TONS of macros and helper structs for accessing these things, so you don't have to copy these numbers from the docs.

Let's start with the simpler thing first - getting the signal from the Teensy input pins to where it can be accessed. The Teensy pins mean the microcontroller pads, and these are first routed through the IOMUX controller (Chapter 11 in the reference manual). It allows to route the signal from the pads to different subsystems. This is quite interesting, as treating them as input-output pins (so GPIO) is only one option - the pads don't have to actually be used this way! They can be hooked up to ADC converters, some UARTS, FLEXIO module and tons of other stuff - but it's not important at the moment. The important bit is that there is this flexibility here and we need to set them up to be routed to GPIO, as this is something we can access later on.

The registers controlling where the input of each pad goes are described in Chapter 11.7. For instance, Teensy 4.1 pi number 23 is connected to pad AD_B1_09 (see here https://www.pjrc.com/store/teensy41.html#tech) which is controlled by the register IOMUXC_SW_MUX_CTL_PAD_GPIO_AD_B1_09 described on page 501. You can select one out of the 10 modes for that pin, by writing the corresponding value to that register - writing 6 there will for example set it to be routed to USDHC2_CLK signal in usdhc2 module (which, I'm sure you can find details on in that manual too). We want our inputs to be set to the GPIO modes - which here would mean writing 5 to that register. Teensy core libraries actually do that by default (or at least I think they do, quickly skimming through the code it didn't pop up, but the configuration is usually done through macros, and there's quite a few of them, so I probably just missed it), so you don't really need to change it, if you just want your input signal as 0 or 1 somewhere. The code that simplifies all that setup is here https://github.com/PaulStoffregen/cores/blob/master/teensy4/core_pins.h. The *really* important bit here is the name of the input pin in the GPIO unit specified in the register description - for that AD_B1_09 it's GPIO1_IO25. This means that that particular pad, AD_B1_09 gets connected to the GPIO1 to its 25th input. KurtE has actually a brilliant diagram coalescing all this data into a readable form, here: https://github.com/KurtE/TeensyDocuments/blob/master/Teensy4.1 Pins.pdf. And another one, with all the MUX modes for every pin: https://github.com/KurtE/TeensyDocuments/blob/master/Teensy4.1 Pins Mux.pdf - which shows where the signal from each Teensy pin can be routed to.

We have the signal from our input pins routed to GPIO unit by the IOMUX now. Next we need to look at th GPIO unit itself to see how we can access it. This is covered in Chapter 12, page 949. We can see that there are 9 GPIOs, first five of them being standard speed, working on the IPG_CLK clock (it's the peripheral clock, described in Chapter 14, it runs at 1/4th of a speed of the ARM core, so if you have the Teensy running at 600MHz, it runs at 150MHz), and the next four being "fast ones", runnig at the same speed as the core (so 600MHz) *but* sharing the same pins as the last four of the regular ones ("pins" as the input pins to this module, not the Teensy pins/pads - so we're talking about these signals called GPIO1_IO25). This means that a signal routed to GPIO1_IO25 can be read both through GPIO1 AND GPIO6. Of course, there's a MUX to pick which one we want to use. The switching is controlled *per pin*, via IOMUXC_GPR_GPR26-29 registers (pages 375+).

Each GPIO allows to access the data through, again, memory-mapped registers. All the pins of a particular GPIO appear as individual bits of these registers - so in a single, 32bit memory access you can read the content of the entire GPIO unit. For instance, if you want to read the entire GPIO1, you simply read 32 bit value from the address 0x401B8000 (or simply use the GPIO1_DR macro from the core libraries; the data registers for GPIO are described on page 962.). This is why the mapping from Teensy pins/pads to GPIO units (the one on KurtE diagram) is so important - because you want all your signals to be hooked up to Teensy pins that live on a single GPIO - so you can read them all in one go, instead of touching multiple memory locations and doing bit twiddling. The best GPIO for that is GPIO1 (GPIO6 in the fast version) that has 20 external Teensy pins that can be routed to it (and 16 of the being contiguous, so if you route your data in a particular order, you don't even have to mess if it much on the code side, just do a bit shift). Let's assume we want to use Teensy pins 14, 15, 40, 41, 17, 16, 22, 23, 20, 21. They correspond to outputs 18-27 of the GPIO1.

Another important bit here is the direction - the GPIO can be used both to read and to output external signals, and the mode in which each GPIO bit works is controlled by another register - the direction GPIOx_GDIR. We want to read our data, so we want to set them to inputs - which means clearing corresponding bits of the GDIR register:

Code:
GPIO1_GDIR &= ~(0x0FFC0000u);

Now the tricky bit is that the "fast" GPIO cannot be read by the DMA. This isn't stated anywhere explicitly, but Paul mentions it somewhere here on the forum. It sort-of makes sense, as the DMA is clocked by the same clock as the "regular" GPIO, while the "fast" ones are clocked much higher, but an explicit note in the docs would be nice. To make our inputs readable to the DMA, we need to set them to be "regular" (so hooked up to GPIO1, not GPIO6) with the corresponding IOMUXC_GPR_GPR26-29 register. To switch our input pins to the "regular" mode we need to clear these bits in IOMUXC_GPR_GPR26:

Code:
IOMUXC_GPR_GPR26 &= ~(0x0FFC0000u);

With such setup, we have our input Teensy pins prepared to route the signals to the DMA. Now comes the more tricky bits: setting the DMAs and the XBAR (ugh).

The DMA subsystems are described in Chapters 5 and 6, with some really useful information also in Chapter 4, but we'll get to that. Long story short, the purpose of the DMA is to move data around memory, without the CPU being involved. Usually, when the data is transmitted, we want some notification (like an interrupt), but we can just as well simply poll for the completion in a busy loop. There are technically 32 DMA channels on the chip, but this is something I consider a bit of a lie. They are not independent DMA channels that run in parallel, there's in fact an arbitration unit and priorietes and picking which channel should be serviced at a given moment - so in reality there's only one transaction being performed at a given time (well, through this general purpose DMA - there are other specialized DMA controllers in other units that in fact run in parallel).

The configuration of each channel is described but a Transfer Control Descriptor (TCD). It contains things like the source address, number of bytes to copy in each transfer, the destination address etc. That structure is precisly layed out on page 116, and individual fields are described on subsequent pages. They all live, as you might have guessed, in a global memory space - so again, to configure them, you need to write the desired things to certain places in memory.

DMA transfers happen in chunks, called minor loops. The minor loop is executed then the channel receives a transfer request - this is an important concept, and more on that later, but it's important to realize that it makes it pretty different from how DMAs behave in other systems. Again, for me, they've been traditionally fire-and-forget requests - I want some data copied from here to here, alternatively filled with some value or similar, don't bother me until you're done. Here it works very different. When you set up a channel and enable it, it doesn't generally transfer anything until it gets that request (it *can* just transfer the data without any request too, but all this is something you set up). And when it gets that request it executes a *single* minor loop, not everything (and, of course, you can totally set it up to transfer everything, but again, you don't have to and that's something you set up). This is way more flexible mechanism, but you need to be aware of that.

Each minor loop transfers some number of bytes (NBYTES field from the TCD). The bytes are fetched from the source address (SADDR field in the TCD) and after grabbing some number of them (the actual count is defined by the SSIZE field), an offset value is added to the source address (SOFF field in the TCD). Then some number bytes (defined by DSIZE field in the TCD) are written to the destination address (DADDR in the TCD) and the destination address is modified by its offset values (DOFF in the TCD). A TCD can define to execute such minor loop multiple times, which is called a major loop - via BITER and CITER fields. They define, respectively, beginning iteration count for the major loop and the current iteration count for the major loop (they should start with the same value generally). There's also a bunch of other fields - you can add some value to the addresses on completion of the major loop, you can do channel linking, when major (or minor) loop completion triggers another channel etc - but these are more complex cases, we're not really interested in right now.

The important bits right now are that, since the data from the input Teensy pins is actually available under certain address, we can read it with the DMA. For that, we set the source address to memory mapped GPIO1 data register GPIO1_DR, and we set the source offset to not change (SOFF = 0). On the destination side, we put the address of our buffer in the destination address, and we set it to add 4 bytes to it on every transfer. We set the transfer size to 4 bytes, since GPIO registers allow only 32bit access. Then we set the major loop count to the size of our destination buffer. This setup gives us exactly what we want: at every clock cycle we will generate a request (how exactly is explained below), and at each request we will grab a single, 4 byte piece of data presented at the input pins and write them to the destination buffer. Then, at another clock cycle we will get another four bytes and so on, until we fill the entire buffer, which corresponds to finishing the major loop.

Teensy core libraries provide a really nice wrapper for all this functionality - the DMAChannel class here: https://github.com/PaulStoffregen/cores/blob/master/teensy4/DMAChannel.h, that lets you specify the interesting things without touching the actual registers. There is a couple of things to be careful about: first, the interface is constructed in a way that generally discouragues from explicitly defining the which hardware channel we'll actually be using, they are by default just allocated sequentially. You can technically override it, but it's a bit cumbersome. In situations like this, I'm much more used to specifying these sort of indices explicitly, so there are no surprises - especially that certain functionality is available only on certain channels (periodic triggering for instance). The other thing that was a bit of surprise was inferring the number of bytes to transfer in each transaction from the data type of the buffer passed. Again, might be just personal background, but passing the buffer in a type-agnostic way and explicitly providing the size of a single transfer would seem more natural.

All the setup boils down to:

Code:
dmachannel.begin();
dmachannel.source( GPIO1_DR ); 
dmachannel.destinationBuffer( dmaBuffer, DMABUFFER_SIZE * 4 );

so on every request, DMA will grab data from GPIO1, and write it to the destination buffer. We also wanted it to trigger interrupt when the entire buffer is full, and this is simply done with:

Code:
dmachannel.interruptAtCompletion();  
dmachannel.attachInterrupt( dmaInterrupt );

The first call just sets a bit in the TCD, the other one writes the address of the interrupt handler to an interrupt handlers table.

We now need the final piece of the puzzle - a DMA request generated at every clock cycle on some Teensy input pin.

How the DMA requests work is described in Chapter 5, on DMAMUX. The purpose of this system is to route signals that generate DMA *requests*. There is a number of signals that can act as DMA requests, and each of them can be connected to any of the DMA channels - and DMAMUX is used to set up this connection. The confusing thing is that throughout the chapter the signals that act as DMA requests are referenced to as "sources", while they don't have anything to do with the source of the *data* for the transfer - so just keep that in mind when reading all this. To make it even more confusing, the first four DMA channels have *trigger* capability - but it's not "triggering" the transfer, but rather periodic triggering on top of the actual request - when you have some source of the DMA request, you can additionally configure a periodic timer, and only when both request source AND the periodic trigger are high, the DMA request is actually generated. Not using "triggers", just regular source, also "triggers"/generates the request, it just doesn't have this additional, periodic gating on top. This is described on pages 79-80, and is actually pretty logical, as long you remember that the "peripheral request" is the DMA request source signal, and the "trigger" is that period trigger that you set up somewhere else (in the Periodic Interval Timer system).

For our transfer we need a request on every clock cycle. The problem is that the DMA requests can only come from certain places. The full list is on page 52, and the GPIO is simply not there. So you cannot trigger DMA through the GPIO system, it needs to come from somewhere else. Closer inspection of the list reveals that there does contain XBAR, which can actually even act as source of four independent DMA request. XBAR is yet another multiplexer. That subsystem actually contains 3 different multiplexers, all pretty gigantic, with the first being able to connect 80-something inputs to 130-something output, but there's also XBAR2 and XBAR3 which can do additional connections (though for slightly different signals). The list of the available inputs and outputs for each of the XBARs is listed in chapter 4, on pages 61+.

If you look at the input list for the XBAR, among other things you'll see IOMUX_XBAR_xxx which are outputs from IOMUX, which, as we've already seen before are the multiplexers governing the input pads, so Teensy input pins. Before, we had them set up to route the inputs to GPIO, but we mentioned that we can also redirect them to other subsystems. Now we want to redirect the input from the pin that will get the clock signal to the XBAR. Looking at KurtE's diagram, we can see which pads/Teensy pins can be routed to XBAR, and which XBAR input they end up as. Picking Teensy pin 4, we can read that it can get routed to XBAR input 8.

The register controlling Teensy pin 4/pad EMC_06 is called IOMUXC_SW_MUX_CTL_PAD_GPIO_EMC_06 and is described on page 435. According to the table, we need to set its mode to 3 to route it to XBAR (again, the more comprehensive version of all these assignments is here: https://github.com/KurtE/TeensyDocuments/blob/master/Teensy4.1 Pins Mux.pdf).

Code:
IOMUXC_SW_MUX_CTL_PAD_GPIO_EMC_06 = 3;


Then we connect the input 8 of the XBAR to the XBAR DMA request output. Scrolling the list of the XBAR1 outputs (page 68), we can see that output 0 corresponds to the XBAR DMA request 0 in the DMAMUX (input 30 in the DMA MUX on page 54 - this table is a bit weird again, as it lists the sources of DMA requests as "channels", but they are not DMA channels... ugh, confusing). Configuring XBAR connections is done through registers described in Chapter 61, on pages 3235+, but Teensy libraries provide a function called xbar_connect and a macros for all the XBAR inputs and outpus so it's all a bit more readable. All we need to do is call

Code:
xbar_connect( XBARA1_IN_IOMUX_XBAR_INOUT08, XBARA1_OUT_DMA_CH_MUX_REQ30 );

(the xbar_connect function is defined in pwm.c, so you might need to either link the file to your project, or copy that function over).


Now comes a bit of more obscure stuff, that I'm sure I would have missed in the docs, but KurtE sources do all this, so I had some reference point to look for. First, the XBAR1_INOUT8 is called "INOUT" for a reason - it can be both input and output. Since we want it to be input, we need to set it this way: IOMUXC_GPR_GPR6 register (page 344) sets a direction for a bunch of XBAR I/O pins. We want to clear the bit corresponding to INOUT8 to set it as input:

Code:
IOMUXC_GPR_GPR6 &= ~(IOMUXC_GPR_GPR6_IOMUXC_XBAR_DIR_SEL_8)

Then, certain XBAR inputs can be driven by two different input pads. This is called daisy-chaining, and for XBAR1_IN08 is controlled by register IOMUXC_XBAR1_IN08_SELECT_INPUT (page 906) - we can choose between it begin driven by pad EMC_06 (so Teensy pin 4) or by pad SD_B0_04 (one of the SDIO pins). We want to use Teensy pin 4, so we set it to 0

Code:
IOMUXC_XBAR1_IN08_SELECT_INPUT = 0;

We also want the XBAR to react to an edge of the input signal, not really the level, and we want to trigger the DMA on that edge. This is controlled by the XBARA1_CTRL0 register, see page 3271 for details. We want to enable edge detection on XBARA_OUTPUT00, detect rising edge, and generate a DMA request when it occurs:

Code:
XBARA1_CTRL0 = XBARA_CTRL_STS0 | XBARA_CTRL_EDGE0(1) | XBARA_CTRL_DEN0;

This is tbh, at least to me, a bit confusing, as the XBAR1_OUTPUT00 is XBAR1_DMA - so why do we additionally need to enable the XBARA_CTRL_DEN0 bit here? What would it actually mean to have something routed to that output but without this bit set? Seems like it wouldn't generate these request, but in that case what does this connection actually mean? I'm not really sure.

Anyway.... the only remaining thing is to actually set up our DMA channel to execute it's minor copy loop when triggered by the XBAR request:

Code:
dmachannel.triggerAtHardwareEvent( DMAMUX_SOURCE_XBAR1_0 );

And done!

Well, not quite, because, surprise, surprise, XBAR doesn't seem to be clocked by default, so it just doesn't work if you don't enable its clock explicitly (clock gating register described on page 1086):

Code:
CCM_CCGR2 |= CCM_CCGR2_XBAR1(CCM_CCGR_ON);

And you need to enable that clock *before* you start messing with XBAR settings, otherwise, they wont take any effect.


And now we're really really done. When we enable the DMA channel, it will wait for the clock signal on Teensy pin 4, and its each rising edge will trigger a minor DMA loop, copying 4 bytes from the GPIO1 to the output buffer. When the output buffer is full, it will trigger the interrupt. One useful thing to mention here, is that you need to clear the interrupt flag in the interrupt handler:

Code:
dmachannel.clearInterrupt();	
asm("DSB");

The DSB instruction is a memory barrier, ensuring that the following code will not be executed before the previous memory operations complete (apparently it can be pretty common to finish the execution of the interrupt handler, before that clear actually getting through, due to differences in the clock rates between different systems)


Here's the complete code you can just build and try out. It uses the setup as described above, so you need a clock signal on pin 4. On the data pins I present a binary counter, which I then check for correctness - this is mainly to check how fast I can get this data from the input pins with the DMA ((it only does 8 bit data, instead of 10, too many cables to connect ;-). The disappointing bit is that it's not really that fast. 10MHz seem to work fine, but anything above is just generating errors - some values are missed. I'm not entirely sure if this is problem with the slowness of RT1060 or my setup - I have the signal lines connected with ~15cm cables, so maybe there's a problem with the signal integrity by the time it gets to the input pins. But I don't really have neither the equipment nor the experience to reliably measure a >10MHz signal. So maybe it's possible to get it to work slightly faster too - but I wouldn't expect anything crazy - the DMA is only clocked at 1/4th of the core frequency.


Code:
#include <DMAChannel.h>

DMAChannel dmachannel;

#define DMABUFFER_SIZE	4096
uint32_t dmaBuffer[DMABUFFER_SIZE];

int counter = 0;
unsigned long prevTime;
unsigned long currTime;
bool error = false;
bool dmaDone = false;
uint32_t errA, errB, errorIndex;

// copied from pwm.c
void xbar_connect(unsigned int input, unsigned int output)
{
	if (input >= 88) return;
	if (output >= 132) return;

	volatile uint16_t *xbar = &XBARA1_SEL0 + (output / 2);
	uint16_t val = *xbar;
	if (!(output & 1)) {
		val = (val & 0xFF00) | input;
	} else {
		val = (val & 0x00FF) | (input << 8);
	}
	*xbar = val;
}


void dmaInterrupt()
{
	dmachannel.clearInterrupt();	// tell system we processed it.
	asm("DSB");						// this is a memory barrier

	prevTime = currTime;
	currTime = micros();  

	error = false;

	uint32_t prev = ( dmaBuffer[0] >> 18 ) & 0xFF;
	for( int i=1; i<4096; ++i )
	{
		uint32_t curr = ( dmaBuffer[i] >> 18 ) & 0xFF;

		if ( ( curr != prev + 1 ) && ( curr != 0 ) )
		{
			error = true;
			errorIndex = i;
			errA = prev;
			errB = curr;
			break;
		}

		prev = curr;
	}

	dmaDone = true;
}


void kickOffDMA()
{
	prevTime = micros();
	currTime = prevTime;

	dmachannel.enable();	
}


void setup()
{
	Serial.begin(115200);	

	// set the GPIO1 pins to input
	GPIO1_GDIR &= ~(0x03FC0000u);

	// Need to switch the IO pins back to GPI1 from GPIO6
	IOMUXC_GPR_GPR26 &= ~(0x03FC0000u);

	// configure DMA channels
	dmachannel.begin();
	dmachannel.source( GPIO1_DR ); 
	dmachannel.destinationBuffer( dmaBuffer, DMABUFFER_SIZE * 4 );  

	dmachannel.interruptAtCompletion();  
	dmachannel.attachInterrupt( dmaInterrupt );

	// clock XBAR - apparently not on by default!
	CCM_CCGR2 |= CCM_CCGR2_XBAR1(CCM_CCGR_ON);

	// set the IOMUX mode to 3, to route it to XBAR
	IOMUXC_SW_MUX_CTL_PAD_GPIO_EMC_06 = 3;

	// set XBAR1_IO008 to INPUT
	IOMUXC_GPR_GPR6 &= ~(IOMUXC_GPR_GPR6_IOMUXC_XBAR_DIR_SEL_8);  // Make sure it is input mode

	// daisy chaining - select between EMC06 and SD_B0_04
	IOMUXC_XBAR1_IN08_SELECT_INPUT = 0;
	
	// Tell XBAR to dDMA on Rising
	XBARA1_CTRL0 = XBARA_CTRL_STS0 | XBARA_CTRL_EDGE0(1) | XBARA_CTRL_DEN0;

	// connect the IOMUX_XBAR_INOUT08 to DMA_CH_MUX_REQ30
	xbar_connect(XBARA1_IN_IOMUX_XBAR_INOUT08, XBARA1_OUT_DMA_CH_MUX_REQ30);

	// trigger our DMA channel at the request from XBAR
	dmachannel.triggerAtHardwareEvent( DMAMUX_SOURCE_XBAR1_0 );

	kickOffDMA();
}


void loop()
{
	delay( 100 );

	if ( dmaDone )
	{
		Serial.printf( "Counter %8d Buffer 0x%08X time %8u  %s", counter, dmaBuffer[0], currTime - prevTime, error ?  "ERROR" : "no error"  );
	
		if ( error )
		{
			Serial.printf( " [%d] 0x%08X 0x%08X", errorIndex, errA, errB );
		}

		Serial.printf( "\n");

		dmaDone = false;
		delay( 1000 );

		Serial.printf( "Kicking off another \n" );

		kickOffDMA();
	}
	else
	{
		Serial.printf( "Waiting...\n" );
	}

	++counter;
}

Once you feel confident with all this, reading any more complex setups will be much easier, as they generally build on these basics. For instance KurtE's camera code has two DMA setups, that alternate between one another, each writing to a different buffer. The DMA controller has a bunch of interesting functionality (like triggering interrupt on half-completion, so you can probably get a similar setup to KurtE's with just a single channel/buffer), it can be hooked up to different systems etc. And when you get some familiarity with these docs, it's actually not that hard to navigate them, even though they are over 3000 pages long.

And thanks to everyone around here, especially Paul and KurtE, all your work has been extremely helpful.
 
Nice write up - Puts a lot of the details into one place which is great.

Note: another gotcha you may run into with DMA is with caching.

That is with the example above you are using the dmabuffer
Code:
DMAChannel dmachannel;

#define DMABUFFER_SIZE	4096
uint32_t dmaBuffer[DMABUFFER_SIZE];
This puts your buffer into the RAM1 section of memory (TCM - Tightly Coupled Memory) which is not cached...

However if instead you had:
Code:
DMAMEM uint32_t dmaBuffer[DMABUFFER_SIZE];
Which puts this array into RAM2. Or likewise if you had it like dmaBuffer = malloc(DMABUFFER_SIZE) this also goes into RAM2
or on T4.1 with external PSRAM, with EXTMEM or extmem_malloc(), these are in up on your external memory chip.

Why is this important? Because these are in sections of memory which have a cache associated with them. Again why is this important?

DMA operations work directly to and from the physical memory. It knows nothing about the cache. However normal CPU instructions do use the cache. So the values you see may be different! Which is probably not at times what you want :D So what do you do about it... The system does have a few built in functions, to help deal with this:
The two I typically use are: arm_dcache_flush and arm_dcache_delete but there is also arm_dcache_flush_delete. Theser are defined in the imxrt.h file.

When to use them. Suppose you are doing DMA from memory to someplace. Example display driver outputting a frame to the display. In this case I would call arm_dcache_flush(buffer, size); to make sure everything that is in the buffer is flushed out to the underlying memory and as such will output the most recent data.

If I am doing DMA in from a device into memory, I call arm_dcache_delete, probably before the DMA operation. This removes the data from the cache such that the next time you do read operations from that area of memory it will read it from the actual memory which should have the results of the DMA operation.

Another minor note: DMA appears to really prefer that the memory addresses passed in are 32 byte aligned.

So you will see places in code that have defines like:
Code:
    uint8_t  MTPD::big_buffer_[BIG_BUFFER_SIZE] DMAMEM __attribute__ ((aligned(32)));
Or in other places in code that maybe use malloc something like:

Code:
      // Hack to start frame buffer on 32 byte boundary
      _we_allocated_buffer = (uint16_t *)malloc(CBALLOC + 32);
      if (_we_allocated_buffer == NULL)
        return 0; // failed
      _pfbtft = (uint16_t *)(((uintptr_t)_we_allocated_buffer + 32) &
                             ~((uintptr_t)(31)));

Again nice write up!
 
@miciwan great post! Do you have any reference to information about XBAR besides the technical manual?
What other methods are there to trigger a DMA transfer? And how would one attach a clock signal to a DMA transfer (for driving an LCD in this case)
 
@miciwan great post! Do you have any reference to information about XBAR besides the technical manual?
What other methods are there to trigger a DMA transfer? And how would one attach a clock signal to a DMA transfer (for driving an LCD in this case)

XBAR - for me it was reading the manual a few times.. (sometimes more than a few)... Note: So far when playing with DMA, I have mainly needed to have some understanding of XBAR for GPIO and for ADC... Note: ADC is a lot more complex as have to go through another subsystem (ADC_ETC)...

DMA Trigger - Look in the DMA MUX chapter (chapter 4.4) - and/or guess by looking in imxrt.h file for things that looks like: DMAMUX_SOURCE_FLEXIO1_REQUEST0

But for each of these, you may need to look up more information in the chapters that deal with what subsystem you are looking at.

As for LCD - How is it connected? SPI? If so you can look at some of our drivers that do DMA, like ili9341_t3n for things about this. Or you can look at the SPI library at the asynchronous transfer code which uses DMA...

As for I2C, I have not done much with it on T4.x. But I will assume that it uses a source like: DMAMUX_SOURCE_LPI2C1

But again you then need to look at the subsystem like for LPI2C to see what registers need to be setup to trigger I2C.
For example with LPI2C you see there is a register: Master DMA Enable Register (MDER)
Sounds very likely you might have to set something here...
 
Hi Kurt,

I’m driving the lcd at the moment via 8/16bit 8080 on GPIO6 writing directly to the port.
It works quite fast but I am afraid that when I start to add more functions and interrupts its going to impact that speed I see right now. That is why I am interested in trying to so 8 bit with DMA as there are just enough pins to achieve that on the same port.
So I need to provide a write strobe signal to the lcd when writing a command/data and from my understanding that need to be triggered with the DMA transfer via a timer of some sort(?)

I’ll head into the chapters you referred to, although all of this is rather challenging to me &#55357;&#56837;

I use your ILI9488 SPI library on my current project (with the gauge) but my new version with the T4.1 runs with LVGL on the UI and speed and/or DMA transfers to the display are needed to keep it all smooth
 
@KurtE - ah, great points. Properly flushing/invalidating caches is crucial when working with such systems, thanks for pointing that out. TBH, with this being my first T4.1 thing that I touched, I haven't even noticed that i.MX RT1060 has data cache ;) And great point about the alignment - this sort of stuff actually should be pointed out in the docs, it's pretty ridiculous that it isn't.

@Frank - I will! Again, being pretty fresh, I didn't even know there's a wiki! I will clean it up a bit and I'll put it there!

@Rezo - no, just the reference manual and the source, I'm afraid. It takes a while to get used to, it's not the best written documentation I've seen, but it does contain quite a lot.
There's actually a dedicated unit for driving LCDs, similar to CSI for reading the CMOS sensors, it's described in Chapter 35. I'm not sure if enough of it's pins are available externally (that's one of the problems with CSI - it seems great, but when you look closer, if you want to get raw data, the only option is 10bits; and then only the pins for the high 8 bits are available - which is a bit of a shame if you have a 12bit sensor).

If you want to do it by hand, I would probably start with setting up a timer to generate DMA requests that will push the data out on GPIO pins, so pretty similar, just setting GPIO to outputs and having a timer generate the requests instead of external clock signal (which should be simpler, i think, only setting up the timer). Then you would need to somehow push that clock signal outside, but it's probably possible to hook up that timer to some output pin with the XBAR too - but all this is just my rambling on how I would approach it, rather than actual way of doing that. And ofc, you would need need some buffering and a way of chaining the DMA requests so that screen update is continuous. When you get it working, post some details!
 
I tried doing similar setup for the output and it works fine too. I set up the FlexPWM module to ouput the clock signal on Teensy pin 4 (should be doable with QuadTimer too, but FlexPWM docs were easier to parse...) - chapter 55 has the details, but there's generally a counter incremented every clock cycle, and when it's between two values, the output is high

Code:
FLEXPWM2_MCTRL		|= FLEXPWM_MCTRL_CLDOK( 1 );

FLEXPWM2_SM0CTRL	 = FLEXPWM_SMCTRL_FULL | FLEXPWM_SMCTRL_PRSC(0);
FLEXPWM2_SM0CTRL2	 = FLEXPWM_SMCTRL2_INDEP | FLEXPWM_SMCTRL2_CLK_SEL(0);			// wait enable? debug enable?
FLEXPWM2_SM0INIT	 = 0;
FLEXPWM2_SM0VAL0	 = 0;									// midrange value to position the signal
FLEXPWM2_SM0VAL1	 = 66;								// max value
FLEXPWM2_SM0VAL2	 = 0;									// start A 
FLEXPWM2_SM0VAL3	 = 33;								// end A - should give roughly 2Mhz
FLEXPWM2_OUTEN		|= FLEXPWM_OUTEN_PWMA_EN( 1 );

FLEXPWM2_SM0TCTRL	 = FLEXPWM_SMTCTRL_PWAOT0;              // route the output waveform directly to the trigger 

FLEXPWM2_MCTRL		|= FLEXPWM_MCTRL_LDOK( 1 );
FLEXPWM2_MCTRL		|= FLEXPWM_MCTRL_RUN( 1 );

Then set the pin 4 mode to FlexPWM so the output wave goes directly outside.

Code:
IOMUXC_SW_MUX_CTL_PAD_GPIO_EMC_06 = 1;

Setting up the FLEXPWM2_SM0TCTRL register allows to route the output waveform directly to the "TRIGGER" output, which in turn can be hooked up to XBAR - where we can do edge detection and trigger the DMA as before. This time, i trigger the DMA on falling edge, so on rising, it's already set to the proper value and the client can read it:

Code:
// Tell XBAR to dDMA on Falling edge
// DMA will output next data piece on falling edge
// so the client can receive it on rising
XBARA1_CTRL0 = XBARA_CTRL_STS0 | XBARA_CTRL_EDGE0(2) | XBARA_CTRL_DEN0;
	
// connect the clock signal to DMA_CH_MUX_REQ30
xbar_connect( XBARA1_IN_FLEXPWM2_PWM1_OUT_TRIG0, XBARA1_OUT_DMA_CH_MUX_REQ30);

Complete thing:

Code:
#include <DMAChannel.h>

DMAChannel dmachannel;

#define DMABUFFER_SIZE	4096
uint32_t dmaBuffer[DMABUFFER_SIZE];

int counter = 0;
unsigned long prevTime;
unsigned long currTime;
bool error = false;
bool dmaDone = false;
uint32_t errA, errB, errorIndex;

void xbar_connect(unsigned int input, unsigned int output)
{
	if (input >= 88) return;
	if (output >= 132) return;

	volatile uint16_t *xbar = &XBARA1_SEL0 + (output / 2);
	uint16_t val = *xbar;
	if (!(output & 1)) {
		val = (val & 0xFF00) | input;
	} else {
		val = (val & 0x00FF) | (input << 8);
	}
	*xbar = val;
}


void outputDMAInterrupt()
{
	dmachannel.clearInterrupt();	// tell system we processed it.
	asm("DSB");						// this is a memory barrier

	prevTime = currTime;
	currTime = micros();  

	error = false;

	dmaDone = true;
}


void setupOutputDMA()
{
	// prepare the output buffer
	for( int i=0; i<DMABUFFER_SIZE; ++i )
	{
		dmaBuffer[i] = ( i & 0xFF ) << 18;
	}

	// set GPIO1 to output
	GPIO1_GDIR |= 0x03FC0000u;

	// Need to switch the IO pins back to GPI1 from GPIO6
	IOMUXC_GPR_GPR26 &= ~(0x03FC0000u);

	// configure DMA channels
	dmachannel.begin();
	dmachannel.sourceBuffer( dmaBuffer, DMABUFFER_SIZE * 4 );  
	dmachannel.destination( GPIO1_DR ); 

	dmachannel.interruptAtCompletion();  
	dmachannel.attachInterrupt( outputDMAInterrupt );

	// set the IOMUX mode to 3, to route it to FlexPWM
	IOMUXC_SW_MUX_CTL_PAD_GPIO_EMC_06 = 1;

	// setup flexPWM to generate clock signal on pin 4
	FLEXPWM2_MCTRL		|= FLEXPWM_MCTRL_CLDOK( 1 );

	FLEXPWM2_SM0CTRL	 = FLEXPWM_SMCTRL_FULL | FLEXPWM_SMCTRL_PRSC(0);
	FLEXPWM2_SM0CTRL2	 = FLEXPWM_SMCTRL2_INDEP | FLEXPWM_SMCTRL2_CLK_SEL(0);			// wait enable? debug enable?
	FLEXPWM2_SM0INIT	 = 0;
	FLEXPWM2_SM0VAL0	 = 0;									// midrange value to position the signale
	FLEXPWM2_SM0VAL1	 = 66;									// max value
	FLEXPWM2_SM0VAL2	 = 0;									// start A
	FLEXPWM2_SM0VAL3	 = 33;								// end A
	FLEXPWM2_OUTEN		|= FLEXPWM_OUTEN_PWMA_EN( 1 );

	FLEXPWM2_SM0TCTRL	 = FLEXPWM_SMTCTRL_PWAOT0;

	FLEXPWM2_MCTRL		|= FLEXPWM_MCTRL_LDOK( 1 );
	FLEXPWM2_MCTRL		|= FLEXPWM_MCTRL_RUN( 1 );

	// clock XBAR - apparently not on by default!
	CCM_CCGR2 |= CCM_CCGR2_XBAR1(CCM_CCGR_ON);
	
	// Tell XBAR to dDMA on Falling edge
	// DMA will output next data piece on falling edge
	// so the client can receive it on rising
	XBARA1_CTRL0 = XBARA_CTRL_STS0 | XBARA_CTRL_EDGE0(2) | XBARA_CTRL_DEN0;
	
	// connect the clock signal to DMA_CH_MUX_REQ30
	xbar_connect( XBARA1_IN_FLEXPWM2_PWM1_OUT_TRIG0, XBARA1_OUT_DMA_CH_MUX_REQ30);
	
	// trigger our DMA channel at the request from XBAR
	dmachannel.triggerAtHardwareEvent( DMAMUX_SOURCE_XBAR1_0 );
}


void kickOffDMA()
{
	prevTime = micros();
	currTime = prevTime;

	dmachannel.enable();	
}


void setup()
{
	Serial.begin(115200);	

	setupOutputDMA();

	kickOffDMA();
}


void loop()
{
	delay( 100 );

	if ( dmaDone )
	{
		Serial.printf( "Counter %8d Buffer 0x%08X time %8u  %s", counter, dmaBuffer[0], currTime - prevTime, error ?  "ERROR" : "no error"  );
	
		if ( error )
		{
			Serial.printf( " [%d] 0x%08X 0x%08X", errorIndex, errA, errB );
		}

		Serial.printf( "\n");

		dmaDone = false;
		delay( 1000 );

		Serial.printf( "Kicking off another \n" );

		kickOffDMA();
	}
	else
	{
		Serial.printf( "Waiting...\n" );
	}


	++counter;
}

Seems to work:

Capture2.jpg

Interestingly, this allows to measure the time between the falling edge of the clock and when the DMA shows the data on the output pins - which is around 100ns.
 
This is an excellent writeup, thank you miciwan!

Even though I'm not using a Teensy (working on a custom iMXRT1062 board) it helped me immensely getting DMA to work well with the ADC, ADC_EXT and PIT together. The reference manual leaves a bit to be desired; most of the info is there but spread out so much and written a bit esoterically.
 
Complete thing:

Code:
#include <DMAChannel.h>

DMAChannel dmachannel;

#define DMABUFFER_SIZE	4096
uint32_t dmaBuffer[DMABUFFER_SIZE];

int counter = 0;
unsigned long prevTime;
unsigned long currTime;
bool error = false;
bool dmaDone = false;
uint32_t errA, errB, errorIndex;

void xbar_connect(unsigned int input, unsigned int output)
{
	if (input >= 88) return;
	if (output >= 132) return;

	volatile uint16_t *xbar = &XBARA1_SEL0 + (output / 2);
	uint16_t val = *xbar;
	if (!(output & 1)) {
		val = (val & 0xFF00) | input;
	} else {
		val = (val & 0x00FF) | (input << 8);
	}
	*xbar = val;
}


void outputDMAInterrupt()
{
	dmachannel.clearInterrupt();	// tell system we processed it.
	asm("DSB");						// this is a memory barrier

	prevTime = currTime;
	currTime = micros();  

	error = false;

	dmaDone = true;
}


void setupOutputDMA()
{
	// prepare the output buffer
	for( int i=0; i<DMABUFFER_SIZE; ++i )
	{
		dmaBuffer[i] = ( i & 0xFF ) << 18;
	}

	// set GPIO1 to output
	GPIO1_GDIR |= 0x03FC0000u;

	// Need to switch the IO pins back to GPI1 from GPIO6
	IOMUXC_GPR_GPR26 &= ~(0x03FC0000u);

	// configure DMA channels
	dmachannel.begin();
	dmachannel.sourceBuffer( dmaBuffer, DMABUFFER_SIZE * 4 );  
	dmachannel.destination( GPIO1_DR ); 

	dmachannel.interruptAtCompletion();  
	dmachannel.attachInterrupt( outputDMAInterrupt );

	// set the IOMUX mode to 3, to route it to FlexPWM
	IOMUXC_SW_MUX_CTL_PAD_GPIO_EMC_06 = 1;

	// setup flexPWM to generate clock signal on pin 4
	FLEXPWM2_MCTRL		|= FLEXPWM_MCTRL_CLDOK( 1 );

	FLEXPWM2_SM0CTRL	 = FLEXPWM_SMCTRL_FULL | FLEXPWM_SMCTRL_PRSC(0);
	FLEXPWM2_SM0CTRL2	 = FLEXPWM_SMCTRL2_INDEP | FLEXPWM_SMCTRL2_CLK_SEL(0);			// wait enable? debug enable?
	FLEXPWM2_SM0INIT	 = 0;
	FLEXPWM2_SM0VAL0	 = 0;									// midrange value to position the signale
	FLEXPWM2_SM0VAL1	 = 66;									// max value
	FLEXPWM2_SM0VAL2	 = 0;									// start A
	FLEXPWM2_SM0VAL3	 = 33;								// end A
	FLEXPWM2_OUTEN		|= FLEXPWM_OUTEN_PWMA_EN( 1 );

	FLEXPWM2_SM0TCTRL	 = FLEXPWM_SMTCTRL_PWAOT0;

	FLEXPWM2_MCTRL		|= FLEXPWM_MCTRL_LDOK( 1 );
	FLEXPWM2_MCTRL		|= FLEXPWM_MCTRL_RUN( 1 );

	// clock XBAR - apparently not on by default!
	CCM_CCGR2 |= CCM_CCGR2_XBAR1(CCM_CCGR_ON);
	
	// Tell XBAR to dDMA on Falling edge
	// DMA will output next data piece on falling edge
	// so the client can receive it on rising
	XBARA1_CTRL0 = XBARA_CTRL_STS0 | XBARA_CTRL_EDGE0(2) | XBARA_CTRL_DEN0;
	
	// connect the clock signal to DMA_CH_MUX_REQ30
	xbar_connect( XBARA1_IN_FLEXPWM2_PWM1_OUT_TRIG0, XBARA1_OUT_DMA_CH_MUX_REQ30);
	
	// trigger our DMA channel at the request from XBAR
	dmachannel.triggerAtHardwareEvent( DMAMUX_SOURCE_XBAR1_0 );
}


void kickOffDMA()
{
	prevTime = micros();
	currTime = prevTime;

	dmachannel.enable();	
}


void setup()
{
	Serial.begin(115200);	

	setupOutputDMA();

	kickOffDMA();
}


void loop()
{
	delay( 100 );

	if ( dmaDone )
	{
		Serial.printf( "Counter %8d Buffer 0x%08X time %8u  %s", counter, dmaBuffer[0], currTime - prevTime, error ?  "ERROR" : "no error"  );
	
		if ( error )
		{
			Serial.printf( " [%d] 0x%08X 0x%08X", errorIndex, errA, errB );
		}

		Serial.printf( "\n");

		dmaDone = false;
		delay( 1000 );

		Serial.printf( "Kicking off another \n" );

		kickOffDMA();
	}
	else
	{
		Serial.printf( "Waiting...\n" );
	}


	++counter;
}



I finally have the guts to test this and start playing around as I have been doing ALOT of reading (still don't understand much but know more that I did a few weeks ago).
I want to thank miciwan for the guidance and information provided over PM.
I just want to confirm a few things to make sure I have understood what I have read so far :eek:

I want to set the upper 16 bits of GPIO1 to outputs, so I have done the following - is this correct?
Code:
// set GPIO1 to output
//GPIO1_GDIR |= 0x03FC0000u;
GPIO1_GDIR |= 0xFFFF0000u

// Need to switch the IO pins back to GPI1 from GPIO6
//IOMUXC_GPR_GPR26 &= ~(0x03FC0000u);
IOMUXC_GPR_GPR26 &= ~(0xFFFF0000u);

I want to set the PWM output to 8-10 Mhz - based on the FlexPWM clock running at 108Mhz (per miciwan) I have set the following values:
Code:
FLEXPWM2_SM0VAL0   = 0;                 // midrange value to position the signal
FLEXPWM2_SM0VAL1   = 12; //66;                  // max value
FLEXPWM2_SM0VAL2   = 0;                 // start A
FLEXPWM2_SM0VAL3   = 6; //33;                // end A
108/12 = 9Mhz


I want to use FlexPWM2:0 on alternate pin #33, did I get it right? :D

Code:
// set the IOMUX mode to 3, to route it to FlexPWM
//IOMUXC_SW_MUX_CTL_PAD_GPIO_EMC_06 = 1;
IOMUXC_SW_MUX_CTL_PAD_GPIO_EMC_07 = 1;

last but not least! If I want to stop PWM2:0 Can I just set the following register to 0?
Code:
FLEXPWM2_MCTRL    |= FLEXPWM_MCTRL_RUN( 0 );
 
Back
Top