Forum Rule: Always post complete source code & details to reproduce any issue!
Results 1 to 14 of 14

Thread: Teensy 4.1 How to start using DMA?

  1. #1

    Teensy 4.1 How to start using DMA?

    Hello.
    I want to use DMA on Teensy 4.1 but I found out that there is hardly any good documentation of how to start (tutorial), which libraries to use, how to configure DMA, and simple examples.
    What I want to do now is to read 28gpio states into an array, at speed ot 2M samples per second.
    I'm using Arduino-IDE.

    Regards Mateusz

  2. #2
    Senior Member+ KurtE's Avatar
    Join Date
    Jan 2014
    Posts
    7,681
    Sorry, I am not sure if there is any easy answer to this, nor any tutorials,

    The way I have learned about doing DMA, is by looking at other code doing DMA... Plus looking through the reference manual, starting with chapters 4 and 5.

    I personally have not done GPIO DMA input or output myself. I have mostly done some with SPI and Analog inputs...

    Source wise, I usually again start of by looking at other code that has worked and model after that...
    In the cores\teensy4 directory of your install are the files DMAChannel.h and DMAChannel.cpp
    That I use a lot to help setup things. Like a DMAChannel and DMASettings and the like. Information about formats of this and what each field does is in the chapters I mentioned above.

    Also the file imxrt.h includes a set of defines for what the source is for the DMA transfer (DMAMUX_SOURCE_...)

    Also to do GPIO, in this way you need to understand which pins are on what hardware point and where on the ports. Again you can figure that out from the header files. I keep my own spreadsheet with that information:
    Click image for larger version. 

Name:	T4.1-Cardlike.jpg 
Views:	33 
Size:	225.3 KB 
ID:	21856
    Which is in an excel document up at: https://github.com/KurtE/TeensyDocuments

    With GPIO there is an additional complication, that each GPIO port actually has two versions of the port. GPIO1 also maps to GPIO6, where the 6 runs the GPIO pins at faster way than GPIO 1, BUT it does not allow DMA from there. There is another set of hardware registers that tell by each pin if it should be in the normal GPIO or faster GPIO...
    There is code in the main startup.c which maps all of our pins into fast mode... So to do GPIO to those you need to switch them back to low speed....

    Probably the best example code for this is the OctoWS2811 code that Paul did to allow the OctoW2811 board to work with T4.x... But his code is only doing outputs to GPIO, so would need to reverse this.
    But the sources again are part of Teensyduino in the OctoWS2811 library in OctoWS2811_imxrt.cpp

    Hope that helps some.

  3. #3
    Senior Member PaulStoffregen's Avatar
    Join Date
    Nov 2012
    Posts
    22,757
    Sadly, there aren't any easy tutorials. But you might read through the DMAChannel.h class. I put lots of comments in the source to explain how things work. As things are today, those comments are the closest thing we have to any sort of introduction to DMA.

    Indeed the OctoWS2811 source code is probably the best place to start if you want GPIO to memory.

    DMA is tricky to use, partly because you get pretty much no info when it doesn't work which makes troubleshooting very hard, but also because NXP's gigantic reference manual is sorely lacking in many important details that usually you can only discover by a lot of painful trial and error. Having done that for many hours to write the OctoWS2811 and other libs, here's a few specific tips to hopefully save you from some of those painful moments.

    First you'll need to configure a timer to generate the DMA trigger events at the rate you want. While developing, I highly recommend using an extremely slow rate. If the timer is *very* slow, you can write code to print the actual memory locations and watch how they change as your DMA runs (may sure you use volatile keywords so the compiler doesn't assume memory can't spontaneously change). You can always just edit the timer config for the very high speeds once the rest is working.

    While the timers can generate DMA events directly, that rarely works if you want to use the DMA to do something other than write to the timer's settings. The problem is the timer doesn't get any acknowledgement that its DMA request was serviced, so it keeps asserting the request signal and the DMA controller goes into an infinite loop running the DMA you requested (or if you configure DMA to halt rather than keep repeating the transfer, it does the whole thing as fast as the hardware can go). This DMA acknowledge process is the main thing that's sorely lacking in NXP's documentation. The rest of the manual does have all the needed info, though it's scattered and difficult to find across those thousands of pages!

    You will probably need to route the timer's output pulse through the crossbar switch to 1 of it's 4 DMA request generators. Those do auto-acknowledge when the DMA controller service your request.

    The other alternative for DMA acknowledgement is to set up 2 DMA channels, one which does the GPIO operation you want, and another which does a dummy write to the timer so it won't keep requesting more DMA until the next timer pulse. But you can't use the DMAMUX to route a trigger event to more than 1 DMA channel. To get 2 channels to run, you need to have the hardware event trigger the first channel, then set that channel up to trigger the other channel as it completes each event. This way is more complicated, less efficient, and consumes an extra DMA channel. But the upside is it doesn't consume any of those finite DMA triggering resources in the crossbar switch.

    GPIO also requires configuration. Each 32 bit GPIO port has 2 sets of registers. GPIO1-4 are on the normal (slow) peripheral bus. GPIO5-9 are on a fast bus, but the DMA controller can't access that bus. The fact that DMA can't access those registers is the other painful detail not mentioned anywhere in NXP's documentation. It's crazy frustrating if you don't know this and have to find out by experimentation while so many other things are also unclear! You have to use only GPIO1-4 for DMA. Each individual pin can be assigned to be accessed by either the fast or slow registers. By default they're all assigned to the fast registers. So you will need to write to the GPR registers to reassign the pins you wish to use. See the code in OctoWS2811 for an example.

    Once you have the timer, crossbar trigger and GPIO ready to be used by DMA, then comes a lot of decisions about how to configure the DMA channel. If you only need a fairly simple configuration, maybe the DMAChannel.h functions are enough. But you can also use DMAChannel to write directly to the 32 bytes of TCD registers which control the actual transfer. Most of the libraries do this, which gives the advantage that the DMAChannel class dynamically allocates an available DMA channel for you, which means your code is far more likely to be able to work together with the various libraries which use DMA (as far as I know, all of them use DMAChannel to avoid conflicts with the others).

    The OctoWS2811 code uses a pretty complex TCD setup, because it generates data "on the fly" in chunks. You're probably better off to look at how the audio library configured the TCD, at least for the simpler protocols like input_i2s.cpp, where the DMA runs continuously circling through a buffer and the interrupt from DMA just copies half of the received data to a non-DMA buffer while the DMA keeps running to fill up the other half. If you're just looking to acquire input signals at steady speed, that is probably a usage model much closer to your needs than the complex reloading of the DMA settings which OctoWS2811 does. If you can just set up the DMA channel once and let if run forever while you respond to its interrupts, things are much simpler. Of course the hardware can do very complex things as OctoWS2811 shows, but best to avoid that sort of use, especially as a first learning experience.

    Just to keep this in perspective, you're going to configure the TCD source address to read from the GPIO register, and you want zeros in all the source address offset fields so the DMA channel always reads the same place. Of course set it up for 32 bit transfer, as the GPIO registers do not support 8 or 16 bit access. For the destination, you'll set the address to your buffer that will receive the incoming DMA writes, and you'll set the destination offset to 4 so it increments through the buffer as it writes. You'll probably also set the DLAST offset so the destination address automatically returns to the start of your buffer. Especially if you don't set the bit for the transfer to be done at the end, it will automatically restart, which is really convenient to just set the DMA up once and let it keep running forever. The main trick is getting the "last" destination offset right so the destination address resets back to the beginning automatically.

    If possible, use DMAChannel.h functions to do the dirty work of setting up all those TCD registers. Especially if you're going to just set up a transfer than runs continuously and gives you interrupts as it does it work (so you can copy the data it's acquired before it gets even more and overwrite the data it already got) that TCD init is only does once at startup. Of course you can read all about the TCD registers in the reference manual, but the amount of raw capability and the huge number of possible ways to set it up can be pretty overwhelming. Best to keep things simple if you don't need those advanced features (and this sounds like one of the pretty simple uses).

    Whew, that turned out longer than I imagined. Hopefully it helps?
    Last edited by PaulStoffregen; 09-25-2020 at 08:55 PM. Reason: added more info...

  4. #4
    Thank you both for your replies.
    I don't know if it helps with complexity, but I already have available square 2MHz signal that I use to clock external parallel ADC's.
    So on every falling edge, I need to cath 28 GPIO states and put them as bits into variable (and into array of variables). When first array is full, let it know to rest of code, and start filling up second array. And so on.
    Currently I'm doing this inside ISR and here is my code:

    Code:
    void myISR() //this will occur 2 000 000 times per second
    {
      if (GPIO_array_sel)  //select second array
      {
    
        if (adc_sample_start == true)  //select if we really want adc samples or just need to max-out value to make short impulse that will indicate to further software that adc is about to start sampling
        {
          GPIO6_1.array[GPIO_array_pos] = 0xFFFFFFFF;
          GPIO7_1.array[GPIO_array_pos] = 0xFFFFFFFF;
          adc_sample_start = false;
        }
        else
        {
          GPIO6_1.array[GPIO_array_pos] = GPIO6_DR;
          GPIO7_1.array[GPIO_array_pos] = GPIO7_DR;
        }
    
    
      }
      else //select first array
      {
    
        if (adc_sample_start == true) //select if we really want adc samples or just need to max-out value to make short impulse that will indicate to further software that adc is about to start sampling
        {
          GPIO6_0.array[GPIO_array_pos] = 0xFFFFFFFF;
          GPIO7_0.array[GPIO_array_pos] = 0xFFFFFFFF;
          adc_sample_start = false;
        }
        else
        {
          GPIO6_0.array[GPIO_array_pos] = GPIO6_DR;
          GPIO7_0.array[GPIO_array_pos] = GPIO7_DR;
        }
    
      }
    
      GPIO_array_pos++;  //increment current position counter in array, so on next ISR we will write to next record
    
    
      if (GPIO_array_pos == 250)  //if array is full, select another array and reset position counter
      {
        GPIO_array_sel = !GPIO_array_sel;
        GPIO_array_pos = 0;
      }
    }
    If any array is full, rest of code in loop() prepares the data, makes 1000 byte array and sends that array via UDP ethernet using NativeEthernet library.
    Code:
      
            Udp.beginPacket(Udp.remoteIP(), port);
            Udp.write(arrayToSend, 1000);
            Udp.endPacket();
    And it works just fine for most of time but sometimes when UDP packet is composed or transferred, and ISR occurs, UDP packets contains distorted data.
    The above 3 lines of code are blocking and take some time to execute, so I can't just put them between noInterrupts() and interrupts(). I could not find better library for ethernet so I thought I will get rid of this time sensitive and continues ISR.


    Looking through your DMA libraries, I'm very, very sad to say that but I think that in this state, I'm just too stupid to use DMA in my projects
    I am a "typical" user, I totally can not understand most of this libraries code that looks like this:
    Code:
    TMR4_SCTRL0 = TMR_SCTRL_OEN | TMR_SCTRL_FORCE | TMR_SCTRL_MSTR;
    	TMR4_CSCTRL0 = TMR_CSCTRL_CL1(1) | TMR_CSCTRL_TCF1EN;
    	TMR4_CNTR0 = 0;
    	TMR4_LOAD0 = 0;
    	TMR4_COMP10 = comp1load[0];
    	TMR4_CMPLD10 = comp1load[0];
    	TMR4_CTRL0 = TMR_CTRL_CM(1) | TMR_CTRL_PCS(8) | TMR_CTRL_LENGTH | TMR_CTRL_OUTMODE(3);
    	TMR4_SCTRL1 = TMR_SCTRL_OEN | TMR_SCTRL_FORCE;
    	TMR4_CNTR1 = 0;
    	TMR4_LOAD1 = 0;
    	TMR4_COMP11 = comp1load[1]; // T0H
    	TMR4_CMPLD11 = comp1load[1];
    	TMR4_CTRL1 = TMR_CTRL_CM(1) | TMR_CTRL_PCS(8) | TMR_CTRL_COINIT | TMR_CTRL_OUTMODE(3);
    	TMR4_SCTRL2 = TMR_SCTRL_OEN | TMR_SCTRL_FORCE;
    	TMR4_CNTR2 = 0;
    	TMR4_LOAD2 = 0;
    	TMR4_COMP12 = comp1load[2]; // T1H
    	TMR4_CMPLD12 = comp1load[2];
    	TMR4_CTRL2 = TMR_CTRL_CM(1) | TMR_CTRL_PCS(8) | TMR_CTRL_COINIT | TMR_CTRL_OUTMODE(3);
    I don't blame anybody because I know writing library and commenting every line will took a lot of extra time and effort.
    Writing this thread I was hoping that there is simple and easy to use DMA library with good documentation that does not require to use "register names" or dive into NXP manual.

    If you still have any thoughts or tips about my issue please reply. Thanks again for your contribution.

  5. #5
    Senior Member+ KurtE's Avatar
    Join Date
    Jan 2014
    Posts
    7,681
    Hi @Paul and @Fiskusmati

    As I mentioned in the previous posts, some of this DMA stuff can get real confusing. Up till now with T4.x I have mainly done DMA operations to SPI device, maybe logical UART for flexIO.

    But I am playing with trying to get some DMA to one GPIO port to work to read in 8 pins, using an external clock from a camera... And I think I am getting pretty close. I believe I am getting the data clocked logically correctly and getting DMA data, but the data does not look correct...

    Paul - there are some simple things associated with GPIO on IMXRT that at times I am not 100% what is correct in reading in multiple pins. Example for the camera the code has been doing,
    For example the example sketch I started playing with to convert to doing DMA was doing something like:
    Code:
    static inline uint32_t cameraReadPixel() 
    {
      uint32_t pword= GPIO6_DR >> 18;  // get the port bits. We want bits 18, 19 and 22 to 27
      return (pword&3) | ((pword&0x3f0)>>2);
    }
    Which appears to work, but I wondering if it really should be: uint32_t pword= GPIO6_PSR >> 18;
    As digitalRead does: return (*(p->reg + 2) & p->mask) ? 1 : 0;

    where p->reg I am pretty sure points to the DR and the +2 (is + 8 bytes) gets to to PSR...

    Note: I have tried both in my conversion of library (https://github.com/arduino-libraries/Arduino_OV767X)
    The WIP code is up in this area including sketch...


    I am trying to remember where I read about DMA needs to go to GPIO1 instead of GPIO6...

    The interesting thing is that with reading without DMA, My current code, appears to work with either GPIO6_PSR or GPIO1_PSR regardless of the state of
    // Need to switch the IO pins back to GPI1 from GPIO6
    IOMUXC_GPR_GPR26 &= ~(0x0FCC0000u);


    Note: to both: The DMA camera case again I am clocking the DMA (I think) by using external IO pin (clock from the camera).
    The setting up for the clock/pin to drive the DMA is:
    Code:
      // first see if we can convert the _pclk to be an XBAR Input pin...
      // OV7670_PLK   4
      *(portConfigRegister(_pclkPin)) = 3; // set to XBAR mode (xbar 8)
    
      // route the timer outputs through XBAR to edge trigger DMA request
      CCM_CCGR2 |= CCM_CCGR2_XBAR1(CCM_CCGR_ON);
      xbar_connect(XBARA1_IN_IOMUX_XBAR_INOUT08, XBARA1_OUT_DMA_CH_MUX_REQ30);
      digitalToggleFast(31);
    
      // Tell XBAR to dDMA on Rising
      //attachInterruptVector(IRQ_XBAR1_01, &xbar01_isr);
      //NVIC_ENABLE_IRQ(IRQ_XBAR1_01);
      XBARA1_CTRL0 = XBARA_CTRL_STS0 | XBARA_CTRL_EDGE0(1) | XBARA_CTRL_DEN0/* | XBARA_CTRL_IEN0 */ ;
    
      IOMUXC_GPR_GPR6 &= ~(IOMUXC_GPR_GPR6_IOMUXC_XBAR_DIR_SEL_8);  // Make sure it is input mode
      IOMUXC_XBAR1_IN08_SELECT_INPUT = 0; // Make sure this signal goes to this pin...
    Note: This is using a pin(4) that is an XBar pin(XBARA1 8).
    It sets the pin to mode 3(XBAR for that pin), I left the PAD alone which was previously configured for input, It enables XBAR, it sets up the connection of
    XBARA1_IN_IOMUX_XBAR_INOUT08 to output to XBARA1_OUT_DMA_CH_MUX_REQ30, sets XBARA1_IN_IOMUX_XBAR_INOUT08 to be in input mode,
    It also configures XBAR register to Do DMA on Edge 1 (Rising).

    Note: I have a DMA chain setup that does 2 horizontal lines of the camera per DMASetting where it calls ISR to convert the raw 32 bit data into pixel data, and this does appear to be called at the right time.

    But data does not look correct. So maybe/hopefully just missing some simple something, like need to tell the GPIO port to resample or ???

    EDIT: Side Note: keep thinking in IMXRT we should create a structure for GPIO registers like we did for SPI, or UART or ... such that instead of the magical
    things like: *(p->reg + 2) we might have something like p->reg.PSR. Wonder if that is worth doing at this point?

    EDIT2:
    Found a simple extracting bits from the data problem, which is a lot of the issue... Getting better now... Looks like I may have one extra byte that was read in at the start... When data was already high... can probably easily compensate if need be...
    Last edited by KurtE; 10-10-2020 at 03:02 PM.

  6. #6
    Member
    Join Date
    Nov 2015
    Location
    colorado
    Posts
    51
    Hi @KurtE,

    Like @Fiskusmati I am working on connecting an external A/D to the GPIO lines on a 4.1 and accessing it via DMA. It appears that your work on interfacing the OV767X would be really helpful to me (and others) as a starting place. Is the code available somewhere on GitHub? I browsed the link in P5 but only found the original OV767X library, not your adaptation for the 4.1.

    Also, @Paul your overview on DMA in P3 was very helpful.

    David

  7. #7
    Senior Member+ KurtE's Avatar
    Join Date
    Jan 2014
    Posts
    7,681
    My DMA camera stuff is up at: https://github.com/KurtE/Arduino_OV7...mits/Teensy_t4
    Note: It is a PIP (Play In Progress) As I am just doing it for the fun of it.

    The DMA version up there does get a frame... Working on more of a continuous update version, first attempt did not work (In different branch), Now going back to have the DMA only get one frame and stop and then have a pin change interrupt on VSYNC to start up DMA again for the next frame, it will probably skip frames this way, but again for me, I don't need it... So just doing it to learn a few more things.

  8. #8
    Member
    Join Date
    Nov 2015
    Location
    colorado
    Posts
    51
    Thanks Kurt.

    I'll take a look. At this point I'm trying to work though as much example code as I can. DMA use on the 4.X is pretty complex and I'm hoping to get more comfortable with the triggering and data flow setups.

  9. #9
    Senior Member+ KurtE's Avatar
    Join Date
    Jan 2014
    Posts
    7,681
    You are welcome. Question is what AtoD converter? Don't a lot of them use SPI or I2C to communicate? If so their communications using DMA will likely be different than this work. Probably closer to some of the display drivers, although in reverse...

  10. #10
    Member
    Join Date
    Nov 2015
    Location
    colorado
    Posts
    51
    I'm actually hoping to tie 2 parallel output A/Ds onto GPIO1. My desired bandwidth is quite high at ~50MHz, which may be unrealistic. I'm willing to use FIFO buffering between the A/Ds and the port if necessary, but right now I just want to see if I can use DMA to reliably read GPIO1, and what the max rate is. The DMA would be set to interrupt on buffer full, not to run continuously. So: ext. trigger - enable DMA w/ timer - xfer - completion interrupt - process [repeat]. My test bed is a couple of externally-clocked fast counters cascaded w/ parallel outputs connected to GPIO1 pins. This should let me look for missing or corrupt data in the received buffer.

  11. #11
    Senior Member
    Join Date
    May 2015
    Location
    USA
    Posts
    647
    @dgranger

    I'm curious how fast you can go if you don't use DMA and just poll? Ie, wait for pin change, read/store GPIO, loop.

  12. #12
    Member
    Join Date
    Nov 2015
    Location
    colorado
    Posts
    51
    I made some non-DMA timings on the 4.1 at both 450 & 600 MHz

    Note that the lib version of memcpy is pretty highly optimized w/ unrolled loops for word aligned data. Also note that the GPIO tests are using GPIO6 which uses fast bus access which is not available to DMA.

    It's pretty clear that the test / branch overhead is significant for the transfer loops.

    450 MHz:

    Mem-toMem:

    100x 10000 word (4 byte) memcpy: 2366 total uSec; 2.4 nSec (2 cy) per word; 422 MHz effective rate
    100x 10000 *d++ = *s++ word copy loop: 18895 total uSec; 18.9 nSec (9 cy) per word; 52 MHz effective rate
    100x 200 Unrolled (rep 50x) [*d++ = *s++] word copy loop: 2538 total uSec; 2.5 nSec (2 cy) per word; 394 MHz effective rate

    GPIO6-to-Mem:

    100x 10000 *d++ = GPIO6_DR word copy loop: 34452 total uSec; 34.5 nSec (16 cy) per word; 29 MHz effective rate
    100x 200 Unrolled (rep 50x) [*d++ = GPIO6_DR] word copy loop: 20094 total uSec; 20.1 nSec (10 cy) per word; 49 MHz effective

    600 MHz:

    Mem-toMem:

    100x 10000 word (4 byte) memcpy: 1774 total uSec; 1.8 nSec (2 cy) per word; 563 MHz effective rate
    100x 10000 *d++ = *s++ word copy loop: 14171 total uSec; 14.2 nSec (9 cy) per word; 70 MHz effective rate
    100x 200 Unrolled (rep 50x) [*d++ = *s++] word copy loop: 1904 total uSec; 1.9 nSec (2 cy) per word; 525 MHz effective rate

    GPIO6-to-Mem:

    100x 10000 *d++ = GPIO6_DR word copy loop: 25838 total uSec; 25.8 nSec (16 cy) per word; 38 MHz effective rate
    100x 200 Unrolled (rep 50x) [*d++ = GPIO6_DR] word copy loop: 15070 total uSec; 15.1 nSec (10 cy) per word; 66 MHz effective

  13. #13
    Senior Member
    Join Date
    May 2015
    Location
    USA
    Posts
    647
    > 66 MHz effective

    Interesting, slower than I thought. I'd guess that an unrolled loop of [*d = GPIO1_DR] is probably most similar to how fast DMA can run. If the DMA is triggered by a pin input, then delay between that pin changing and the input of the GPIO data could be another issue.

    Worth trying, otherwise a hardware fifo.

  14. #14
    Member
    Join Date
    Nov 2015
    Location
    colorado
    Posts
    51
    I ran the test using GPIO1 as that's what DMA can connect to. The #'s are not too promising, only 21 MHz effective w/o overclocking. Looks like a FIFO will probably be required. I'm unclear as to why the cycle counts vary with the clock speed. In the first test the cycle counts are constant within an access method and only the clock period varies. Perhaps the peripheral clock divider is different for the overclocked run. Regardless, these numbers are a long way from my goal of 50 MHz.

    450 MHz:

    GPIO1-to-Mem:

    100x 10000 *d++ = GPIO6_DR word copy loop: 63343 total uSec; 63.4 nSec (39 cy) per word; 15 MHz effective rate
    100x 200 Unrolled (rep 50x) [*d++ = GPIO6_DR] word copy loop: 46940 total uSec; 46.9 nSec (29 cy) per word; 21 MHz effective

    600 MHz:

    GPIO1-to-Mem:

    100x 10000 *d++ = GPIO6_DR word copy loop: 56673 total uSec; 56.7 nSec (35 cy) per word; 17 MHz effective rate
    100x 200 Unrolled (rep 50x) [*d++ = GPIO6_DR] word copy loop: 46672 total uSec; 46.7 nSec (29 cy) per word; 21 MHz effective

    816 MHz:

    GPIO1-to-Mem:

    100x 10000 *d++ = GPIO6_DR word copy loop: 41671 total uSec; 41.7 nSec (26 cy) per word; 23 MHz effective rate
    100x 200 Unrolled (rep 50x) [*d++ = GPIO6_DR] word copy loop: 34318 total uSec; 34.3 nSec (21 cy) per word; 29 MHz effective

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •