Teensy 4.1 How to start using DMA?

Just a quick update..
After a few hours of going through documentation and experimenting, I finally found how to switch the timer output to pin #33

I started by setting pin 33 to FlexPWM_PWM2_B00 output

Set PWM2_B00 lower value

Then enabled PWM2_B00

Next step is to implement the 16 bit color transfer to the DMA kickoff function and callback.
Hopefully I can come back with some "good news" in the next few days when I find time to mess around with this a little more.
I want to stop the PWM in some cases to write simple 16 bit commands to the display, so I set the registers to stop the PWM and to set pin #33 back to a GPIO pin:

FLEXPWM2_OUTEN    |= FLEXPWM_OUTEN_PWMB_EN( 0 ); //Disable FlexPWM2_B00 output
IOMUXC_SW_MUX_CTL_PAD_GPIO_EMC_07 = 5; // set pin 33 back to GPIO4_IO07

I then tried to toggle the pin high and low using digitalWrite/digitalWriteFast to toggle the WR line manually for a single write strobe but It's not doing anything.

Also, I noticed that setting FLEXPWM_MCTRL_RUN( 0 ) an/or FLEXPWM_OUTEN_PWMB_EN( 0 ) does not stop the PWM, only changing the pin assignment in the MUX does - which is only doing half of what I want/need it to do.

Can someone make sense of this for me?
the problem is that the manual is riddled with a lot of domain-specific lingo, and (at least for me, coming to uControllers for a different field) it required some time to actually connect some dots and understand what the authors meant by certain things. Also the sources are well documented and are a great reference
I believe that it might be useful as an introduction to actually reading the reference manual - going over a particular use case and extracting the information needed for that from the docs.

Miciwan, thank you for this. You provided a great contextual framework. Some one should promote you from "junior member" to "super helpful member".
Last edited:
...This puts your buffer into the RAM1 section of memory (TCM - Tightly Coupled Memory) which is not cached...

Kurt, if you're working with something that uses DMA with repeated/ongoing transfers, and you want to minimize address/data bus contention (due to most of the 16 DMA channels actively doing similar ongoing transfers), would using the (TCM - Tightly Coupled Memory) be more efficient (than DMAMEM )?
Working on a CCD line-array application, with external SPI ADCs.

I am trying to adapt this DMA code to initiate an SPI transmit with a variable for a 16bit SPI transaction.

The SPI receive is being delt with elsewhere (probably a second Teensy 4) so I don't need to wait for the returned data.

Using flexPWM2, etc. as per miciwan's example.

Changed the setup to:

volatile uint32_t scan_count = 0;

// configure DMA channels
dmachannel.source( scan_count );
dmachannel.destination( LPSPI4_TDR );

I then want to start and stop the DMA/SPI at certian points in an indexed loop, state-machine caled by flexPWM2 interrupt at 500kHz

Start new loop

Count a number of loop cycles to get to starting point


For a given number of further counts, transmit data at each PWM trigger

SPI transmit

At end of active counts, stop DMA


Incriment position


Loop until external trigger resets state-machine.

I know I am missing something obvious here, any pointers would be appreciated.
Many thanks.
Hi, I'm trying to understand how the DMA's are used on the Teensy 4 in the OctoWS2811 library.
By looking at the following lines I can decipher that dma1 is used to set the pins high at the start of the WS2811 waveform, dma2 to set it high or low depending on the data in the bitdata buffer and dma3 is used to set it low at the end of the waveform.
I also understand that the bitdata buffer is continuously being updated during the transfer using an interrupt triggered by dma2 after every half of the buffer has been sent.
What I'm trying to figure out is how this interrupt is triggered before the transfer is completed and how the minor/major loops are exactly specified.
Since 4 32bit GPIO registers are targeted, I would expect DOFF to be 16 (4 x 4bytes), but in the code it is set to 16384 for all dma's and I can't figure out where this number comes from.
Can somebody help me make sense of this part of the code?
Last edited:
I'm trying to understand how the DMA's are used on the Teensy 4 in the OctoWS2811 library.

OctoWS2811 on Teensy 4 is probably the most complicated DMA of any library. You'd probably be better off reading the DMA code from a much simpler library, like WS2812Serial.

Since 4 32bit GPIO registers are targeted, I would expect DOFF to be 16 (4 x 4bytes), but in the code it is set to 16384 for all dma's and I can't figure out where this number comes from.

To answer this specific question, 16K is the distance between each of the 4 GPIO peripherals. Here's the relevant part of the memory map from the reference manual.


The minor loop copies four 32 bit words to the 4 GPIO peripherals. After it writes the first 32 bit word to the DR_SET register in GPIO1, the destination address is incremented by 16K so the next 32 bit word will be written to the DR_SET register in GPIO2. And so it repeats for all 4 GPIO peripherals. You'll also see the DLASTSGA register is set to -65536, which causes the destination register to return back to GPIO1 after all 4 are written. This way each time the minor loop runs it can copy 128 bits to all the GPIO registers, without the CPU having to rewrite the DMA TCD.
Thanks, that helps a lot! (I was assuming the GPIO DR_SET registers were located right next to each other for some reason 🙄)

I'm trying to adapt the library in such a way that it controls WS2812 LEDs through shift registers to allow even more channels to be addressed in parallel.
Similar to what I did for the Teensy 3.6 with MultiWS2811 but now I'm trying to make it work for the Teensy 4.
Where with the T3.6 I was using a bunch of multiplexers on the output of the shift registers to create the "fixed" high and low parts of the WS2811 waveform, the Teensy 4 (when overclocked) seems to be fast enough to create the entire waveform directly through the shift registers, similar to what is done with WS2812Serial by including those parts of the waveform in the data. (even though with the max OC the resulting WS2811 frequency is only 750 KHz instead of 800, the three different types of digital LEDs I tested still seem to be able to handle it correctly.)

In my first tests I managed to successfully control the LEDs but only till a low amount of LEDs per channel, since I had all data preloaded in a buffer and used a single transfer. Also I was only using a single GPIO and doubt it will work with multiple GPIOs using a minor loop since it seems to be on the edge of what's stable regarding bandwidth (running it faster caused data to be skipped).
The next step would be to dynamically update the buffer during the transfer like how it's done in OctoWS2811, which is why I'm trying to understand the code.

I'll give it another look during the coming week and see how far I get. Could you in the meantime shed some light on how the isr triggering mechanism on dma2 is set up? A new TCD is prepared (dma2next) and automatically loaded once a transfer is finished? That would be very helpful!
That is definitely interesting, I didn't imagine shifting data out at those speeds was even possible :oops:
I have zero knowledge of the FlexIO module so I'll look into that. I'm trying to create a system that creates at least 256 WS2811 channels. The library seems to be limited to one data pin feeding a series of 4 shift registers resulting in 32 channels, with the possibility of increasing that to thee pins for 96 channels according to the dev, I guess that's because there are only 3 FlexIO's on the RT1060.

Anyway, I feel like I now understand how all the DMA code of the OctoWS2811 library works together and managed to adapt it to output the required signal for the shift registers.
Turns out that the DMA still skips a memory copy from time to time so I had to reduce its trigger frequency even further resulting in an WS2811 frequency of 700KHz. While increasing the LEDs per channel I noticed the Teensy would sometimes freeze and reboot, which seemed to happen more often when the chip gets hotter. Even with a cooling block installed this would happen so I decreased the OC from 1008MHz to 960MHz which seems to be stable but decreases the WS2811 frequency to 667KHz which starts to be problematic with some of the LED types I'm testing with.

Also, copying data to multiple GPIO's with a minor loop like in OctoWS2811 introduces too much delay which means only one GPIO block can be used, limiting it to 22 pins when using the GPIO with the most pins.
(22 x 8 =) 176 WS2811 channels isn't bad but since I need 256 channels I guess I'm back to using multiplexers after the shift registers like I was doing with the Teensy 3.6.
Unless... there is some other technique I'm missing?