Now I just wish I knew more to help the efforts in making a DMA version of the library or a softSPI library.
My idea, which I'm planning to actually work on after publishing beta 12, involves using 3 DMA channels triggered by the edges from a pair of PWM signals. It may be possible to trigger the DMA channels using a the PIT and PDB timers, but for a first attempt I going to go the simpler (and observable on an oscilloscope) route that will consume a few pins.
The PWM will be 800 kHz on pins 3 and 4, which are controlled by the same timer. Both pins go high at the same instant. Pin 3 will stay high for 250 ns (the WS2811 waveform of sending all zero bits), and pin 4 will stay high for 600 ns (the WS2811 waveform of sending all one bits).
The rising edge of pin 3 will trigger DMA channel 0. That channel will be configured to read a fixed location and write a fixed location. The read will be a byte in memory containing 0xFF. The write will be the low 8 bits of I/O port D (which is pins 2, 5, 6, 7, 8, 14, 20 and 21).
The falling edge of pin 3 will trigger DMA channel 1. That channel will read an incrementing location (the RGB image data) and write to a fixed location (port D). For each bit that's 0, the WS2811 will see a 0 write, because the previously written 1 will change to 0 at this point 250 ns later. For each bit that's 1, the pin will remain a 1.
The falling edge of pin 4 will trigger DMA channel 2. That will be configured like channel 0, except the memory will contain 0x00. For the bits that remained high, they'll go low at the correct 600 ns timing so the WS2811 sees a 1 bit. For the bits that went low earlier, they'll remain low.
This scheme should allow a large buffer to stream 8 strips of WS2811 LEDs without any CPU overhead. Of course, each 1250 ns, three single-byte DMA transfers occur, taking 6 bus cycles plug probably a few more for bus arbitration. During the 1250 ns period, the internal bus has 60 cycles available, so the DMA will consume about 10-15% of the available bus bandwidth. Fortunately, the chip has a switched mux matrix, and the RAM has 2 separate busses (one for the low half, another for the high half), so this activity should have minimal impact on code execution because separate buses are used, and if the RGB pixel buffer is placed in the lower half of RAM, it should have minimal impact on the CPU's access to the stack (the other thing that matters for code performance).
At least that's my crazy plan.
There are a couple minor details about triggering on those edges, which will involve jumping certain pins together. The RGB buffer will also have a somewhat strange layout, where each byte in RAM is actually a single bit for 8 different LEDs. A little crafty code can abstract that away for simple animation running on the chip. For people who want to stream video from a PC, they'll probably need to pre-arrange their data before transmitting it over the USB.
I'm also considering a double buffering scheme to allow smooth animation.....