Here's a project I've been working on to generate 32 WS* channels of up to 1000 LEDs each, with zero processor overhead through the use of FlexIO and DMA and external shift registers. Some features:

1000 LEDs per channel at 30 FPS, though the number of LEDs and frame rate can be traded off. 10 LEDs at 3 kHz should be attainable.
Consumes three Teensy pins and zero processor time
Double-buffered to reduce tearing artifacts with video
Each channel is configurable for RGB, GRB, or GRBW
If you're really nuts, you could probably fit three of these on a Teensy, for 96 total output channels. Right now it's hard-coded to FlexIO 1

This would also be a good starting point for any project that needs a lot of outputs updated relatively (3.2 MHz) quickly, directly from RAM.

Code and lots more details here:
https://github.com/wramsdell/TriantaduoWS2811