I have a (rather ambitious) project I would like to start that involves 512 APA106's (functionally equivalent to a WS2812) and 32 of Sparkfun's 4x4 button pads. I've designed a pcb that holds an 8x8 grid of LEDs and am planning to have a 4x2 grid of those PCBs. This creates a rather large 16x32 button matrix and 8 strips of 64 LEDs to be controlled by the Teensy. The Teensy would act as an "intermediate" to pass data to and from a Raspberry Pi to the hardware. Here's a block diagram of what I'm trying to achieve.
Click image for larger version. 

Name:	TbYJ9hO.jpg 
Views:	12 
Size:	55.7 KB 
ID:	15337

My first question is about the button matrix. Since this would require 48 pins of a Teensy (which is kind of ridiculous), I am looking for an IC to reduce that pin count down. Should I use shift registers, mux/demuxers, or something else entirely?

My next question is about the leds and communication to them. The way I have the PCB laid out right would create 8 different 64 led lines, so I figure the OctoWS2811 library would be perfect for that. I also gather that the Teensy's USB serial is the best way to send a high bandwidth of data to and from the Teensy. I would appreciate some input on the "protocol" to use. Should I make it so it "latches" in all the data for all 512 LEDs from the Pi and then shows the matrix (similar to the Glediator sketch), or should I use some protocol where I can reference an x, y, and a color to update, and then have a separate "show" command? I would like to use the 2nd one for flexibility (so I don't have to resend all 512 colors to update a single LED), but my concern is that it wouldn't be fast enough. I don't know what a reasonable speed for such a thing is, but if I can get 60fps out of it that would be great. If 30fps is more realistic, that's fine too.

Thanks for your time.