SPI slave is tough
I have just finished (this afternoon) doing a crude Teensy LC SPI slave for a board I designed. I now see why no one does synchronous communication with a microcontroller as a slave—the timing requirements are tough to meet. I did manage to get my SPI interface working (at least, for communicating with the protocol tool of Digilent's Analog Discovery 2), but I had to add a dummy byte between a request from the master and the slave response, and even then I could only clock the SPI communication at 1MHz, taking 8µs/byte or 125k bytes/s. (If I had a series of single-byte responses to commands, the transfer rate would be 1/3 of that, or 41667 bytes/s.) Luckily, the task I have only needs a 60Hz update and the information to transfer is only 7 or 8 bytes each time.
I'm going to write a blog post about the code soon, after I've cleaned it up a little and gotten another Teensy board acting as a master talking to this slave.
The secret turned out to be keeping the transmit buffer always full, and only refilling it on transmit-empty interrupts (not on receive interrupts). This resulted in a predictable 1-byte delay between queuing a response to a command and when the response would be received. I also had to set the SPI interrupt to have a higher priority than other interrupts (which meant resetting all the rest, since the default is for all to have the highest priority).
I don't know whether a general-purpose SPI slave library is a reasonable goal—I tried to make things flexible for my application, but I did not generalize very far. The library that others mentioned gets high data rates, but only by having very high latency: the block of data has to be set up before the command is given. That can be useful for exchanging large quantities of data, but does not follow the usual protocols for SPI communication.