Yeah, W5100 is really horrible over SPI, but it's actually pretty fast in 8 bit parallel bus mode, when used with a chip that does the chip selects and strobes and read/write in a single load/store instruction.
My guess is the original W5100 was probably designed by very good engineers a very long time ago with 8051 chips in mind, and a terrible SPI bus interface was bolted on by junior level people. The experts almost certainly wrote the original socket layer code with the parallel bus in mind. Since then, the microcontroller market moved away from memory buses, and Wiznet seems to have recycled the same basic design with the SPI interface. Judging by the chip's bugs with reset and MISO drive, I'm guessing these changes haven't been made by the highly skilled engineers who must have designed the original chip.
Since then, it seems nobody has really looked at the design decisions in that code in the context of slower SPI bus. It reads many registers for every operation, which makes sense if you can access a register with a single instruction. But with SPI bus, reading the registers takes many hundreds of cycles. Especially for TCP with simple code that moves 1 byte at a time (as pretty much all the Arduino examples do), there's massive register reading overhead for every single byte! This is true with W5200 & W5500. Wiznet's code doesn't even try to leverage the burst modes to read both bytes of 16 bit registers. It's pretty amazing they put so little work into the software side, which could vastly improve their product's performance.
A couple months ago, I did quite a bit of code restructuring and storing of the socket state on the Teensy side. That costs some extra RAM on the Teensy side, but it allows skipping many slow register reads, because their values are remembered in variables on Teensy. It make a pretty dramatic speedup for reading TCP data. UDP ends up about the same if you're reading the whole packet all at once, but it helps if you read the packet in small chunks. Unfortunately we don't yet have timer events from the Teensyduino core library, so transmitting TCP is harder to optimize, because I can't leave buffered data in place and have a reliable way to auto-flush it if the program doesn't keep calling the library's functions.
Eventually I'll put W5500 code into the lowest level and test with these boards. And someday in the more distant future, I'm going to add an event notification system to the core library. Then I'll be able to really optimize the TCP transmit. But that day is months, maybe even years away...