115200 baud is only 11.52 kbytes/sec, which really is not very fast for Teensy 3.0. I'll try to answer these questions, but first I'd like to just put this speed into perspective. Worrying about these UART specs probably does little good. Your time would be much better spent on the code that actually does whatever you're going to do with the data.
With that in mind, Serial1 and Serial2 run from the CPU clock (which you select from Tools > CPU Speed), and Serial3 runs from the bus clock. With the CPU at 96 MHz, the bus clock is 48 MHz. Otherwise, they're the same. The UART clock only really affects what baud rates you can use, not how much CPU overhead it causes. But unless you're trying to generate a very fast baud rate, even that really doesn't matter much. Even at 24 MHz, the fractional divider creates 115200 with only -0.08% error. These speeds are just so slow compared to the hardware's capability that it really isn't worth worrying about. But this page has a table of the baud rate errors, for a couple columns of data which are essentially zero.
http://www.pjrc.com/teensy/td_uart.html
The FIFO does reduce the number of interrupts. But again, let's keep this in perspective. The interrupt takes about 1.5 us. A single byte in 8N1 format at 115200 baud takes about 87 us. So even without the FIFO, the interrupts are only taking approximately 2% of the CPU time at this slow baud rate. Pulling the bytes from the buffer with Serial1.read() or Serial3.read() takes about 1 more microsecond per byte.
All the Teensy3 UARTs support DMA, but there is currently no code written to use it. While you could try to write DMA-based drivers, for such slow baud rates it would be pointless. The total overhead is under 3% of the CPU time. With DMA, you might get that down to almost zero, but it's tremendously complex, for only a tiny gain.
For only 115200 baud, I'd just use Serial3 as-is. If you're concerned about efficiency, write code that calls Serial3.available() only once and then calls Serial3.read() many times, depending on how many bytes are available (or how many you're prepared to use). If you're just copying data from one place to another, it's hard to burn lots of CPU time, but if you're parsing it, your optimization efforts should definitely focus on crafting efficient parsing code.
A good way to test your code is to have it parse some dummy data from an array. You can call micros() before and after digesting a block of data, to get an idea of how many microseconds your algorithm needs per byte. If you discover you really do need 80-some microseconds for each byte, then aggressively pursuing driver-level optimizations might free up that extra few percent of CPU time to enable your algorithm to keep up with the data rate. But 80 us is a very long time for a 32 bit ARM processor. Unless you're doing something really, really complex, it's very likely your code will need much less time per byte. If that's the case, like 40 us or less, I'd just use Serial3, knowing your program will spend about half of the CPU time just waiting for new data to arrive.
For libraries that are reused for many projects, I'll put an incredible amount of work into optimizations. But for a specific application that has a fixed maximum speed, the very best thing you can do it measure the actual CPU time the specific code needs. If it's already much faster than the data can ever arrive, which it probably will be, I'd put my time into other things... like having a beer!