Wow, that's an impressive number of detail-oriented technical questions to pack into one message!
Before answers any, I'd like to repeat the notion that Arduino programming is somewhat like PC operating system programming, where normally you use higher level functions and libraries. It's also like a PC operating system in that numerous software components come from many different people. Much like on Windows, Linux, and MacOS, you can't just ask about worst-case interrupt response latency for all cases, because there's a virtually limitless number of combinations of software components that might be combined from numerous different sources, few of which specify such details.
1) Since the ISR vector takes care of the registers, is it a true save onto the stack where multiple interrupts within low and high priority can nest or is it similar to the old Z80 and other microcontrollers that only have a fixed set or fixed stack (even microchip stack has a limit on entering return addresses)?
Yes, it's a true RAM-based stack, limited in size only by the available RAM you're not using for other stuff.
Many of the old PIC chips use a small, fixed-size stack in dedicated registers. ARM uses RAM.
Similarly, Microchip's PIC architecture has numerous "features" which are efficient for small programs carefully crafted by a single programmer, usually in assembly, or in C with a very detailed focus on the specific hardware. It's been a long time since I've programmed a PIC, but I remember well the pain of the tiny stack, limited direct addressible RAM (and bank swapping), using a single pair of special registers for indirect address, running everything through the "W" accumulator, complex sequences of skip instructions needed for efficient 16 & 32 bit math, and lots of other stuff that makes sense for small, assembly-oriented development of fairly simple programs, but becomes rather horrible when scaled up to larger programs with complex algorithms.
ARM's architecture is generally the exact opposite. It's designed to scale. All registers are 32 bits. All registers can act as pointers to the entire 4 Gbyte memory space. All peripherals are memory mapped into that single 4 GB address space (no special memory regions). Internally, the CPU implements dual 32 bit buses, which feed into a multi-master switched bus matrix, allowing simultaneous access to both CPU buses, the DMA engine, and DMA-based peripherals like USB. All math operations use any combination of registers, push and pop can handle any combination of registers in a single instruction, special if/else handling allows common small conditionals to not incur extra cycles from branching. It's extremely well thought-out for C compilers. From a CPU design perspective, it's pretty much exactly the opposite of Microchip's earliest PIC designs. For small and simple code, there's extra overhead. For larger, complex software, it performs extremely well.
2) In timer ISR intended for timer ticks, normally you will need to turn off interrupts and take into consideration the negative number in the counter to correctly load the timer to minimize jitter and drift, is that capability possible in this timer/counter implementation?
Oh how this brings back long-forgotten PIC memories from the mid-1990s!
Technology has changed, quite a lot, since those days. Imagine someone from the 1960's seeing today's televisions. They're color, and flat, and hi-def, and stream digital video from a global computer network.
Likewise, as transistor density has increased and IC design has moved from transistors and gates to very high level synthesis languages, once-exotic timer features like automatic reload became standard fare. The modern microcontroller market is highly competitive, and with everyone licensing the same ARM CPU core, peripheral features are a place vendors can add extra value for little extra cost. Like TVs, hardware timers these days come with tons of advanced features.
Truly, implementing manual reload with careful cycle counting is a relic of the past, much like black and white television!
Clearly even with an extensive library and support, a good understanding of the internal resources and specifics yields proper software.
Certainly you can learn about all the hardware resources in the reference manual.
http://www.pjrc.com/teensy/K20P64M72SF1RM.pdf
Most of those peripherals already have good software support for the most common usages, so there's probably not a lot of value studying the reference manual, unless you're using the timers directly for their advanced features. Well, the other case where you'd need to study the details is using the DMA engine in highly creative ways, to move data between peripherals.
Probably the more important reference, if you're really concerned about proper software design, would be regarding the ARM CPU. For that, you really want this book:
http://www.amazon.com/Definitive-Cortex®-M3-Cortex®-M4-Processors-Edition/dp/0124080820
If you read this book, comparing to the old PIC chips, the level of capability and sophistication in the ARM Cortex-M chips is pretty astounding. Especially the NVIC's automatic handling of prioritized interrupt nesting and hardware optimizations like tail chaining are features pretty much unimaginable in the old world of 8 bit PIC chips!
Do the various libraries that handle particular things like serial or other comms just poll or are they interrupt driven (as quality software would be).
Each peripheral is handled differently. Some have multiple libraries.
The 3 serial ports are interrupt driven, with RAM-based buffers. Two of those also use hardware FIFOs, to reduce the number of interrupts during continuous data flow.
The USB port is DMA-based for extremely efficient operations, double-buffered for packets at the hardware level, and extended to a larger pool of packet buffers by interrupts.
SPI is usually polled.
I2C uses a mix of polling and interrupts.
ADC usage by analogRead is polled.
I2S, ADC and DAC for audio (the audio library) use DMA and interrupts.
I see the fantastic power provided by these low cost boards and the high performance applications possible with some of the shields and such.
Yes, indeed.
Imagine trying to use an old PIC chip to play 30 Hz video from SD card to 4320 RGB LEDs (which requires a special, tight-timing 800 kbps waveform), while also streaming 44.1 kHz audio:
http://community.arm.com/groups/embedded/blog/2014/05/23/led-video-panel-at-maker-faire-2014
Teensy 3.1 can do it pretty easily!
Imaging trying to develop all that from scratch based only on the hardware registers, working out all the details of DMA transfers and interrupts. Leveraging existing libraries makes the software side of such projects fairly easy.