Yes it's these types of instructions that I intended to talk about; I did assume that it would generally pick those up automatically ...
No, the compiler pretty much never uses the special DSP extension instructions.
If you look at the audio library code, they're defined in utility/dspinst.h as inline functions with inline assembly. If you look around the rest of the source, you'll find those function names used in many locations.
You'll see names like "16b" and "16t", meaning the bottom or top 16 bits of a 32 bit variable are used as a signed integer. Despite all the talk of SIMD, the most common speedup these instructions offer isn't the actual math, it's allowing 16 bit signed integers to be packed into 32 bit data. This doubles the bandwidth of reading and writing samples from RAM, and it effectively doubles the amount of data you can keep in the CPU's registers. Well, only if all the stuff you want to do with the 16 bit numbers can be done with these instructions. As soon as you need to mask off 16 bits or shuffle samples around to get them into their own registers, you've lost the speedup.
There's also a really useful 16x32 multiply which produces a 48 bit result, and then discards the low 16 bits (with or without roundoff). That turns out to be really handy, as you can see from its usage in many places.
Much of the optimization effort revolves around planning how much fits into the CPU's registers. Cortex-M4 also has a memory optimization, where the first read/write takes 2 cycles, but subsequent ones happen with only 1 cycle. So if you're able to dedicate 4 of the 13 available ARM register to audio samples, you can bring in 8 samples with only 5 clock cycles. It all becomes a game of doing the most you can with the registers, without forcing the compiler to spill anything onto the stack. That 16x32 multiply is particularly handy, since its result takes only a single output register and doesn't temporarily waste another register just to handle the intermediate 48 bit result and discard the low 16 bits.
This sort of strategic planning of the algorithm and data processing strategy around the register set and special instructions that don't conform to C language semantics of 8, 16, 32, 64 bit variables is far beyond how the compiler can optimize ordinary code.