Here's a link to the code.
https://github.com/PaulStoffregen/A...60dfecd95f42eebc7564a8a/utility/dspinst.h#L81
As you can see, it's just inline assembly that becomes a single SMULWT instruction.
To look up what SMULWT really does, you'll need the ARM Architecture v7M Reference Manual. Google "DDI0403E" to get the PDF. Turn to page 413 for the documentation you seek.
The required format is only an array of "int16_t". You do not need to use these special functions.
But these functions do allow code to run much faster. Or maybe run faster. In practice they are very tricky to use. Actually achieving higher performance can be quite difficult. For the sake of learning, I recommend writing simple, normal C-style code without these special functions.
With that in mind, most of the speed advantage comes from 2 things. First, memory access is done as 32 bit integers. Transferring 2 samples at once is twice as fast, since the CPU and memory is all 32 bits wide. Cortex M4 also has an optimization where the subsequent memory access done using the same pointer gets optimized to only a single cycle, rather than the usual 2 cycles. So reading four 32 bit numbers brings in 8 audio samples, using only 5 cycles. That's much faster than a simple approach, which would use 16 cycles.
The other half of the speedup comes from those instructions which access half of a 32 bit register. That 32x16 one is really nice because it produces a 48 bit result, then discards the low 16 bits. That means the result takes only 1 of the ARM's registers. How many registers are used becomes a critically important focus while trying to use these special instructions to speed up your code. The general strategy is to bring in as many samples as you can, partly to gain the memory access speedup (less of a factor on Teensy 4.x) and partly to process more samples per loop iteration, so less time is spent on looping overhead (also less of a factor on Teensy 4 where we have branch prediction).
In practice, this sort of optimization work involves inspecting the assembly code the compiler generates to check if the registers are actually being used as you intended. When the compiler needs to "spill" local variables onto the stack, you lose any performance advantage. It's tedious work, and difficult if you've not familiar with ARM assembly language, all to get the same results computed in fewer clock cycles.
You don't need to do this sort of optimization. If you're still learning and developing your algorithm, writing ordinary C code is much simpler. I highly recommend learning and testing with ordinary code. Using those special instructions is usually only done after you've made the normal way work, and that normal code serves as a test case to compare the optimized results.