So my question is... How is this boards function in terms of DSP performance?
As a very rough guideline, the DSP performance is approx 11 times what you had with Teensy 3.2. So has a very crude guess, if you were able to synthesize 2 voices on Teensy 3.2, you'll probably be able to do 22 on Teensy 4.0.
And how big is the performance tax of using floating point numbers?
For "normal" C / C++ code, you might expect somewhere between 10 to 40% less. While 32 bit float operations are approximately the same speed as 32 bit integers, Cortex-M7 has 2 integer execution units but only 1 FPU. Benchmarking shows about 50% speedup relative to Cortex-M4 at the same clock speed. But not all of that 50% comes from the dual issue pipeline being able to sometimes execute 2 instructions in the same clock cycle. Some of it comes from branch prediction, which M7 has but M4 does not.
But some of the DSP code in the audio library is anything but normal. For highly optimized code taking advantage of the DSP extension instructions, especially the SIMD multiply-accumulate, and using tricks like packing 16 bit samples into the 32 bit registers (double the memory bandwidth), you could expect a huge hit in performance by converting to "normal" programming. Code using those sorts of intense optimizations does DSP on 16 bit samples much faster than more ordinary programming can accomplish.
It starts sounding pretty crunchy with 4 voices playing at once. In some cases (lots of effects and stuff) even just 1 voice sounds crunchy.
...
If I am able to continue working on the synth, I think I'd like to fork the Audio library and have it use float all the way through (until the codec obviously). That should hopefully really help clean the sound up.
I believe Chip (the guy working on that hearing aid project) made a fork of the library using floats. Maybe that might help, or at least give you a head start?
Whether it actually helps is a good question. If you decide to publish your synth code and you can give a reproducible test case that demonstrates the "chunky" sound, I'd be curious to take a look.
If the issue is 16-bits, using float will help a little, but you've just moved the goal post out just a little. Float has 23 bits of precision for the mantissa, plus the hidden bit.
Floats can really help if something is clipping, or if some part of the algorithm is using relatively few bits, like the FreeVerb code might be doing. The automatic scaling by the exponent can be really nice.
But ultimately the DAC output is an integer with effectively 15 to at best 18 bits. DACs that really use more than 16 bits are rare. Sure, they all say 24 bits, but the noise floor makes those low 7 to 9 bits worthless. Like so much consumer and even pro audio stuff, it's all a bunch of gaming the numbers. Nearly all DACs have A-weighted specs, which lets them claim a somewhat higher number like 100 to 110 dB dynamic range (but rarely SINAD - the true measure of the DAC's perfection). By the time you take away the spectral weighting and divide by ~6 for effective number of bits, the reality is even the very best DACs are using barely more than 16 bits.