I'm really interested in what you are doing as well. If I may offer a suggestion: Ideally you could make your module output a 0-1 signal corresponding to pitch (ie a note to cv convertor) which could then be "patched" to drive other filters and oscillators. Since you mentioned making a module, then you can process it as a control signal with filters and mixers without writing code.
I am a little confused about what you mean by the note to CV conversion and how that would enable this. Could you provide more details about exactly what you have in mind?
Also let me give more details about what I am doing since you are interested.
Let me break down my approach:
First the algorithm can be separated into two parts. First frequency detection and then frequency shifting. The detection in this case is being done by the notefreq module. In my main loop I check if it is available, and if so take the notefreq output and determine the closest note (for now I am using the full 12 note scale but I am working on adding specific scales). After determining the closest note, I take the ratio of the closest note (lets call this CF) and the actual frequency (F) ratio=CF/F. This values is then set as a parameter in my pitch shifting code, and each block is shifted by this ratio.
The shifting is done using a STFT based shifting algorithm sourced from
here.
A few things I have noticed:
1) The algorithm requires the pitch detection be very fast and stable. This is something Antares addressed specifically in their
patent for the original autotune (which expired in 2018). Their approach is much different from the Yin algorithm, but the original Yin paper claims and supports that their algorithm is more robust than the auto-correlation based approach Antares approximates in their work. So I am assuming that the Yin algorithm should be suitable for this application, but I am suspicious the approximation method antares uses may still be faster.
2) The pitch shifting is generally in the range of +-10Hz at the most and generally is even less (this will depend on the users vocal range but for me is generally around 3-5Hz). That requires a very accurate pitch shifting algorithm. The 128 sample size used here is restrictive in this case because it limits the frequency resolution of the STFT used here. This is an issue I have yet to determine how to solve, but the obvious answer would be to increase the buffer size. The way I estimate, I will need at least 1024 samples to come close to the accuracy I need, but even more would be better. This creates memory issues as well as latency issues so I will need to put some serious thought into this. There are some limits at play here that are not related to the processing power of the Teensy (resolution vs latency) that I am still trying to rationalize.
Again the original Antares algorithm uses a different approach for both the detection and shifting. I likely will need to implement these eventually but I am not looking forward to that so I am holding out hope I can get things working without that.
Throughout this process I have gained a great deal of respect for the ingenuity of the Antares Autotune algorithm developed by Harold A. Hildebrand. Interestingly enough it was a result of his work in seismic signal processing.
Best wishes!