Constant-Q Transform on T4 w/ Audio Shield

Status
Not open for further replies.

Audio951

New member
Hello!

For the past few months I've bounced this idea around and scrounged for resources on how to implement CQT on the T4 such that I can drive WS2812B leds on a chromatic scale rather than the linear frequency bins provided by the default FFT implementation.
I'm curious if anyone is aware of any resources on this subject as it relates to the T4 specifically and integrating it as an extension to the AudioStream class.
I took a look at the AudioSDR project to see what might be useful there, but it's so massive it's hard to tell what the overall flow is and how the data is manipulated.
I also looked at the CMSIS DSP library to see if there was an implementation available but could not find anything.

I only need to drive 96 LEDS - 8 octaves with 12 half steps starting at E0.
Technically even that is a little overkill, but that's the absolute maximum I can foresee being useful and seems to be typical for a CQT implementation to support on 44,100hz.


Any guidance or help would be massively appreciated - thanks in advance for your time!

-Audio
 
As Q is related to the time-frequency product you "only" need to implement 96 DFTs, each with own window size.
Edit: Not sure is there is a DFT object in Audio library, but you can easily write one yourself
 
The FFTs in the audio lib are 256 and 1024 point, locked to the sample rate.

One FFT can do the full frequency analysis, then you would apply different frequency windows on
the results to sum power for each target frequency band for an arbitrary set of output frequencies/Qs
 
+1 on combining bins. Say you down sample to 11K sps, then run a 22,000 sample FFT. Each bin would represent .5 hz.

The DD4WH code is an example of large FFTs.
 
+1 on combining bins. Say you down sample to 11K sps, then run a 22,000 sample FFT. Each bin would represent .5 hz.

The DD4WH code is an example of large FFTs.

While we do all agree that the FFT approach is probably superior to a bank of 96 parallel filters, do be aware that the FFT approach more clearly exposes the inherent trade between frequency resolution and time response.

For example, if you were to really want 0.5 Hz frequency resolution as suggested above, you need to wait for all the data to arrive, which in this case is: 22,000 samples / 11,000 samples/sec = 2.0 seconds of data. Assuming that you use a traditional butterworth window (or whatever), the meat of the FFT is the middle of that 2-second sample. So, on average your display will lag your audio by 1.0 second. If you're watching this display along with your music, having your meters be a whole second behind the sound...that will be *very* dissatisfying.

Also, in addition to this latency, your meters will have a very slow update rate. The problem is that you need to wait *another* 1 second for enough fresh audio to move into the meat of your FFT for the FFT output to be appreciably different. Yes, you can do highly overlapping FFTs, but the results of successive FFTs will be highly correlated. So, you don't get more time resolution, you just get results that are better smoothed in time. Fast audio events are inherently smoothed out when doing an analysis with a high frequency resolution.

By contrast, you could back off on the frequency resolution to reclaim a faster update rate. If you did a 2048pt FFT at a sample rate of 22 kHz, your bins would be 11 Hz wide and you'd get fresh updates at 9 Hz. With bins that are 11 Hz apart, you won't get your 1/12th octave resolution until you get up to note G3 (it's at G3 that the chromatic notes start being separated by 11 Hz). While this lack of note resolution isn't great, updating the display 9 times a second is *way* more fun to watch.

So, you have to decide what you want for your display...superfine frequency resolution or fast update rates (or something in the middle). You could also dial it somewhere in-between. Or, you could attempt doing multiple FFTs at different resolutions so that you get fast updates (with modest frequency resolution) for the mid and high frequencies while also getting high frequency resolution at low frequencies (though with a slow update rate). You have many choices!
 
Last edited:
Agreed, the typical person watching is likely to prefer larger bins and less delay. You can still use the same number of lights, probably most people won't notice the lack of accuracy on the low end.
 
The OP and title asks about constant Q and not "logarithmic" spectrograms.
constant Q indicates "logarithmic" spacing of frequencies AND therefore also logarithmic changes in window size.
summing simply frequencies does not give you constant Q.
 
The OP and title asks about constant Q and not "logarithmic" spectrograms.
constant Q indicates "logarithmic" spacing of frequencies AND therefore also logarithmic changes in window size.
summing simply frequencies does not give you constant Q.

The number of bins that you sum will be proportional to your center frequency. So, at G3, you might have just the one bin. At G4, you'd sum 2 bins. At G5, 3 bins. etc. That's how one can approximate a constant Q filterbank given a linearly-spaced frequency analysis.

For sure, this is a cheesy approximation for constant-Q. Yes, it's totally a quick-n-dirty solution. But for analysis/visualization of music, it's still way better than a linear display. My opinion.

A well-designed bank of 96 parallel filters would indeed be much more accurate in having a constant Q response. If one has the CPU to do the 96 parallel filters (plus the 96 envelope detections), the results will be superior. But noticeably so? That's harder to know.
 
My first few attempts did indeed use the approximation method of summing bins, but after experimenting with many programs and examples I found the CQT approach to be exactly what I was looking for.

This implementation in particular was interesting: https://github.com/mfcc64/youtube-musical-spectrum

I have no need for distinct left/right channel analysis, but otherwise it was spot on. Porting to teensy seemed do-able, but would require many optimizations for both memory and speed.
ARM specific optimizations are beyond my current skill set, but if there's no other way I can start digging into it.

I've found that aiming for ~20-30 updates a second provides enough time resolution to follow musical beats in lights, so that would be 33-50ms per loop, which seems possible on the T4 strictly based on intuition.

Edit: And to be clear, the main reason I like the CQT approach is because of the exact nature of the logarithmic bins, especially with regards to the lower registers.
It felt like there should be a better way to get C1 (32.7hz) to C9 (8372hz) with enough resolution in the bass region to visualize anything meaningful.
I could very well abandon the 1 led to 1 chromatic note idea and concede having 3-5 leds instead to smooth out a logarithmic representation of a normal FFT, but I'd like to try CQT first just to know :)
 
Last edited:
I only looked at it briefly, but that code appears to be doing a FFT and then combining bins. Not clear why so many people write code without comments.
 
For sure, this is a cheesy approximation for constant-Q. Yes, it's totally a quick-n-dirty solution. But for analysis/visualization of music, it's still way better than a linear display. My opinion.

Chip,
I would not call it "approximation to constant-Q", as a simple logarithmic frequency scale (logarithmic summing frequency bins) is equivalent to a logarithmic Q.
In how far logarithmic summing of frequency bins is appropriate for Led display, that is a different discussion.
 
Isn't logarithmically increasing bandwidth (to follow logarithmically increasing center frequencies) the same as constant Q? Or am I misunderstanding constant Q?

I characterize the FFT approach as an approximation because I can't easily sum fractional bins. Instead, I'd probably just sum an integer number of bins, which isn't fully accurate. IMO, prob good enough for a visualization. Maybe not good enough for scientific analysis.
 
Two questions:

I use partitioned fast convolution to greatly reduce filtering delays. Can't partitioning be used for this application?

If one does use 96 biquad bandpass filters (which seems feasible), does the high Q needed cause quantization problems? Or perhaps it is insignificant when it comes to a LED display.
 
So, yeah, if one is looking to see how many bins one should sum together, that equation shuffles around to be:

N_k = (Q * f_s) * f_k [*** EDIT: WMXZ noticed my obvious math error! See his message below ***]

As the Q is constant for all filters (by definition of constant Q), and as the sample rate is constant, you're left with the statement that the number of bins for your summing window should be proportional to the center frequency. If you've got logarithmically spaced center frequencies, you end up needing logarithmically increasing bandwidth.

The difficulty hits when one realizes that this approach often requires you to sum a fractional number of bins. Like, you might find that you need to sum 2.3 bins. Not 2 bins, and not 3 bins, but 2.3 bins. How does one sum 2.3 bins?!? I guess that you sum just 2 bins? And, later, when you need to sum 2.7 bins, you'll round up and sum 3 bins for that case? Ugly. That's why this is an approximation. That's why this is quick-n-dirty. That's why an IIR or FIR filterbank approach would definitely provide a more accurate answer, if accuracy is important. But, this single-FFT approach is easier for those of us who are lazy.
 
Last edited:
Chip
you mean
N_k = (Q * f_s) /f_k

anyhow, it does not matter, as one can use FFT to approximate the spectral resolution (with summing of bins), then one can sum the temporal bins to approximate the windowing.

To come back to OP
one could do first the summing of frequency bins. the fractal bins could be weighted sums. This would then be followed by a variable time average (long averages at lower frequencies and no average at highest frequency). After all, only the spectral power or RMS value is visible on the leds.
 
Nice catch!

I didn't realize that the N_k was the *time*-domain window used in computing the DFT. I incorrectly assumed that it was the frequency window used for frequency-bin summing. Ooops!

Looking more closely at the wiki article, I see that they are computing an independent discrete fourier transform for each of the constant Q filters. As they are only doing a single DFT for a given filter, they adjust the bandwidth of that single DFT by adjusting the size the time windowing. A longer window for more frequency resolution at the low frequencies and a short window for less frequency resolution at high frequencies. Hence, one sees the inverse relationship exhibited by your (absolutely correct) fix to my bad algebra.

[To me, this is exactly how you'd set up an FIR filterbank, too. But DFT and FIR are brother-and-sister, so i guess that's not a surprise.]

For the approach of using a single FFT and using it to approximate the constant Q filters, we don't have the freedom to adjust the time-domain window for each filter. So, you're stuck with choosing an FFT that is long enough to get the resolution that you need at the low frequencies. This long window ends up giving really slow time response at higher frequencies (as already discussed).

>> This would then be followed by a variable time average (long averages at lower frequencies and no average at highest frequency). After all, only the spectral power or RMS value is visible on the leds

With the FFT length being set by the resolution needed by the lowest frequencies, is time averaging adding anything? Or, are you suggesting that a short FFTs can be used and that the tight frequency resolution at low frequencies can somehow be achieved through time averaging? If so, I don't know that trick! I'm interested!
 
Status
Not open for further replies.
Back
Top