definition of the signed_multiply_32x16t() function

MarkV · Jun 26, 2020

Hey guys!

First of all: i was searching in the forum for this topic and didn't find anything, i'm sorry if this question was already asked before!

I'd like to get to see the definition of the signed_multiply_32x16t() function, but i just couldn't find it on GitHub! Can anyone help me here?

I'd like to find this out because since the audio_block_t 16 bit integers are needed to be in that format, i maybe could understand the way how the audio is processed a bit more!

Sorry for my horrible english, and thank you for your time

PaulStoffregen · Jun 26, 2020

MarkV said:
'd like to get to see the definition of the signed_multiply_32x16t() function, but i just couldn't find it on GitHub! Can anyone help me here?

Here's a link to the code.

https://github.com/PaulStoffregen/A...60dfecd95f42eebc7564a8a/utility/dspinst.h#L81

As you can see, it's just inline assembly that becomes a single SMULWT instruction.

To look up what SMULWT really does, you'll need the ARM Architecture v7M Reference Manual. Google "DDI0403E" to get the PDF. Turn to page 413 for the documentation you seek.

I'd like to find this out because since the audio_block_t 16 bit integers are needed to be in that format, i maybe could understand the way how the audio is processed a bit more!

The required format is only an array of "int16_t". You do not need to use these special functions.

But these functions do allow code to run much faster. Or maybe run faster. In practice they are very tricky to use. Actually achieving higher performance can be quite difficult. For the sake of learning, I recommend writing simple, normal C-style code without these special functions.

With that in mind, most of the speed advantage comes from 2 things. First, memory access is done as 32 bit integers. Transferring 2 samples at once is twice as fast, since the CPU and memory is all 32 bits wide. Cortex M4 also has an optimization where the subsequent memory access done using the same pointer gets optimized to only a single cycle, rather than the usual 2 cycles. So reading four 32 bit numbers brings in 8 audio samples, using only 5 cycles. That's much faster than a simple approach, which would use 16 cycles.

The other half of the speedup comes from those instructions which access half of a 32 bit register. That 32x16 one is really nice because it produces a 48 bit result, then discards the low 16 bits. That means the result takes only 1 of the ARM's registers. How many registers are used becomes a critically important focus while trying to use these special instructions to speed up your code. The general strategy is to bring in as many samples as you can, partly to gain the memory access speedup (less of a factor on Teensy 4.x) and partly to process more samples per loop iteration, so less time is spent on looping overhead (also less of a factor on Teensy 4 where we have branch prediction).

In practice, this sort of optimization work involves inspecting the assembly code the compiler generates to check if the registers are actually being used as you intended. When the compiler needs to "spill" local variables onto the stack, you lose any performance advantage. It's tedious work, and difficult if you've not familiar with ARM assembly language, all to get the same results computed in fewer clock cycles.

You don't need to do this sort of optimization. If you're still learning and developing your algorithm, writing ordinary C code is much simpler. I highly recommend learning and testing with ordinary code. Using those special instructions is usually only done after you've made the normal way work, and that normal code serves as a test case to compare the optimized results.

MarkV · Jul 1, 2020

PaulStoffregen said:
Here's a link to the code.

https://github.com/PaulStoffregen/A...60dfecd95f42eebc7564a8a/utility/dspinst.h#L81

As you can see, it's just inline assembly that becomes a single SMULWT instruction.

To look up what SMULWT really does, you'll need the ARM Architecture v7M Reference Manual. Google "DDI0403E" to get the PDF. Turn to page 413 for the documentation you seek.

The required format is only an array of "int16_t". You do not need to use these special functions.

But these functions do allow code to run much faster. Or maybe run faster. In practice they are very tricky to use. Actually achieving higher performance can be quite difficult. For the sake of learning, I recommend writing simple, normal C-style code without these special functions.

With that in mind, most of the speed advantage comes from 2 things. First, memory access is done as 32 bit integers. Transferring 2 samples at once is twice as fast, since the CPU and memory is all 32 bits wide. Cortex M4 also has an optimization where the subsequent memory access done using the same pointer gets optimized to only a single cycle, rather than the usual 2 cycles. So reading four 32 bit numbers brings in 8 audio samples, using only 5 cycles. That's much faster than a simple approach, which would use 16 cycles.

The other half of the speedup comes from those instructions which access half of a 32 bit register. That 32x16 one is really nice because it produces a 48 bit result, then discards the low 16 bits. That means the result takes only 1 of the ARM's registers. How many registers are used becomes a critically important focus while trying to use these special instructions to speed up your code. The general strategy is to bring in as many samples as you can, partly to gain the memory access speedup (less of a factor on Teensy 4.x) and partly to process more samples per loop iteration, so less time is spent on looping overhead (also less of a factor on Teensy 4 where we have branch prediction).

In practice, this sort of optimization work involves inspecting the assembly code the compiler generates to check if the registers are actually being used as you intended. When the compiler needs to "spill" local variables onto the stack, you lose any performance advantage. It's tedious work, and difficult if you've not familiar with ARM assembly language, all to get the same results computed in fewer clock cycles.

You don't need to do this sort of optimization. If you're still learning and developing your algorithm, writing ordinary C code is much simpler. I highly recommend learning and testing with ordinary code. Using those special instructions is usually only done after you've made the normal way work, and that normal code serves as a test case to compare the optimized results.

Thanks Paul for your very detailed answer!

I was asking because I would like to program my own wavetable synth with the Teensy 4.0 and a Rev D 4.0 Audio Board. I checked some of the teensy waveforms/wavetable examples and it seemed that most of the time the audio_block_t bits are processed with something like

*bp = signed_multiply_32x16t(ph, magnitude);

so i was wondering how exactly the "information brackets" are handled in those example programs to easily just adapt the same frame structures in my own code. Your answer was helping a lot, thanks again! It still though feels a bit like i don't exactly understand how i could check in every of those

for(i=0; i<AUDIO_BLOCK_SAMPLES; i++) {}

iterations, that the counter of my waveforms is correct or to maybe experiment a bit with waveforms and follow the single increments - could You give me an advise on how to easily verify my own code and the different wavetables i try to create?

peace

markv

definition of the signed_multiply_32x16t() function

MarkV

Member

PaulStoffregen

Well-known member

MarkV

Member