Unsigned version of multiply_32x32_rshift32

Status
Not open for further replies.

neutron7

Well-known member
I am trying to do an unsigned version of this assembler code in the audio library, but I really have no idea how it works.

the code is:
Code:
// computes (((int64_t)a[31:0] * (int64_t)b[31:0]) >> 32)
static inline int32_t multiply_32x32_rshift32(int32_t a, int32_t b) __attribute__((always_inline, unused));
static inline int32_t multiply_32x32_rshift32(int32_t a, int32_t b)
{
#if defined (__ARM_ARCH_7EM__)
	int32_t out;
	asm volatile("smmul %0, %1, %2" : "=r" (out) : "r" (a), "r" (b));
	return out;
#elif defined(KINETISL)
	return ((int64_t)a * (int64_t)b) >> 32;
#endif
}

I do not need the Kinetis part.

is it even worth doing? Or would it be just as fast to use a 64 bit variable and shift it?

I could be wrong, but I think I need to use the umull instead of smmul.

what i naively tried:
change the initializer and inputs to uint32_t,
change smmul to umull,
change out to uint32_t

of course that did not work :rolleyes:
what else am i missing here?
 
is it even worth doing?

Please allow me to answer this question with another question...

Are you regularly viewing the assembly listing to check how the compiler implemented your code and especially how it allocated the CPU registers to your variables?
 
As far as I know, there isn't an unsigned math instruction like SMMUL (which multiplies 32 bit signed integers for a 64 bit product, gives only the top 32 bits).

I'm pretty sure the compiler will automatically use UMULL (which multiplies 32 bit unsigned integers for a 64 bit product) and then just use 1 of the 2 registers which receive the 64 bit result, if you write code like x = ((uint64_t)a * (uint64_t)b) >> 32. So unless you see the compiler doing an unnecessary register dance or something else horribly inefficient in those assembly listings, there's no reason to bother with inline assembly.

The main benefit from SMMUL is use of only 3 registers rather than 4, in the case where you're doing groups of multiplies so the results can't overwrite the inputs. The math isn't done any faster. But 1 extra register available after each multiply can allow you to restructure your entire approach, perhaps bringing in a larger set of inputs to the tiny 12-13 register space and needing fewer loop iterations to work across whatever size data you have.

If you care about performance, reading the assembly listing almost always gives you insight to restructure your code in ways that allow the compiler to make the best use of the limited register set. Usually there is a lot of much lower hanging fruit than use of inline asm. But if you do enough of this and gain a good understanding of exactly what trade-offs your code is forcing the compiler's register allocator to make, you will almost always know the answer to whether a particular optimization is worthwhile.
 
Thank you. I did have a look at the .lst file in the past when i was working on a polynomial sine generator, there was also an asm example i was going to try, but i found the compiler had basically written the same thing.

As for this question,
In the end, i found that unsigned mult shift was not even what i needed for what i wanted to do, but i have wondered about it before.
 
Status
Not open for further replies.
Back
Top