Kuba0040
Well-known member
Hello,
I am trying to get a fast function that will compute this: (a*b>>32)+sum for me: a, b and sum are uint32_t variables. With the help of more people here I have already figured out the a*b>>32 part. Now I want to add the accumulation step. After checking the instruction set for the ARM M7 CPU (Teensy 4.0) I came across the UMLAL instruction. It says that after multiplying Rm*Rs, it adds the result of this multiplication back into the RdLo and RdHi registers as a 64-bit number. Perfect! So, I tried to implement this into my code, and unfortunately it doesn't work.
This may look weird but what we do here is first we set RdHi to the number we want to accumulate, then we multiply a*b which gives us a 64-bit result and it gets added into RdLo and RdHi. Then we return just RdHi, effectively shifting the result by >>32. Thats why we also add our accumulate value into RdHi and not RdLo. This is so it doesn't get shifted away.
I hope that made sense.
Now the issue I am having is that I don't know how to specify (I do it in the "r" and "=&r" bits) that I am using a register both as an input and then later writing to it. Is this even possible, or am I too far down the rabbit hole?
Thank You for the help.
I am trying to get a fast function that will compute this: (a*b>>32)+sum for me: a, b and sum are uint32_t variables. With the help of more people here I have already figured out the a*b>>32 part. Now I want to add the accumulation step. After checking the instruction set for the ARM M7 CPU (Teensy 4.0) I came across the UMLAL instruction. It says that after multiplying Rm*Rs, it adds the result of this multiplication back into the RdLo and RdHi registers as a 64-bit number. Perfect! So, I tried to implement this into my code, and unfortunately it doesn't work.
Code:
static inline uint32_t unsigned_multiply_accumulate_32x32_rshift32(uint32_t out, uint32_t a, uint32_t b)
{
uint32_t junk; //Just a trash can register to throw in data we don't need
asm volatile("umlal %[junk], %[out], %[a], %[b]\n\t"
: [junk] "=&r" (junk), [out] "=&r" (out)
: [a] "r" (a), [b] "r" (b));
return out;
}
This may look weird but what we do here is first we set RdHi to the number we want to accumulate, then we multiply a*b which gives us a 64-bit result and it gets added into RdLo and RdHi. Then we return just RdHi, effectively shifting the result by >>32. Thats why we also add our accumulate value into RdHi and not RdLo. This is so it doesn't get shifted away.
I hope that made sense.
Now the issue I am having is that I don't know how to specify (I do it in the "r" and "=&r" bits) that I am using a register both as an input and then later writing to it. Is this even possible, or am I too far down the rabbit hole?
Thank You for the help.