Adding more DSP instructions to Arduino Code.

Kuba0040

Well-known member
Hello!
Recently, in one of my projects I ran into a situation where I need to quickly multiply two unsigned 32-bit numbers (uint32_t) together and then shift the result 32 bits to the right. There exists a DSP instruction in the ARM M7 (Teensy 4.0) CPU called "smmul" which performs a similar operation however on signed numbers. There is no direct instruction that would compute (uint32_t*uint32_t)>>32. However, there exists an instruction called "umaal" (You can learn more about it here). What it does is multiply two unsigned 32-bit values together and returns a 64-bit result. However, because the ARM M7 is a 32-bit CPU, each register can only hold 32 bits, thus the result is split into a low register (bits 0-31) and a high register (bits 32-63). But if we were to read just the high register and ignore the low one. Then we would achieve exactly what we are looking for: (uint32_t*uint32_t)>>32.

So here are my questions:
How do I implement this instruction in a similar style to how DSP instructions are integrated in the Audio library?
Basically, how do I set which assembly parameter corresponds to my variables? Also, what are the % signs here?
Example:
Code:
// computes (((int64_t)a[31:0] * (int64_t)b[31:0]) >> 32)
static inline int32_t multiply_32x32_rshift32(int32_t a, int32_t b) __attribute__((always_inline, unused));
static inline int32_t multiply_32x32_rshift32(int32_t a, int32_t b)
{
#if defined (__ARM_ARCH_7EM__)
	int32_t out;
	asm volatile("smmul %0, %1, %2" : "=r" (out) : "r" (a), "r" (b));
	return out;
#elif defined(KINETISL)
	return ((int64_t)a * (int64_t)b) >> 32;
#endif
}

Secondly, how do I ignore the low register? Are there some registers in the ARM M7 CPU that ignore writes, so that I could maybe put a dummy register here to just throw the useless data into like a trashcan?

Thank you for the help.
 
Last edited:
I meddled with the FRACMUL_SHL inline function in dspinst.h and changed smull to umull and removed the extraneous instructions and parameter list.
I think this is correct.
Code:
static inline int32_t UMULL_HI32(int32_t x, int32_t y)
{
    int32_t t, t2;
    asm ("umull    %[t], %[t2], %[a], %[b]\n\t"
         : [t] "=&r" (t), [t2] "=&r" (t2)
         : [a] "r" (x), [b] "r" (y));
    return t2;
}

void setup(void)
{
  Serial.begin(9600);
  while(!Serial);
  delay(1000);

  Serial.printf("%08X\n",UMULL_HI32(0xffffffff,0xffffffff));
}

void loop(void)
{
}

If you want to check the low order of the result, change "return t2;" to "return t;"

Pete
 
You could also take a look at cores\Teensy4\arm_math.h, and search for SMMULR. I've never used these macros, but they appear to do what you want.
 
GCC is smart enough to do that without ASM....

Note, the mov and bx is only there because it's a not inlined function.. don't underestimate the compiler. In general, we can assume that he is smarter than we are - in most cases.
Writing asm is useful for very rare cases only..

https://godbolt.org/z/r8rbosqPh
 
A bad example is this :

Code:
[COLOR=#000000][FONT=Consolas][COLOR=#0000ff]short[/COLOR][COLOR=#000000] CLIPTOSHORT([/COLOR][COLOR=#0000ff]int[/COLOR][COLOR=#000000] x){[/COLOR]
   [COLOR=#0000ff]int[/COLOR][COLOR=#000000] sign; [/COLOR][COLOR=#008000]/* clip to [-32768, 32767] */[/COLOR]
[COLOR=#000000]    sign = x >> [/COLOR][COLOR=#098658]31[/COLOR][COLOR=#000000];[/COLOR]
   [COLOR=#0000ff]if[/COLOR][COLOR=#000000] (sign != (x >> [/COLOR][COLOR=#098658]15[/COLOR][COLOR=#000000])) x = sign ^ (([/COLOR][COLOR=#098658]1[/COLOR][COLOR=#000000] << [/COLOR][COLOR=#098658]15[/COLOR][COLOR=#000000]) - [/COLOR][COLOR=#098658]1[/COLOR][COLOR=#000000]);[/COLOR]
[COLOR=#0000ff]return[/COLOR][COLOR=#000000] ([/COLOR][COLOR=#0000ff]short[/COLOR][COLOR=#000000])x;[/COLOR]
[COLOR=#000000]}
[/COLOR][/FONT][/COLOR]

The GCC for ARM really translates this to a bunch of instructions. But the CPu has a single instruction to do this.

On ESP32, GCC produces a single instruction for the code above.

However...
I could not not find a way to get him to produce the correct instruction for a simple float=round(float) for ESP32. This CPU has a single instruction for this, too...
This is one of the rare cases I mentioned. Or I just did it wrong so far.




 
It works like a charm. I've just changed the variables to be uint32_t instead of int32_t. The total execution time is only 1 CPU cycle. Perfect! Thank You very much everyone.
 
Back
Top