Adding more DSP instructions to Arduino Code.

Kuba0040 · Nov 28, 2021

Hello!
Recently, in one of my projects I ran into a situation where I need to quickly multiply two unsigned 32-bit numbers (uint32_t) together and then shift the result 32 bits to the right. There exists a DSP instruction in the ARM M7 (Teensy 4.0) CPU called "smmul" which performs a similar operation however on signed numbers. There is no direct instruction that would compute (uint32_t*uint32_t)>>32. However, there exists an instruction called "umaal" (You can learn more about it here). What it does is multiply two unsigned 32-bit values together and returns a 64-bit result. However, because the ARM M7 is a 32-bit CPU, each register can only hold 32 bits, thus the result is split into a low register (bits 0-31) and a high register (bits 32-63). But if we were to read just the high register and ignore the low one. Then we would achieve exactly what we are looking for: (uint32_t*uint32_t)>>32.

So here are my questions:
How do I implement this instruction in a similar style to how DSP instructions are integrated in the Audio library?
Basically, how do I set which assembly parameter corresponds to my variables? Also, what are the % signs here?
Example:

Code:

// computes (((int64_t)a[31:0] * (int64_t)b[31:0]) >> 32)
static inline int32_t multiply_32x32_rshift32(int32_t a, int32_t b) __attribute__((always_inline, unused));
static inline int32_t multiply_32x32_rshift32(int32_t a, int32_t b)
{
#if defined (__ARM_ARCH_7EM__)
	int32_t out;
	asm volatile("smmul %0, %1, %2" : "=r" (out) : "r" (a), "r" (b));
	return out;
#elif defined(KINETISL)
	return ((int64_t)a * (int64_t)b) >> 32;
#endif
}

Secondly, how do I ignore the low register? Are there some registers in the ARM M7 CPU that ignore writes, so that I could maybe put a dummy register here to just throw the useless data into like a trashcan?

Thank you for the help.

el_supremo · Nov 28, 2021

I meddled with the FRACMUL_SHL inline function in dspinst.h and changed smull to umull and removed the extraneous instructions and parameter list.
I think this is correct.

Code:

static inline int32_t UMULL_HI32(int32_t x, int32_t y)
{
    int32_t t, t2;
    asm ("umull    %[t], %[t2], %[a], %[b]\n\t"
         : [t] "=&r" (t), [t2] "=&r" (t2)
         : [a] "r" (x), [b] "r" (y));
    return t2;
}

void setup(void)
{
  Serial.begin(9600);
  while(!Serial);
  delay(1000);

  Serial.printf("%08X\n",UMULL_HI32(0xffffffff,0xffffffff));
}

void loop(void)
{
}

If you want to check the low order of the result, change "return t2;" to "return t;"

Pete

joepasquariello · Nov 28, 2021

You could also take a look at cores\Teensy4\arm_math.h, and search for SMMULR. I've never used these macros, but they appear to do what you want.

Frank B · Nov 28, 2021

GCC is smart enough to do that without ASM....

Note, the mov and bx is only there because it's a not inlined function.. don't underestimate the compiler. In general, we can assume that he is smarter than we are - in most cases.
Writing asm is useful for very rare cases only..

https://godbolt.org/z/r8rbosqPh

Frank B · Nov 28, 2021

A bad example is this :

Code:

[COLOR=#000000][FONT=Consolas][COLOR=#0000ff]short[/COLOR][COLOR=#000000] CLIPTOSHORT([/COLOR][COLOR=#0000ff]int[/COLOR][COLOR=#000000] x){[/COLOR]
   [COLOR=#0000ff]int[/COLOR][COLOR=#000000] sign; [/COLOR][COLOR=#008000]/* clip to [-32768, 32767] */[/COLOR]
[COLOR=#000000]    sign = x >> [/COLOR][COLOR=#098658]31[/COLOR][COLOR=#000000];[/COLOR]
   [COLOR=#0000ff]if[/COLOR][COLOR=#000000] (sign != (x >> [/COLOR][COLOR=#098658]15[/COLOR][COLOR=#000000])) x = sign ^ (([/COLOR][COLOR=#098658]1[/COLOR][COLOR=#000000] << [/COLOR][COLOR=#098658]15[/COLOR][COLOR=#000000]) - [/COLOR][COLOR=#098658]1[/COLOR][COLOR=#000000]);[/COLOR]
[COLOR=#0000ff]return[/COLOR][COLOR=#000000] ([/COLOR][COLOR=#0000ff]short[/COLOR][COLOR=#000000])x;[/COLOR]
[COLOR=#000000]}
[/COLOR][/FONT][/COLOR]

The GCC for ARM really translates this to a bunch of instructions. But the CPu has a single instruction to do this.

On ESP32, GCC produces a single instruction for the code above.

However...
I could not not find a way to get him to produce the correct instruction for a simple float=round(float) for ESP32. This CPU has a single instruction for this, too...
This is one of the rare cases I mentioned. Or I just did it wrong so far.

Kuba0040 · Dec 4, 2021

It works like a charm. I've just changed the variables to be uint32_t instead of int32_t. The total execution time is only 1 CPU cycle. Perfect! Thank You very much everyone.

Adding more DSP instructions to Arduino Code.

Kuba0040

Well-known member

el_supremo

Well-known member

joepasquariello

Well-known member

Frank B

Senior Member

Frank B

Senior Member

Kuba0040

Well-known member