Truncated algebra optimisation

Status
Not open for further replies.

Xenoamor

Well-known member
Say I have the following:
Code:
uint32_t X = Z/Y;
uint32_t F = Z-(Y*X);
Can these two be optimised.
X and F are unknowns I wish to determine.

I was thinking:
Code:
X into F
F = Z-(Y*(Z/Y))
But this obviously equals 0. This arises as Z/Y is rounded down (truncated) which is desired behaviour
 
Say I have the following:
Code:
uint32_t X = Z/Y;
uint32_t F = Z-(Y*X);
Can these two be optimised.
X and F are unknowns I wish to determine.

I was thinking:
Code:
X into F
F = Z-(Y*(Z/Y))
But this obviously equals 0. This arises as Z/Y is rounded down (truncated) which is desired behaviour

The first is F=0 too... so the optimization is indeed F=0; :)
Try it with excel or by hand... :)
 
Last edited:
Code:
Let:
Z = 5;
Y = 3;

5/3 = 1.66666...
Cast to uint32_t X = 1;

It may make more sense if I write it as:
Code:
uint32_t X = floor(Z/Y);
 
Thanks, I'll pick apart the assembly tomorrow.

I think as I'm using fixed point arithmetic I'm loosing precision that can't be recovered
 
the assembler-div takes only 7 cycles if i remember correctly

let us know your results :)
 
Last edited:
i just searched the manual,

and wrote a short function (similiar to the ones that paul uses) - unsigned div in inline-assembler:

Code:
static inline uint32_t udiv(uint32_t a, uint32_t b) __attribute__((always_inline, unused));
static inline uint32_t udiv(uint32_t a, uint32_t b)
{
  uint32_t out;
  asm volatile("udiv %0, %1, %2" : "=r" (out) : "r" (a), "r" (b));
  return out;
}

void setup() {
  // put your setup code here, to run once:
  delay(900);
  Serial.print( udiv (15,5));
}

void loop() {
  // put your main code here, to run repeatedly:

}
is it faster ?

from the manual:
Division operations terminate when the divide calculation completes, with the number of cycles required dependent on the values of the input operands. Division operations are interruptible, meaning that an operation can be abandoned when an interrupt occurs, with worst case latency of one cycle, and restarted when the interrupt completes.

It takes 2 - 12 cycles.
 
Last edited:
I've missed the obvious, I suppose....
Why inline function calling asm to do an integer divide - when the C compiler knows there's hardware integer divide for 2's complement numbers?
 
Okay I've just found this glorious website

So using -O3 and ARM gcc4.8.2. Note I've had to use volatile and statics to stop various optimisations:
Code:
#include <stdint.h>

volatile uint32_t Z = 312;
volatile uint32_t Y = 35;

static inline uint32_t maxArray(uint32_t a, uint32_t b) {
  uint32_t out = a/b;
  return out;
}

void test() {
  volatile uint32_t x = maxArray(Z,Y); 
}
creates:
Code:
test():
        movw    r3, #:lower16:.LANCHOR0
        movt    r3, #:upper16:.LANCHOR0
        push    {lr}
        sub     sp, sp, #12
        ldr     r0, [r3]
        ldr     r1, [r3, #4]
        bl      __aeabi_uidiv
        str     r0, [sp, #4]
        add     sp, sp, #12
        ldr     pc, [sp], #4
Z:
        .word   312
Y:
        .word   35

And finally:
Code:
#include <stdint.h>

volatile uint32_t Z = 312;
volatile uint32_t Y = 35;

static inline uint32_t udiv(uint32_t a, uint32_t b) __attribute__((always_inline, unused));
static inline uint32_t udiv(uint32_t a, uint32_t b)
{
  uint32_t out;
  asm volatile("udiv %0, %1, %2" : "=r" (out) : "r" (a), "r" (b));
  return out;
}

void test() {
  volatile uint32_t x = udiv(Z,Y);
}
Creates:
Code:
test():
        movw    r3, #:lower16:.LANCHOR0
        movt    r3, #:upper16:.LANCHOR0
        sub     sp, sp, #8
        ldr     r2, [r3]
        ldr     r3, [r3, #4]
        udiv r3, r2, r3
        str     r3, [sp, #4]
        add     sp, sp, #8
        bx      lr
Z:
        .word   312
Y:
        .word   35

I'm having a hard time finding: __aeabi_uidiv
I think it may well be a pre-compiled binary shipped in a library with arm gcc. Quite possibly this?
 
Last edited:
You should use the options
Code:
-O3 -mthumb -mcpu=cortex-m4

...turns out that both of your sourcecode result in exactly the same output :)
 
You should use the options
Code:
-O3 -mthumb -mcpu=cortex-m4

...turns out that both of your sourcecode result in exactly the same output :)

Excellent, thanks Frank!
I was wondering why it looked a little out of place. This'll be handy for analysing code snippets from here on out
 
Maybe N/A, but here's what IAR's compiler generated for same source... uses ARM UDIV

2016-02-09_094543.jpg
 
You noted that the GCC code calls a function rather than using the ARM UDIV machine instruction?
 
You noted that that output was from a compilation not for ARMCORTEX-M4 ? And you noted my comment that after setting the right options the output is the same ? (~which means with UDIV ?)

;)
 
Say I have the following:
Code:
uint32_t X = Z/Y;
uint32_t F = Z-(Y*X);
Can these two be optimised.
A C89/C90/C99/C11 compiler will calculate X using integer division, and thus F = Z % Y (where % is the integer modulus operator in C).

Some architectures (i386 and AMD64, i.e. 32-bit and 64-bit Intel/AMD) have an integer divide instruction that yields the remainder as well, and many compilers have optimizations to combine a division and remainder into one. The C89/C99 div() function is intended as an optimization for this particular case. It returns a div_t structure, which contains both the quotient (result) (in quot member) and the remainder (in rem member of the div_t structure).

Unfortunately, while the Cortex-M4 (Teensy 3.x) does have an integer division operation as part of the Thumb instruction set, it only computes the quotient, rounding towards zero.

Comparing hand-written and C versions of udiv() and sdiv():
Code:
typedef struct {
    unsigned int quot;
    unsigned int rem;
} udiv_t;

typedef struct {
    int quot;
    int rem;
} sdiv_t;

static inline __attribute__((always_inline))
udiv_t udiv_c(const unsigned int num, const unsigned int denom)
{
    register udiv_t result;
    result.quot = num / denom;
    result.rem = num - result.quot * denom;
    return result;
}

static inline __attribute__((always_inline))
sdiv_t sdiv_c(const int num, const int denom)
{
    register sdiv_t result;
    result.quot = num / denom;
    result.rem = num - result.quot * denom;
    return result;
}

static inline __attribute__((always_inline))
udiv_t udiv_asm(const unsigned int num, const unsigned int denom)
{
    register udiv_t result;
    __asm__ __volatile__ ( "udiv %0, %2, %3\n\t"
                           "mul %1, %0, %3\n\t"
                           "sub %1, %2, %1\n\t"
                         : "+r" (result.quot)  /* %0 */
                          ,"+r" (result.rem)   /* %1 */
                         : "r" (num)           /* %2 */
                          ,"r" (denom)         /* %3 */
                         );
    return result;
}

static inline __attribute__((always_inline))
sdiv_t sdiv_asm(const int num, const int denom)
{
    register sdiv_t result;
    __asm__ __volatile__ ( "sdiv %0, %2, %3\n\t"
                           "mul %1, %0, %3\n\t"
                           "sub %1, %2, %1\n\t"
                         : "+r" (result.quot)  /* %0 */
                          ,"+r" (result.rem)   /* %1 */
                         : "r" (num)           /* %2 */
                          ,"r" (denom)         /* %3 */
                         );
    return result;
}
using arm-none-eabi-gcc-4.9.3 -mthumb -mcpu=cortex-m4 -Os and arm-none-eabi-gcc-4.9.3 -mthumb -mcpu=cortex-m4 -O2, we find that the code generated from the C version is actually superior (due to smaller function pre/postamble).

Further experimentation shows that as long as you have the two operations nearby -- first computing the quotient, followed soon enough by computing the remainder, either via the % modulus operator, or by substracting the product of the quotient and denominator from the numerator (both compiling to the exact same machine code) --, GCC-4.9.3 compiles very good code using -mthumb -mcpu=cortex-m4 -Os and -mthumb -mcpu=cortex-m4 -O2; certainly good enough for me to not write inline assembly variants!

The ARM (non-Thumb) instruction set part of cortex-M4 does not have any division operator, so if you do not allow for Thumb instructions, GCC must use a helper function (__aeabi_uidiv() for unsigned int arguments) instead. Teensy LC uses Cortex-M0+ (armv6-m architecture for GCC), and it does not have a hardware divide instruction; division by a variable is computed using such a library helper function instead.

In general, the code quoted from Xenoamor's post will compile to good code for Teensy LC, too, although due to the lack of division instruction in the machine code, it'll have to call to the library division function.

TL;DR: GCC will generate pretty optimized code for those expressions, but you must tell it the exact features Teensy supports via compiler options.

If you are using fixed-point arithmetic on Cortex-M4 (Teensy 3.x), you should note that it does have a machine instruction for multiplying two 32-bit values yielding a 64-bit result. For example,
Code:
unsigned int umulhi(const unsigned int a, const unsigned int b)
{
    return ((unsigned long long)a * (unsigned long long)b) >> 32;
}

unsigned int umulhiadd(const unsigned int a, const unsigned int b, const unsigned long long c)
{
    return ((unsigned long long)a * (unsigned long long)b + c) >> 32;
}

int smulhi(const int a, const int b)
{
    return ((long long)a * (long long)b) >> 32;
}

int smulhiadd(const int a, const int b, const long long c)
{
    return ((long long)a * (long long)b + c) >> 32;
}
compiles to (using -mthumb -mcpu=cortex-m4 -O2 or -Os):
Code:
umulhi:
    umull r0, r1, r0, r1
    mov   r0, r1
    bx    lr

umulhiadd:
    umlal r2, r3, r1, r0
    mov   r0, r3
    bx    lr

smulhi:
    smull r0, r1, r0, r1
    mov   r0, r1
    bx    lr

smulhiadd:
    smlal r2, r3, r1, r0
    mov   r0, r3
    bx    lr
which is optimal, as far as I can tell.

This means that instead of division, you should check if you can do the operation via multiplication (by 4294967296/divisor), optionally adding a 64-bit constant, and shifting the result down by 32 bits (logically the same as dividing by 4294967296). Among other code, this optimizes to just one umull, umlal, smull, or smlal Thumb instruction, taking only one clock cycle (according to Cortex-M4 technical reference manual). In typical code, the bit shift is "hidden" by consecutive code, by just using the high 32-bit word of the result, ignoring the low part.

The types of the multiplicands are important: If and only if both are unsigned, the result is unsigned. At least one has to be cast to 64-bit (long long or [u]int64_t); the other will be automatically promoted, but I personally prefer to cast both, just to be explicit. You do want to use a cast instead of specifying the parameters as 64-bit, so that the compiler knows that the arguments are just 32-bit, and only the result is 64-bit.

Apologies for the excess verbosity. :eek:
 
You noted that that output was from a compilation not for ARMCORTEX-M4 ? And you noted my comment that after setting the right options the output is the same ? (~which means with UDIV ?)

;)
The target was a Cortex M4. One would expect the compiler options in use to permit exploiting hardware SDIV and UDIV, of course.
 
Status
Not open for further replies.
Back
Top