Modulo division - how does it work?

Status
Not open for further replies.

TelephoneBill

Well-known member
Does anyone know how the algorithm works for performing modulo division in the Arduino/Teensyduino IDE?

I'm interested to know if the timing for this operation might depend on the size of the operands involved?

If this were performed by iterative division to find the remainder, then smaller "divisees" would get to the result quicker? That could affect the timing in a time critical loop.
 
Here https://godbolt.org/z/FdhuYq is what gcc makes out of
Code:
int test(int x, int y)
{  
   return x % y;  
}

Not much to optimize I guess. Anyway, you can change the code in the compiler explorer and see if your ideas work.
 
Last edited:
Does anyone know how the algorithm works for performing modulo division in the Arduino/Teensyduino IDE?

I'm interested to know if the timing for this operation might depend on the size of the operands involved?

If this were performed by iterative division to find the remainder, then smaller "divisees" would get to the result quicker? That could affect the timing in a time critical loop.

It really depends from chipset to chipset (and of course the amount of effort put in by the compiler engineers, particularly ones that target a particular processor).

Some chipsets do not have a modulus operation in hardware, and you do modulus by division, truncation, multiplication, and subtraction. Some chipsets do not have divide in hardware, and the compiler has to do an iterative process to do the division (and then multiplication, subtraction). Even if a chipset internally has support for divide and modulus instructions, the chip may still do an iterative process. There may or may not be an early out for small divisors.

If you are doing modulus on unsigned values by a constant power of two, the compiler will transform this into an AND operation. If you are doing a division instead of a module, the compiler will do a logical SHIFT right. If it is a signed value, the compiler will have to do some instructions to correct for the sign after doing AND.

If you are doing division by a constant, and the machine has a special multiply instruction that either gives you double the number of bits (i.e. multiply two 32 bit numbers together to get a 64 bit number) or just the high bits, the compiler can transform the division into a multiply high operation, depending on whether the compiler things this is faster (it depends on the speed of the multiplier).

Talking about the speed of multiply, some chipsets don't have multiply, and those that have a multiply operation might do it internally via a serial shift and add method. IIRC, the M4 processors in the Teensy 3.x boards have a fast 1 cycle integer multiply operation.

I have once ran into a particular chipset where integer division was actually faster than multiply, but this is rare.

On some chipsets, you have a combined operation that does both division and modulus, leaving the results in different registers. The compiler may optimize code (typically used in print statements to get each digit) where if you are doing both divide and modulus to use the combined instruction. On some machines with separate divide and modulus operations, the compiler may decide not to use the hardware modulus operation in the case where you are dividing and doing modulus of the same value, and instead fall back to doing divide, multiply, subtract.

If you are doing division on floating point and you indicate you are using the -ffast-math option, the compiler may choose to use a reciprocal approximation instruction (if the chipset has one) and do a few rounds of Newton-Rhapson iteration (multiply-adds) to get most of the bits of accuracy back (down to 2-3 ULP typically). Some machines also have a reciprocal square root approximation instruction that combines 1.0/sqrt(x) (as an aside some numerical codes spend almost all of their time doing floating point 1.0/sqrt(x) operations).

Typically hardware square root shares a lot of silicon real estate with the divide function, so often times hardware square root is added to the instruction set.

Note, while I have worked on GCC support for something like 20 different machines, I have never worked on the ARM environment, so I can't say what the behavior of the M0 (LC), M4 (3.x), and M7 (4.0) chipsets are in terms of division and modulus.
 
Last edited:
So, in summary, the C code modulo operation "A % B" is realised by two assembler instructions - "sdiv" (signed division) followed by "mls" (multiply and subtract). Sdiv finds the integer result R of the division of A by B. Then mls multiplies R by B and subtracts it from A to leave the remainder as the answer.

Sdiv takes between 2 and 12 instruction cycles, with "early termination" (nearer 2 than 12 - this depends on the leading zeroes and ones of the operands). Mls takes 2 cycles.
 
Status
Not open for further replies.
Back
Top