Binary Pattern for branch decision

Status
Not open for further replies.

Manu

Well-known member
Hi,

In "The definitive guide to arm cortex-m3 and cortex-m4 processors" by Joseph Yiu (third edition) there is a trick about branch decision (if then) and binary patterns that lead to smaller code.

Chapter 23.1, page 738 :

Code:
void is_prime_number(unsigned int i)
{
  if ((i == 2) || (i = 3) || (i = 5) || (i == 7) ||
      (i == 11) || (i = 13) || (i = 17) || (i == 19) ||
      (i == 23) || (i = 29) || (i = 31))
  {
  printf ('- %D\n",i);
  }
  return;
}

Can also be :

Code:
void is_prime_number(unsigned int i)
{
  // bit pattern is 
  // 31:0 - 1010 0000 1000 1010 0010 1000 1010 1100 = 0xA08A28AC
  if ((1<<i) & (0xA08A28AC))
  {
  printf ('- %D\n",i);
  }
  return;
}


I can't figure how to calculate this pattern and I don't find any good information over the net for this. Can someone please explain haw to calculate it ?

There is an article about this here :
https://www.sciencedirect.com/topics/engineering/conditional-branch

Thank you,
Manu

Edit : typo & link
 
I don't know how to calculate this magic number but the compiler does. It compiles both of your functions to more or less the same code including your magic number. These days it is difficult to outsmart a compiler :)

See here for the disassembly: https://godbolt.org/z/shTad4v56 (see labels L6 and L15 which both are your number)
 
i is assumed to be a number between 0 and 31. Each possible value corresponds to a single bit, bit i.
For example i=0 corresponds to the least significant bit (bit 0), and i=31 to the most significant bit (bit 31).
You construct a mask with zeros except at the bits corresponding to the acceptable values of i.

For example, if you want to allow 1, 5, and 13, the mask is (1<<1) | (1<<5) | (1<<13) = 0b10000000100010 = 0010 0000 0010 0010 = 0x2022 = 8226.
 
Thank you, but I still not understand the way to build the mask. In your example you use 1 and shift it per the number, and then look at the bit position. Right

(1<<1) | (1<<5) | (1<<13) = 0b10000000100010 = 0010 0000 0010 0010 = 0x2022 = 8226

So this mean that if I want to check if condition is "1", I have to shift 1 per 1bit than check a mask with the shifted bit mask ?

There is really something I can't figure out..

@luni : I'm absolutely not sure about the compiler behaviour, even if it may be good. This is why I want to confirm using the 2 methods and looking at the disassembly for teensy ;-)
 
So, if I understand :

if ((i == 1) || (i = 2) || (i = 4) || (i == 6) || (i == 8) || (i = 9) || (i = 11))

(1<<1) | (1<<2) | (1<<4) | (1<<6) | (1<<8) | (1<<9) | (1<<11) = 0b 101101010110 = 0000 1011 0101 0110 = 0xB56 = 2902

Correct ?
 
@luni => thank you, I just NOW understand your link. It this absolutely true with VScode and teensy ?
 
Manu: Correct.

The test is ((1 << i) & mask) and literally means, "if the i'th bit in mask is set, then...".

To construct the mask, you take each acceptable value v, shift 1 (value one) left by v number of bits –– this being (1 << v) ––, and OR those together.

To deconstruct a mask to the values it allows, just read the bits in the pattern; if bit b is set (b = 0..31, inclusive), then value b is allowed by the mask.
Mask 0xA08A28AC = 1010 0000 1000 1010 0010 1000 1010 1100 has bits 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, and 31 set, and therefore accepts i that has any of those values.

I'm in agreement with Luni: this is best left to the compiler to optimize.

However, if you do insist, there is a way to construct the mask at compile time (run time if you use variables instead of literal numbers), using a set of preprocessor macros, one that evaluates to one of 32 macros based on the number of arguments, and those macros all boiling down to the above for each argument. It essentially looks like if (IS_ONE_OF(i, 2, 3, 5, 7)) (to accept i having value 2, 3, 5, or 7; up to 32 values accepted, limited to range 0..31).

I'm not sure if I should show it here, because it seriously isn't the way I'd want to see code implemented. On the other hand, it is an interesting trick, that could come in handy; and I myself have used such macros in a slightly different situation (at EEVBlog forums, for expanding a function-like macro to a different function depending on whether it has 2, 3, or 4 arguments).

Should I? Or is it a bad example, because one really should be careful when using such? "Tricky" code belongs in Obfuscated C code contest, not in anything you intend to use or sell.
 
Well, thank to both of you. I was interested in code optimisation for a project in which I have 4 "big if". I was curious about the potential gain with pattern since I had read that it can save many instructions (the link I provide in the first post demonstrate it).

Both of you teach me one thing. One teach me how to compose and decompose patterns, one teach me that it's useless.

Thank you.
Manu

P.S. : Question => why "trustable" literature is so weak ?

@luni : we have enough power in recent processors, but this save 2 instructions : https://godbolt.org/z/n5P37eP3M
 
@luni => thank you, I just NOW understand your link. It this absolutely true with VScode and teensy ?

Yes, it is the same compiler and if you give it the correct options it will generate the same code. Here https://godbolt.org/z/WT5jz7T1W with options for a T3.2 optimization level O2

And this is what the actual "Teensy" compiler generates (cut out from the list file) I'd say this is exactly what the compiler explorer shows but IMHO the compiler explorer output is more easy to read.

Code:
void   __attribute__ ((noinline)) is_prime_number1(unsigned int i)
{
     46c:	cmp	r0, #31
     46e:	bls.n	472 <is_prime_number1(unsigned int)+0x6>
     470:	bx	lr
     472:	ldr	r3, [pc, #16]	; (484 <is_prime_number1(unsigned int)+0x18>)
     474:	lsrs	r3, r0
     476:	lsls	r3, r3, #31
     478:	bpl.n	470 <is_prime_number1(unsigned int)+0x4>
     47a:	mov	r2, r0
     47c:	ldr	r1, [pc, #8]	; (488 <is_prime_number1(unsigned int)+0x1c>)
     47e:	ldr	r0, [pc, #12]	; (48c <is_prime_number1(unsigned int)+0x20>)
     480:	b.w	13cc <Print::printf(char const*, ...)>
     484:	.word	0xa08a28ac
     488:	.word	0x00006640
     48c:	.word	0x1fff8728

void  __attribute__ ((noinline)) is_prime_number2(unsigned int i)
{
     490:	movs	r1, #1
     492:	ldr	r3, [pc, #20]	; (4a8 <is_prime_number2(unsigned int)+0x18>)
     494:	lsls	r1, r0
     496:	ands	r3, r1
     498:	cbnz	r3, 49c <is_prime_number2(unsigned int)+0xc>
     49a:	bx	lr
     49c:	mov	r2, r0
     49e:	ldr	r1, [pc, #12]	; (4ac <is_prime_number2(unsigned int)+0x1c>)
     4a0:	ldr	r0, [pc, #12]	; (4b0 <is_prime_number2(unsigned int)+0x20>)
     4a2:	b.w	13cc <Print::printf(char const*, ...)>
     4a6:	nop
     4a8:	.word	0xa08a28ac
     4ac:	.word	0x00006640
     4b0:	.word	0x1fff8728

I used this code:

Code:
#include "Arduino.h"

void   __attribute__ ((noinline)) is_prime_number1(unsigned int i)
{
    if ((i == 2) || (i == 3) || (i == 5) || (i == 7) ||
      (i == 11) || (i == 13) || (i == 17) || (i == 19) ||
      (i == 23) || (i == 29) || (i == 31))
    {
        Serial.printf("- %D\n", i);
    }
    return;
}

void  __attribute__ ((noinline)) is_prime_number2(unsigned int i)
{
    // bit pattern is
    // 31:0 - 1010 0000 1000 1010 0010 1000 1010 1100 = 0xA08A28AC
    if ((1 << i) & (0xA08A28AC))
    {
        Serial.printf("- %D\n", i);
    }
    return;
}

void setup()
{
    pinMode(LED_BUILTIN, OUTPUT);

    is_prime_number1(17);
    is_prime_number2(17);
}

void loop()
{
    digitalWriteFast(LED_BUILTIN, !digitalReadFast(LED_BUILTIN));
    delay(500);
}

BTW: In vsCode one can install the compiler explorer locally then it operates on exactly your code including all the Arduino/Teensyduino includes. It updates the assembly directly in a vsCode window while you type which is very useful if you are interested in such stuff or need to optimize something.
 
@luni : we have enough power in recent processors, but this save 2 instructions : https://godbolt.org/z/n5P37eP3M

Yes, if all numbers are smaller than 11 the second code is more efficient. For i > 11 the first one wins since the conditional expressions (those ending LS) are not executed. Anyway, my experience is, most time this kind of optimization is not worth the effort and makes the code more difficult to read. Modern compilers are amazingly good...

Code:
foo1(unsigned int):
        cmp     r0, #11
        ldrls   r3, .L5
        movls   r0, r3, lsr r0
        andls   r0, r0, #1
        movhi   r0, #0
        bx      lr
.L5:
        .word   2902

foo2(unsigned int):
        ldr     r3, .L8
        mov     r0, r3, asr r0
        and     r0, r0, #1
        bx      lr
.L8:
        .word   2902
 
I'd also use the real switches like Arduino does:

-mthumb -mcpu=cortex-m7 -mfloat-abi=hard -mfpu=fpv5-d16 -O3

especially you need -mthumb -mcpu=cortex-m7 to generate the same code. the other both are useful to see floating point.
without the cpu switches, it generates ARM code, not ARM-Thumb. The Teensys don't speak the full featured "ARM" code.

If I do that, the output looks very different: https://godbolt.org/z/vYxPTbene
 
Frank B: The difference is that foo1() verifies i is between 1 and 11, inclusive, and subtracts one from it so that the mask always has the least significant bit set. 2×1451 == 2902 == 0xb56, so no oddities in the implementation.

Both use form ((mask >> i) & 1) instead, which is functionally equivalent to !!((1<<i)&mask). In Thumb, if i is in rA register, it only takes three instructions to convert it to the result: ldr rB, address_of_mask then lsr rA, rB, rA and finally and rA, rA, #1 for a total of fourteen bytes of Flash: 10 for code, 4 for the constant.

Compiler optimization in C and C++ has evolved a lot in the last two decades, so it's no wonder literature has trouble keeping apace. This is also one of the easier patterns to detect in the abstract syntax trees, so no wonder the compilers do optimize it. When I first learned C in the early nineties, the C compilers I had access to didn't do much optimization at all.
 
Frank B: The difference is that foo1() v
@Nominal: If you re-read the posts, you will notice that I did not write a single word about the code.
I said, that if you use godbolt, you may want to use the correct compiler settings. Otherwise you will see output for a different CPU.
And before you complain: The floating-points settings are added only to show show them, not because they are needed in this case.
 
No, I just misunderstood you. I thought that by "output looks different", you meant between the two functions in that listing.
I didn't realize you meant compared to when using different compiler options. That's all.
 
Status
Not open for further replies.
Back
Top