Forum Rule: Always post complete source code & details to reproduce any issue!
Results 1 to 18 of 18

Thread: Binary Pattern for branch decision

  1. #1
    Senior Member
    Join Date
    Jan 2015
    Location
    France
    Posts
    160

    Binary Pattern for branch decision

    Hi,

    In "The definitive guide to arm cortex-m3 and cortex-m4 processors" by Joseph Yiu (third edition) there is a trick about branch decision (if then) and binary patterns that lead to smaller code.

    Chapter 23.1, page 738 :

    Code:
    void is_prime_number(unsigned int i)
    {
      if ((i == 2) || (i = 3) || (i = 5) || (i == 7) ||
          (i == 11) || (i = 13) || (i = 17) || (i == 19) ||
          (i == 23) || (i = 29) || (i = 31))
      {
      printf ('- %D\n",i);
      }
      return;
    }
    Can also be :

    Code:
    void is_prime_number(unsigned int i)
    {
      // bit pattern is 
      // 31:0 - 1010 0000 1000 1010 0010 1000 1010 1100 = 0xA08A28AC
      if ((1<<i) & (0xA08A28AC))
      {
      printf ('- %D\n",i);
      }
      return;
    }

    I can't figure how to calculate this pattern and I don't find any good information over the net for this. Can someone please explain haw to calculate it ?

    There is an article about this here :
    https://www.sciencedirect.com/topics...itional-branch

    Thank you,
    Manu

    Edit : typo & link

  2. #2
    Senior Member
    Join Date
    Apr 2014
    Location
    Germany
    Posts
    1,643
    I don't know how to calculate this magic number but the compiler does. It compiles both of your functions to more or less the same code including your magic number. These days it is difficult to outsmart a compiler :-)

    See here for the disassembly: https://godbolt.org/z/shTad4v56 (see labels L6 and L15 which both are your number)

  3. #3
    Senior Member
    Join Date
    Feb 2015
    Location
    Finland
    Posts
    239
    i is assumed to be a number between 0 and 31. Each possible value corresponds to a single bit, bit i.
    For example i=0 corresponds to the least significant bit (bit 0), and i=31 to the most significant bit (bit 31).
    You construct a mask with zeros except at the bits corresponding to the acceptable values of i.

    For example, if you want to allow 1, 5, and 13, the mask is (1<<1) | (1<<5) | (1<<13) = 0b10000000100010 = 0010 0000 0010 0010 = 0x2022 = 8226.

  4. #4
    Senior Member
    Join Date
    Jan 2015
    Location
    France
    Posts
    160
    Thank you, but I still not understand the way to build the mask. In your example you use 1 and shift it per the number, and then look at the bit position. Right

    (1<<1) | (1<<5) | (1<<13) = 0b10000000100010 = 0010 0000 0010 0010 = 0x2022 = 8226

    So this mean that if I want to check if condition is "1", I have to shift 1 per 1bit than check a mask with the shifted bit mask ?

    There is really something I can't figure out..

    @luni : I'm absolutely not sure about the compiler behaviour, even if it may be good. This is why I want to confirm using the 2 methods and looking at the disassembly for teensy ;-)

  5. #5
    Senior Member
    Join Date
    Jan 2015
    Location
    France
    Posts
    160
    So, if I understand :

    if ((i == 1) || (i = 2) || (i = 4) || (i == 6) || (i == 8) || (i = 9) || (i = 11))

    (1<<1) | (1<<2) | (1<<4) | (1<<6) | (1<<8) | (1<<9) | (1<<11) = 0b 101101010110 = 0000 1011 0101 0110 = 0xB56 = 2902

    Correct ?

  6. #6
    Senior Member
    Join Date
    Jan 2015
    Location
    France
    Posts
    160
    @luni => thank you, I just NOW understand your link. It this absolutely true with VScode and teensy ?

  7. #7
    Senior Member
    Join Date
    Feb 2015
    Location
    Finland
    Posts
    239
    Manu: Correct.

    The test is ((1 << i) & mask) and literally means, "if the i'th bit in mask is set, then...".

    To construct the mask, you take each acceptable value v, shift 1 (value one) left by v number of bits –– this being (1 << v) ––, and OR those together.

    To deconstruct a mask to the values it allows, just read the bits in the pattern; if bit b is set (b = 0..31, inclusive), then value b is allowed by the mask.
    Mask 0xA08A28AC = 1010 0000 1000 1010 0010 1000 1010 1100 has bits 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, and 31 set, and therefore accepts i that has any of those values.

    I'm in agreement with Luni: this is best left to the compiler to optimize.

    However, if you do insist, there is a way to construct the mask at compile time (run time if you use variables instead of literal numbers), using a set of preprocessor macros, one that evaluates to one of 32 macros based on the number of arguments, and those macros all boiling down to the above for each argument. It essentially looks like if (IS_ONE_OF(i, 2, 3, 5, 7)) (to accept i having value 2, 3, 5, or 7; up to 32 values accepted, limited to range 0..31).

    I'm not sure if I should show it here, because it seriously isn't the way I'd want to see code implemented. On the other hand, it is an interesting trick, that could come in handy; and I myself have used such macros in a slightly different situation (at EEVBlog forums, for expanding a function-like macro to a different function depending on whether it has 2, 3, or 4 arguments).

    Should I? Or is it a bad example, because one really should be careful when using such? "Tricky" code belongs in Obfuscated C code contest, not in anything you intend to use or sell.

  8. #8
    Senior Member
    Join Date
    Jan 2015
    Location
    France
    Posts
    160
    Well, thank to both of you. I was interested in code optimisation for a project in which I have 4 "big if". I was curious about the potential gain with pattern since I had read that it can save many instructions (the link I provide in the first post demonstrate it).

    Both of you teach me one thing. One teach me how to compose and decompose patterns, one teach me that it's useless.

    Thank you.
    Manu

    P.S. : Question => why "trustable" literature is so weak ?

    @luni : we have enough power in recent processors, but this save 2 instructions : https://godbolt.org/z/n5P37eP3M

  9. #9
    Senior Member
    Join Date
    Apr 2014
    Location
    Germany
    Posts
    1,643
    Quote Originally Posted by Manu View Post
    @luni => thank you, I just NOW understand your link. It this absolutely true with VScode and teensy ?
    Yes, it is the same compiler and if you give it the correct options it will generate the same code. Here https://godbolt.org/z/WT5jz7T1W with options for a T3.2 optimization level O2

    And this is what the actual "Teensy" compiler generates (cut out from the list file) I'd say this is exactly what the compiler explorer shows but IMHO the compiler explorer output is more easy to read.

    Code:
    void   __attribute__ ((noinline)) is_prime_number1(unsigned int i)
    {
         46c:	cmp	r0, #31
         46e:	bls.n	472 <is_prime_number1(unsigned int)+0x6>
         470:	bx	lr
         472:	ldr	r3, [pc, #16]	; (484 <is_prime_number1(unsigned int)+0x18>)
         474:	lsrs	r3, r0
         476:	lsls	r3, r3, #31
         478:	bpl.n	470 <is_prime_number1(unsigned int)+0x4>
         47a:	mov	r2, r0
         47c:	ldr	r1, [pc, #8]	; (488 <is_prime_number1(unsigned int)+0x1c>)
         47e:	ldr	r0, [pc, #12]	; (48c <is_prime_number1(unsigned int)+0x20>)
         480:	b.w	13cc <Print::printf(char const*, ...)>
         484:	.word	0xa08a28ac
         488:	.word	0x00006640
         48c:	.word	0x1fff8728
    
    void  __attribute__ ((noinline)) is_prime_number2(unsigned int i)
    {
         490:	movs	r1, #1
         492:	ldr	r3, [pc, #20]	; (4a8 <is_prime_number2(unsigned int)+0x18>)
         494:	lsls	r1, r0
         496:	ands	r3, r1
         498:	cbnz	r3, 49c <is_prime_number2(unsigned int)+0xc>
         49a:	bx	lr
         49c:	mov	r2, r0
         49e:	ldr	r1, [pc, #12]	; (4ac <is_prime_number2(unsigned int)+0x1c>)
         4a0:	ldr	r0, [pc, #12]	; (4b0 <is_prime_number2(unsigned int)+0x20>)
         4a2:	b.w	13cc <Print::printf(char const*, ...)>
         4a6:	nop
         4a8:	.word	0xa08a28ac
         4ac:	.word	0x00006640
         4b0:	.word	0x1fff8728
    I used this code:

    Code:
    #include "Arduino.h"
    
    void   __attribute__ ((noinline)) is_prime_number1(unsigned int i)
    {
        if ((i == 2) || (i == 3) || (i == 5) || (i == 7) ||
          (i == 11) || (i == 13) || (i == 17) || (i == 19) ||
          (i == 23) || (i == 29) || (i == 31))
        {
            Serial.printf("- %D\n", i);
        }
        return;
    }
    
    void  __attribute__ ((noinline)) is_prime_number2(unsigned int i)
    {
        // bit pattern is
        // 31:0 - 1010 0000 1000 1010 0010 1000 1010 1100 = 0xA08A28AC
        if ((1 << i) & (0xA08A28AC))
        {
            Serial.printf("- %D\n", i);
        }
        return;
    }
    
    void setup()
    {
        pinMode(LED_BUILTIN, OUTPUT);
    
        is_prime_number1(17);
        is_prime_number2(17);
    }
    
    void loop()
    {
        digitalWriteFast(LED_BUILTIN, !digitalReadFast(LED_BUILTIN));
        delay(500);
    }
    BTW: In vsCode one can install the compiler explorer locally then it operates on exactly your code including all the Arduino/Teensyduino includes. It updates the assembly directly in a vsCode window while you type which is very useful if you are interested in such stuff or need to optimize something.

  10. #10
    Senior Member
    Join Date
    Jan 2015
    Location
    France
    Posts
    160
    I still have long to learn...

  11. #11
    Senior Member
    Join Date
    Apr 2014
    Location
    Germany
    Posts
    1,643
    @luni : we have enough power in recent processors, but this save 2 instructions : https://godbolt.org/z/n5P37eP3M
    Yes, if all numbers are smaller than 11 the second code is more efficient. For i > 11 the first one wins since the conditional expressions (those ending LS) are not executed. Anyway, my experience is, most time this kind of optimization is not worth the effort and makes the code more difficult to read. Modern compilers are amazingly good...

    Code:
    foo1(unsigned int):
            cmp     r0, #11
            ldrls   r3, .L5
            movls   r0, r3, lsr r0
            andls   r0, r0, #1
            movhi   r0, #0
            bx      lr
    .L5:
            .word   2902
    
    foo2(unsigned int):
            ldr     r3, .L8
            mov     r0, r3, asr r0
            and     r0, r0, #1
            bx      lr
    .L8:
            .word   2902

  12. #12
    Senior Member
    Join Date
    Jan 2015
    Location
    France
    Posts
    160
    thank you both ;-)

  13. #13
    Senior Member
    Join Date
    Apr 2014
    Location
    Germany
    Posts
    9,407
    Quote Originally Posted by luni View Post
    the conditional expressions (those ending LS) are not executed.
    Almost. They are executed as NOP -means, they need one cycle (however, don't know how this behaves on CM7 re: dual issue)

  14. #14
    Senior Member
    Join Date
    Apr 2014
    Location
    Germany
    Posts
    1,643
    Thanks Frank, wasn't aware of that. Here a link if someone is interested in details: https://developer.arm.com/documentat...0/d/ch05s04s03

  15. #15
    Senior Member
    Join Date
    Apr 2014
    Location
    Germany
    Posts
    9,407
    I'd also use the real switches like Arduino does:

    -mthumb -mcpu=cortex-m7 -mfloat-abi=hard -mfpu=fpv5-d16 -O3

    especially you need -mthumb -mcpu=cortex-m7 to generate the same code. the other both are useful to see floating point.
    without the cpu switches, it generates ARM code, not ARM-Thumb. The Teensys don't speak the full featured "ARM" code.

    If I do that, the output looks very different: https://godbolt.org/z/vYxPTbene

  16. #16
    Senior Member
    Join Date
    Feb 2015
    Location
    Finland
    Posts
    239
    Frank B: The difference is that foo1() verifies i is between 1 and 11, inclusive, and subtracts one from it so that the mask always has the least significant bit set. 21451 == 2902 == 0xb56, so no oddities in the implementation.

    Both use form ((mask >> i) & 1) instead, which is functionally equivalent to !!((1<<i)&mask). In Thumb, if i is in rA register, it only takes three instructions to convert it to the result: ldr rB, address_of_mask then lsr rA, rB, rA and finally and rA, rA, #1 for a total of fourteen bytes of Flash: 10 for code, 4 for the constant.

    Compiler optimization in C and C++ has evolved a lot in the last two decades, so it's no wonder literature has trouble keeping apace. This is also one of the easier patterns to detect in the abstract syntax trees, so no wonder the compilers do optimize it. When I first learned C in the early nineties, the C compilers I had access to didn't do much optimization at all.

  17. #17
    Senior Member
    Join Date
    Apr 2014
    Location
    Germany
    Posts
    9,407
    Quote Originally Posted by Nominal Animal View Post
    Frank B: The difference is that foo1() v
    @Nominal: If you re-read the posts, you will notice that I did not write a single word about the code.
    I said, that if you use godbolt, you may want to use the correct compiler settings. Otherwise you will see output for a different CPU.
    And before you complain: The floating-points settings are added only to show show them, not because they are needed in this case.

  18. #18
    Senior Member
    Join Date
    Feb 2015
    Location
    Finland
    Posts
    239
    No, I just misunderstood you. I thought that by "output looks different", you meant between the two functions in that listing.
    I didn't realize you meant compared to when using different compiler options. That's all.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •