Forum Rule: Always post complete source code & details to reproduce any issue!
Results 1 to 12 of 12

Thread: code generation bug? multiply affects later shift? (Teensy 4.0) can loop?

  1. #1
    Senior Member
    Join Date
    Jul 2020
    Posts
    398

    code generation bug? multiply affects later shift? can loop? (Teensy 4.0)

    Board Teensy 4.0
    System MacOS 10.15.5 / Macbook Pro
    Arduino 1.8.12 / Teensyduino 1.52

    I whittled down to a simpler testcase:
    Code:
    #define START 4  // values less than 4 work, more than 4 hang after first line of output
    #define END 8
    
    
    void setup() 
    {
      Serial.begin (115200) ;
      for (int n = START ; n <= END ; n++)
      {
        // value of 1<<n prints correct here:
        Serial.print (n) ; Serial.print (" shifted=") ; Serial.print (1<<n) ;  Serial.print (" ... ") ;
        int foo;
        for (int i = 0 ; i < (1<<n) ; i++)
        {
          foo = i * 0x08000000 ;  
        }
       // The value of 1<<n is printed as 16 whatever n is at this point
       Serial.print (n) ; Serial.print (" shifted=") ; Serial.print (1<<n) ; Serial.print ("   foo=0x") ; Serial.println (foo, HEX) ;
       delay (50) ;
      }
    }
    
    void loop() 
    {
    }
    The output is
    Code:
    4 shifted=16 ... 4 shifted=16   foo=0x78000000
    5 shifted=32 ... 5 shifted=16   foo=0xF8000000
    6 shifted=64 ... 6 shifted=16   foo=0xF8000000
    7 shifted=128 ... 7 shifted=16   foo=0xF8000000
    8 shifted=256 ... 8 shifted=16   foo=0xF8000000
    So after the crucial multiplication(s) in the loop the attempt to do a left shift of 1 by n always seems to produce 16.
    The multiplication seems to have to overflow (perhaps in a specific way) to trigger the issue. This feels like a
    code-generator bug to do with modelling the kill set of such instructions.

    If START is changed to less than 4, it seems to work OK though:
    Code:
    3 shifted=8 ... 3 shifted=8   foo=0x38000000
    4 shifted=16 ... 4 shifted=16   foo=0x78000000
    5 shifted=32 ... 5 shifted=32   foo=0xF8000000
    6 shifted=64 ... 6 shifted=64   foo=0xF8000000
    7 shifted=128 ... 7 shifted=128   foo=0xF8000000
    8 shifted=256 ... 8 shifted=256   foo=0xF8000000
    Which is very counter-intuitive

    Or if START is more than 4 it appears to jam the processor:
    Code:
    5 shifted=32 ...
    Last edited by MarkT; 07-04-2020 at 11:48 PM. Reason: Improve subject line

  2. #2
    Senior Member
    Join Date
    Nov 2012
    Posts
    1,454
    It appears to be a problem with optimizations. The default Optimize option of "Faster" uses -O2 and this produces the strange result, as does the smallest code which uses -On. Changing Optimize to "Fast" uses -O1 and produces the expected (correct) result as does Fastest code which uses -O3.

    Pete

  3. #3
    Senior Member
    Join Date
    May 2015
    Location
    USA
    Posts
    644
    Interesting. Changing i to unsigned seems to fix it.

  4. #4
    Senior Member
    Join Date
    Jul 2020
    Posts
    398
    Neither of those are 'fixes', they are workarounds Which compiler is being used - is there a bug report site for it?

    I'm pretty sure its a code-generator bug rather than optimizer, even though changing optimization level may hide it, it has the
    feel of mis-modelling the instruction set semantics (I've worked in compilers before, and super-scalar architectures make this
    area much more complex for the compiler). -O3 no doubt hides it by caching the value of 1<<n so its not recalculated after
    the multiply. Lower optimization probably doesn't attempt any fancy dual-issue trickery in the code generator.

  5. #5
    Member
    Join Date
    Apr 2020
    Location
    Germany, NRW
    Posts
    86
    Try to put the code without the serial setup stuff in some testfn. Call testfn from setup. Compile and objdump -D firmware.elf, look at testfn.

    Here it is like this:

    Code:
    0000008c <_Z6testfnv>:
          8c:   b508        push    {r3, lr}
          8e:   2205        movs    r2, #5                                     ; start
          90:   2320        movs    r3, #32                                   ; end
          92:   48024902    stmdami r2, {r1, r8, fp, lr}
          96:   f005 fd21   bl  5adc <_ZN5Print6printfEPKcz>
          9a:   0240e7fe    subeq   lr, r0, #66584576   ; 0x3f80000
          9e:   2000        movs    r0, #0
          a0:   20000f4c    andcs   r0, r0, ip, asr #30
    ; hm, no function epilog?
    000000a4 <setup>:
    Which looks for me like only some part of the fn is generated.

    If you change the 0x8000_0000 to 0x800_0000 this gets generated:

    Code:
    0000008c <_Z6testfnv>:
          8c:   b530        push    {r4, r5, lr}
          8e:   2405        movs    r4, #5
          90:   b083        sub sp, #12
          92:   46222501    strtmi  r2, [r2], -r1, lsl #10
          96:   490a        ldr r1, [pc, #40]   ; (c0 <_Z6testfnv+0x34>)
          98:   480a40a5    stmdami sl, {r0, r2, r5, r7, lr}
          9c:   462b        mov r3, r5
          9e:   fd3bf005    ldc2    0, cr15, [fp, #-20]!    ; 0xffffffec
          a2:   1e68        subs    r0, r5, #1
          a4:   34014622    strcc   r4, [r1], #-1570    ; 0xfffff9de
          a8:   05c0        lsls    r0, r0, #23
          aa:   4906462b    stmdbmi r6, {r0, r1, r3, r5, r9, sl, lr}
          ae:   9000        str r0, [sp, #0]
          b0:   f0054804            ; <UNDEFINED> instruction: 0xf0054804 ; mul?
          b4:   fd31 2c09   ldc2    12, cr2, [r1, #-36]!    ; 0xffffffdc
          b8:   b003d1eb    andlt   sp, r3, fp, ror #3
          bc:   bd30        pop {r4, r5, pc}                       ; return
          be:   bf00        nop
          c0:   0240        lsls    r0, r0, #9                       ; pointers to data (0x2000240, 0x20000f6c, 0x20000258)
          c2:   2000        movs    r0, #0
          c4:   0f6c        lsrs    r4, r5, #29
          c6:   2000        movs    r0, #0
          c8:   0258        lsls    r0, r3, #9
          ca:   2000        movs    r0, #0
    Which works.

    Changed to START=4 this gets generated:

    Code:
    0000008c <_Z6testfnv>:
          8c:   b538        push    {r3, r4, r5, lr}
          8e:   2504        movs    r5, #4
          90:   2401        movs    r4, #1
          92:   48184629    ldmdami r8, {r0, r3, r5, r9, sl, lr}
          96:   40ac        lsls    r4, r5
          98:   fda4f005    stc2    0, cr15, [r4, #20]!
          9c:   2109        movs    r1, #9
          9e:   f0074817            ; <UNDEFINED> instruction: 0xf0074817
          a2:   fb56 4621           ; <UNDEFINED> instruction: 0xfb564621
          a6:   f0054814            ; <UNDEFINED> instruction: 0xf0054814
          aa:   fd9c 2105   ldc2    1, cr2, [ip, #20]
          ae:   f0074814            ; <UNDEFINED> instruction: 0xf0074814
          b2:   fb4e 4629           ; <UNDEFINED> instruction: 0xfb4e4629
          b6:   35014810    strcc   r4, [r1, #-2064]    ; 0xfffff7f0
          ba:   f005 fd93   bl  5be4 <_ZN5Print5printEl>
          be:   480e2109    stmdami lr, {r0, r3, r8, sp}
          c2:   f007 fb45   bl  7750 <usb_serial_write>
          c6:   2110        movs    r1, #16
          c8:   480b        ldr r0, [pc, #44]   ; (f8 <_Z6testfnv+0x6c>)
          ca:   f005 fd8b   bl  5be4 <_ZN5Print5printEl>
          ce:   2109        movs    r1, #9
          d0:   480c        ldr r0, [pc, #48]   ; (104 <_Z6testfnv+0x78>)
          d2:   f007 fb3d   bl  7750 <usb_serial_write>
          d6:   1e61        subs    r1, r4, #1
          d8:   22102300    andscs  r2, r0, #0, 6
          dc:   06c9        lsls    r1, r1, #27
          de:   f0054806            ; <UNDEFINED> instruction: 0xf0054806
          e2:   fd40 4804   stc2l   8, cr4, [r0, #-16]
          e6:   f005 fd15   bl  5b14 <_ZN5Print7printlnEv>
          ea:   f0052032            ; <UNDEFINED> instruction: 0xf0052032
          ee:   ff90 2d09           ; <UNDEFINED> instruction: 0xff902d09
          f2:   bd38d1cd    ldfltd  f5, [r8, #-820]!    ; 0xfffffccc
          f6:   bf00        nop
          f8:   0f4c        lsrs    r4, r1, #29
          fa:   2000        movs    r0, #0
          fc:   00e4        lsls    r4, r4, #3
          fe:   2000        movs    r0, #0
         100:   00f0        lsls    r0, r6, #3
         102:   2000        movs    r0, #0
         104:   200000f8    strdcs  r0, [r0], -r8
    Funny that the objdump from the toolchain is not able to disassemble to code the toolchain generated.

    This is gcc version 7.2.1 20170904 (release) [ARM/embedded-7-branch revision 255204] (GNU Tools for Arm Embedded Processors 7-2017-q4-major)

  6. #6
    Senior Member
    Join Date
    May 2015
    Location
    USA
    Posts
    644
    Same problem, but slightly clearer. Bad result is:

    1<<4 = 16 vs 1616161616161616161616161616161616

    1<<5 = 32 vs 3232...3216 <-- 16 here is wrong

    1<<6 = 64 vs 64...6416

    Code:
    #define START 4  // values less than 4 work, more than 4 hang after first line of output
    #define END 6
    
    int foo;
    
    void setup()
    {
      Serial.begin (115200) ;
      delay(1000);
      setup2();
    }
    
    void setup2()
    {
      for (unsigned n = START ; n <= END ; n++)
      {
        int j = 1 << n;
    
        // value of j prints correct here:
        Serial.print ("1<<"); Serial.print (n) ; Serial.print (" = ") ; Serial.print (j) ;  Serial.print (" vs ");
    
        for (int i = 0 ; i < j ; i++)    // 0 through 16,32 ...
        { 
          foo = i * 0x08000000;
          Serial.print (j) ;
        }
    
        // The value of j is incorrectly printed as 16 at this point
        Serial.println (j) ; 
      }
    }
    
    void loop()
    {
    }

  7. #7
    Senior Member+ Frank B's Avatar
    Join Date
    Apr 2014
    Location
    Germany NRW
    Posts
    6,920
    It happens on a Teensy LC, too (Cortex-M0+) , and still with GCC 9

    Seems to a serious bug somewhere in GCC, when optimizing..

  8. #8
    Senior Member
    Join Date
    May 2015
    Location
    USA
    Posts
    644
    It also occurs on gcc 7.5.0, x86_64-linux-gnu with -O2 or -O3.

    Good for looking at gcc output: https://godbolt.org/

    Code:
    #include <stdio.h>
    void
    setup2 ()
    {
      for (int n = 4; n <= 6; n++)
        {
          int j = 1 << n;
          int foo;
    
          printf ("%d ", j);	// correct here
    
          for (int i = 0; i < j; i++)	// 0 through 16,32 ...
    	{
    	  foo = i * 0x08000000;
    	  printf ("%d ", j);	// correct here
    	}
    
          printf ("<> %d\n", j);	// incorrect here
        }
    }
    
    int
    main ()
    {
      setup2 ();
    }

  9. #9
    Senior Member
    Join Date
    May 2015
    Location
    USA
    Posts
    644
    Looked into it more. Signed integer overflow is undefined. meaning that when it occurs. the compiler is free to generate any "bug" it wants. There is no problem with gcc here.

  10. #10
    Senior Member+ Frank B's Avatar
    Join Date
    Apr 2014
    Location
    Germany NRW
    Posts
    6,920
    Quote Originally Posted by jonr View Post
    Signed integer overflow is undefined. meaning that when it occurs. the compiler is free to generate any "bug" it wants.
    Hm, I don't think so. The overflow does not influence the loop, and with optimization the whole "foo" should be optimized away - overflow or not.

  11. #11
    Senior Member
    Join Date
    Jul 2020
    Posts
    398
    Actually after a quick bit of research it turns out signed overflow give undefined program semantics - any C or C++ program
    that ever overflows a signed operation is undefined thereafter.

    I think the workaround for this wretched state of affairs in the language definition is to use gcc/g++'s -fwrapv flag.
    It fixes this problem, and I can't think of a good reason to ever not use this... Just added it to the boards.txt

  12. #12
    Senior Member+ Frank B's Avatar
    Join Date
    Apr 2014
    Location
    Germany NRW
    Posts
    6,920
    Quote Originally Posted by MarkT View Post
    It fixes this problem, and I can't think of a good reason to ever not use this... Just added it to the boards.txt
    It would be default, if it was that good.
    But it isn't.
    Programs which work perfectly now, can behave different.
    It influences all existing Software (and if it is optimization only), and I can't think of a good reason to enable it.
    The buggy program above is not a good reason.

    Would be better to fix GCC optimization.
    Is there a GCC BUG Report for this?

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •