code generation bug? multiply affects later shift? (Teensy 4.0) can loop?

MarkT

Well-known member
code generation bug? multiply affects later shift? can loop? (Teensy 4.0)

Board Teensy 4.0
System MacOS 10.15.5 / Macbook Pro
Arduino 1.8.12 / Teensyduino 1.52

I whittled down to a simpler testcase:
Code:
#define START 4  // values less than 4 work, more than 4 hang after first line of output
#define END 8


void setup() 
{
  Serial.begin (115200) ;
  for (int n = START ; n <= END ; n++)
  {
    // value of 1<<n prints correct here:
    Serial.print (n) ; Serial.print (" shifted=") ; Serial.print (1<<n) ;  Serial.print (" ... ") ;
    int foo;
    for (int i = 0 ; i < (1<<n) ; i++)
    {
      foo = i * 0x08000000 ;  
    }
   // The value of 1<<n is printed as 16 whatever n is at this point
   Serial.print (n) ; Serial.print (" shifted=") ; Serial.print (1<<n) ; Serial.print ("   foo=0x") ; Serial.println (foo, HEX) ;
   delay (50) ;
  }
}

void loop() 
{
}

The output is
Code:
4 shifted=16 ... 4 shifted=16   foo=0x78000000
5 shifted=32 ... 5 shifted=16   foo=0xF8000000
6 shifted=64 ... 6 shifted=16   foo=0xF8000000
7 shifted=128 ... 7 shifted=16   foo=0xF8000000
8 shifted=256 ... 8 shifted=16   foo=0xF8000000

So after the crucial multiplication(s) in the loop the attempt to do a left shift of 1 by n always seems to produce 16.
The multiplication seems to have to overflow (perhaps in a specific way) to trigger the issue. This feels like a
code-generator bug to do with modelling the kill set of such instructions.

If START is changed to less than 4, it seems to work OK though:
Code:
3 shifted=8 ... 3 shifted=8   foo=0x38000000
4 shifted=16 ... 4 shifted=16   foo=0x78000000
5 shifted=32 ... 5 shifted=32   foo=0xF8000000
6 shifted=64 ... 6 shifted=64   foo=0xF8000000
7 shifted=128 ... 7 shifted=128   foo=0xF8000000
8 shifted=256 ... 8 shifted=256   foo=0xF8000000
Which is very counter-intuitive :)

Or if START is more than 4 it appears to jam the processor:
Code:
5 shifted=32 ...
 
Last edited:
It appears to be a problem with optimizations. The default Optimize option of "Faster" uses -O2 and this produces the strange result, as does the smallest code which uses -On. Changing Optimize to "Fast" uses -O1 and produces the expected (correct) result as does Fastest code which uses -O3.

Pete
 
Neither of those are 'fixes', they are workarounds :) Which compiler is being used - is there a bug report site for it?

I'm pretty sure its a code-generator bug rather than optimizer, even though changing optimization level may hide it, it has the
feel of mis-modelling the instruction set semantics (I've worked in compilers before, and super-scalar architectures make this
area much more complex for the compiler). -O3 no doubt hides it by caching the value of 1<<n so its not recalculated after
the multiply. Lower optimization probably doesn't attempt any fancy dual-issue trickery in the code generator.
 
Try to put the code without the serial setup stuff in some testfn. Call testfn from setup. Compile and objdump -D firmware.elf, look at testfn.

Here it is like this:

Code:
0000008c <_Z6testfnv>:
      8c:   b508        push    {r3, lr}
      8e:   2205        movs    r2, #5                                     ; start
      90:   2320        movs    r3, #32                                   ; end
      92:   48024902    stmdami r2, {r1, r8, fp, lr}
      96:   f005 fd21   bl  5adc <_ZN5Print6printfEPKcz>
      9a:   0240e7fe    subeq   lr, r0, #66584576   ; 0x3f80000
      9e:   2000        movs    r0, #0
      a0:   20000f4c    andcs   r0, r0, ip, asr #30
; hm, no function epilog?
000000a4 <setup>:

Which looks for me like only some part of the fn is generated.

If you change the 0x8000_0000 to 0x800_0000 this gets generated:

Code:
0000008c <_Z6testfnv>:
      8c:   b530        push    {r4, r5, lr}
      8e:   2405        movs    r4, #5
      90:   b083        sub sp, #12
      92:   46222501    strtmi  r2, [r2], -r1, lsl #10
      96:   490a        ldr r1, [pc, #40]   ; (c0 <_Z6testfnv+0x34>)
      98:   480a40a5    stmdami sl, {r0, r2, r5, r7, lr}
      9c:   462b        mov r3, r5
      9e:   fd3bf005    ldc2    0, cr15, [fp, #-20]!    ; 0xffffffec
      a2:   1e68        subs    r0, r5, #1
      a4:   34014622    strcc   r4, [r1], #-1570    ; 0xfffff9de
      a8:   05c0        lsls    r0, r0, #23
      aa:   4906462b    stmdbmi r6, {r0, r1, r3, r5, r9, sl, lr}
      ae:   9000        str r0, [sp, #0]
      b0:   f0054804            ; <UNDEFINED> instruction: 0xf0054804 ; mul?
      b4:   fd31 2c09   ldc2    12, cr2, [r1, #-36]!    ; 0xffffffdc
      b8:   b003d1eb    andlt   sp, r3, fp, ror #3
      bc:   bd30        pop {r4, r5, pc}                       ; return
      be:   bf00        nop
      c0:   0240        lsls    r0, r0, #9                       ; pointers to data (0x2000240, 0x20000f6c, 0x20000258)
      c2:   2000        movs    r0, #0
      c4:   0f6c        lsrs    r4, r5, #29
      c6:   2000        movs    r0, #0
      c8:   0258        lsls    r0, r3, #9
      ca:   2000        movs    r0, #0

Which works.

Changed to START=4 this gets generated:

Code:
0000008c <_Z6testfnv>:
      8c:   b538        push    {r3, r4, r5, lr}
      8e:   2504        movs    r5, #4
      90:   2401        movs    r4, #1
      92:   48184629    ldmdami r8, {r0, r3, r5, r9, sl, lr}
      96:   40ac        lsls    r4, r5
      98:   fda4f005    stc2    0, cr15, [r4, #20]!
      9c:   2109        movs    r1, #9
      9e:   f0074817            ; <UNDEFINED> instruction: 0xf0074817
      a2:   fb56 4621           ; <UNDEFINED> instruction: 0xfb564621
      a6:   f0054814            ; <UNDEFINED> instruction: 0xf0054814
      aa:   fd9c 2105   ldc2    1, cr2, [ip, #20]
      ae:   f0074814            ; <UNDEFINED> instruction: 0xf0074814
      b2:   fb4e 4629           ; <UNDEFINED> instruction: 0xfb4e4629
      b6:   35014810    strcc   r4, [r1, #-2064]    ; 0xfffff7f0
      ba:   f005 fd93   bl  5be4 <_ZN5Print5printEl>
      be:   480e2109    stmdami lr, {r0, r3, r8, sp}
      c2:   f007 fb45   bl  7750 <usb_serial_write>
      c6:   2110        movs    r1, #16
      c8:   480b        ldr r0, [pc, #44]   ; (f8 <_Z6testfnv+0x6c>)
      ca:   f005 fd8b   bl  5be4 <_ZN5Print5printEl>
      ce:   2109        movs    r1, #9
      d0:   480c        ldr r0, [pc, #48]   ; (104 <_Z6testfnv+0x78>)
      d2:   f007 fb3d   bl  7750 <usb_serial_write>
      d6:   1e61        subs    r1, r4, #1
      d8:   22102300    andscs  r2, r0, #0, 6
      dc:   06c9        lsls    r1, r1, #27
      de:   f0054806            ; <UNDEFINED> instruction: 0xf0054806
      e2:   fd40 4804   stc2l   8, cr4, [r0, #-16]
      e6:   f005 fd15   bl  5b14 <_ZN5Print7printlnEv>
      ea:   f0052032            ; <UNDEFINED> instruction: 0xf0052032
      ee:   ff90 2d09           ; <UNDEFINED> instruction: 0xff902d09
      f2:   bd38d1cd    ldfltd  f5, [r8, #-820]!    ; 0xfffffccc
      f6:   bf00        nop
      f8:   0f4c        lsrs    r4, r1, #29
      fa:   2000        movs    r0, #0
      fc:   00e4        lsls    r4, r4, #3
      fe:   2000        movs    r0, #0
     100:   00f0        lsls    r0, r6, #3
     102:   2000        movs    r0, #0
     104:   200000f8    strdcs  r0, [r0], -r8

Funny that the objdump from the toolchain is not able to disassemble to code the toolchain generated.

This is gcc version 7.2.1 20170904 (release) [ARM/embedded-7-branch revision 255204] (GNU Tools for Arm Embedded Processors 7-2017-q4-major)
 
Same problem, but slightly clearer. Bad result is:

1<<4 = 16 vs 1616161616161616161616161616161616

1<<5 = 32 vs 3232...3216 <-- 16 here is wrong

1<<6 = 64 vs 64...6416

Code:
#define START 4  // values less than 4 work, more than 4 hang after first line of output
#define END 6

int foo;

void setup()
{
  Serial.begin (115200) ;
  delay(1000);
  setup2();
}

void setup2()
{
  for (unsigned n = START ; n <= END ; n++)
  {
    int j = 1 << n;

    // value of j prints correct here:
    Serial.print ("1<<"); Serial.print (n) ; Serial.print (" = ") ; Serial.print (j) ;  Serial.print (" vs ");

    for (int i = 0 ; i < j ; i++)    // 0 through 16,32 ...
    { 
      foo = i * 0x08000000;
      Serial.print (j) ;
    }

    // The value of j is incorrectly printed as 16 at this point
    Serial.println (j) ; 
  }
}

void loop()
{
}
 
It happens on a Teensy LC, too (Cortex-M0+) , and still with GCC 9

Seems to a serious bug somewhere in GCC, when optimizing..
 
It also occurs on gcc 7.5.0, x86_64-linux-gnu with -O2 or -O3.

Good for looking at gcc output: https://godbolt.org/

Code:
#include <stdio.h>
void
setup2 ()
{
  for (int n = 4; n <= 6; n++)
    {
      int j = 1 << n;
      int foo;

      printf ("%d ", j);	// correct here

      for (int i = 0; i < j; i++)	// 0 through 16,32 ...
	{
	  foo = i * 0x08000000;
	  printf ("%d ", j);	// correct here
	}

      printf ("<> %d\n", j);	// incorrect here
    }
}

int
main ()
{
  setup2 ();
}
 
Looked into it more. Signed integer overflow is undefined. meaning that when it occurs. the compiler is free to generate any "bug" it wants. There is no problem with gcc here.
 
Signed integer overflow is undefined. meaning that when it occurs. the compiler is free to generate any "bug" it wants.
Hm, I don't think so. The overflow does not influence the loop, and with optimization the whole "foo" should be optimized away - overflow or not.
 
Actually after a quick bit of research it turns out signed overflow give undefined program semantics - any C or C++ program
that ever overflows a signed operation is undefined thereafter.

I think the workaround for this wretched state of affairs in the language definition is to use gcc/g++'s -fwrapv flag.
It fixes this problem, and I can't think of a good reason to ever not use this... Just added it to the boards.txt
 
It fixes this problem, and I can't think of a good reason to ever not use this... Just added it to the boards.txt

It would be default, if it was that good.
But it isn't.
Programs which work perfectly now, can behave different.
It influences all existing Software (and if it is optimization only), and I can't think of a good reason to enable it.
The buggy program above is not a good reason.

Would be better to fix GCC optimization.
Is there a GCC BUG Report for this?
 
Back
Top