Forum Rule: Always post complete source code & details to reproduce any issue!
Results 1 to 16 of 16

Thread: Teensy 4.0 compiler needs 30 minutes for 40000 statements

  1. #1

    Teensy 4.0 compiler needs 30 minutes for 40000 statements

    I use the program given below to examine the runtime of certain statements. The macro multiExeMM generates code for MM matrix multiplications. In the example below MM=10000. The compiler then needs 30 minutes (!) for compilation of the 40000 statements. For MM=1000 the compiler only needs 15 seconds. Why is compilation for MM=10000 so long? Can I change some options to make it faster?

    Code:
    int led = 13;
    #define NN 10
    #define MM 10000
     
    #define multiExe1 {\
      yy1=a11*x1+a12*x2 ; \
      yy2=a21*x1+a22*x2 ;\
      x1=yy1 ;\
      x2=yy2 ;\
      }
     
    #define multiExe5 {\
      multiExe1 \
      multiExe1 \
      multiExe1 \
      multiExe1 \
      multiExe1 \
      }  
    
    #define multiExe25 {\
      multiExe5 \
      multiExe5 \
      multiExe5 \
      multiExe5 \
      multiExe5 \
      }    
    
    #define multiExe100 {\
      multiExe25 \
      multiExe25 \
      multiExe25 \
      multiExe25 \
      }      
    
    #define multiExe500 {\
      multiExe100 \
      multiExe100 \
      multiExe100 \
      multiExe100 \
      multiExe100 \
      }
    
    #define multiExe1000 {\
      multiExe500 \
      multiExe500 \
      }
    
    #define multiExe10000 {\
      multiExe1000 \
      multiExe1000 \
      multiExe1000 \
      multiExe1000 \
      multiExe1000 \
      multiExe1000 \
      multiExe1000 \
      multiExe1000 \
      multiExe1000 \
      multiExe1000 \
      }  
      
    float x1,x2,yy1,yy2 ;
    float phi1,phiNNMM ;
    float a11,a12,a21,a22 ;
    float b11,b12,b21,b22 ;
    float xshb1,xshb2 ;
    
    void setup() {
      pinMode(led, OUTPUT);
      Serial.begin(115200);  
      delay(500) ;
      Serial.println("teensy40timingMat22Mul1...") ;
      Serial.print("F_CPU=") ;  Serial.println(F_CPU) ;
      Serial.print("MM   =") ;  Serial.println(MM) ;
     
      phi1=PI/7/NN ;
      phiNNMM=MM*NN*phi1 ;
      a11=cos(phi1) ;
      a12=sin(phi1) ;
      a21=-sin(phi1) ;
      a22=cos(phi1) ; 
      if(a11==1.0){ Serial.println("a11==1 !") ; }
      if(a12==0.0){ Serial.println("a12==0 !") ; }
         
      b11=cos(phiNNMM) ;
      b12=sin(phiNNMM) ;
      b21=-sin(phiNNMM) ;
      b22=cos(phiNNMM) ; 
     
      x1=1.0 ; x2=0.0 ;
      xshb1=b11*x1+b12*x2 ;
      xshb2=b21*x1+b22*x2 ;
      x1=1.0 ; x2=0.0 ; 
      
      uint32_t microsStart=micros() ;
      for(int k=0 ; k<NN ; k++){
        multiExe10000
        }
      uint32_t microsStop=micros() ;
      
      int32_t  microsCount=microsStop-microsStart ;
      float microsTime=microsCount*1e-6 ;
      
      Serial.printf("micros: start=%10i stop=%10i count=%5i  time=%8.3f us cyclesPerBlock= %10.5f\n",microsStart,microsStop,microsCount,microsTime/1e-6,(microsTime/(NN*MM)*F_CPU)) ;
      Serial.printf("x1   =%15.10f  x2   =%15.10f\n",x1,x2) ;
      Serial.printf("xshb1=%15.10f  xshb2=%15.10f\n",xshb1,xshb2) ;
      }
          
    void loop() {  }

  2. #2
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    10,565
    Built here in 7 min 21 secs - a long time - but no abusive looking CPU / RAM / or DISK usage? An i7 with 4*2 cores showed 20% - with no core pegged - 8 sets of jaggies. All building on SSD drive T: and it shows a few quick blips up front - then about nothing and C: drive a few blips over 50% but longer pauses than spikes.

    Must just be a weak point in the compiler keeping busy waiting ?

    teensy40timingMat22Mul1...
    F_CPU=600000000
    MM =10000
    micros: start= 800002 stop= 801337 count= 1335 time=1335.000 us cyclesPerBlock= 8.01000
    x1 = -0.2219638228 x2 = -0.9722617865
    xshb1= -0.2222798169 xshb2= -0.9749829173

  3. #3
    Senior Member
    Join Date
    Apr 2014
    Location
    Germany
    Posts
    693
    I assume that the preprocessor has a hard time generating those 160000 lines of code. Any reason why you don't pack one 'multiExe100' into a large for loop? The overhead for the loop can definitely be neglected.

  4. #4
    Senior Member
    Join Date
    Feb 2016
    Location
    Australia
    Posts
    251
    origianal-
    compile time: 4 mins 20 seconds
    time:1335 us
    cyclesPerBlock:8.01

    for loop of multiExe100-
    compile time: 6.45 seconds
    time:1336 us
    cyclesPerBlock: 8.016

    for loop of multiExe1-
    compile time: 4 seconds
    time:1501 us
    cyclesPerBlock: 9.006

  5. #5
    Seems my PC is rather slow. I am doing this just because I am curious and I did not want to have any (loop-) overhead.

  6. #6
    Senior Member+ KurtE's Avatar
    Join Date
    Jan 2014
    Posts
    6,075
    Do you have an Anti-virus stuff running on your machine? Wonder size of temporary files created and how much disk space you have...

    Wonder what type of PC you are running, things like how much RAM... Did the compile exceed it and start a lot of swapping? ...

  7. #7
    I have an AMD 3.2GHz CPU with 16G Ram and 240G harddisk-space left. I don't see any extraordinary harddisk activity during compilation (so not too much swapping).

  8. #8
    Senior Member PaulStoffregen's Avatar
    Join Date
    Nov 2012
    Posts
    21,279
    Temporarily disable Windows Defender.

  9. #9
    Senior Member
    Join Date
    Apr 2014
    Location
    Germany
    Posts
    693
    .... or set it up to exclude the build folders. (Which is a good idea in any case)

    Anyway, I still don't see why you want to torture the compiler when Gibbedy has already shown in #4 that there is no measurable overhead due to the loop...

  10. #10
    @Paul: Windows Defender is disabled @luni: It's just that I want that no other instructions enter the pipeline. It's just that I want to know what happens. It's totally clear to me that under normal circumstances that makes no sense.

  11. #11
    Senior Member
    Join Date
    Jul 2014
    Posts
    2,504
    Quote Originally Posted by ossi View Post
    It's just that I want that no other instructions enter the pipeline.
    But you only instruct the pre-processor. The compiles does what it wants (except when it debug mode), right? It may shift the lines to optimize register usage.

  12. #12
    I keep an eye on the generated code...

  13. #13
    Senior Member
    Join Date
    Feb 2016
    Location
    Australia
    Posts
    251
    for loop of multiExe1 *debug mode*-
    time:5501 us
    cyclesPerBlock: 33.006

    original *debug mode*-
    time:1870 us
    cyclesPerBlock: 11.22

    All previous tests were with "faster" option.
    original *fastest*-
    time:1334us
    cyclesPerBlock: 8.004

    I'm going to get rid of the NN loop to and compile in debug mode and see If I can hit the leading 1334us.
    Oooh this is fun.

  14. #14
    Senior Member
    Join Date
    Feb 2016
    Location
    Australia
    Posts
    251
    Code:
    cc1plus.exe: out of memory allocating 1846519080 bytes
    one moment,...

    edit: I give up. Can't get past this error.
    Last edited by Gibbedy; 01-18-2020 at 05:35 PM.

  15. #15
    What is the purpose of the "debug"-compile-option?

  16. #16
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    10,565
    Quote Originally Posted by ossi View Post
    What is the purpose of the "debug"-compile-option?
    the general idea AFAIK is that the code is compiled more in order as written so stepping through sources matches execution order. And some optimizations that would obfuscate walking the code during debug may be avoided. That might result in fatter code - but with T4 code in full speed RAM that wouldn't hurt as much as blowing up caches from FLASH.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •