Teensy 4.0 compiler needs 30 minutes for 40000 statements

Status
Not open for further replies.

ossi

Well-known member
I use the program given below to examine the runtime of certain statements. The macro multiExeMM generates code for MM matrix multiplications. In the example below MM=10000. The compiler then needs 30 minutes (!) for compilation of the 40000 statements. For MM=1000 the compiler only needs 15 seconds. Why is compilation for MM=10000 so long? Can I change some options to make it faster?

Code:
int led = 13;
#define NN 10
#define MM 10000
 
#define multiExe1 {\
  yy1=a11*x1+a12*x2 ; \
  yy2=a21*x1+a22*x2 ;\
  x1=yy1 ;\
  x2=yy2 ;\
  }
 
#define multiExe5 {\
  multiExe1 \
  multiExe1 \
  multiExe1 \
  multiExe1 \
  multiExe1 \
  }  

#define multiExe25 {\
  multiExe5 \
  multiExe5 \
  multiExe5 \
  multiExe5 \
  multiExe5 \
  }    

#define multiExe100 {\
  multiExe25 \
  multiExe25 \
  multiExe25 \
  multiExe25 \
  }      

#define multiExe500 {\
  multiExe100 \
  multiExe100 \
  multiExe100 \
  multiExe100 \
  multiExe100 \
  }

#define multiExe1000 {\
  multiExe500 \
  multiExe500 \
  }

#define multiExe10000 {\
  multiExe1000 \
  multiExe1000 \
  multiExe1000 \
  multiExe1000 \
  multiExe1000 \
  multiExe1000 \
  multiExe1000 \
  multiExe1000 \
  multiExe1000 \
  multiExe1000 \
  }  
  
float x1,x2,yy1,yy2 ;
float phi1,phiNNMM ;
float a11,a12,a21,a22 ;
float b11,b12,b21,b22 ;
float xshb1,xshb2 ;

void setup() {
  pinMode(led, OUTPUT);
  Serial.begin(115200);  
  delay(500) ;
  Serial.println("teensy40timingMat22Mul1...") ;
  Serial.print("F_CPU=") ;  Serial.println(F_CPU) ;
  Serial.print("MM   =") ;  Serial.println(MM) ;
 
  phi1=PI/7/NN ;
  phiNNMM=MM*NN*phi1 ;
  a11=cos(phi1) ;
  a12=sin(phi1) ;
  a21=-sin(phi1) ;
  a22=cos(phi1) ; 
  if(a11==1.0){ Serial.println("a11==1 !") ; }
  if(a12==0.0){ Serial.println("a12==0 !") ; }
     
  b11=cos(phiNNMM) ;
  b12=sin(phiNNMM) ;
  b21=-sin(phiNNMM) ;
  b22=cos(phiNNMM) ; 
 
  x1=1.0 ; x2=0.0 ;
  xshb1=b11*x1+b12*x2 ;
  xshb2=b21*x1+b22*x2 ;
  x1=1.0 ; x2=0.0 ; 
  
  uint32_t microsStart=micros() ;
  for(int k=0 ; k<NN ; k++){
    multiExe10000
    }
  uint32_t microsStop=micros() ;
  
  int32_t  microsCount=microsStop-microsStart ;
  float microsTime=microsCount*1e-6 ;
  
  Serial.printf("micros: start=%10i stop=%10i count=%5i  time=%8.3f us cyclesPerBlock= %10.5f\n",microsStart,microsStop,microsCount,microsTime/1e-6,(microsTime/(NN*MM)*F_CPU)) ;
  Serial.printf("x1   =%15.10f  x2   =%15.10f\n",x1,x2) ;
  Serial.printf("xshb1=%15.10f  xshb2=%15.10f\n",xshb1,xshb2) ;
  }
      
void loop() {  }
 
Built here in 7 min 21 secs - a long time - but no abusive looking CPU / RAM / or DISK usage? An i7 with 4*2 cores showed 20% - with no core pegged - 8 sets of jaggies. All building on SSD drive T: and it shows a few quick blips up front - then about nothing and C: drive a few blips over 50% but longer pauses than spikes.

Must just be a weak point in the compiler keeping busy waiting ?

teensy40timingMat22Mul1...
F_CPU=600000000
MM =10000
micros: start= 800002 stop= 801337 count= 1335 time=1335.000 us cyclesPerBlock= 8.01000
x1 = -0.2219638228 x2 = -0.9722617865
xshb1= -0.2222798169 xshb2= -0.9749829173
 
I assume that the preprocessor has a hard time generating those 160000 lines of code. Any reason why you don't pack one 'multiExe100' into a large for loop? The overhead for the loop can definitely be neglected.
 
origianal-
compile time: 4 mins 20 seconds
time:1335 us
cyclesPerBlock:8.01

for loop of multiExe100-
compile time: 6.45 seconds
time:1336 us
cyclesPerBlock: 8.016

for loop of multiExe1-
compile time: 4 seconds
time:1501 us
cyclesPerBlock: 9.006
 
Seems my PC is rather slow. I am doing this just because I am curious and I did not want to have any (loop-) overhead.
 
Do you have an Anti-virus stuff running on your machine? Wonder size of temporary files created and how much disk space you have...

Wonder what type of PC you are running, things like how much RAM... Did the compile exceed it and start a lot of swapping? ...
 
I have an AMD 3.2GHz CPU with 16G Ram and 240G harddisk-space left. I don't see any extraordinary harddisk activity during compilation (so not too much swapping).
 
.... or set it up to exclude the build folders. (Which is a good idea in any case)

Anyway, I still don't see why you want to torture the compiler when Gibbedy has already shown in #4 that there is no measurable overhead due to the loop...
 
@Paul: Windows Defender is disabled @luni: It's just that I want that no other instructions enter the pipeline. It's just that I want to know what happens. It's totally clear to me that under normal circumstances that makes no sense.
 
It's just that I want that no other instructions enter the pipeline.
But you only instruct the pre-processor. The compiles does what it wants (except when it debug mode), right? It may shift the lines to optimize register usage.
 
for loop of multiExe1 *debug mode*-
time:5501 us
cyclesPerBlock: 33.006

original *debug mode*-
time:1870 us
cyclesPerBlock: 11.22

All previous tests were with "faster" option.
original *fastest*-
time:1334us
cyclesPerBlock: 8.004

I'm going to get rid of the NN loop to and compile in debug mode and see If I can hit the leading 1334us.
Oooh this is fun.
 
Code:
cc1plus.exe: out of memory allocating 1846519080 bytes
one moment,...

edit: I give up. Can't get past this error.
 
Last edited:
What is the purpose of the "debug"-compile-option?

the general idea AFAIK is that the code is compiled more in order as written so stepping through sources matches execution order. And some optimizations that would obfuscate walking the code during debug may be avoided. That might result in fatter code - but with T4 code in full speed RAM that wouldn't hurt as much as blowing up caches from FLASH.
 
Status
Not open for further replies.
Back
Top