Timing of nop-delayloops on Teensy4.0

Status
Not open for further replies.

ossi

Well-known member
I am currently measuring the execution times of simple delay loops. I toggle pin 13 and measure the execution times using an oscilloscope. Clock frequency is 600 MHz. The function nopLoopn() contains n NOP instructions (n=0..7). The complete program is:

Code:
int led = 13;

void setup() {
  pinMode(led, OUTPUT);
  delay(2000) ;
  Serial.println("teensy40nopLoops1...") ;
  delay(200) ;
  nopLoop5() ;
  }

void nopLoop0() { // 150MHz = 4 cycles
  Serial.println("nopLoop0()...") ;
  while(1){
    CORE_PIN13_PORTSET = CORE_PIN13_BITMASK;
    CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK;
    }
  }

void nopLoop1() { // 150MHz = 4 cycles
  Serial.println("nopLoop1()...") ;
  while(1){
    CORE_PIN13_PORTSET = CORE_PIN13_BITMASK;
    asm volatile("nop");    
    CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK;
    }
  }  

void nopLoop2() { // 150MHz = 4 cycles
  Serial.println("nopLoop2()...") ;
  while(1){
    CORE_PIN13_PORTSET = CORE_PIN13_BITMASK;
    asm volatile("nop");
    asm volatile("nop");        
    CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK;
    }
  }    

void nopLoop3() { // 150MHz = 4 cycles
  Serial.println("nopLoop3()...") ;
  while(1){
    CORE_PIN13_PORTSET = CORE_PIN13_BITMASK;
    asm volatile("nop");    
    asm volatile("nop");
    asm volatile("nop");        
    CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK;
    }
  }      

void nopLoop4() { // 150MHz = 4 cycles
  Serial.println("nopLoop4()...") ;
  while(1){
    CORE_PIN13_PORTSET = CORE_PIN13_BITMASK;
    asm volatile("nop");    
    asm volatile("nop");    
    asm volatile("nop");
    asm volatile("nop");        
    CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK;
    }
  }      

void nopLoop5() { // 150MHz = 4 cycles
  Serial.println("nopLoop5()...") ;
  while(1){
    CORE_PIN13_PORTSET = CORE_PIN13_BITMASK;
    asm volatile("nop");
    asm volatile("nop");
    asm volatile("nop");    
    asm volatile("nop");
    asm volatile("nop");    
    CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK;
    }
  }  

void nopLoop6() { // mix 4/5 cycles ?
  Serial.println("nopLoop6()...") ;
  while(1){
    CORE_PIN13_PORTSET = CORE_PIN13_BITMASK;
    asm volatile("nop");
    asm volatile("nop");
    asm volatile("nop");    
    asm volatile("nop");
    asm volatile("nop");
    asm volatile("nop");    
    CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK;
    }
  }  

void nopLoop7() { // 120 MHz = 5 cycles
  Serial.println("nopLoop7()...") ;
  while(1){
    CORE_PIN13_PORTSET = CORE_PIN13_BITMASK;
    asm volatile("nop");
    asm volatile("nop");
    asm volatile("nop");    
    asm volatile("nop");
    asm volatile("nop");
    asm volatile("nop");    
    asm volatile("nop");    
    CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK;
    }
  }  

  
void loop(){
 }

The compiler inserts the NOP instructions as given by the source code. The generated code for nopLoop5() for example looks like this:

Code:
void nopLoop5() { // 150MHz = 4 cycles
      a0:	b508      	push	{r3, lr}
  Serial.println("nopLoop5()...") ;
      a2:	4908      	ldr	r1, [pc, #32]	; (c4 <nopLoop5()+0x24>)
      a4:	4808      	ldr	r0, [pc, #32]	; (c8 <nopLoop5()+0x28>)
      a6:	f7ff ffe9 	bl	7c <Print::println(char const*)>
  while(1){
    CORE_PIN13_PORTSET = CORE_PIN13_BITMASK;
      aa:	4a08      	ldr	r2, [pc, #32]	; (cc <nopLoop5()+0x2c>)
      ac:	2308      	movs	r3, #8
      ae:	f8c2 3084 	str.w	r3, [r2, #132]	; 0x84
    asm volatile("nop");
      b2:	bf00      	nop
    asm volatile("nop");
      b4:	bf00      	nop
    asm volatile("nop");    
      b6:	bf00      	nop
    asm volatile("nop");
      b8:	bf00      	nop
    asm volatile("nop");    
      ba:	bf00      	nop
    CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK;
      bc:	f8c2 3088 	str.w	r3, [r2, #136]	; 0x88
      c0:	e7f5      	b.n	ae <nopLoop5()+0xe>
      c2:	bf00      	nop
      c4:	20000050 	.word	0x20000050
      c8:	20000464 	.word	0x20000464
      cc:	42004000 	.word	0x42004000
Interesting are the loop execution times:
Code:
n     execution frequency/time
0       150 MHz      4 cycles
1       150 MHz      4 cycles
2       150 MHz      4 cycles
3       150 MHz      4 cycles
4       150 MHz      4 cycles
5       150 MHz      4 cycles
6                         4/5 cycles
7       120 MHz      5
Interesting is that the NOPs seem to be not executed for n=1..5, for n=6 the nops are sometimes executed, for n=7 there is 1 nop that gets always executed.
Is there a simple explanation for this behaviour?
 
Is there a simple explanation for this behaviour?

No, there is no single simple explanation. But there are many complicated overlapping ones!

I believe most of the effect you're seeing is due to the GPIO peripheral taking a few cycles to perform the previously written (implied) read-modify-write operation. If the CPU writes another one before the prior has completed, the STR.W instruction is forced to wait.

NXP's documentation on this stuff is scant at best. In many cases, there simply isn't conclusive documentation and a fair amount of guesswork is needed.

Also keep in mind you're measuring a best case scenario (unless you're using a single trigger on your scope to capture the very first usage). Code is executing from ITCM and the branch prediction hardware is already primed. This code is also simple enough (low register pressure) that the compiler optimizes common sub-expressions. You can do much worse, where those other complicated explanations also come into play and overlap on top of the bus bridge & peripheral timing effects.
 
I find it rather interesting that in case of nopLoop5() the compiler inserts the NOPs into the generated code and pipelining or whatever thing at runtime eliminates the NOPs on the long run.
 
But is it really eliminating those nop instructions at runtime?

Or is it executing them and simply arriving later at an access to a resource that would not have been ready that early?

Remember, LDR and STR instructions can take any number of cycles (possibly hundreds or thousands) if there is bus contention. It's also possible to get 2 of them to happen in the same cycle, if they both access DTCM in just the right way.
 
If I let nopLoop5() run I see sometimes short pauses in the pin13 waveform. But that seems to be due to interrupts by the TEENSY system. If I insert a cli() to eliminate these interrupts I see a clean 150MHz waveform on pin 13. There seem to be no NOPs executed (inserted on later time) on the long run. So the NOPs seem to be eliminated somehow.
 
If you let the attached program run on a teensy 4.0 the result is 4 cycles per loop. That means that pin-toggle plus loop-counting is done in 4 cycles and no NOPs seem to get inserted. If you insert a further NOP into the loop the mean-execution-time gets 4.5 cycles. So a nop seemes to be inserted every other loop.

Code:
int led = 13;

void setup() {
  pinMode(led, OUTPUT);
  delay(2000) ;
  Serial.println("teensy40nopLoops1...") ;
  }

elapsedMicros timer ;

void loop() { 
  Serial.print("loop()  ") ;
  int N=100000000 ;
  double fCPU=600e6 ;
  timer=0 ;
  for(int k=0 ; k<N ; k++){
    CORE_PIN13_PORTSET = CORE_PIN13_BITMASK;
    asm volatile("nop");
    asm volatile("nop");
    asm volatile("nop");    
    asm volatile("nop");
    CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK;
    }
  double timePerLoop=timer*1e-6/N ;  
  double cyclesPerLoop=fCPU*timePerLoop ;  
  Serial.printf("time per loop=%15.10f ns = %15.10f cycles \n",timePerLoop/1e-9,cyclesPerLoop) ;
  }
 
I got the e-mail that Frank B. posted a reply to this thread:
Frank B has just replied to a thread you have subscribed to entitled - Timing of nop-delayloops on Teensy4.0 - in the Technical Support & Questions forum of PJRC (Teensy) Forum.

But I don't see his post on this thread. Where is it?
 
...to add at least something useful to this thread, a quote from here:

NOP does nothing. NOP is not necessarily a time-consuming NOP. The processor might remove it from the pipeline before it reaches the execution stage.Use NOP for padding, for example to place the following instruction on a 64-bit boundary.

 
..and this, from here:

... What this really means is that attempts to do small (1-3) cycle delays have fragile dependencies on the surrounding instructions, which in turn depend on the compiler and its optimization flags. If you’re getting a hard fault because you manipulate a module register too quickly after enabling the module, insert a __NOP() or two and see if it works. If the exact cycle count of the code you write is critical, you’re going to have to analyze it in context.
 
The following program is also interesting
Code:
int led = 13;

int k ;

void setup() {
  pinMode(led, OUTPUT);
  delay(2000) ;
  Serial.println("teensy40nopLoops2a...") ;
  k=0 ;
  }

elapsedMicros timer ;

void loop() { 
  Serial.print("loop()  ") ;
  int N=100000000 ;
  double fCPU=600e6 ;
  timer=0 ;
  for(int k=0 ; k<N ; k++){
    CORE_PIN13_PORTSET = CORE_PIN13_BITMASK;
    asm volatile("subs  r3, #2");
    asm volatile("adds  r3, #2");
    CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK;
    }
  double timePerLoop=timer*1e-6/N ;  
  double cyclesPerLoop=fCPU*timePerLoop ;  
  Serial.printf("k=%8i time per loop=%15.10f ns = %15.10f cycles \n",k,timePerLoop/1e-9,cyclesPerLoop) ;
  k++ ;
  delay(100) ;
  }

The instructions within the for loop probably can not be eliminated. The execution time for the loop is 4 cycles. This shows that the "dual issue" cpu executes instructions in paralklel. The compiler generates the following code for the loop:

Code:
  d2:	f8c1 2084 	str.w	r2, [r1, #132]	; 0x84
    asm volatile("subs  r3, #2");
      d6:	3b02      	subs	r3, #2
    asm volatile("adds  r3, #2");
      d8:	3302      	adds	r3, #2
  for(int k=0 ; k<N ; k++){
      da:	3b01      	subs	r3, #1
    CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK;
      dc:	f8c1 2088 	str.w	r2, [r1, #136]	; 0x88
  for(int k=0 ; k<N ; k++){
      e0:	d1f7      	bne.n	d2 <loop+0x1a>
 
A better way to test would involve replicating that asm code make times, so you have 100 or more instructions between the 2 I/O operations. You already established there's several cycles of uncertainty regarding the timing of those STR instructions which write to the GPIO register. So you should design your test to have a high ratio of measured CPU cycles to that measurement uncertainty.
 
A nop that can't be ignored by the T4 is asm volatile ("mov r0,r0");

I wrote a mini library that can do cycle-exact measuring.

Example:
Code:
#include "Teensy_perf.h"

void f() {
  asm volatile ("mov r0,r0"); //A NOP that can't be ignored by the pipeline
  asm volatile ("mov r1,r1");
  asm volatile ("mov r2,r2");
  asm volatile ("mov r3,r3");

  asm volatile ("mov r4,r4");
  asm volatile ("mov r5,r5");
  asm volatile ("mov r6,r6");
  asm volatile ("mov r7,r7");
};

void setup() {
  while (!Serial && millis() < 4000 );
  Serial.printf("Cycles: %d", [B]measure(f)[/B] );
}

void loop() {}
This takes 4 cycles on a T4 (because of "dual issue")

On a T3.6 ~20cycles - because it runs from flash and 8 cycles if you add "FASTRUN" to f()
The T3.x count might not be exact because I have to find a way to invalidate the cache on a 3.x
Means, if you add a 2nd line "Serial.printf("Cycles: %d", measure(f));" it will show 8 cycles - because of the cache.


On T4, the lib invalidates i-cache and d-cache before measuring.
Interrupts get disabled on all Teensys.

(Code parts from CMSIS)

It gives interesting results about the T4 pipeline. Just exchange the different mov all with "mov r0,r0" - the result is different.
The pipeline recognizes, that the movs now can not be parallelized.

As Goethe said "Grau ist alle Theorie" (All theory is gray). It is better to do it to get a good understanding.

Something is wrong if you choose "fastest" as optimization. I don't know the problem exactly, have to look at the weekend..

https://github.com/FrankBoesing/Teensy_perf
 
Last edited:
p.s.: If someone knows how to invalidate the cache on a Cortex M4, i would be very interested to know. i haven't found anything yet.
 
p.s.: If someone knows how to invalidate the cache on a Cortex M4, i would be very interested to know. i haven't found anything yet.

For Teensy 3.6's main 8K cache, look at the LMEM_PCCCR register, which is documented on page 683 of the K66 reference manual. Also look at FMC_PFB01CR on page 712 for the small cache built into the flash memory controller.
 
Great !
it is
Code:
  //push and invalidate cache on T3.x:
  LMEM_PCCCR |= (1<<27) | (1<<26) | (1<<25) | (1<<24);
  LMEM_PCCCR |= (1<<31);
  while (LMEM_PCCCR & (1<<31)) {;}
  FMC_PFB0CR |= (1<<23) | (1<<22) | (1<<21) | (1<<20) | (1<<19);
I'll upgrade the lib. It shows negative cycles now with FASTRUN on T3.6. The reason is clear - it measures againt an empty function which is not in RAM.
Have to live with that until I have a solution (I have an Idea...)
 
Last edited:
Ok, just pushed the final version for today :) sorry for all the updates.. everytime I thought that is it.. a couple of tests later it turned out "its not .
But now it's stable for T3.6
Re: the "optimize faster" problem - i added optimize -O1 to the functions.
The 3.6 still showed 4 cycles difference between run1 and run2 - I think, and that's just a guess - that USB DMA took 4 cycles.. so I changed the delay at the begin to delay(250)
Maybe not the best solution, but enough for today :)
It shows now stable 18 cycles for the demo in #14.
 
It shows negative cycles now with FASTRUN on T3.6. The reason is clear - it measures againt an empty function which is not in RAM.
Have to live with that until I have a solution (I have an Idea...)

I've added a second function for this and other cases:
Code:
uint32_t measure(void (*func)(void), void (*compensate)(void));
It compares against a 2nd function.
Usage is like this:
Code:
FASTRUN void emptyf() {};
FASTRUN
void f() {
  asm volatile ("mov r0,r0"); //A NOP that can't be ignored by the pipeline
  asm volatile ("mov r1,r1");
  asm volatile ("mov r2,r2");
  asm volatile ("mov r3,r3");

  asm volatile ("mov r4,r4");
  asm volatile ("mov r5,r5");
  asm volatile ("mov r6,r6");
  asm volatile ("mov r7,r7");
};

...
 Serial.printf("Cycles: %d\n", [B][I]measure(f, emptyf)[/I][/B] );

So, on a T3.x it can show now the exact cycles without influence of the FLASH.
You can use it on T4, too... a posting follows.
 
I always wanted to know how many cycles a DIV needs. The manual says, it can take 2..12 cycles.
A mult needs 1 cycle, so we can compare the both, and add 1 cycle:
Code:
#include "Teensy_perf.h
volatile int a = 0, b = 1;
void f_mult() {
  volatile int result = a * b;
};
void f_div() {
  volatile int result = a / b;
};
void setup() {
  while (!Serial && millis() < 4000 );
  unsigned mult = measure(f_mult);
  Serial.printf("Cycles mult (+ variable loads): %d\n", mult);
  for (int i = 0; i < 256; i++) {
    a = i;
    unsigned div =  measure(f_div);
    Serial.printf("%03d / %03d cycles: %d\n", a, b, div - mult  + 1);
  }
}
void loop() {}
Code:
Cycles mult (+ variable loads): 7
000 / 001 cycles: 8
001 / 001 cycles: 5
002 / 001 cycles: 5
003 / 001 cycles: 5
004 / 001 cycles: 6
005 / 001 cycles: 6
006 / 001 cycles: 6
007 / 001 cycles: 6
008 / 001 cycles: 6
009 / 001 cycles: 6
010 / 001 cycles: 6
011 / 001 cycles: 6
012 / 001 cycles: 6
013 / 001 cycles: 6
014 / 001 cycles: 6
015 / 001 cycles: 6
016 / 001 cycles: 7
017 / 001 cycles: 7
018 / 001 cycles: 7
019 / 001 cycles: 7
020 / 001 cycles: 7
021 / 001 cycles: 7
022 / 001 cycles: 7
023 / 001 cycles: 7
024 / 001 cycles: 7
025 / 001 cycles: 7
026 / 001 cycles: 7
027 / 001 cycles: 7
028 / 001 cycles: 7
029 / 001 cycles: 7
030 / 001 cycles: 7
031 / 001 cycles: 7
032 / 001 cycles: 7
033 / 001 cycles: 7
034 / 001 cycles: 7
035 / 001 cycles: 7
036 / 001 cycles: 7
037 / 001 cycles: 7
038 / 001 cycles: 7
039 / 001 cycles: 7
040 / 001 cycles: 7
041 / 001 cycles: 7
042 / 001 cycles: 7
043 / 001 cycles: 7
044 / 001 cycles: 7
045 / 001 cycles: 7
046 / 001 cycles: 7
047 / 001 cycles: 7
048 / 001 cycles: 7
049 / 001 cycles: 7
050 / 001 cycles: 7
051 / 001 cycles: 7
052 / 001 cycles: 7
053 / 001 cycles: 7
054 / 001 cycles: 7
055 / 001 cycles: 7
056 / 001 cycles: 7
057 / 001 cycles: 7
058 / 001 cycles: 7
059 / 001 cycles: 7
060 / 001 cycles: 7
061 / 001 cycles: 7
062 / 001 cycles: 7
063 / 001 cycles: 7
064 / 001 cycles: 8
065 / 001 cycles: 8
066 / 001 cycles: 8
067 / 001 cycles: 8
068 / 001 cycles: 8
069 / 001 cycles: 8
070 / 001 cycles: 8
071 / 001 cycles: 8
072 / 001 cycles: 8
073 / 001 cycles: 8
074 / 001 cycles: 8
075 / 001 cycles: 8
076 / 001 cycles: 8
077 / 001 cycles: 8
078 / 001 cycles: 8
079 / 001 cycles: 8
080 / 001 cycles: 8
081 / 001 cycles: 8
082 / 001 cycles: 8
083 / 001 cycles: 8
084 / 001 cycles: 8
085 / 001 cycles: 8
086 / 001 cycles: 8
087 / 001 cycles: 8
088 / 001 cycles: 8
089 / 001 cycles: 8
090 / 001 cycles: 8
091 / 001 cycles: 8
092 / 001 cycles: 8
093 / 001 cycles: 8
094 / 001 cycles: 8
095 / 001 cycles: 8
096 / 001 cycles: 8
097 / 001 cycles: 8
098 / 001 cycles: 8
099 / 001 cycles: 8
100 / 001 cycles: 8
101 / 001 cycles: 8
102 / 001 cycles: 8
103 / 001 cycles: 8
104 / 001 cycles: 8
105 / 001 cycles: 8
106 / 001 cycles: 8
107 / 001 cycles: 8
108 / 001 cycles: 8
109 / 001 cycles: 8
110 / 001 cycles: 8
111 / 001 cycles: 8
112 / 001 cycles: 8
113 / 001 cycles: 8
114 / 001 cycles: 8
115 / 001 cycles: 8
116 / 001 cycles: 8
117 / 001 cycles: 8
118 / 001 cycles: 8
119 / 001 cycles: 8
120 / 001 cycles: 8
121 / 001 cycles: 8
122 / 001 cycles: 8
123 / 001 cycles: 8
124 / 001 cycles: 8
125 / 001 cycles: 8
126 / 001 cycles: 8
127 / 001 cycles: 8
128 / 001 cycles: 8
129 / 001 cycles: 8
130 / 001 cycles: 8
131 / 001 cycles: 8
132 / 001 cycles: 8
133 / 001 cycles: 8
134 / 001 cycles: 8
135 / 001 cycles: 8
136 / 001 cycles: 8
137 / 001 cycles: 8
138 / 001 cycles: 8
139 / 001 cycles: 8
140 / 001 cycles: 8
141 / 001 cycles: 8
142 / 001 cycles: 8
143 / 001 cycles: 8
144 / 001 cycles: 8
145 / 001 cycles: 8
146 / 001 cycles: 8
147 / 001 cycles: 8
148 / 001 cycles: 8
149 / 001 cycles: 8
150 / 001 cycles: 8
151 / 001 cycles: 8
152 / 001 cycles: 8
153 / 001 cycles: 8
154 / 001 cycles: 8
155 / 001 cycles: 8
156 / 001 cycles: 8
157 / 001 cycles: 8
158 / 001 cycles: 8
159 / 001 cycles: 8
160 / 001 cycles: 8
161 / 001 cycles: 8
162 / 001 cycles: 8
163 / 001 cycles: 8
164 / 001 cycles: 8
165 / 001 cycles: 8
166 / 001 cycles: 8
167 / 001 cycles: 8
168 / 001 cycles: 8
169 / 001 cycles: 8
170 / 001 cycles: 8
171 / 001 cycles: 8
172 / 001 cycles: 8
173 / 001 cycles: 8
174 / 001 cycles: 8
175 / 001 cycles: 8
176 / 001 cycles: 8
177 / 001 cycles: 8
178 / 001 cycles: 8
179 / 001 cycles: 8
180 / 001 cycles: 8
181 / 001 cycles: 8
182 / 001 cycles: 8
183 / 001 cycles: 8
184 / 001 cycles: 8
185 / 001 cycles: 8
186 / 001 cycles: 8
187 / 001 cycles: 8
188 / 001 cycles: 8
189 / 001 cycles: 8
190 / 001 cycles: 8
191 / 001 cycles: 8
192 / 001 cycles: 8
193 / 001 cycles: 8
194 / 001 cycles: 8
195 / 001 cycles: 8
196 / 001 cycles: 8
197 / 001 cycles: 8
198 / 001 cycles: 8
199 / 001 cycles: 8
200 / 001 cycles: 8
201 / 001 cycles: 8
202 / 001 cycles: 8
203 / 001 cycles: 8
204 / 001 cycles: 8
205 / 001 cycles: 8
206 / 001 cycles: 8
207 / 001 cycles: 8
208 / 001 cycles: 8
209 / 001 cycles: 8
210 / 001 cycles: 8
211 / 001 cycles: 8
212 / 001 cycles: 8
213 / 001 cycles: 8
214 / 001 cycles: 8
215 / 001 cycles: 8
216 / 001 cycles: 8
217 / 001 cycles: 8
218 / 001 cycles: 8
219 / 001 cycles: 8
220 / 001 cycles: 8
221 / 001 cycles: 8
222 / 001 cycles: 8
223 / 001 cycles: 8
224 / 001 cycles: 8
225 / 001 cycles: 8
226 / 001 cycles: 8
227 / 001 cycles: 8
228 / 001 cycles: 8
229 / 001 cycles: 8
230 / 001 cycles: 8
231 / 001 cycles: 8
232 / 001 cycles: 8
233 / 001 cycles: 8
234 / 001 cycles: 8
235 / 001 cycles: 8
236 / 001 cycles: 8
237 / 001 cycles: 8
238 / 001 cycles: 8
239 / 001 cycles: 8
240 / 001 cycles: 8
241 / 001 cycles: 8
242 / 001 cycles: 8
243 / 001 cycles: 8
244 / 001 cycles: 8
245 / 001 cycles: 8
246 / 001 cycles: 8
247 / 001 cycles: 8
248 / 001 cycles: 8
249 / 001 cycles: 8
250 / 001 cycles: 8
251 / 001 cycles: 8
252 / 001 cycles: 8
253 / 001 cycles: 8
254 / 001 cycles: 8
255 / 001 cycles: 8
A div /0 takes 4 cycles.
For variable b:
Code:
Cycles mult (+ variable loads): 7
137 / 000 cycles: 4
137 / 001 cycles: 8
137 / 002 cycles: 8
137 / 003 cycles: 8
137 / 004 cycles: 7
137 / 005 cycles: 7
137 / 006 cycles: 7
137 / 007 cycles: 7
137 / 008 cycles: 7
137 / 009 cycles: 7
137 / 010 cycles: 7
137 / 011 cycles: 7
137 / 012 cycles: 7
137 / 013 cycles: 7
137 / 014 cycles: 7
137 / 015 cycles: 7
137 / 016 cycles: 6
137 / 017 cycles: 6
137 / 018 cycles: 6
137 / 019 cycles: 6
137 / 020 cycles: 6
137 / 021 cycles: 6
137 / 022 cycles: 6
137 / 023 cycles: 6
137 / 024 cycles: 6
137 / 025 cycles: 6
137 / 026 cycles: 6
137 / 027 cycles: 6
137 / 028 cycles: 6
137 / 029 cycles: 6
137 / 030 cycles: 6
137 / 031 cycles: 6
137 / 032 cycles: 6
137 / 033 cycles: 6
137 / 034 cycles: 6
137 / 035 cycles: 6
137 / 036 cycles: 6
137 / 037 cycles: 6
137 / 038 cycles: 6
137 / 039 cycles: 6
137 / 040 cycles: 6
137 / 041 cycles: 6
137 / 042 cycles: 6
137 / 043 cycles: 6
137 / 044 cycles: 6
137 / 045 cycles: 6
137 / 046 cycles: 6
137 / 047 cycles: 6
137 / 048 cycles: 6
137 / 049 cycles: 6
137 / 050 cycles: 6
137 / 051 cycles: 6
137 / 052 cycles: 6
137 / 053 cycles: 6
137 / 054 cycles: 6
137 / 055 cycles: 6
137 / 056 cycles: 6
137 / 057 cycles: 6
137 / 058 cycles: 6
137 / 059 cycles: 6
137 / 060 cycles: 6
137 / 061 cycles: 6
137 / 062 cycles: 6
137 / 063 cycles: 6
137 / 064 cycles: 5
137 / 065 cycles: 5
137 / 066 cycles: 5
137 / 067 cycles: 5
137 / 068 cycles: 5
137 / 069 cycles: 5
137 / 070 cycles: 5
137 / 071 cycles: 5
137 / 072 cycles: 5
137 / 073 cycles: 5
137 / 074 cycles: 5
137 / 075 cycles: 5
137 / 076 cycles: 5
137 / 077 cycles: 5
137 / 078 cycles: 5
137 / 079 cycles: 5
137 / 080 cycles: 5
137 / 081 cycles: 5
137 / 082 cycles: 5
137 / 083 cycles: 5
137 / 084 cycles: 5
137 / 085 cycles: 5
137 / 086 cycles: 5
137 / 087 cycles: 5
137 / 088 cycles: 5
137 / 089 cycles: 5
137 / 090 cycles: 5
137 / 091 cycles: 5
137 / 092 cycles: 5
137 / 093 cycles: 5
137 / 094 cycles: 5
137 / 095 cycles: 5
137 / 096 cycles: 5
137 / 097 cycles: 5
137 / 098 cycles: 5
137 / 099 cycles: 5
137 / 100 cycles: 5
137 / 101 cycles: 5
137 / 102 cycles: 5
137 / 103 cycles: 5
137 / 104 cycles: 5
137 / 105 cycles: 5
137 / 106 cycles: 5
137 / 107 cycles: 5
137 / 108 cycles: 5
137 / 109 cycles: 5
137 / 110 cycles: 5
137 / 111 cycles: 5
137 / 112 cycles: 5
137 / 113 cycles: 5
137 / 114 cycles: 5
137 / 115 cycles: 5
137 / 116 cycles: 5
137 / 117 cycles: 5
137 / 118 cycles: 5
137 / 119 cycles: 5
137 / 120 cycles: 5
137 / 121 cycles: 5
137 / 122 cycles: 5
137 / 123 cycles: 5
137 / 124 cycles: 5
137 / 125 cycles: 5
137 / 126 cycles: 5
137 / 127 cycles: 5
137 / 128 cycles: 5
137 / 129 cycles: 5
137 / 130 cycles: 5
137 / 131 cycles: 5
137 / 132 cycles: 5
137 / 133 cycles: 5
137 / 134 cycles: 5
137 / 135 cycles: 5
137 / 136 cycles: 5
137 / 137 cycles: 5
137 / 138 cycles: 5
137 / 139 cycles: 5
The more similar the value of a and b is, the faster "div" is. May sound trivial, but now I know it.
And there is no shortcut for /1. ARM can optimize this ;)

For a /10 which is often used to print numbers or for decimal->bcd conversion we can expect a 4..7 cycles per div for 3-digit numbers.
Not too bad.
 
Last edited:
Status
Not open for further replies.
Back
Top