Can you explain the runtime of this simple Teensy4 loop?

Status
Not open for further replies.

ossi

Well-known member
I am just playing around with the Teensy 4.0 and try to understand how it works. I am currently executing the following program:

Code:
int led = 13;

void setup() {
  pinMode(led, OUTPUT);
  Serial.begin(115200);  
  while(!Serial){} ;
  Serial.println("teensy40dualIssue5nops1a...") ;
  delay(200) ;
  }

void loop() { 
  cli()  ;  // following loop should not be interrupted
  while(1){ // loop executes with 150MHz at 600MHz clock
    CORE_PIN13_PORTSET = CORE_PIN13_BITMASK; // led on
    asm volatile("nop"); // nop-1
    asm volatile("nop"); // nop-2
    asm volatile("nop"); // nop-3
    asm volatile("nop"); // nop-4
    asm volatile("nop"); // nop-5
    CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK; // led off
    }
  }
I watch the led-pin with an oscilloscope and see a 150MHz squarewave, so one loop execution costs (only) 4 cycles. If I run the program with no nops it runs also with 4 cycles per loop. Can you explain how the teensy "removes" the nops at runtime? Generated code is as follows:
Code:
void loop() { 
  cli()  ;  // following loop should not be interrupted
      dc:	b672      	cpsid	i
  while(1){ // loop executes with 150MHz at 600MHz clock
    CORE_PIN13_PORTSET = CORE_PIN13_BITMASK; // led on
      de:	4a06      	ldr	r2, [pc, #24]	; (f8 <loop+0x1c>)
      e0:	2308      	movs	r3, #8
      e2:	f8c2 3084 	str.w	r3, [r2, #132]	; 0x84
    asm volatile("nop"); // nop-1
      e6:	bf00      	nop
    asm volatile("nop"); // nop-2
      e8:	bf00      	nop
    asm volatile("nop"); // nop-3
      ea:	bf00      	nop
    asm volatile("nop"); // nop-4
      ec:	bf00      	nop
    asm volatile("nop"); // nop-5
      ee:	bf00      	nop
    CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK; // led off
      f0:	f8c2 3088 	str.w	r3, [r2, #136]	; 0x88
      f4:	e7f5      	b.n	e2 <loop+0x6>
      f6:	bf00      	nop
      f8:	42004000 	.word	0x42004000
So the code really contains the nops, so the cpu must get rid of them at runtime. Can you explain the behaviour? If I use 7 nops the loop runtime increases to 5 cycles. So no longer all nops are ignored. Very interesting behaviour.

And I do this all just because I am curious. It's no real need for me to understand but it would be nice.
 
Interesting is, a Cortex-M4 shows the same behaviour. So it's not a CM7 feature.
Code:
#include "Teensy_perf.h"

void fnops4() {
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
};
void fnops5() {
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
};
void fnops6() {
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
};
void fnops7() {
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
};
void fnops8() {
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
};
void setup() {
  while (!Serial && millis() < 4000 );
  Serial.printf("Cycles 4nops: %d\n", measure(fnops4));
  Serial.printf("Cycles 5nops: %d\n", measure(fnops5));
  Serial.printf("Cycles 6nops: %d\n", measure(fnops6));
  Serial.printf("Cycles 7nops: %d\n", measure(fnops7));
  Serial.printf("Cycles 8nops: %d\n", measure(fnops8));
}
  void loop() {
  }

Code:
Cycles 4nops: 2
Cycles 5nops: 2
Cycles 6nops: 3
Cycles 7nops: 3
Cycles 8nops: 4
The pipeline loads two thumb instructions at once. Not more.
 
Interesting is, a Cortex-M4 shows the same behaviour. So it's not a CM7 feature.
...
Code:
Cycles 4nops: 2
Cycles 5nops: 2
Cycles 6nops: 3
Cycles 7nops: 3
Cycles 8nops: 4
The pipeline loads two thumb instructions at once. Not more.

Which Teensy is that ? 3.2 or 3.6 or either one?
 
I measured on a T3.6 and 4.0
Since all 3.x are Cortex-M4 i suspect they need the same cycles. But I did not measure it (i.e. a 3.2) so far.
 
This shows for T_3.2:
Code:
Cycles 4nops: 4
Cycles 5nops: 4
Cycles 6nops: 5
Cycles 7nops: 8
Cycles 8nops: 8

Thought the T_3.6 K66 had upgraded processing.
 
..oh.. indeed! I wonder what exactly it is.
edit: ehh... did you add "FASTRUN" ?? otherwise you see the flash-waitstates.
You need the extended measure (a, b) function, mentioned in the other thread.
 
oops.
I measured again. Tim, you're right.. and the 3.6 has NOT this feature. it shows 4 for 4 nops .. 8 for 8 nops. Edit: for both: T3.2 and T3.6
seems i was a bit too fast with my copy'n paste actions... sorry.
shit happens.
 
Didn't see measure(a,b) in other thread? Just got the current teensy-Perf from github

TSet failed on T_3.2 - it made >> "T:\temp\arduino_build_FB_Perf_NOP.ino\FB_Perf_NOP.ino.TEENSY32.hex"
but TyComm was told to upload :: >> File 'T:\TEMP\\arduino_build_FB_Perf_NOP.ino\FB_Perf_NOP.ino.teensy31.hex' does not exist
 
Welcome to my world of trying to figure out why complex modern hardware is not easy to benchmark. :confused:

Note, I know nothing about ARM so I can't speculate on the particulars.

Since the T4 uses caching, etc. one thing to consider is code that crosses cache boundaries. We had one of the spec benchmarks that would routinely differ by something like 10%. We eventually traced it down to depending on what else was loaded into the program pushed the hot loop of just a few instructions, from being completely contained in one cache boundary, to having the loop start be in one cache boundary and the end of the loop being in another, and that would throw off the timing. Normally it wouldn't matter, but since that one loop was the hot loop in the program, in this case it did. I just started ignoring that benchmark, and I only concentrated on more repeatable benchmarks.
 
Tim, use this:
https://forum.pjrc.com/threads/5903...s-on-Teensy4-0?p=228385&viewfull=1#post228385
@Michael: no it was just a copy&paste error and i was a bit too fast..
For me, its not a benchmark, i'd like to know how the CPU works.
GCC has much more influence than these cycle-measurings or tries to micro-optimize. That's pretty senseless.
Esp. "nop" :)
More interesting, for me, was the influence of the flash-waitstates, the cache, and my measuring of "div" in the other thread.
In the next days, i want to test the busses and if it makes sense to rearrange codelines around bus-transfers (periphal accesses)

The influence of the cache is eliminated in these tests. it gets invalidated by the measure() function.
 
Last edited:
Looks like TSET needs this update to CMD files : if "%model%"=="teensy31" set model=teensy32.
> After build but before upload. The BUILD process accepts only model=teensy31 - but it creates the file above as teensy32

<edit> : Good thing for TSET these defines all come from same build params as before now with TD_1.50b1:: -D__MK20DX256__ -DTEENSYDUINO=150 -DARDUINO=10600 -DARDUINO_TEENSY32


I see this:
Code:
// BEFORE
Cycles 4nops: 4
Cycles 5nops: 4
Cycles 6nops: 5
Cycles 7nops: 8
Cycles 8nops: 8
// AFTER with code below
Cycles 4nops: 4
Cycles 5nops: 5
Cycles 6nops: 6
Cycles 7nops: 7
Cycles 8nops: 8

Using this based on linked thread???::
Code:
#include "Teensy_perf.h"

FASTRUN void emptyf() {};
FASTRUN
void fnops4() {
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
};
FASTRUN
void fnops5() {
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
};
FASTRUN
void fnops6() {
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
};
FASTRUN
void fnops7() {
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
};
FASTRUN
void fnops8() {
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
};
void setup() {
  while (!Serial && millis() < 4000 );
  Serial.printf("Cycles 4nops: %d\n", measure(fnops4, emptyf));
  Serial.printf("Cycles 5nops: %d\n", measure(fnops5, emptyf));
  Serial.printf("Cycles 6nops: %d\n", measure(fnops6, emptyf));
  Serial.printf("Cycles 7nops: %d\n", measure(fnops7, emptyf));
  Serial.printf("Cycles 8nops: %d\n", measure(fnops8, emptyf));
}
  void loop() {
  }
 
Yes, the second are the right cycles-counts: 4,5,6,7,8
As I said, it was my fault when I said the CM4-Pipeline can ignore nops, too. That was just wrong.

But i'm happy now that I have now a easy to use tool to measure such things.
It was a pain to use micros() and to be never sure if it was used correctly.
And i know now how to achieve a more deterministic performance If i ever need it - with the now stable code in measure() it's easy to invalidate the caches, so the runtime will not vary much (taken DMA and flash out of the equation).
 
Last edited:
Frank: Just confirmed Build.txt changed this in TD_1.49 and prior
teensy31.build.board=TEENSY31
to this in TD 1.50:
teensy31.build.board=TEENSY32

Paul I suppose you did this for your clarity/plan?

It broke the command line build Upload as noted in post #10:
TSet failed on T_3.2 - it made >> "T:\temp\arduino_build_FB_Perf_NOP.ino\FB_Perf_NOP .ino.TEENSY32.hex"
but TyComm was told to upload :: >> File 'T:\TEMP\\arduino_build_FB_Perf_NOP.ino\FB_Perf_NO P.ino.teensy31.hex' does not exist

To build in TD 1.50 and after requires 'something like' this - where building with this won't work on prior versions where the HEX will have wrong name the other way:
> if "%model%"=="teensy31" set model=teensy32
 
I am just playing around with the Teensy 4.0 and try to understand how it works. I am currently executing the following program:

Code:
int led = 13;

void setup() {
  pinMode(led, OUTPUT);
  Serial.begin(115200);  
  while(!Serial){} ;
  Serial.println("teensy40dualIssue5nops1a...") ;
  delay(200) ;
  }

void loop() { 
  cli()  ;  // following loop should not be interrupted
  while(1){ // loop executes with 150MHz at 600MHz clock
    CORE_PIN13_PORTSET = CORE_PIN13_BITMASK; // led on
    asm volatile("nop"); // nop-1
    asm volatile("nop"); // nop-2
    asm volatile("nop"); // nop-3
    asm volatile("nop"); // nop-4
    asm volatile("nop"); // nop-5
    CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK; // led off
    }
  }
I watch the led-pin with an oscilloscope and see a 150MHz squarewave, so one loop execution costs (only) 4 cycles. If I run the program with no nops it runs also with 4 cycles per loop. Can you explain how the teensy "removes" the nops at runtime? Generated code is as follows:
Code:
void loop() { 
  cli()  ;  // following loop should not be interrupted
      dc:    b672          cpsid    i
  while(1){ // loop executes with 150MHz at 600MHz clock
    CORE_PIN13_PORTSET = CORE_PIN13_BITMASK; // led on
      de:    4a06          ldr    r2, [pc, #24]    ; (f8 <loop+0x1c>)
      e0:    2308          movs    r3, #8
      e2:    f8c2 3084     str.w    r3, [r2, #132]    ; 0x84
    asm volatile("nop"); // nop-1
      e6:    bf00          nop
    asm volatile("nop"); // nop-2
      e8:    bf00          nop
    asm volatile("nop"); // nop-3
      ea:    bf00          nop
    asm volatile("nop"); // nop-4
      ec:    bf00          nop
    asm volatile("nop"); // nop-5
      ee:    bf00          nop
    CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK; // led off
      f0:    f8c2 3088     str.w    r3, [r2, #136]    ; 0x88
      f4:    e7f5          b.n    e2 <loop+0x6>
      f6:    bf00          nop
      f8:    42004000     .word    0x42004000
So the code really contains the nops, so the cpu must get rid of them at runtime. Can you explain the behaviour? If I use 7 nops the loop runtime increases to 5 cycles. So no longer all nops are ignored. Very interesting behaviour.

And I do this all just because I am curious. It's no real need for me to understand but it would be nice.

My theory is, that during the bus-accesse which are needed to acces the periphal "GPIO", the code gets executed.
So it makes sense to interleave bus-accesses with calculations.

So, the "nops" are "for free" - and you can try to execute other code there. BUT: nops are somewhat special, as the pipeline can remove them. so they are not good for tests like this.
The manual says, they are used to align the following code. well..excatly what the name suggests: no operation!


So, better is to test with "mov r0,r0" which can not be removed. But attention: if you use two of them, the second can not be parallel executed ("dual issue")
so use a sequence like "mov r0,r0 - mov r1,r1"


I'll do further tests in the next days. To understand this in detail would help to write and optimize libraries which access periphals. But only where it's important to save some cycles... not many uses-cases!
 
Hm, interesting. So, now with your fix it works with 1.50?

Yes, Just updated :: github.com/Defragster/Tset/blob/master/TSet.cmd2#L35

TSet.cmd2 now ends like this - for a quick fix that breaks on older TD versions the the change to %model%:
Code:
[B][U]rem Comment line below to build prior to TeensyDuino 1.50
if "%model%"=="teensy31" set model=teensy32
[/U][/B]if not "%1"=="0" (
  if "%errorlevel%"=="0" "%TyTools%\TyCommanderC.exe" upload --autostart --wait --multi "%temp1%\%sketchname%.%model%.hex"
  REM "%arduino%\hardware\tools\arm\bin\arm-none-eabi-gcc-nm.exe" -n "%temp1%\%sketchname%.elf" | "%tools%\imxrt_size.exe"
)

AFAIK only affects the T_3.2/3.1. T4 works - including the USB descriptors I've tried: Serial, MTP {when boards.txt uncommented}, Keyboard
 
Seems a lot of discussion took place yesterday late in the evening when I was not online. Thank you all for your contributions. Exact execution timing seems to be very difficult on this high power CPUs. Is it right that ARM did not publish any execution timings for the Cortex-M7 instructions?
 
Seems a lot of discussion took place yesterday late in the evening when I was not online. Thank you all for your contributions. Exact execution timing seems to be very difficult on this high power CPUs. Is it right that ARM did not publish any execution timings for the Cortex-M7 instructions?

It seems the instruction execution time should be known. What isn't known is how the T4's dual execution core will process instructions, and the speed of the instruction feeding into the processor from Flash if not from ITCM RAM.
 
Can someone tell me how to disable the instruction cache? I think Frank B. comes close to it in his measure concept.
As far as I understand now I should invalidate the I-cache by
SCB_CACHE_ICIALLU = 0; // invalidate I-Cache
and then probably disable cache by clearing the appropriate bit in CCR. Therafter probably DSB and ISB have to be issued. Am I right?
 
Can someone tell me how to disable the instruction cache? I think Frank B. comes close to it in his measure concept.
As far as I understand now I should invalidate the I-cache by
SCB_CACHE_ICIALLU = 0; // invalidate I-Cache
and then probably disable cache by clearing the appropriate bit in CCR. Therafter probably DSB and ISB have to be issued. Am I right?

I keep the cache on purpose, because I want reproducable timing.
Search in CMSIS, there are functions for almost everything. (Core_cm7.h, in the Teensy 4 core directory - but you'll have to edit all the names)

As long you're not using FLASHMEM you don't need to disable the cache because everything is in RAM anyway (zero waitstates - like cache)

edit: found it
Code:
__STATIC_INLINE void SCB_DisableICache (void)
{
  #if defined (__ICACHE_PRESENT) && (__ICACHE_PRESENT == 1U)
    __DSB();
    __ISB();
    SCB->CCR &= ~(uint32_t)SCB_CCR_IC_Msk;  /* disable I-Cache */
    SCB->ICIALLU = 0UL;                     /* invalidate I-Cache */
    __DSB();
    __ISB();
  #endif
}
 
Status
Not open for further replies.
Back
Top