Can you explain the runtime of this simple Teensy4 loop?

ossi · Jan 31, 2020

I am just playing around with the Teensy 4.0 and try to understand how it works. I am currently executing the following program:

Code:

int led = 13;

void setup() {
  pinMode(led, OUTPUT);
  Serial.begin(115200);  
  while(!Serial){} ;
  Serial.println("teensy40dualIssue5nops1a...") ;
  delay(200) ;
  }

void loop() { 
  cli()  ;  // following loop should not be interrupted
  while(1){ // loop executes with 150MHz at 600MHz clock
    CORE_PIN13_PORTSET = CORE_PIN13_BITMASK; // led on
    asm volatile("nop"); // nop-1
    asm volatile("nop"); // nop-2
    asm volatile("nop"); // nop-3
    asm volatile("nop"); // nop-4
    asm volatile("nop"); // nop-5
    CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK; // led off
    }
  }

I watch the led-pin with an oscilloscope and see a 150MHz squarewave, so one loop execution costs (only) 4 cycles. If I run the program with no nops it runs also with 4 cycles per loop. Can you explain how the teensy "removes" the nops at runtime? Generated code is as follows:

Code:

void loop() { 
  cli()  ;  // following loop should not be interrupted
      dc:	b672      	cpsid	i
  while(1){ // loop executes with 150MHz at 600MHz clock
    CORE_PIN13_PORTSET = CORE_PIN13_BITMASK; // led on
      de:	4a06      	ldr	r2, [pc, #24]	; (f8 <loop+0x1c>)
      e0:	2308      	movs	r3, #8
      e2:	f8c2 3084 	str.w	r3, [r2, #132]	; 0x84
    asm volatile("nop"); // nop-1
      e6:	bf00      	nop
    asm volatile("nop"); // nop-2
      e8:	bf00      	nop
    asm volatile("nop"); // nop-3
      ea:	bf00      	nop
    asm volatile("nop"); // nop-4
      ec:	bf00      	nop
    asm volatile("nop"); // nop-5
      ee:	bf00      	nop
    CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK; // led off
      f0:	f8c2 3088 	str.w	r3, [r2, #136]	; 0x88
      f4:	e7f5      	b.n	e2 <loop+0x6>
      f6:	bf00      	nop
      f8:	42004000 	.word	0x42004000

So the code really contains the nops, so the cpu must get rid of them at runtime. Can you explain the behaviour? If I use 7 nops the loop runtime increases to 5 cycles. So no longer all nops are ignored. Very interesting behaviour.

And I do this all just because I am curious. It's no real need for me to understand but it would be nice.

KurtE · Jan 31, 2020

you might do a quick search on NOP...

There have been several threads talking about this, including the recent one:
https://forum.pjrc.com/threads/59033-Timing-of-nop-delayloops-on-Teensy4-0?highlight=nop

Frank B · Jan 31, 2020

ossi said:
And I do this all just because I am curious. It's no real need for me to understand but it would be nice.

I think: Something (the pipeline) must load all these nops, and this needs time if they are too much.

Frank B · Jan 31, 2020

Interesting is, a Cortex-M4 shows the same behaviour. So it's not a CM7 feature.

Code:

#include "Teensy_perf.h"

void fnops4() {
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
};
void fnops5() {
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
};
void fnops6() {
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
};
void fnops7() {
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
};
void fnops8() {
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
};
void setup() {
  while (!Serial && millis() < 4000 );
  Serial.printf("Cycles 4nops: %d\n", measure(fnops4));
  Serial.printf("Cycles 5nops: %d\n", measure(fnops5));
  Serial.printf("Cycles 6nops: %d\n", measure(fnops6));
  Serial.printf("Cycles 7nops: %d\n", measure(fnops7));
  Serial.printf("Cycles 8nops: %d\n", measure(fnops8));
}
  void loop() {
  }

Code:

Cycles 4nops: 2
Cycles 5nops: 2
Cycles 6nops: 3
Cycles 7nops: 3
Cycles 8nops: 4

The pipeline loads two thumb instructions at once. Not more.

defragster · Jan 31, 2020

Frank B said:
Interesting is, a Cortex-M4 shows the same behaviour. So it's not a CM7 feature.
...

Code:

Cycles 4nops: 2 Cycles 5nops: 2 Cycles 6nops: 3 Cycles 7nops: 3 Cycles 8nops: 4

The pipeline loads two thumb instructions at once. Not more.

Which Teensy is that ? 3.2 or 3.6 or either one?

Frank B · Jan 31, 2020

I measured on a T3.6 and 4.0
Since all 3.x are Cortex-M4 i suspect they need the same cycles. But I did not measure it (i.e. a 3.2) so far.

defragster · Jan 31, 2020

This shows for T_3.2:

Code:

Cycles 4nops: 4
Cycles 5nops: 4
Cycles 6nops: 5
Cycles 7nops: 8
Cycles 8nops: 8

Thought the T_3.6 K66 had upgraded processing.

Frank B · Jan 31, 2020

..oh.. indeed! I wonder what exactly it is.
edit: ehh... did you add "FASTRUN" ?? otherwise you see the flash-waitstates.
You need the extended measure (a, b) function, mentioned in the other thread.

Frank B · Jan 31, 2020

oops.
I measured again. Tim, you're right.. and the 3.6 has NOT this feature. it shows 4 for 4 nops .. 8 for 8 nops. Edit: for both: T3.2 and T3.6
seems i was a bit too fast with my copy'n paste actions... sorry.
shit happens.

defragster · Jan 31, 2020

Didn't see measure(a,b) in other thread? Just got the current teensy-Perf from github

TSet failed on T_3.2 - it made >> "T:\temp\arduino_build_FB_Perf_NOP.ino\FB_Perf_NOP.ino.TEENSY32.hex"
but TyComm was told to upload :: >> File 'T:\TEMP\\arduino_build_FB_Perf_NOP.ino\FB_Perf_NOP.ino.teensy31.hex' does not exist

MichaelMeissner · Jan 31, 2020

Welcome to my world of trying to figure out why complex modern hardware is not easy to benchmark.

Note, I know nothing about ARM so I can't speculate on the particulars.

Since the T4 uses caching, etc. one thing to consider is code that crosses cache boundaries. We had one of the spec benchmarks that would routinely differ by something like 10%. We eventually traced it down to depending on what else was loaded into the program pushed the hot loop of just a few instructions, from being completely contained in one cache boundary, to having the loop start be in one cache boundary and the end of the loop being in another, and that would throw off the timing. Normally it wouldn't matter, but since that one loop was the hot loop in the program, in this case it did. I just started ignoring that benchmark, and I only concentrated on more repeatable benchmarks.

Frank B · Jan 31, 2020

Tim, use this:
https://forum.pjrc.com/threads/5903...s-on-Teensy4-0?p=228385&viewfull=1#post228385
@Michael: no it was just a copy&paste error and i was a bit too fast..
For me, its not a benchmark, i'd like to know how the CPU works.
GCC has much more influence than these cycle-measurings or tries to micro-optimize. That's pretty senseless.
Esp. "nop"

More interesting, for me, was the influence of the flash-waitstates, the cache, and my measuring of "div" in the other thread.
In the next days, i want to test the busses and if it makes sense to rearrange codelines around bus-transfers (periphal accesses)

The influence of the cache is eliminated in these tests. it gets invalidated by the measure() function.

defragster · Jan 31, 2020

Looks like TSET needs this update to CMD files : if "%model%"=="teensy31" set model=teensy32.
> After build but before upload. The BUILD process accepts only model=teensy31 - but it creates the file above as teensy32

<edit> : Good thing for TSET these defines all come from same build params as before now with TD_1.50b1:: -D__MK20DX256__ -DTEENSYDUINO=150 -DARDUINO=10600 -DARDUINO_TEENSY32

I see this:

Code:

// BEFORE
Cycles 4nops: 4
Cycles 5nops: 4
Cycles 6nops: 5
Cycles 7nops: 8
Cycles 8nops: 8
// AFTER with code below
Cycles 4nops: 4
Cycles 5nops: 5
Cycles 6nops: 6
Cycles 7nops: 7
Cycles 8nops: 8

Using this based on linked thread???::

Code:

#include "Teensy_perf.h"

FASTRUN void emptyf() {};
FASTRUN
void fnops4() {
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
};
FASTRUN
void fnops5() {
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
};
FASTRUN
void fnops6() {
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
};
FASTRUN
void fnops7() {
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
};
FASTRUN
void fnops8() {
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
  asm volatile ("nop");
};
void setup() {
  while (!Serial && millis() < 4000 );
  Serial.printf("Cycles 4nops: %d\n", measure(fnops4, emptyf));
  Serial.printf("Cycles 5nops: %d\n", measure(fnops5, emptyf));
  Serial.printf("Cycles 6nops: %d\n", measure(fnops6, emptyf));
  Serial.printf("Cycles 7nops: %d\n", measure(fnops7, emptyf));
  Serial.printf("Cycles 8nops: %d\n", measure(fnops8, emptyf));
}
  void loop() {
  }

Frank B · Jan 31, 2020

Yes, the second are the right cycles-counts: 4,5,6,7,8
As I said, it was my fault when I said the CM4-Pipeline can ignore nops, too. That was just wrong.

But i'm happy now that I have now a easy to use tool to measure such things.
It was a pain to use micros() and to be never sure if it was used correctly.
And i know now how to achieve a more deterministic performance If i ever need it - with the now stable code in measure() it's easy to invalidate the caches, so the runtime will not vary much (taken DMA and flash out of the equation).

defragster · Jan 31, 2020

Frank: Just confirmed Build.txt changed this in TD_1.49 and prior

teensy31.build.board=TEENSY31
to this in TD 1.50:
teensy31.build.board=TEENSY32

Paul I suppose you did this for your clarity/plan?

It broke the command line build Upload as noted in post #10:

TSet failed on T_3.2 - it made >> "T:\temp\arduino_build_FB_Perf_NOP.ino\FB_Perf_NOP .ino.TEENSY32.hex"
but TyComm was told to upload :: >> File 'T:\TEMP\\arduino_build_FB_Perf_NOP.ino\FB_Perf_NO P.ino.teensy31.hex' does not exist

To build in TD 1.50 and after requires 'something like' this - where building with this won't work on prior versions where the HEX will have wrong name the other way:
> if "%model%"=="teensy31" set model=teensy32

Frank B · Jan 31, 2020

Hm, interesting. So, now with your fix it works with 1.50?

Frank B · Jan 31, 2020

ossi said:
I am just playing around with the Teensy 4.0 and try to understand how it works. I am currently executing the following program:

Code:

int led = 13; void setup() { pinMode(led, OUTPUT); Serial.begin(115200); while(!Serial){} ; Serial.println("teensy40dualIssue5nops1a...") ; delay(200) ; } void loop() { cli() ; // following loop should not be interrupted while(1){ // loop executes with 150MHz at 600MHz clock CORE_PIN13_PORTSET = CORE_PIN13_BITMASK; // led on asm volatile("nop"); // nop-1 asm volatile("nop"); // nop-2 asm volatile("nop"); // nop-3 asm volatile("nop"); // nop-4 asm volatile("nop"); // nop-5 CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK; // led off } }

I watch the led-pin with an oscilloscope and see a 150MHz squarewave, so one loop execution costs (only) 4 cycles. If I run the program with no nops it runs also with 4 cycles per loop. Can you explain how the teensy "removes" the nops at runtime? Generated code is as follows:

Code:

void loop() { cli() ; // following loop should not be interrupted dc: b672 cpsid i while(1){ // loop executes with 150MHz at 600MHz clock CORE_PIN13_PORTSET = CORE_PIN13_BITMASK; // led on de: 4a06 ldr r2, [pc, #24] ; (f8 <loop+0x1c>) e0: 2308 movs r3, #8 e2: f8c2 3084 str.w r3, [r2, #132] ; 0x84 asm volatile("nop"); // nop-1 e6: bf00 nop asm volatile("nop"); // nop-2 e8: bf00 nop asm volatile("nop"); // nop-3 ea: bf00 nop asm volatile("nop"); // nop-4 ec: bf00 nop asm volatile("nop"); // nop-5 ee: bf00 nop CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK; // led off f0: f8c2 3088 str.w r3, [r2, #136] ; 0x88 f4: e7f5 b.n e2 <loop+0x6> f6: bf00 nop f8: 42004000 .word 0x42004000

So the code really contains the nops, so the cpu must get rid of them at runtime. Can you explain the behaviour? If I use 7 nops the loop runtime increases to 5 cycles. So no longer all nops are ignored. Very interesting behaviour.

And I do this all just because I am curious. It's no real need for me to understand but it would be nice.

My theory is, that during the bus-accesse which are needed to acces the periphal "GPIO", the code gets executed.
So it makes sense to interleave bus-accesses with calculations.

So, the "nops" are "for free" - and you can try to execute other code there. BUT: nops are somewhat special, as the pipeline can remove them. so they are not good for tests like this.
The manual says, they are used to align the following code. well..excatly what the name suggests: no operation!

So, better is to test with "mov r0,r0" which can not be removed. But attention: if you use two of them, the second can not be parallel executed ("dual issue")
so use a sequence like "mov r0,r0 - mov r1,r1"

I'll do further tests in the next days. To understand this in detail would help to write and optimize libraries which access periphals. But only where it's important to save some cycles... not many uses-cases!

defragster · Jan 31, 2020

Frank B said:
Hm, interesting. So, now with your fix it works with 1.50?

Yes, Just updated :: github.com/Defragster/Tset/blob/master/TSet.cmd2#L35

TSet.cmd2 now ends like this - for a quick fix that breaks on older TD versions the the change to %model%:

Code:

[B][U]rem Comment line below to build prior to TeensyDuino 1.50
if "%model%"=="teensy31" set model=teensy32
[/U][/B]if not "%1"=="0" (
  if "%errorlevel%"=="0" "%TyTools%\TyCommanderC.exe" upload --autostart --wait --multi "%temp1%\%sketchname%.%model%.hex"
  REM "%arduino%\hardware\tools\arm\bin\arm-none-eabi-gcc-nm.exe" -n "%temp1%\%sketchname%.elf" | "%tools%\imxrt_size.exe"
)

AFAIK only affects the T_3.2/3.1. T4 works - including the USB descriptors I've tried: Serial, MTP {when boards.txt uncommented}, Keyboard

ossi · Jan 31, 2020

Seems a lot of discussion took place yesterday late in the evening when I was not online. Thank you all for your contributions. Exact execution timing seems to be very difficult on this high power CPUs. Is it right that ARM did not publish any execution timings for the Cortex-M7 instructions?

defragster · Jan 31, 2020

ossi said:
Seems a lot of discussion took place yesterday late in the evening when I was not online. Thank you all for your contributions. Exact execution timing seems to be very difficult on this high power CPUs. Is it right that ARM did not publish any execution timings for the Cortex-M7 instructions?

It seems the instruction execution time should be known. What isn't known is how the T4's dual execution core will process instructions, and the speed of the instruction feeding into the processor from Flash if not from ITCM RAM.

ossi · Feb 1, 2020

Can someone tell me how to disable the instruction cache? I think Frank B. comes close to it in his measure concept.
As far as I understand now I should invalidate the I-cache by
SCB_CACHE_ICIALLU = 0; // invalidate I-Cache
and then probably disable cache by clearing the appropriate bit in CCR. Therafter probably DSB and ISB have to be issued. Am I right?

Frank B · Feb 1, 2020

ossi said:
Can someone tell me how to disable the instruction cache? I think Frank B. comes close to it in his measure concept.
As far as I understand now I should invalidate the I-cache by
SCB_CACHE_ICIALLU = 0; // invalidate I-Cache
and then probably disable cache by clearing the appropriate bit in CCR. Therafter probably DSB and ISB have to be issued. Am I right?

I keep the cache on purpose, because I want reproducable timing.
Search in CMSIS, there are functions for almost everything. (Core_cm7.h, in the Teensy 4 core directory - but you'll have to edit all the names)

As long you're not using FLASHMEM you don't need to disable the cache because everything is in RAM anyway (zero waitstates - like cache)

edit: found it

Code:

__STATIC_INLINE void SCB_DisableICache (void)
{
  #if defined (__ICACHE_PRESENT) && (__ICACHE_PRESENT == 1U)
    __DSB();
    __ISB();
    SCB->CCR &= ~(uint32_t)SCB_CCR_IC_Msk;  /* disable I-Cache */
    SCB->ICIALLU = 0UL;                     /* invalidate I-Cache */
    __DSB();
    __ISB();
  #endif
}

Can you explain the runtime of this simple Teensy4 loop?

Well-known member

Senior Member+

Senior Member

Senior Member

Senior Member+

Senior Member

Senior Member+

Senior Member

Senior Member

Senior Member+

Senior Member+

Senior Member

Senior Member+

Senior Member

Senior Member+

Senior Member

Senior Member

Senior Member+

Well-known member

Senior Member+

Well-known member

Senior Member