Teensy 4.0 First Beta Test

Status
Not open for further replies.
@Paul that the new, additional values can be used everywhere.
some change with the arm-clockspeed.

Ok, i'll wait for your clock-api.
@luni, I#ll send you some code @weekend , then you can at least patch your intervaltimer and clockspeed.c to make your library work with T4.
 
Last edited:
@defragster: Oh indeed -
@Luni, sorry, seems not to work.
apart from the typo here:
constexpr IntervalTimer() {
CCM_CSCMR1 &= ~CCM_CSCMR1_PERCLK_CLK_SEL;
...

Made that edit and it now works the same with and without the change:

Beta7 _isr's and 'ii' passes though loop with deluge USB printing:
ITcnt=500003 loop() ii=13635
ITcnt=499984 loop() ii=14107

Edited "~" FB header _isr's and 'ii' passes though loop with deluge USB printing:
ITcnt=500197 loop() ii=14098
ITcnt=501055 loop() ii=16832

The loop count flux seems to be 13K to 17K on both - though an avg USB out might be a bit higher on the FB edit?
 
I know I am jumping into this late - been busy all day wife going for knee surgery (not replace though) tomorrow so had things to do today, including trying to get one of my hard drives working again, with no luck.

Has anybody gone over the NXP App Note on Measuring interrupt latency, https://www.nxp.com/docs/en/application-note/AN12078.pdf. Or does this not apply.

There is one warning in there though:
Set i.MX RT1050 IPG clock frequency to 150 MHz, and updated the results. It was set to 300 MHz before which is not allowed.


@defragster - is that the only change you made to your test sketch.
 
Looks like more things are unexpectedly slow on the T4
Code:
start = ARM_DWT_CYCCNT;

digitalWriteFast(LED_BUILTIN,!digitalReadFast(LED_BUILTIN));

end = ARM_DWT_CYCCNT;
Serial.println(end - start);
takes 119 cycles...

Can't reproduce 119 cycles.

I'm seeing 42 to 51 cycles with this:

Code:
void setup() {
  pinMode(LED_BUILTIN, OUTPUT);
}

void loop() {
  uint32_t start = ARM_DWT_CYCCNT;
  digitalWriteFast(LED_BUILTIN, !digitalReadFast(LED_BUILTIN));
  uint32_t end = ARM_DWT_CYCCNT;
  Serial.println(end - start);
  delay(1000);
}

Looks like digitalReadFast() is much of the problem. It's only 2 cycles with this:

Code:
void setup() {
  pinMode(LED_BUILTIN, OUTPUT);
}

void loop() {
  uint32_t start = ARM_DWT_CYCCNT;
  digitalWriteFast(LED_BUILTIN, HIGH);
  uint32_t end = ARM_DWT_CYCCNT;
  Serial.println(end - start);
  delay(1000);
  digitalWriteFast(LED_BUILTIN, LOW);
  delay(1000);
}
 
...

@defragster - is that the only change you made to your test sketch.

Mike - yes - just had GitHub confirm Sketch Blink_IntvTime.ino - only these lines were changed/added:
#define HW_SERIAL Serial1 // pin 17 debug Tx only
HW_SERIAL.begin( 115200 );

Also it runs faster on the loop() counts but the same on the IntervalTimer counts without DSB:
void TimeSome() {
jj++;
// asm("dsb");

Best wishes for Happy Healing and good 'edits' on the knee.

NOTE: - some time before anyone made note of 'double interrupts' I tried a WFI and it was not waking at all - that is why I went to the IntervalTimer - and early on 'not sure why' the WFI acted like it was seeing double interrupts so I moved on as I noted then the NXP (12085 pg 10) doc for WFI suggest a proper WFI for low power has a few other steps. I posted question at the time that didn't get answered - there is now systick_isr that didn't seem to be active in earliest betas?
 
Last edited:
@defragster
Evening Tim,
Sure everything will be fine tomorrow, have to be there at 9am so will probably be most of the day out of it, surgery itself should take less than an hour - so just minor bug needs fixing :)
 
have you tried with and without the SION bit for reading the pin? perhaps the SION uses more cycles as oposed to reading an input pin without it?
 
Has anybody gone over the NXP App Note on Measuring interrupt latency, https://www.nxp.com/docs/en/application-note/AN12078.pdf. Or does this not apply.

Too bad we can't reproduce the example in that pdf. None of the (6) GPT output pins are available on Teensy 4 (1050).
Code:
         GPT1_COMPARE1 GPIO_EMC_35 ALT2
         GPT1_COMPARE2 GPIO_EMC_36 ALT2
         GPT1_COMPARE3 GPIO_EMC_37 ALT2
         GPT2_COMPARE1 GPIO_AD_B0_06 ALT1
         GPT2_COMPARE2 GPIO_AD_B0_07 ALT1
         GPT2_COMPARE3 GPIO_AD_B0_08 ALT1

I had a flexpwm timer test sketch that I modified to enable the interrupt and toggle pin 13 in the ISR. Interestingly, the scope shows the PWM interrupt occurs in the middle of the high PWM pulse. So to get the interrupt to occur when the PWM pin goes high i had to change the pulse to 1 cycle (VAL3 and VAL4).
pwmisr.png
The blue pulse is PWM pin, the yellow pin 13. Timings are close to what the pdf reported.
 
Last edited:
have you tried with and without the SION bit for reading the pin? perhaps the SION uses more cycles as oposed to reading an input pin without it?

Tried just now. Exactly the same speed with SION = 0. But of course reading is also zero when the pin is configured for output mode.

Also tried adding this in setup(), to check the IPG clock.

Code:
  CCM_CCOSR = CCM_CCOSR_CLKO1_EN | CCM_CCOSR_CLKO1_DIV(1) | 
    CCM_CCOSR_CLKO1_SEL(12);
  IOMUXC_SW_MUX_CTL_PAD_GPIO_SD_B0_04 = 6;

Indeed a 75 MHz waveform appears (since div by 2) on the pad on the button of the board (the one approx underneath the bootloader chip).

Their docs in the GPIO chapter says 2 cycles of IPG, but it really looks like it's taking 4 of those cycles to read the pin. :(

Also tried overclocking IPG. Does indeed scales almost linear with IPG clock speed.
 
Too bad we can't reproduce the example in that pdf. None of the (6) GPT output pins are available on Teensy 4.
Code:
         GPT1_COMPARE1 GPIO_EMC_35 ALT2
         GPT1_COMPARE2 GPIO_EMC_36 ALT2
         GPT1_COMPARE3 GPIO_EMC_37 ALT2
         GPT2_COMPARE1 GPIO_AD_B0_06 ALT1
         GPT2_COMPARE2 GPIO_AD_B0_07 ALT1
         GPT2_COMPARE3 GPIO_AD_B0_08 ALT1

Yeah and there is not other pins with those available as alternates. But take a look at section 5.3. GPIO interrupt latency. For those measurements you don't need a "compare".

The other thing I am wondering is if using pin 13 (LED) is impacting the timing?
 
Simple sketch - started with Paul's #1205 last simple write - saw that using global or static versus local var versus high/low in func call changed the time - then noticed it was different whether the write was HIGH or LOW?

T_Loader verbose at this point save and cleared: View attachment DWtestLost.zip - had to ZIP ...

Now My T4 again in the ODD state - even holding the button does nothing - after 20 secs it does a quick blink - but won't do the ON and Reset at 15 seconds?

Uploaded a few variations as this grew evolved to where I was going to try #define to replace WriteFast with Write but it failed upload and here I am. Had a T-3.1 on - watching the Debug Serial port - no code uploads to it this time.

Fails Auto and Button is T4 wholly ignored. Just realized the other T_3.1 was of course on Serial1 - from the same hub. Unplugged it and then the T4 LED blinked some - but no upload.

Red LED on (with a pulse near the off side) a short second then off about 2 secs?

Sermon won't connect - closed IDE - now no T_ports on ports - only two Serial ports on IDE are the two disconnected - LED is flashing like it is running the code now - but it isn't visible. TyComm doesn't show the device online - it is lost somewhere ...

Held button - saw flash - seemed over 15 secs - it did the long red flash? Did not present the 'unknown USB' - back to the flash 1 on 2 off cycle?

Okay pull off PC put on USB battery pack and 15 seconds blinked and then reset - put back on PC same port and nothing - moved to another port and it came up and is now working again.

Here is the T_loader log from that with RESET : View attachment DWtestLost_RESET.txt

Wow - non FAST write is really SLOW - 187 to 197 cycles for 3 writes. Versus 5 to 24 cycles for WriteFast.

With var versus LED_BUILTIN it adds a couple cycles to both write and write fast, still takes longer for the ones that flip in some combination - but same with constant or non changing var.


Code:
#define DiWr digitalWriteFast
//#define DiWr digitalWrite

uint32_t PinVar = LED_BUILTIN;
#define PIN_NUM LED_BUILTIN
//#define PIN_NUM PinVar

void setup() {
  pinMode(PIN_NUM, OUTPUT);
}

bool Gflip = false;
void loop() {
  bool flip = Gflip;
  uint32_t start, end;
  delay(1000);
  start = ARM_DWT_CYCCNT;
  DiWr(PIN_NUM, flip);
  flip = !flip;
  DiWr(PIN_NUM, flip);
  flip = !flip;
  DiWr(PIN_NUM, flip);
  flip = !flip;
  end = ARM_DWT_CYCCNT;
  Serial.print("flipped write >");
  Serial.println(end - start);
  delay(1000);
  start = ARM_DWT_CYCCNT;
  DiWr(PIN_NUM, Gflip);
  DiWr(PIN_NUM, !Gflip);
  DiWr(PIN_NUM, Gflip);
  end = ARM_DWT_CYCCNT;
  Serial.print("Global ! flip write >");
  Serial.println(end - start);
  delay(1000);
  start = ARM_DWT_CYCCNT;
  DiWr(PIN_NUM, HIGH);
  DiWr(PIN_NUM, LOW);
  DiWr(PIN_NUM, HIGH);
  end = ARM_DWT_CYCCNT;
  Serial.print("H L H write >");
  Serial.println(end - start);
  delay(1000);
  start = ARM_DWT_CYCCNT;
  DiWr(PIN_NUM, LOW);
  DiWr(PIN_NUM, HIGH);
  DiWr(PIN_NUM, LOW);
  end = ARM_DWT_CYCCNT;
  Serial.print("L H L write >");
  Serial.println(end - start);
  delay(1000);
  start = ARM_DWT_CYCCNT;
  DiWr(PIN_NUM, flip);
  DiWr(PIN_NUM, !flip);
  DiWr(PIN_NUM, flip);
  end = ARM_DWT_CYCCNT;
  Serial.print("! flip write >");
  Serial.println(end - start);
  Serial.println();
  delay(1000);
  Gflip = flip;
}

Yeah and there is not other pins with those available as alternates. But take a look at section 5.3. GPIO interrupt latency. For those measurements you don't need a "compare".

The other thing I am wondering is if using pin 13 (LED) is impacting the timing?

Just edited to do write to pin 12 instead of LED_BUILTIN and the numbers are the same.

Code:
uint32_t PinVar = LED_BUILTIN;
[B]#define NO_LED 12
#define PIN_NUM NO_LED
[/B]//#define PIN_NUM LED_BUILTIN
//#define PIN_NUM PinVar
 
Can't reproduce 119 cycles.

This gets even more weird. First call takes about 120 cycles, repeated calls only 50 cycles. Can it be that there is some caching/buffering active there? -> Writing would be quick (at least from the cycle counter point of view) but it obviously cant cache the read so digitalReadFast takes more cycles than digitalWriteFast?

Code below produces:
Code:
First: 135
Second: 52
loop: 44
loop: 50
loop: 40
loop: 50
loop: 40

Code:
#include "Arduino.h"

unsigned start, end;

void setup()
{
  delay(500);
  pinMode(LED_BUILTIN, OUTPUT);

  start = ARM_DWT_CYCCNT;
  digitalWriteFast(LED_BUILTIN, !digitalReadFast(LED_BUILTIN));
  end = ARM_DWT_CYCCNT;
  Serial.printf("First: %d\n", end - start);

  start = ARM_DWT_CYCCNT;
  digitalWriteFast(LED_BUILTIN, !digitalReadFast(LED_BUILTIN));
  end = ARM_DWT_CYCCNT;
  Serial.printf("Second: %d\n", end - start);
}

void loop()
{ 
  uint32_t start = ARM_DWT_CYCCNT;
  digitalWriteFast(LED_BUILTIN, !digitalReadFast(LED_BUILTIN));
  uint32_t end = ARM_DWT_CYCCNT;
  Serial.printf("loop: %d\n",end - start);
  delay(1000);
}
 
Last edited:
Tim
Just tried it myself and was goi.g to update the post but yeah, got the same values. Interesting just flipping using digitalwritefast was only 5 cycles. Think luni just post it was 50? On cell now and heading for my bed.
 
Can it be that there is some caching/buffering active there?

If the writing to peripherals is really buffered, the buffer needs to be full eventually and the required cycles for a write will get larger. Tested this by doing a growing number of consecutive writes and timed it. Code below, output here:

Code:
First: cycles 20
1 write: cycles 4
2 writes: total 5, mean:2
6 writes: total 101, mean:16
20 writes: total 549, mean:27

It would be interesting now how long it really takes to write. Unfortunately my scope and LA are too slow to reasonable measure this. Maybe somebody with better equipment can have a quick look?


Code:
#include "Arduino.h"

unsigned start, end;

void setup()
{
  delay(500);
  pinMode(LED_BUILTIN, OUTPUT);

  start = ARM_DWT_CYCCNT;
  digitalWriteFast(LED_BUILTIN, HIGH);
  end = ARM_DWT_CYCCNT;
  Serial.printf("First: cycles %d\n", end - start);

  start = ARM_DWT_CYCCNT;
  digitalWriteFast(LED_BUILTIN, LOW);
  end = ARM_DWT_CYCCNT;
  Serial.printf("1 write: cycles %d\n", end - start);

  start = ARM_DWT_CYCCNT;
  digitalWriteFast(LED_BUILTIN, HIGH);
  digitalWriteFast(LED_BUILTIN, LOW);
  end = ARM_DWT_CYCCNT;
  Serial.printf("2 writes: total %d, mean:%d\n", end - start, (end - start) / 2);

  start = ARM_DWT_CYCCNT;
  digitalWriteFast(LED_BUILTIN, HIGH);
  digitalWriteFast(LED_BUILTIN, LOW);
  digitalWriteFast(LED_BUILTIN, HIGH);
  digitalWriteFast(LED_BUILTIN, LOW);
  digitalWriteFast(LED_BUILTIN, HIGH);
  digitalWriteFast(LED_BUILTIN, LOW);
  end = ARM_DWT_CYCCNT;
  Serial.printf("6 writes: total %d, mean:%d\n", end - start, (end - start) / 6);

  start = ARM_DWT_CYCCNT;
  digitalWriteFast(LED_BUILTIN, HIGH);
  digitalWriteFast(LED_BUILTIN, LOW);
  digitalWriteFast(LED_BUILTIN, HIGH);
  digitalWriteFast(LED_BUILTIN, LOW);
  digitalWriteFast(LED_BUILTIN, HIGH);
  digitalWriteFast(LED_BUILTIN, LOW);
  digitalWriteFast(LED_BUILTIN, HIGH);
  digitalWriteFast(LED_BUILTIN, LOW);
  digitalWriteFast(LED_BUILTIN, HIGH);
  digitalWriteFast(LED_BUILTIN, LOW);
  digitalWriteFast(LED_BUILTIN, HIGH);
  digitalWriteFast(LED_BUILTIN, LOW);
  digitalWriteFast(LED_BUILTIN, HIGH);
  digitalWriteFast(LED_BUILTIN, LOW);
  digitalWriteFast(LED_BUILTIN, HIGH);
  digitalWriteFast(LED_BUILTIN, LOW);
  digitalWriteFast(LED_BUILTIN, HIGH);  
  digitalWriteFast(LED_BUILTIN, LOW);
  digitalWriteFast(LED_BUILTIN, HIGH);
  digitalWriteFast(LED_BUILTIN, LOW);  
  end = ARM_DWT_CYCCNT;
  Serial.printf("20 writes: total %d, mean:%d\n", end - start, (end - start) / 20);
}

void loop()
{
}
 
Last edited:
NOTE: My T4 won't start if the two T_3.1's are plugged into their hub - when T4 goes in on a separate port now. T4 alone okay - or T_3's after - three times repeated ...


Is all the code running from RAM at this point? I was finding on my micros() test for 1us increments that the first loop would typically miss one somewhere in the first thousands - so I added a permissive ignore of the error in loop #1. Then it would run hours after that with no trouble.

Seems like something is getting cached or getting the pipeline filled with the right guesses?

My last sketch seems to cycle the same - except the difference in the toggle order with the FLIP writes.:
flipped write >16
Global ! flip write >17
H L H 'const' write >5
H H H var write >5
L L L var write >5
L H L 'const' write >5
! flip write >24

flipped write >7
Global ! flip write >8
H L H 'const' write >5
H H H var write >5
L L L var write >5
L H L 'const' write >5
! flip write >19

flipped write >16
 
Would maybe help to have a look at the assembler output. Maybe it's just a missed optimization?
Then, counting cycles is a bit misleading - the cycles are much shorter on the T4 :)
The ARM core is running much faster now - but not all peripherals
 
The ARM core is running much faster now - but not all peripherals
Hm, seems to be a Nerd-Processor, highly intelligent inside but reluctant to communicate with the outside world :)
 
It would be interesting now how long it really takes to write. Unfortunately my scope and LA are too slow to reasonable measure this. Maybe somebody with better equipment can have a quick look?

Borrowed one from work. The following pictures show a simple sequence of digitalWriteFast(HIGH) and digitalWriteFast(LOW),The different measured cycle times obviously do not influence the output frequency. Also there is no long first pulse and no pulse time jitter. Pulse width for the T4.0 (left picture) about 50ns and about 6ns for the T3.6 @180MHz

Newfile1.jpeg Newfile2.jpeg
 
Just made this pull #345 request for review

Based on Code from :: community.nxp.com/thread/389002 - does this mean it is free for use as modified?

It changes the Fault handler response and for now at least prints out the labels on the fault dump info.
Sets clock to 300 MHz - drops voltage and TEMP showing 40-42°C - and won't be left at/over 600 MHz - but leaves IPG showing at 150 MHz. So F_BUS is unchanged.

Not sure if anyone if faulting the T4 and this would be helpful just now? Output currently is the Serial4 Tx pin 17. I've captured the WEAK func() below to user Sketch and tested printing Serial or Serial1 in some fashion so this seems stable and usable.

HELP: looking for easy ways to fault the T4 - post or PM would be welcome - just to make sure they all perform the same.

This would be exposed for user handling of faults:
Code:
__attribute__((weak))
void HardFault_HandlerC(unsigned int *hardfault_args) {

A user sketch here - also there a startup.c with needed cores changes for teensy4 dir. It is the sketch 'GPIOwriteSpeed' I had open and working and forces a fault at the end of loop() unless #define at top commented.

Here is the named output.
Fault irq 3
stacked_r0 :: 20000988
stacked_r1 :: 0000016E
stacked_r2 :: 00000000
stacked_r3 :: FFFFFE92
stacked_r12 :: 2E4F70BA
stacked_lr :: 00001BD7
stacked_pc :: 00000A90
stacked_psr :: 01000000
_CFSR :: 00000400
_HFSR :: 40000000
_DFSR :: 00000000
_AFSR :: 00000000
_BFAR :: 00000000
_MMAR :: 00000000
need to switch to alternate clock during reconfigure of ARM PLL
USB PLL is running, so we can use 120 MHz
Freq: 12 MHz * 75 / 3 / 1
ARM PLL=80002064
ARM PLL needs reconfigure
ARM PLL=8000204B
New Frequency: ARM=300000000, IPG=150000000
Decreasing voltage to 1150 mV

IRQ 3 faults:
// GPT1_CNT = 5;
// int* bar = 0; *bar=7;
// Serial.print(*((uint16_t*)0));
 
Last edited:
HELP: looking for easy ways to fault the T4 - post or PM would be welcome - just to make sure they all perform the same.
Code:
void setup() {
  delay(1000);
  Serial.begin(9600);   
  Serial.print(*((uint16_t*)0));
}
void loop() {}

Gives:
in Sketch ... Fault irq 3
stacked_r0 :: 20000c10
stacked_r1 :: 2bb
stacked_r2 :: 3e8
stacked_r3 :: 0
stacked_r12 :: 72459fc7
stacked_lr :: 9600000
stacked_pc :: 260
stacked_psr :: 61000000
_CFSR :: 10000
_HFSR :: 40000000
 
Last edited by a moderator:
Some more playing with Flexio - Start of SPI support.

I have started hacking up a version of SPI support using flex pins. Currently requires all 4 pins to be defined and CS also happens... Currently I don't have anything in place yet to set the transfer speed, although I have some pieces in place to be able to do so. Currently doing transfers at about 7.5mhz.

Example code in WIP github project: https://github.com/KurtE/FlexIO_t4

Code:
#include <FlexIO_t4.h>
#include <FlexSPI.h>

FlexSPI SPIFLEX(2, 3, 4, 5); // Setup on (int mosiPin, int sckPin, int misoPin, int csPin=-1) :


void setup() {
  pinMode(13, OUTPUT);
  while (!Serial && millis() < 4000);
  Serial.begin(115200);
  delay(500);
  SPIFLEX.begin();

  Serial.println("End Setup");
}
uint8_t buf[] = "abcdefghijklmnopqrstuvwxyz";
uint16_t ret_buf[256];
uint8_t ch_out = 0;

void loop() {
	for (uint8_t ch_out = 0; ch_out < 64; ch_out++) {
		ret_buf[ch_out] = SPIFLEX.transfer(ch_out);
	}
//	Serial.println();
	delay(25);

	uint8_t index = 0;
	for (uint16_t ch_out = 0; ch_out < 500; ch_out+=25) {
	  ret_buf[ch_out] = SPIFLEX.transfer16(ch_out);	  
	}
	delay(25);
	SPIFLEX.transfer(buf, NULL, sizeof(buf));
	delay(500);
}
I found calling Serial.printf was pretty slow to call in the loop, so commented out...

Again showing some progress:
screenshot.jpg
Probably lots of optimizations can go into some of this, like how the timing for CS pins is hard coded. Might try to see how hard to remove.
Also currently using default timer setup for FlexIO:
Uses the 480mhz clock, Has two dividers, that can be set, which are defaults of 2 and 8 so FlexIO timer is at 30mhz. Clock setup is (value+1)*2 so I believe I can set value to 0 and get a max of 15mhz.

But the 2 and 8 have the options of being FLEXIO1_CLK_PRED(1, *2, 3, 8) (1 should not be used at high input frequency), and FLEXIO1_CLK_PODF (1, 8).
So might try something like (3, 1) and see if that works...
 
Status
Not open for further replies.
Back
Top