K66 Beta Test

Status
Not open for further replies.
Works 'well enough' ... For TYQT to get the serial # it has to pass through bootloader mode - which is what is need to program. TYQT does talk serial - and it should be there but without a change in startup I was wondering how that could be.
t36ser103.PNG
This at 240 MHz on a T_3.6 with fresh IDE 1.6.12 and TD_1.31b1 with no hacks. I saw it working well enough after the bootloader update.

Given my time with EEPROM and HSRUN OFF/ON I just did this and it works - HighSpeed T_3.6 serial number known at USB connect- interrupts are already disabled so it should be safe:
<EDIT> : Pull request below

@KurtE - please test this - it has worked for HOURS doing EEPROM I/O much more extensive - I think this is a good answer.
Note: My latest testing did plenty of USB SPEW before and after repeated continuous EEPROM writes over days and USB was never broken.

Paul - hopefully you get to go non-stop - running through spare airports is no fun.

Indeed @MM - there are lots of ways to get proto boards - but that one was new to me and seemed a perfect fit to usable mount devices and headers. With the pre-connected bottom sets most through hole connects on the bottom would be clean with less manual soldering, though both sides plated is more flexible. I made an AdaFruit Perma Proto 1/2 sized as the adapter for the top socketed PROTO and Beta units and it was very handy with pins soldered down and sockets soldered up.
 
Last edited:
WOW - did I just make Github work for me on the web with a patch thingie? <edit>: not at first no :(

[URL="https://github.com/Defragster/cores/pull/1"]https://github.com/Defragster/cores/pull/1[/URL]

Updated #ifdef code
Code:
#elif defined(HAS_KINETIS_FLASH_FTFE) 
-	// TODO: does not work in HSRUN mode 
+#if F_CPU > 120000000	// Disable HSRUN mode across ser# read 
+	SMC_PMCTRL = SMC_PMCTRL_RUNM(0); // exit HSRUN mode 
+	while (SMC_PMSTAT == SMC_PMSTAT_HSRUN) ; // wait for !(HSRUN) 
+#endif 
 	FTFL_FSTAT = FTFL_FSTAT_RDCOLERR | FTFL_FSTAT_ACCERR | FTFL_FSTAT_FPVIOL; 
 	*(uint32_t *)&FTFL_FCCOB3 = 0x41070000; 
 	FTFL_FSTAT = FTFL_FSTAT_CCIF; 
 	while (!(FTFL_FSTAT & FTFL_FSTAT_CCIF)) ; // wait 
 	num = *(uint32_t *)&FTFL_FCCOBB; 
+#if F_CPU > 120000000 
+	SMC_PMCTRL = SMC_PMCTRL_RUNM(3); // enter HSRUN mode 
+	while (SMC_PMSTAT != SMC_PMSTAT_HSRUN) ; // wait for HSRUN 
+#endif 
 #endif 
 	__enable_irq();

NOTE: Tested and now TYQT gets serial number on USB connect when opened or closed or when T_3.6 moved to a new cable end.
 
Last edited:
Pull Request for #ifdef usage - same link [URL="https://github.com/Defragster/cores/pull/1"]https://github.com/Defragster/cores/pull/1[/URL]

My prior discovery was that once the Serial # was read across the HSRUN barrier it continued to be returned - even if in HSRUN ( as far as my testing saw ) so here is a USER sketch doing the same read code and it works, because the pull request code has already brought the value in:

CPU is T_3.6 F_CPU =240000000
Serial # =2056390
F_PLL =240000000 F_BUS =120000000 F_MEM =30000000

CPU is T_3.6 F_CPU =216000000
Serial # =2056390
F_PLL =216000000 F_BUS =54000000 F_MEM =27000000

CPU is T_3.6 F_CPU =192000000
Serial # =2056390
F_PLL =192000000 F_BUS =48000000 F_MEM =27428571

CPU is T_3.6 F_CPU =180000000
Serial # =2056390
F_PLL =180000000 F_BUS =90000000 F_MEM =25714286

CPU is T_3.6 F_CPU =168000000
Serial # =2056390
F_PLL =168000000 F_BUS =56000000 F_MEM =28000000

CPU is T_3.6 F_CPU =144000000
Serial # =2056390
F_PLL =144000000 F_BUS =48000000 F_MEM =28800000

CPU is T_3.6 F_CPU =120000000
Serial # =2056390
F_PLL =120000000 F_BUS =60000000 F_MEM =24000000

Code:
void setup() {
  // put your setup code here, to run once:
  while (!Serial && (millis ()  <= 8000)) {}
  CPUspecs();
}

void loop() {
  // put your main code here, to run repeatedly:
}

uint32_t getSerialNum()
{
  uint32_t num;
  __disable_irq();
#if defined(HAS_KINETIS_FLASH_FTFA) || defined(HAS_KINETIS_FLASH_FTFL)
  FTFL_FSTAT = FTFL_FSTAT_RDCOLERR | FTFL_FSTAT_ACCERR | FTFL_FSTAT_FPVIOL;
  FTFL_FCCOB0 = 0x41;
  FTFL_FCCOB1 = 15;
  FTFL_FSTAT = FTFL_FSTAT_CCIF;
  while (!(FTFL_FSTAT & FTFL_FSTAT_CCIF)) ; // wait
  num = *(uint32_t *)&FTFL_FCCOB7;
#elif defined(HAS_KINETIS_FLASH_FTFE)
  FTFL_FSTAT = FTFL_FSTAT_RDCOLERR | FTFL_FSTAT_ACCERR | FTFL_FSTAT_FPVIOL;
  *(uint32_t *)&FTFL_FCCOB3 = 0x41070000;
  FTFL_FSTAT = FTFL_FSTAT_CCIF;
  while (!(FTFL_FSTAT & FTFL_FSTAT_CCIF)) ; // wait
  num = *(uint32_t *)&FTFL_FCCOBB;
#endif
  __enable_irq();
  // add extra zero to work around OS-X CDC-ACM driver bug
  if (num < 10000000) num = num * 10;
  return num;
}

void CPUspecs() {
#if defined(__MKL26Z64__)
  Serial.print( "CPU is T_LC");
#elif defined(__MK20DX256__)
  Serial.print( "CPU is T_3.1/3.2");
#elif defined(__MK20DX128__)
  Serial.print( "CPU is T_3.0");
#elif defined(__MK64FX512__)
  Serial.print( "CPU is T_3.5");
#elif defined(__MK66FX1M0__)
  Serial.print( "CPU is T_3.6");
#endif
  Serial.print( "  F_CPU =");   Serial.println( F_CPU );
  Serial.print( "Serial # =");   Serial.println( getSerialNum() );
  Serial.print( "F_PLL =");   Serial.print( F_PLL );
  Serial.print( "  F_BUS =");   Serial.print( F_BUS );
  Serial.print( "  F_MEM =");   Serial.println( F_MEM );
}
 
Last edited:
Github Fail - It seems my PULL request was against my own clone, not a fork?

It's near the top of my todo list. So is basic SDIO support in the SD library.

Whether I get it in tomorrow or early next week is a good question. Friday to Monday will be consumed with travel for NY Maker Faire.

Paul: Fix in (post #1452) using the HSRUN drop as I've seen run without issue against EEPROM writes with T_3.6 under HSRUN works.

Other than doing this drop of HSRUN, it seems the only other way is to get it before going to HSRUN. In either case once it has been read subsequent reads using the unchanged method pull back that cached value as shown in post #1453.
 
Although to me, traveling all the way across the country does not sound overly relaxing!

Yeah, probably won't be relaxing.

Really, I'm most looking forward to everything returning to normal, and getting some time to really work on stuff I've *really* wanted to do for a long time, like more features in the Audio lib.
 
EEPROM write during HSRUN works with four function edits in ..\teensy3\Eeprom.c calling this #ifdef code as needed when in HSRUN. I have a test sketch doing all the 'CPP' operators { which devoles out in this "C" code } and .put/.get write and test of 1 to 16 bytes in 8 groups (even and odd) on even and odd boundaries as it walks through EEPROM. Just started a test to run 4 passes with 50 ms wait between group hits on each byte ... will know soon . . . report later . . .

I put interrupt status check in to be sure it can be called from anywhere and not munge them as it drops HSRUN when needed and conditionally restore on completion. I'm not sure this code changed much from what I last posted on the other thread - but now I think I am testing all the entry points for change? Code compiles out when not set for T_3.6 HSRUN. No change in caller code or behavior - it just works - will check on time overhead { 120 .vs.144 } will be close enough where 144 will be slowed and 120 is slower but no delays.

Code:
#if F_CPU > 120000000 && defined(__MK66FX1M0__)
// - THIS WILL DISABLE this code :: #if F_CPU > 260000000 && defined(__MK66FX1M0__)
static volatile uint16_t c_intrestore = 0;
void c_enable_irq( void );
void c_disable_irq( void );
static __inline__ uint32_t __get_primask(void) \
{ uint32_t primask = 0; \
  __asm__ volatile ("MRS %[result], PRIMASK\n\t":[result]"=r"(primask)::); \
  return primask; } // returns 0 if interrupts enabled, 1 if disabled
void c_enable_irq( void ){
	if ( c_intrestore ) {
		c_intrestore =0;
		__enable_irq( );
	}
}
void c_disable_irq( void ){
	if ( !__get_primask() ) { // Returns 0 if they are enabled, or non-zero if disabled 
		c_intrestore = 1;
		__disable_irq( );
	}
}
static volatile uint8_t restore_hsrun = 0;
static void hsrun_off(void)
{
	if (SMC_PMSTAT == SMC_PMSTAT_HSRUN) {
		c_disable_irq( ); // Turn off interrupts for the DURATION !!!!
		SMC_PMCTRL = SMC_PMCTRL_RUNM(0); // exit HSRUN mode
		while (SMC_PMSTAT == SMC_PMSTAT_HSRUN) ; // wait for !HSRUN
		restore_hsrun = 1;
	}
}

static void hsrun_on(void)
{
	if (restore_hsrun) {
	    SMC_PMCTRL = SMC_PMCTRL_RUNM(3); // enter HSRUN mode
	    while (SMC_PMSTAT != SMC_PMSTAT_HSRUN); // wait for HSRUN
		restore_hsrun = 0;
	    c_enable_irq( ); // Restore interrupts only when HSRUN restored	}
	}
}
#else
#define hsrun_off()
#define hsrun_on()
#endif

Cool - done 4 passes - stops at 4081 to fit the 16 char string without detecting read error. [funny the NULL chars are printed so cut and paste of Put/Get line was piecemeal to drop them]:
Code:
 __ EEPROM[ 4081 ] =98
 = 100	__ EEPROM[ 4081 ] =100
 ++	__ EEPROM[ 4081 ] =101
 --	__ EEPROM[ 4081 ] =100
 += 5	__ EEPROM[ 4081 ] =105
 -= 5	__ EEPROM[ 4081 ] =100
 *= 2	__ EEPROM[ 4081 ] =200
 /= 2	__ EEPROM[ 4081 ] =100
 ^= 48	__ EEPROM[ 4081 ] =84
 %= 8	__ EEPROM[ 4081 ] =4
 |= 7	__ EEPROM[ 4081 ] =7
 &=4	__ EEPROM[ 4081 ] =4
 <<= 2	__ EEPROM[ 4081 ] =16
 >>= 2	__ EEPROM[ 4081 ] =4
 =-=-=-=-=-=- 
 <M> <yz> <654321> <!@#$%^&*> <Z> <ABC> <1234567> <abcdefghijklmno> :: No Put/Get Errors
 __
 DONE - NO ERRORS !
 
Last edited:
Perhaps "4.0" will be meaningful when we get a Cortex-M7 in 65 or 40 nm silicon.

The processor exists in a 10mmx10mm form factor. Does that mean a Teensy 4.0 is on the horizon? ;)

I copied the following from the dataset. The only downside is there is no 5V version.

Atmel® | SMART SAM S70 is a high-performance Flash microcontroller (MCU) based on the 32-bit ARM® Cortex®-M7 RISC (5.04 CoreMark/MHz) processor with floating point unit (FPU). The device operates at a maximum speed of 300 MHz, features up to 2048 Kbytes of Flash, dual 16 Kbytes of cache memory, up to 384 Kbytes of SRAM and is available in 64-, 100- and 144-pin packages. The Atmel | SMART SAM S70 offers an extensive peripheral set, including Highspeed USB Host and Device plus PHY, up to 8 UARTs, I2S, SD/MMC interface, a CMOS camera interface, system control and a 12-bit 2 Msps ADC, as well as highperformance crypto-processors AES, SHA and TRNG.
 
Defragster: Sounds great that you are getting the EEPROM writes to work without too much disruption to the code.

As for Pull requests - I am guessing from your posts, that you are using Github for windows and probably created the PR from there. I have made the mistake before, where it did not have it comparing to PaulStroffregen/master but instead <yourgithub>/master.... In my Case KurtE/master. If you create the pull request and it gives you some low number like: #1 or #2, it is a sure clue as one of my last PR to the main one is #173.

What you can do then is, click up on the #1 type number, which opens up your browser to your pull request on github, and click to close the request. But don't delete your branch. Then go back to github to windows and do a Sync. At times I needed to choose different branch or different project and go back, to have it not show the OLD PR, then click again to create PR and make sure to choose the Paul.../master.

Kurt
 
HSRUN DROP and RESTORE EEPROM WRITE

Pull request for HSRUN EEPROM writes on T_3.6 :: https://github.com/PaulStoffregen/cores/pull/177

My test sample - with above eeprom.c edit this will work at any HSRUNT_3.6 speed - code is #ifdef to compile out on any other Teensy/Speed :: View attachment EE_HSRUN_minCoreCPP.ino

With update of "...\cores\teensy3\eeprom.c" - should be able to run any EEPROM write code you want anywhere.

I'm not sure this tests the eeprom_write_word and eeprom_write_dword cases?

Here are speed comparisons two runs each and 240, 180, 144, 120 MHz as uploaded. These tests are just the larger test (activation notes below ) with no printing.
Code:
--- Hello World ---
 14 CPP manipulation loops time  ==102011 at F_CPU == [B]240000000
 [/B]Put / Get manipulation loops time  ==102145 at F_CPU == 240000000

 14 CPP manipulation loops time  ==[B]104714 [/B]at F_CPU == 240000000
 Put / Get manipulation loops time  ==[B]105268 [/B]at F_CPU == 240000000

--- Hello World ---
 14 CPP manipulation loops time  ==103299 at F_CPU == [B]180000000
 [/B]Put / Get manipulation loops time  ==105836 at F_CPU == 180000000

 14 CPP manipulation loops time  ==[B]102152 [/B]at F_CPU == 180000000
 Put / Get manipulation loops time  ==[B]105803 [/B]at F_CPU == 180000000

--- Hello World ---
 14 CPP manipulation loops time  ==104355 at F_CPU == [B]144000000
 [/B]Put / Get manipulation loops time  ==102455 at F_CPU == 144000000

 14 CPP manipulation loops time  ==[B]104602 [/B]at F_CPU == 144000000
 Put / Get manipulation loops time  ==[B]103432 [/B]at F_CPU == 144000000

--- Hello World ---
 14 CPP manipulation loops time  ==100955 at F_CPU == [B]120000000
 [/B]Put / Get manipulation loops time  ==197743 at F_CPU == 120000000

 14 CPP manipulation loops time  ==[B]106738 [/B]at F_CPU == 120000000
 Put / Get manipulation loops time  ==19[B][/B]6173 at F_CPU == 120000000

As posted code above runs silent speed test only. For real test and EEPROM abuse:

Set this to ZERO :: uint32_t iterations = MAX_ITERATIONS; // 0;

Set this to ZERO :: #define EE_BASE 3400 // 0
Set this to ( E2END -14 ) :: #define EE_MAX 200 //( E2END -14 ) // NOTE: Write beyond E2END fail silently - but wipre of read array will catch if 15 bytes won;t fit.

NOTE: posted results show the writes are relatively constant in speed - so the added overhead for doing it under HSRUN is lost in the noise versus not being able to write to the EEPROM.
also:: EE_BASE is the start address and EE_MAX is the end for testing should anyone look - the code as uploaded was just edited to make the timing test work without the spew of the other test.
 
Last edited:
2xTeensy-3.5 arrived today, booted & blinked, and I was able to compile & upload sketches after installing the new Teensy Loader 1.31-boot-update and replacing boards.txt in the Arduino-1.6.9 installation directory.

The linker config file (mk64fx512.ld) says there's only 192K of RAM, but spec sheet claims 256K?

Currently looking for the initialization code (e.g. crt0.c) so I can build bare-metal blinky.

Anybody know if the 3.5/3.6 schematics are available somewhere?
 
Boards.txt comes with the TeensyInstaller? Schematics not posted yet.

from KS:
Features specific to Teensy 3.5:
120 MHz ARM Cortex-M4 with Floating Point Unit
512K Flash, 192K RAM, 4K EEPROM
 
re: 192k/256k: my bad, I was looking at the wrong device flavor in the manual. Not blinking (bare metal) yet, but I'll get there eventually...
 
I've updated my pull request for T_3.6 EEPROM writes over 120 MHz - includes the change to get serial # to USB at those speeds too.

PaulStoffregen/cores/pull/177

The new changes were to EEPROM timing (EEPROM-writing-on-T_3-6-in-HSRUN-mode?) - I added some gratuitous delays and other paranoia to feel it was safer, but in so doing I got the idea the processor was already doing this. I've done most of my EEPROM testing on my months old K66 PROTO beta board - with hundreds of high speed writes to each EEPROM location and it is running error free right now.

Paul - Any thoughts on this method. Empirically it works and no short term problems are apparent. I've posted my test sketches - is there any more telling test or likely problem area?
 
Just got my Teensy 3.6's today, I wanted to do some of benchmarking the FPU sqrtf() for stepper pulse acceleration calculations.

I loaded up Arduino 1.6.11 + Teensyduino 1.30 and compiled your code example as a starting point and played around a bit. I get similar results to yours. My teensy's run stable at 240 MHz too.

I was wondering if I can force the compiler to use the FPU or force use of soft floating point for comparisons - any ideas?

I would like to be able to turn FPU off and on as a sanity check.
 
Code:
Speed test	                :: T_3.1	  :: T_3.5	:: T_3.5	:: T_3.6	:: T_3.6
----------	                ----------	  ----------	----------	----------	----------
F_CPU =	                  	 96000000 Hz	120000000 Hz	168000000 Hz	180000000 Hz	240000000 Hz
1/F_CPU = 	                  0.0104 us	0.0083 us	0.0060 us	0.0056 us	0.0042 us
The next tests are runtime c					
  nop                       	 0.010 us	 0.008 us	0.006 us	 0.006 us	 0.004 us
  avr gcc I/O               	 0.103 us	 0.084 us	0.072 us	 0.066 us	 0.049 us
  Arduino digitalRead       	 0.158 us	 0.126 us	0.089 us	 0.084 us	 0.062 us
  [U]Teensy digitalReadFast	                 0.009 us	0.007 us	0.004 us	0.004 us[/U]

Source Post: Teensy-3-1-overclock-to-168MHz?

Missing the digital FAST calls where I found it - just replicated and renamed . . .
This digitalReadFast test is pretty bogus. Since the result doesn't get used, half of it gets optimized away. So it results in this:
Code:
ldr	r0, [r3, #0]
This is missing the bit masking for the pin. This will never ever happen in real code where the result is not thrown away without looking at it. DigitalReadFast reads the entire port register and must thus mask out the other pins.

If I assign the result to a volatile bool in the test, it looks like this:
Code:
ldr	r1, [r6, #0]
ubfx	r1, r1, #5, #1
strb.w	r1, [sp, #9]
The port register is read, bit 5 is extracted, the result is put into a bool on the stack. This takes 7x longer than the test above (this 7x is a bit unfair, since the stack store probably takes half the time).

I have experimented with using bitbanding, where you can extract the pin value in a single instruction (which does happen), but GCC generates code that is worse than digitalReadFast if you don't know exactly what you are doing.
 
Last edited:
This digitalReadFast test is pretty bogus. Since the result doesn't get used, half of it gets optimized away. ...
I have experimented with using bitbanding, where you can extract the pin value in a single instruction (which does happen), but GCC generates code that is worse than digitalReadFast if you don't know exactly what you are doing.

So you have a faster digitalReadFast in the works? :)

It just seemed missing that so I added it to whatever that code was - cloned what was there for digitalRead to digitalReadFast - digitalRead isn't at least partly optimized away with no use of the result?
 
But for a comparison with the same compiler and the same target (M4), it might be ok, that half of it gets optimized away? The only differences are a bit more cache and different F_CPU.
 
So you have a faster digitalReadFast in the works? :)
Maybe. The issue is I haven't figured out a way to not make it worse than the existing digitalRead in a lot contexts.
It just seemed missing that so I added it to whatever that code was - cloned what was there for digitalRead to digitalReadFast - digitalRead isn't at least partly optimized away with no use of the result?
It's not. It's a function call that is not inlined. It's defined in a .c file, digitalReadFast is defined in a header file.
 
But for a comparison with the same compiler and the same target (M4), it might be ok, that half of it gets optimized away? The only differences are a bit more cache and different F_CPU.
The issue I have with that is that it doesn't measure something relevant. It's definitely not a digitalReadFast test.
 
pre-read note from @tni - I asked partly <edit> - the result is pulled from the bit and sent back: yes since it isn't inline code, but the result wouldn't be assigned anywhere there either?

I was going to say rotten apples to rotten apples comparison. Certainly one of the more artificial benchmarks - but with impressive results on your side . . .

Also that only listed Teensy units - no AVR was harmed in making this table- I just didn't bother to plug in the T_3.1 for the first column after I made the edit - it would have just run that one instruction slower.

The real questions are what was the next digit in the battle of 180 .vs. 240 MHz (is the port read/write speed limited to .004?) - and is 168 really that much slower - and where did I have F_BUS set?

My searches for K66 High Speed Run don't yield much as far as working around the failure to write EEPROM or read the internal areas like the one time write where the serial number comes from. Is there any better solution than what Frank dared me to try? I just found this June 2015 question - that may be out of context - but it suggests the processor won't assume speed over 120 unless/until HSRUN is set - but I've been unsetting it after the clock is that high with no apparent problem (though limiting it to code needed to write EEPROM by disabling interrupts for the duration)- IIRC Frank tried just not setting it on the K66 and then wholly failed to run so it may have been this Frank?:

https://community.nxp.com/thread/356766::
I am using KDS 3.0.0 and KSDK 1.2.0 and when using Processor Expert a desired warning is missing.
*
When selecting 180 (or above 120) MHz clock then Cpu setting for power modes needs attentions. High speed run mode needs to allowed, otherwise clock init gets stuck when waiting for new run mode. It should not be possible to generate code without error in this case.

We could read the manual - I think I scanned and saw nothing - but what do they know - they state the K66 Max 'Device clock specifications' is 180 MHz.

This is a K64 question but where does the K64 stand on OC? How usable are 144 and 168 - I edited in 168 but see 144 and 180 and 192 have specs but no menu entry? It seems I tried one other first and the compiler said that 'speed' wasn't defined.
 
Last edited:
pre-read note from @tni - I asked partly <edit> - the result is pulled from the bit and sent back: yes since it isn't inline code, but the result wouldn't be assigned anywhere there either?
In the test digitalRead just makes the function call and ignores the result (left unused in the return register).
 
For comparing different architectures:

Yes. But storing it to a "volatile" does not help much. No real applications consists of a loop the reads only (are the interrupts disabled?)
If you REALLY want to measure the MAX Speed, you have to use fastest way the MCU is able to. Perhaps DMA ? Would it make sense ? no..
However, the result is very artificial and does not help to estimate the overall speed for a hypothetical application !

Only trust the benchmarks (statistics) you faked yourself...

And, BTW; the compiler has a BIG influence.
What do you want to measure ? The compiler or the hardware ? Both ? Which -Optimization level ?
Did you use the newest or fastest version of GCC ? <- thats a difference.. esp. for Cortex-M4...
 
Last edited:
Status
Not open for further replies.
Back
Top