Anybody have experience overclocking Teensy 3.5 to 240 MHz?

Status
Not open for further replies.

wild_hare

Member
I am working on a signal processing project to replace a current FPGA + level shifter design with a Teensy 3.5 in order to reduce parts count.
The input signals are +5V, but the few output signals can be 3v3 levels.
The current FPGA (and now Teensy) design must respond within 85 nanosecond of an input signal change to assert an appropriate output signal.
The Teensy ARM code and hardware (using input voltage level shifter buffers) successfully works with Teensy 3.6 at [F_CPU] 240 MHz and [F_BUS] 120 MHz, but would like to eliminate the need to use the 5 extra 5V --> 3v3 level-shifter chips.

'Scoping/logic analyzing the Teensy 3.5 at 120, 144, and 168 MHz reveals that the 3.5 is too slow, so overclocking to 240 MHz might be an option.
Does anybody have any [successful] experience with overclocking the '3.5 to this speed, or have other suggestions or options?

Bruce Ray
Wild Hare Computer Systems, Inc.
Boulder, Colorado USA
 
What do you call safe overclocking? If it was for a gadget on your workbench without serious consequences in case of overheating or another h/w fail, I‘d say yes.
But when it comes to a commercial product with mtbf and other reliability specifications, going beyond the manufacturers limits is simply irresponsible.
Are you sure that you can‘t optimize your code, if needed with inline ASM to meet the needed reactivity of the code? 85ns is tough in ever case. It‘s about 12 CPU cycles @144MHz and about 20 @240MHz...
 
Many people have reported success with 240 MHz. Frank is probably the one who has done the most and shared info.

One thing we know doesn't work well at 240 MHz is the I2S MCLK mult/div circuitry. If you're using I2S master mode, you probably won't be able to reliably go beyond 192 MHz.

Of course all the usual caveats about overclocking apply. Everything I've personally tried and as far as I know everything that I've heard has been done at room temperature.
 
This is not a commercial project, and power and heat is not a primary concern. The Teensy is placed on another PCB which has an +5V power supply with many amps, and the PCB is located next to a fan in a chassis. So I consider this 'experimentation' at the moment, but I may release it as open source if there is some future interest.

The signal input and output code is compact, and all variables are pre-calculated and placed in registers (in both hand-written and GCC optimized code). For example, compiler code with 'fastest' option is pretty tight, with only 3 instructions used for testing for input signal transition change, and only 1 instruction used for outputting the output data when the input signal transition is detected. Registers are selected for almost all variables by the compiler's register color mapping optimization, and the 'smallest' compiler optimization option produces equally good quality code.


while ( ((n1 = PORTB_IN) & B_MEM_CLOCK_N) ) ; // wait for MEM_CLK_N to be asserted (high --> low)
820: 6825 ldr r5, [r4, #0]
822: 056f lsls r7, r5, #21
824: d4fc bmi.n 820 <test()+0x110> <--- (branches back to x820 to check signal pin again)
MEM_OUT( n2 ) // send next nibble to output MEM bus
826: 6033 str r3, [r6, #0]
while ( ! (PORTB_IN & B_MEM_CLOCK_N) ) ; // wait for MEM_CLK_N to be deasserted (low --> high)
828: 6823 ldr r3, [r4, #0]
82a: 055b lsls r3, r3, #21
82c: d5fc bpl.n 828 <test()+0x118>

while ( ((n2 = PORTB_IN) & B_MEM_CLOCK_N) ) ; // wait for MEM_CLK_N to be asserted (high --> low)
82e: 6823 ldr r3, [r4, #0]
830: 055f lsls r7, r3, #21
832: d4fc bmi.n 82e <test()+0x11e>
834: 469a mov sl, r3
MEM_OUT( n3 ) // send next nibble to output MEM bus
836: 6032 str r2, [r6, #0]
while ( ! (PORTB_IN & B_MEM_CLOCK_N) ) ; // wait for MEM_CLK_N to be deasserted (low --> high)
838: 6823 ldr r3, [r4, #0]
83a: 055b lsls r3, r3, #21
83c: d5fc bpl.n 838 <test()+0x128>

while ( ((n3 = PORTB_IN) & B_MEM_CLOCK_N) ) ; // wait for MEM_CLK_N to be asserted (high --> low)
83e: 6822 ldr r2, [r4, #0]
840: 0557 lsls r7, r2, #21
842: d4fc bmi.n 83e <test()+0x12e>
MEM_OUT( n4 ) // send next nibble to output MEM bus
844: f8c6 b000 str.w fp, [r6]
while ( ! (PORTB_IN & B_MEM_CLOCK_N) ) ; // wait for MEM_CLK_N to be deasserted (low --> high)
848: 6823 ldr r3, [r4, #0]
84a: 055b lsls r3, r3, #21
84c: d5fc bpl.n 848 <test()+0x138>

while ( ((n4 = PORTB_IN) & B_MEM_CLOCK_N) ) ; // wait for MEM_CLK_N to be asserted (high --> low)
84e: 6823 ldr r3, [r4, #0]
850: 055f lsls r7, r3, #21
852: d4fc bmi.n 84e <test()+0x13e>
MEM_OUT( 0x000F0000 ) // send next nibble to output MEM bus
854: f44f 2770 mov.w r7, #983040 ; 0xf0000
858: 6037 str r7, [r6, #0]
85a: 469b mov fp, r3
while ( ! (PORTB_IN & B_MEM_CLOCK_N) ) ; // wait for MEM_CLK_N to be deasserted (low --> high)
85c: 6823 ldr r3, [r4, #0]
85e: 055b lsls r3, r3, #21
860: d5fc bpl.n 85c <test()+0x14c>

There could be some minor optimizations if tight hand-written assembly code were used, but the amount of performance improvement would be minor compared to the quantum difference determined on raw clock speed. And such ASM code could be used for 'production' or stability once the proof-of-concept code was "close" to meeting required performance...
 
I should have commented that the previous listing was the GCC compiler output with Teensy 3.6, 240 MHz, fastest, options specified.
The application code is dedicated to monitoring the input signals on GPIO pins and output signals on GPIO output pins; no serial, USB, I2C, timer, or other peripherals/devices are used. The initialization code includes configuring the GPIO pins appropriately and disabling interrupts and timers and such just before the tight application monitoring loop. The setup code listing includes:

cli() ;
74e: b672 cpsid i

// disable PIT clock
SIM_SCGC6 &= ~SIM_SCGC6_PIT;
750: f5a0 2037 sub.w r0, r0, #749568 ; 0xb7000
754: 3894 subs r0, #148 ; 0x94
756: 6803 ldr r3, [r0, #0]

// Disable SysTick Exception - delay() will not work while disabled
SYST_CSR &= ~SYST_CSR_TICKINT;
758: 4955 ldr r1, [pc, #340] ; (8b0 <test()+0x1a0>)

// Mask USB interrupt
NVIC_DISABLE_IRQ (IRQ_USBOTG);
75a: 4c56 ldr r4, [pc, #344] ; (8b4 <test()+0x1a4>)
 
It seems that most overclocking modifications in various projects target the Teensy 3.6. Is there a project that targets the Teensy 3.5 with a 240 MHz F_CPU clock and and a 120 or 240 F_BUS setup?
 
Teensy 3.6 has a special high speed run mode which increases the core voltage inside the chip. There is a large 8K cache memory which runs at the full CPU speed, in addition to the small cache built into the flash memory controller. You also get a different PLL clock generator, designed to reach higher speeds. The USB port on 3.6 also has the ability to sync to the USB data, which means you're not limited to using only frequencies that are multiples of 24 MHz for the USB to work.

You don't get any of that special stuff in Teensy 3.5. The code in mk20dx128.c has options for configuring the clock to 144 and 168 MHz. If you want to try faster, you'll need to edit that file. Because you're only able to use multiple of 24 MHz, the next step would be all the way up at 192 MHz. I believe setting MCG_C6_VDIV0 in MCG_C6 to 24 should do it. You'll also need to adjust SIM_CLKDIV1 & SIM_CLKDIV2. The one part of the chip which doesn't overclock well is the flash memory, so getting SIM_CLKDIV1_OUTDIV4 correct and a legal ratio with the other settings is really important.

Without the cache memory, you really do get diminishing returns for code running out of flash. Plan on using "FASTRUN" (copies the code to RAM) on any functions you want to really use the full speed if you're running faster than 120 MHz. Also keep in mind only the bottom 64K of the RAM in Teensy 3.5 connects to the Cortex-M4 code bus. There's an extra 1 cycle wait for code executing at any address above 0x1FFFFFFE, since the bus to that memory is optimized for data, so you can only have a relatively small amount of code running without the latency & tiny cache of the flash.

If you try really hard you can probably get some code to run faster, but the hardware inside the chip on Teensy 3.5 just isn't in the same league as the faster chip on Teensy 3.6.
 
Paul - Thank you very much for the additional information.

That gives me better ideas to guide my investigations. Fortunately, the code required for this project is extremely simple and small, and the GCC compiler generates correspondingly very good code - making code hand-tweaking virtually unnecessary.

I'll focus on the clock area and let you know what I find...
 
Further testing shows that the Teensy 3.5 does indeed not have enough performance to handle the signal processing, but the Teensy 3.6 using the 'standard' overclocked 240 MHz option with any of the compiler optimization options works.

However, I am getting an error in a test program after a random amount of time (from 3 seconds to 3 minutes) and need to track this down.

Since this is a dedicated system I need to disable all interrupts (and anything else) that might affect the timing and determinism. I do this by using a call to

noInterrupts() ;

before the main signal waiting/response loop. I am trying to review the MCU data info and other forum posts to determine if this is sufficient to disable all external conditions attacking the CPU, or is something more detailed needed, i.e.:

noInterrupts() ;

// disable PIT clock
SIM_SCGC6 &= ~SIM_SCGC6_PIT;

// disable the PIT module (turn off disable) $$$ No, clobbers output!!!!!!!
//### PIT_MCR |= PIT_MCR_MDIS;

// Disable SysTick Exception - delay() will not work while disabled
SYST_CSR &= ~SYST_CSR_TICKINT;

// Mask USB interrupt
NVIC_DISABLE_IRQ (IRQ_USBOTG);

// Disable timer
//### TIMER_CONTROL_REGISTER = 0; $$$ No, clobbers ability to output to GPIO signals !!!!!!!

noInterrupts() ;


Any suggestions or definitive answer from the experts?
 
Can you post your test program?

I'm running the 3.6 with overclocked bus and cpu without any problems in my c64 emu for days - at room temperature. But it is not using many hardware devices. Just spi, dma, both usb, sd, gpio and dac.
 
(apologize if this is a repeat post - I seem to have an e.group problem today)

The main immediate objective is to determine if the

noInterrupts() ;

procedure does inhibit all external interrupts from interrupting the Teensy 3.6, or if a more 'complete' initialization sequence is needed (like the following):



If the noInterrupts() call does indeed inhibit all external interrupts, then my next investigation would be in other areas, such as signal line noise and/or cross-talk, capacitance, and ground issues.

update: It seems there is a temperature sensitivity involved in this situation. If I leave the computer system (to which the Teensy 3.6 is attached) powered on for several hours, the run time before first failure increases to the range of an hour rather than seconds (from starting the system when it is cold [temperature]). Also, swapping one Teensy 3.6 for another Teensy 3.6 module produces the same results - no difference noticed between the two modules.

So investigation into other areas seems warranted...



--------------------------------------------

A snapshot of the current main loop is provided below for reference:


/*------------------------------------------------------*/
/* test */
/*------------------------------------------------------*/

FASTRUN void test( void )
{
uint32_t mb ;
uint32_t ma ;
register uint32_t n1, n2, n3, n4 ;

init_octal() ;
init_ram() ;

init_fix01() ;

mb = ~0 ;
ma = 0 ;

PORTA_OUT = 0 ;
PORTB_OUT = 0 ;
PORTC_OUT = 0 ;
PORTD_OUT = 0 ;
PORTE_OUT = 0 ;

PORTA_IN = 0 ;
PORTB_IN = 0 ;
PORTC_IN = 0 ;
PORTD_IN = 0 ;
PORTE_IN = 0 ;


/*$$$ comment out the following setup code unless tests show it is needed...

// PIT timer registers

#define TIMER_CONTROL_REGISTER PIT_TCTRL0
#define TIMER_FLAG_REGISTER PIT_TFLG0
#define TIMER_LOAD_VALUE_REGISTER PIT_LDVAL0

noInterrupts() ;
// disable PIT clock
SIM_SCGC6 &= ~SIM_SCGC6_PIT;

// disable the PIT module (turn off disable) $$$ No, clobbers output!!!!!!!
//### PIT_MCR |= PIT_MCR_MDIS;

// Disable SysTick Exception - delay() will not work while disabled
SYST_CSR &= ~SYST_CSR_TICKINT;

// Mask USB interrupt
NVIC_DISABLE_IRQ (IRQ_USBOTG);

// Disable timer
//### TIMER_CONTROL_REGISTER = 0; $$$ No, clobbers output!!!!!!!
$$$*/
noInterrupts() ;

while ( (PORTB_IN & B_MEM_CLR) ) ; // wait for MEM_CLR to de-asserted

while ( 1 )
{
while ( ! (PORTB_IN & B_MEM_CLR) ) ; // wait for MEM_CLR to be asserted (low --> high)

// get memory address from various input pins

ma = ((PORTE_IN >> 12) & 070000) // MA[ 1: 3]
| (((PORTA_IN >> 6) & 006000) | ((PORTA_IN << 4) & 001000)) // MA[ 4: 6]
| (octal_bits[ (((PORTC_IN >> 8) & 0x000F) | ((PORTA_IN >> 8) & 0x00F0)) ] << 6) // MA[ 7: 9]
| octal_bits[ (PORTD_IN & 0x00FF) ] // MA[13:15]
| (octal_bits[ (PORTC_IN & 0x00FF) ] << 3) // MA[10:12]
;
ma = (~ma) & 077777 ;

mb = (uint32_t)(uint16_t) ram[ ma ] ;

n1 = mb ; // data[12:15] (attempt crude optimizations here)
n2 = (mb >> 4) ; // data[ 8:11] (is masking even needed?)
n3 = (mb >> 8) ; // data[ 4: 7]
n4 = (mb >> 12) ; // data[ 0: 3]

MEM_OUT( n1 )
while ( ((n1 = PORTB_IN) & B_MEM_CLOCK_N) ) ; // wait for MEM_CLK_N to be asserted (high --> low)
MEM_OUT( n2 ) // send next nibble to output MEM bus
while ( ! (PORTB_IN & B_MEM_CLOCK_N) ) ; // wait for MEM_CLK_N to be deasserted (low --> high)

while ( ((n2 = PORTB_IN) & B_MEM_CLOCK_N) ) ; // wait for MEM_CLK_N to be asserted (high --> low)
MEM_OUT( n3 ) // send next nibble to output MEM bus
while ( ! (PORTB_IN & B_MEM_CLOCK_N) ) ; // wait for MEM_CLK_N to be deasserted (low --> high)

while ( ((n3 = PORTB_IN) & B_MEM_CLOCK_N) ) ; // wait for MEM_CLK_N to be asserted (high --> low)
MEM_OUT( n4 ) // send next nibble to output MEM bus
while ( ! (PORTB_IN & B_MEM_CLOCK_N) ) ; // wait for MEM_CLK_N to be deasserted (low --> high)

while ( ((n4 = PORTB_IN) & B_MEM_CLOCK_N) ) ; // wait for MEM_CLK_N to be asserted (high --> low)
MEM_OUT( 0x000F0000 ) // send next nibble to output MEM bus
while ( ! (PORTB_IN & B_MEM_CLOCK_N) ) ; // wait for MEM_CLK_N to be deasserted (low --> high)

//$$$ MEM_OUT( ~(0x0000) ) //### this could be used here for debug

mb = ((n1 & 0x0F0000) >> 16) // data[12:15]
| ((n2 & 0x0F0000) >> 12) // data[ 8:11]
| ((n3 & 0x0F0000) >> 8) // data[ 4: 7]
| ((n4 & 0x0F0000) >> 4) // data[ 0: 3]
;

if ( ma <= MEM_MAX )
{
ram[ma] = (uint16_t) (mb) ; // store (complemented) data into memory
}
} // end of 'while(1)' [forever] loop
} // test()



/*------------------------------------------------------*/
/* loop */
/*------------------------------------------------------*/

void loop( void )
{
test() ;
} // 'loop'
 
Status
Not open for further replies.
Back
Top