Teensyduino 1.49 Beta #3

Status
Not open for further replies.
@neurofun - @KurtE

Curious - made the change to SPI.cpp in post #25 and it is still showing 272 ms/page @18Mhz SPI. Makes it more curious that @manitou showed in post #24 SPI clock is actually faster in beta #3.

BTW: what spi clock are you running at?

EDIT: If I run at 22mhz in beta#3 is taking me 161,280 ms/page. Which makes sense that since 22mhz setting would equate to something like 24-25mhz in actuality with beta3. In beta2 22Mhz would probably below the speed limit. You are over the speed limit for the RA875.

@22Mhz - 161.280 ms/page
@20Mhz - 421 ms/page
@18Mhz - 259 ms/page
@16Mhz - 259 ms/page
@14Mhz - 271 ms/page
@12Mhz - 271 ms/page

This is the warning from the usersetting.h file:
After som mail exchange with RAiO I solved the dilemma behind SPI speed limit:
The RA8875 has limitation of 12Mhz SPI but this has been set because not all internal macros
can run over that speed, the library automatically deal with this so I was able to go over 20Mhz!
At that speed you need to short cables as much you can, provide clean supply and good decoupling!
DO NOT Exceed 23Mhz for RA8875! It will result in garbage on screen or run very slow.
So looks like the problem is with the RA8875 not beta3.

EDIT2: can you run your display at about 18 or 19Mhz with Beta3
 
Last edited:
@mjs513
beta3 with changes from post #25
running at the default 22MHz showing 284ms
changed to 18MHz and now showing 275ms

unmodified beta3
@17MHz -> 246ms
@18MHz -> 248ms
@19MHz -> 256ms

Problem resolved, thank you very much!!!

It might be a good idea to change the MAXSPISPEED for T4 in RA8875UserSettings.h from 22MHz to 18MHz.
 
@neurofun

Unfortunately the speed limit with the RA8875 didn't register until @manitou posted his measurements and I went in to retest. Going to change to 18mhz
 
As far as I know, no bugs have been found with millis().

However, micros() had at least 2 bugs. Neither is new. These have been with us since Teensy 4.0 release.

Functions like delay() which use micros() internally are also affected by these micros() bugs.

Bug #1: (fixed) When configured for 24 MHz speed, we were actually running at 30 MHz, but F_CPU_ACTUAL was set to 24000000. When F_CPU_ACTUAL is wrong, micros() computes incorrect output.

Bug #2: (not fixed) At certain slower CPU speeds, micros() gives wrong results just before millis() increments. For example, if millis() is currently 499 and about to turn to 500, the bug in micros() might result in output like 500001, even though micros() should never report more than 499999 while millis() is still 499.

So far, neither of these bugs has been found while running at 600 or 528 MHz.

Indeed - looking at delay() it relies on micros() - rewrote it to work with millis() and it shows proper function. So the millis systick isn't the trouble.

This line in micros() gets the partial microsecond since the last clock tick using ARM_DWT_CYCCNT expressed in the ccdelta :: usec = 1000*smc + (ccdelta/(F_CPU_ACTUAL/1000000));

Fixed Bug #1 impact from wrong F_CPU_ACTUAL is apparent where CYCCNT is not in sync with cycle counts/second.

Re Bug #2: Looking around not seeing what could be behind this. Anytime the sysTick recorded ARM_DWT_CYCCNT gets out of sync with actual tick or the F_CPU_ACTUAL it can get messy. On set_arm_clock() there will be change in rate, between ticks - but that should reset on the next sysTick.

On T_3.6 the actual ARM_DWT_CYCCNT was under F_CPU based on GPS PPS. Not sure what happens from alternate clock on T4 "At certain slower CPU speeds" - if the cycles run past expected count it could round UP before sysTick resets? But that should be a small window - and only systick_isr() controls the clock values, micros() is read only with code to try to get atomic reads of tick data.
 
The max SPI clock for Teensyduino 1.48 was 37.7 mhz (528/7/2) with an effective data rate of 30.8 mbs. With LPSPI clock changed in 1.49-beta 3, max SPI clock is 120mhz. I ran some 1024-byte SPI transfers with MOSI jumpered to MISO, I ran some tests with scope on SPI CLK (pin 13). Tests would hang with SPI CLK set at 120mhz, but for 80 mhz and lower, tests ran successfully with no errors betwixt MISO and MOSI. The max effective data rate was about 54 mbs. I printed out SPI CCR register to confirm prescale (SCKDIV and DBT)

Code:
SPICLK   CCR  datarate(mbs)  CCR MHz 1.8.10 1.48
   3     C18    2.86            2.9
   4     811    3.89            4
   5     70E    4.61            4.7
   8     408    7.27            7.5
  14     204   11.82           12.6
  16     103   14.0            15.1
  20     102   17.2            18.9
  40       0   30.8            37.7

1.49-beta3 results:
Code:
SPICLK   CCR  datarate(mbs)  scopeMHz
   4    1D3A    3.97            4
   8     E1C    7.9             8
  24     408   23.1            24
  40     204   37.4            40
  48     103   42.7            47.6
  60     102   54.2            60
  80     001   54.2            79.2

spiperf sketch
For 1.49-beta3, you would need to change print to read
Code:
Serial.printf("SPICLOCK %d MHz   CCR freq %.1f MHz\n", SPICLOCK / 1000000, 720. / 3 / ((0xff & LPSPI4_CCR) + 2));
 
Last edited:
The max SPI clock for Teensyduino 1.48 was 37.5 mhz with an effective data rate of 29.2 mbs. With LPSPI clock changed in 1.49-beta 3, max SPI clock is 120mhz. I ran some 1024-byte SPI transfers with MOSI jumpered to MISO, I ran some tests with scope on SPI CLK (pin 13). Tests would hang with SPI CLK set at 120mhz, but for 80 mhz and lower, tests ran successfully with no errors betwixt MISO and MOSI. The max effective data rate was about 54 mbs. I printed out SPI CCR register to confirm prescale (SCKDIV and DBT)

1.49-beta3 results:
Code:
SPICLK   CCR  datarate(mbs)  scopeMHz
   4    1D3A    3.97            4
   8     E1C    7.9             8
  24     408   23.1            24
  40     204   37.4            40
  48     103   42.7            47.6
  60     102   54.2            60
  80     001   54.2            79.2

spiperf sketch
For 1.49-beta3, you would need to change print to read
Code:
Serial.printf("SPICLOCK %d MHz   CCR freq %.1f MHz\n", SPICLOCK / 1000000, 720. / 3 / ((0xff & LPSPI4_CCR) + 2));

@manitou
That's good info to keep handy. Thank you for testing.
 
Just got to test beta 3, tested the MIDI USB Type with large Sysex dumps and worked like a charm! the only thing i noticed that i found odd was that for teensy 4.0 the IDE says it uses 10 times more dynamic memory than on teensy 3.6

This is for a blank sketch just a setup and loop function

t4.0 blank sketch Serial 600MHz Optimize "Faster"
Sketch uses 12448 bytes (0%) of program storage space. Maximum is 2031616 bytes.
Global variables use 41660 bytes (7%) of dynamic memory, leaving 482628 bytes for local variables. Maximum is 524288 bytes.


t3.6 blank sketch Serial 180MHz Optimize "Faster"
Sketch uses 9944 bytes (0%) of program storage space. Maximum is 1048576 bytes.
Global variables use 3828 bytes (1%) of dynamic memory, leaving 258316 bytes for local variables. Maximum is 262144 bytes.
 
Just got to test beta 3, tested the MIDI USB Type with large Sysex dumps and worked like a charm! the only thing i noticed that i found odd was that for teensy 4.0 the IDE says it uses 10 times more dynamic memory than on teensy 3.6

This is for a blank sketch just a setup and loop function

t4.0 blank sketch Serial 600MHz Optimize "Faster"
Sketch uses 12448 bytes (0%) of program storage space. Maximum is 2031616 bytes.
Global variables use 41660 bytes (7%) of dynamic memory, leaving 482628 bytes for local variables. Maximum is 524288 bytes.


t3.6 blank sketch Serial 180MHz Optimize "Faster"
Sketch uses 9944 bytes (0%) of program storage space. Maximum is 1048576 bytes.
Global variables use 3828 bytes (1%) of dynamic memory, leaving 258316 bytes for local variables. Maximum is 262144 bytes.

The T4 pulls code from FLASH to RAM to execute - the minimum allocation block for that is 32KB. That allows that code to run at zero wait states in sync with the processor. Code can be marked as FLASHMEM to keep it in flash where it will run slower unless it gets into and stays in cache.

With code from flash moved to RAM up to 32KB in your sketch - then program data is allocated after that in the remainder of the full speed RAM.

More details on memory here : https://www.pjrc.com/store/teensy40.html

That should detail that big first allocation and lead to more details on the T4's memory.
 
the only thing i noticed that i found odd was that for teensy 4.0 the IDE says it uses 10 times more dynamic memory than on teensy 3.6

That's normal, even though Arduino's description "Global variables" doesn't really apply to the much more complex memory structure on this chip.

The larger memory use comes mostly from 2 places.

1: Code is copied to RAM. That's the only way to run at the full 600 MHz speed. It's assigned in 32K chunks, so 32768 of those bytes are code, not actually variables.

2: USB uses much larger buffers to support 480 Mbit speed. On Teensy 3.x, the USB is limited to 12 Mbit speed (except the USB host port on Teensy 3.6). Also on those boards I designed the USB code to use a shared pool of buffers, which conserves memory, but it can also hurt performance in the some cases of sustained large bidirectional data flow, especially if done simultaneously on multiple interfaces (a very rare case, but it has come up a few times over the years). It was originally designed for Teensy 3.0 which had 16K or RAM, and now supports Teensy LC with 8K of RAM. On Teensy 4.0, since there's so much more memory, I went with dedicated buffers per endpoint. It's a trade-off which puts performance first, but does cost more memory for buffers. Even though the CPU is ~5.5X faster than Teensy 3.6, the USB is 20X faster, so code optimizations & buffers are crucial to giving you access to the power of 480 Mbit USB.
 
Last edited:
Re Bug #2: Looking around not seeing what could be behind this.

It's a subtle timing problem. When the elapsed time between handling milliseconds and cycle count is different in the systick interrupt than what the main program does, the result is a small error by the amount of that difference. The interrupt almost always handles these with minimum delay, but the micros() code does not, especially when it suffers being interrupted.

When running at very high CPU speeds, the error is always (or almost always - we simply haven't found ways to stress it enough with other interrupts) a tiny amount less than 1 microsecond, which gets discarded when the division operation rounds down. At slower speeds, the error can be 1 microsecond or more.

We can't fully eliminate this error, especially with interrupts remaining enabled, but the results can at least be forced to always be monotonic by simply checking that the factional part added to the milliseconds is not more than 1 millisecond. Hopefully that will be good enough.

I pushed a couple commits which hopefully will be a permanent fix.

https://github.com/PaulStoffregen/cores/commit/bd4a9e2cfb75ff4c8872cb69a545cea578c87d23

https://github.com/PaulStoffregen/cores/commit/041bfbe9ed85b13f1a833b6c81fe2cd54d1b1ac6

This was a particularly tough problem. It's been quite a while since I've had to dig so much into the compiler's generated asm to see...
 
Ok that makes sense, even i've had the 4.0 for a month i never really used because there was no USB MIDI so i was patiently waiting for it, last night when i went to bed i checked the forum and saw the announcement so i rushed back to my desk to download the beta and test MIDI, so far large SysEx dumps back and forth worked like a charm i'm going to test setting the Teensy 4.0 as a clock master with USB MIDI and see how it performs but otherwise i'm not sure what other tests need to be done for USB MIDI, all the common uses seem to work like a charm!

Also i forgot to mention USB MID is insanly fast!
 
I pushed a couple commits which hopefully will be a permanent fix.

https://github.com/PaulStoffregen/cores/commit/bd4a9e2cfb75ff4c8872cb69a545cea578c87d23

https://github.com/PaulStoffregen/cores/commit/041bfbe9ed85b13f1a833b6c81fe2cd54d1b1ac6

This was a particularly tough problem. It's been quite a while since I've had to dig so much into the compiler's generated asm to see...

Paul
I reran the ClocksT4 sketch as modified by @defragster and looks like the fix resolved the issue associated the clocks as well - no bugbug reports for the speeds tested. For the clock of 129600000:
Code:
 F_CPU=129600000	50ms delay:: 50001 us and 50 ms
System Clock: 129600000
IPG Clock: 129600000
Semc Clock: 43200000
RTC Clock: 32768
USB1pll Clock: 480000000
Peripheral Clock: 24000000
Osc Clock: 24000000
Arm Clock: 648000000
...
 
@mjs513 (me and ...) - I think we probably need to look at some of these changes...

I thought I would try out an ILI9341_t3 display on the current stuff, so I tried breadboard and not getting any display out of graphictest (either ili9341_t3 or _t3n)...
Maybe display bad, tried different one... Maybe wiring wrong so moved to another T4... Still nothing...

Brought up Arduino 1.8.9 with 1.47 and works...

Argh...
 
Update, I then went back and reran the Beta3 on 1.8.10 and it worked? Not sure what the difference is now? Lose wire?
Logic analyzer is saying that it is trying to run SPI > 31mhz...

Or maybe it likes logic analyzer hooked up?

Maybe false alarm?

Update: although a couple test cases using ili9341_t3n are not working (not sure if that our SPIN...)

Edit2 ILI9341_t3n - Asked beginTransaction to run at 144mhz so it got over 100... which did not work... So backed it down again to 30mhz...
 
Last edited:
Update, I then went back and reran the Beta3 on 1.8.10 and it worked? Not sure what the difference is now? Lose wire?
Logic analyzer is saying that it is trying to run SPI > 31mhz...

Or maybe it likes logic analyzer hooked up?

Maybe false alarm?

Update: although a couple test cases using ili9341_t3n are not working (not sure if that our SPIN...)

Edit2 ILI9341_t3n - Asked beginTransaction to run at 144mhz so it got over 100... which did not work... So backed it down again to 30mhz...

@KurtE
Sorry just saw this.

Just tested Buddhabrot, graphicstest and demosauce (had to fix font name) they all seemed to working at 30Mhz. Didn't do any tests with increased speed. Do what to test ST7735_t3 at some point but keep getting distracted.
 
@KurtE

Just tested ILI9488 = seems busted - nothing on running on my Display. Tried changing spi but still not working. Have to go back to make sure I have the right version.

EDIT: False alarm - it worked at 30 and 72Mhz - demosauce and graphicstest
 
It's a subtle timing problem. When the elapsed time between handling milliseconds and cycle count is different in the systick interrupt than what the main program does, the result is a small error by the amount of that difference. The interrupt almost always handles these with minimum delay, but the micros() code does not, especially when it suffers being interrupted.

When running at very high CPU speeds, the error is always (or almost always - we simply haven't found ways to stress it enough with other interrupts) a tiny amount less than 1 microsecond, which gets discarded when the division operation rounds down. At slower speeds, the error can be 1 microsecond or more.

We can't fully eliminate this error, especially with interrupts remaining enabled, but the results can at least be forced to always be monotonic by simply checking that the factional part added to the milliseconds is not more than 1 millisecond. Hopefully that will be good enough.

I pushed a couple commits which hopefully will be a permanent fix.

https://github.com/PaulStoffregen/cores/commit/bd4a9e2cfb75ff4c8872cb69a545cea578c87d23

https://github.com/PaulStoffregen/cores/commit/041bfbe9ed85b13f1a833b6c81fe2cd54d1b1ac6

This was a particularly tough problem. It's been quite a while since I've had to dig so much into the compiler's generated asm to see...

<Good timing was just coming to ask about your test code?>

Yes, when that Frac could jump a us in advance was the only way I saw it generating a problem - but wasn't sure adding code to limit was the answer.

I'll pull those changes and see what I see in a test sketch I've been working with 24, 36, 600 MHz.

With 1.49 Beta 3 code:: In doing 32K samples ( counted while() with 10 inline doing array index increment on each call ) it looks like 39 cycles per iteration average - so having 2 us between calls is to be expected at 24 MHz. Last run shows zero over 1us between calls at 600 and 396 MHz. I'll have to add ZERO us delta to stats to see new code variance.

In my sketch I'm seeing some USB garbage and output order misqueues/missing in output dumping results of 32K successive runs in inline groups of 10 - didn't help boiling it down :( Even added a delay 10us on each newline.

Also finding when the sketch is running that I have to push the button to reprogram when it runs tight {non delay} code
 
@Paul: re p#36 / p#43 using the updated github cores in the sketch here at 24 MHz:

Short Summary: You did good with the ASM and rework - micros() is running 3 cycles faster by this measure. And net effect looks like a reduction of 24% of sequential calls to micros() returning 2 us or more later, and 37% more returning in 1 us.
And no indication of resulting side effects.

The data for that using github updated with 'frac':
Code:
	 [B]15529 [/B]at 1us    [COLOR="#FF0000"]// 37% more here[/COLOR]
	 17186 at 2us
	     9 at 3us
	    40 at 4us
	     2 at 5us
	     1 at 16us

Test [U]cycles per iteration 36[/U]	Test Cnt over 1us [B]17238 [/B]of 32767  [COLOR="#FF0000"]// 24% fewer here[/COLOR]

And testing with 1.49 b3 code:
Code:
	 [COLOR="#FF0000"]11334 [/COLOR]at 1us
	 21377 at 2us
	     1 at 3us
	    40 at 4us
	    14 at 5us
	     1 at 16us

Test [U]cycles per iteration 39[/U]	Test Cnt over 1us [COLOR="#FF0000"]21433 [/COLOR]of 32767


Same results at 36 MHz are more dramatically better with FRAC:
Code:
	 32198 at 1us
	   535 at 2us
	    34 at 3us

Test cycles per iteration 36	Test Cnt over 1us 569 of 32767

td1.49b3 at 36 MHz:
Code:
	     3 at 0us
	 29653 at 1us
	  3081 at 2us
	    30 at 3us

Test cycles per iteration 39	Test Cnt over 1us 3111 of 32767
 
Last edited:
More speedup in micros() might be possible if the 3 state variables are put into a struct. That way the compiler only has to load a single address to access all 3. But that might break other code which has "systick_millis_count", so some sort of alias special linker symbol might be needed if that becomes a problem.

I'm not planning to try this now, but it's an idea for future optimization...
 
More speedup in micros() might be possible if the 3 state variables are put into a struct. That way the compiler only has to load a single address to access all 3. But that might break other code which has "systick_millis_count", so some sort of alias special linker symbol might be needed if that becomes a problem.

I'm not planning to try this now, but it's an idea for future optimization...

If it helped the two 'new to t4' values are easy - the systick_millis_count would be bad without reference leaving the old behind - looks like reference& is a .cpp feature?

/\ cross posted p#44 edit with 36 MHz showing 540% reduction in over 1us as tested.
 
Status
Not open for further replies.
Back
Top