CircuitPython on Teensy 4!

I may consider a way for future bootloaders to preserve the filesystem portion of Circuit Python. But hacking the bootloader is a very serious matter, as there is a very real risk of bricked boards if things go wrong. That's why much of the design goal of the bootloader is to keep things simple & low-risk. Whether I put engineering time into this and other ways to support Circuit Python will really depend upon whether PJRC sees significant Teensy sales for people using Circuit Python.
 
The latest (2nd) firmware from @tannewt is a lot faster than the 1st one :
- performanceTest with firmware CircuitPython 5.0.0-beta.3-6-g926375d99-dirty on 2020-01-10 @ Teensy 4.0 : 300871
- performanceTest with firmware CircuitPython 5.0.0-beta.3-69-g1c3960634 on 2020-01-18 @ Teensy 4.0 : 4005428 # 13.3x, greater is better
- hsquare with firmware CircuitPython 5.0.0-beta.3-6-g926375d99-dirty on 2020-01-10 @ Teensy 4.0 : 155.3999519348145 us
- hsquare with firmware CircuitPython 5.0.0-beta.3-69-g1c3960634 on 2020-01-18 @ Teensy 4.0 : 10.7399976253510 us # 6.91%, lower is better
Using benchmark scripts adapted from "Benchmark comparison of MicroPython boards" topic in MicroPython forum.

But this 2nd firmware has bugs in REPL and when reading files (sometimes), like the pystone benchmark.
 
I may consider a way for future bootloaders to preserve the filesystem portion of Circuit Python. But hacking the bootloader is a very serious matter, as there is a very real risk of bricked boards if things go wrong. That's why much of the design goal of the bootloader is to keep things simple & low-risk. Whether I put engineering time into this and other ways to support Circuit Python will really depend upon whether PJRC sees significant Teensy sales for people using Circuit Python.

It's not my intent to hack the Teensy Bootloader at all. My intention would be to provide another bootloader alongside CircuitPython. How to enter the bootloader may be an issue though. Does the button on Teensy reset the main chip every press? We usually recognize a double reset as meant to enter the bootloader.

I don't expect you to put time into it until you feel it is worthwhile.

The latest (2nd) firmware from @tannewt is a lot faster than the 1st one :
- performanceTest with firmware CircuitPython 5.0.0-beta.3-6-g926375d99-dirty on 2020-01-10 @ Teensy 4.0 : 300871
- performanceTest with firmware CircuitPython 5.0.0-beta.3-69-g1c3960634 on 2020-01-18 @ Teensy 4.0 : 4005428 # 13.3x, greater is better
- hsquare with firmware CircuitPython 5.0.0-beta.3-6-g926375d99-dirty on 2020-01-10 @ Teensy 4.0 : 155.3999519348145 us
- hsquare with firmware CircuitPython 5.0.0-beta.3-69-g1c3960634 on 2020-01-18 @ Teensy 4.0 : 10.7399976253510 us # 6.91%, lower is better
Using benchmark scripts adapted from "Benchmark comparison of MicroPython boards" topic in MicroPython forum.

But this 2nd firmware has bugs in REPL and when reading files (sometimes), like the pystone benchmark.

Thanks for testing this! I believe the bugs are due to the DCache interacting with TinyUSB. I've turned it off in my PR and attached the build from the GitHub Actions CI. (All boards are built for pending PRs and the files can be downloaded from GitHub. To do so, click the red x or green checkmark next to a commit, get details on a build and then click artifacts. The dropdown will have every board and it'll have a zip of all languages for that board.) I'd expect it to be similar speed and less buggy.
 

Attachments

  • adafruit-circuitpython-teensy40-en_US-20200119-7f96015.hex.zip
    266.5 KB · Views: 92
Thanks for the CircuitPython for Teensy 4.0 v2020-01-19. The benchmarks show that it's as fast as v2020-01-18.

The REPL is better (no bugs until now).

But the the same pystone benchmark (see attached file) works ok on firmware v2020-01-10 :
Code:
Adafruit CircuitPython 5.0.0-beta.3-6-g926375d99-dirty on 2020-01-10; Teensy 4.0 with IMXRT1062DVJ6A

>>> import pystone_lowmem_monotonic
>>> pystone_lowmem_monotonic.main()
Pystone(1.2) time for 500 passes = 761000000 ms
This machine benchmarks at 657 pystones/second
But not on v2020-01-18 and v2020-01-19 :
Code:
Adafruit CircuitPython 5.0.0-beta.3-70-g7f960151b on 2020-01-19; Teensy 4.0 with IMXRT1062DVJ6A

>>> import pystone_lowmem_monotonic
>>> pystone_lowmem_monotonic.main()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'main'
 

Attachments

  • pystone_lowmem_monotonic.zip
    2.6 KB · Views: 98
@tannewt your latest hex file in post #53 only shows a build date of 20200119, i was hoping for something newer. That .hex file fails on T4 as before (post #45) when porting my longer .py scripts (llutm.py raytrace.py pystone.py wator.py) :( T4 does run hsquare.py faster (11.2 us) and 10 s counting script reaches 3247698. I still have the most success with your first .hex (2020-01-10). I've added some more performance numbers for that .hex in post #14
 
@tannewt your latest hex file in post #53 only shows a build date of 20200119, i was hoping for something newer. That .hex file fails on T4 as before (post #45) when porting my longer .py scripts (llutm.py raytrace.py pystone.py wator.py) :( T4 does run hsquare.py faster (11.2 us) and 10 s counting script reaches 3247698. I still have the most success with your first .hex (2020-01-10). I've added some more performance numbers for that .hex in post #14

Will be cool to see the numbers with improved build in the table.
@manitou - can you note if High/Low is better for each subset.

Would be interesting to see something of the same benchmark using a cpp build to gauge overhead of interpreter.

As far as adding bootloader reserved space in FLASH - would be nice for Arduino sketch to be able to have reserved space for rarely updated 'config' info that might be too big for EEPROM - one user wanted 4K IIRC. I suppose that may fall out on future Teensy 4.1 with larger FLASH where space for filesystem may be reserved if the T_4.0's 2MB doesn't justify that. OR altered EEPROM logic and numbers could free up reserved space there compromising the overall size or the backing rewrite ratio. But EEPROM edit doesn't help this thread for store of python code.
 
Will be cool to see the numbers with improved build in the table.
@manitou - can you note if High/Low is better for each subset.

Would be interesting to see something of the same benchmark using a cpp build to gauge overhead of interpreter.
yeah, the high/low is left to the reader. most of early columns are time (us or s) but there are pystones and counter values where bigger is better.

to post #14 i've added a "T4 C" line for a rough comparison of interpreted python vs compiled C. The sketches can take advantage of hardware float, i think the python core uses double everywhere. The interpreter takes a lot of space too. On the M0+ circuit playground express with circuitpython, there is little room left for user scripts.

https://forum.micropython.org/viewtopic.php?t=2659
 
yeah, the high/low is left to the reader. most of early columns are time (us or s) but there are pystones and counter values where bigger is better.

to post #14 i've added a "T4 C" line for a rough comparison of interpreted python vs compiled C. The sketches can take advantage of hardware float, i think the python core uses double everywhere. The interpreter takes a lot of space too. On the M0+ circuit playground express with circuitpython, there is little room left for user scripts.

https://forum.micropython.org/viewtopic.php?t=2659

C looks like an unfair comparison :) - I didn't expect it to be that extreme. Will be interesting to see how T4 comes along. I saw linked table that includes T_3.x's - odd they don't stand out there.

'left to the reader' … okay forum isn't always good for tables anyhow ...
 
Thanks for the CircuitPython for Teensy 4.0 v2020-01-19. The benchmarks show that it's as fast as v2020-01-18.

The REPL is better (no bugs until now).

But the the same pystone benchmark (see attached file) works ok on firmware v2020-01-10 :
Code:
Adafruit CircuitPython 5.0.0-beta.3-6-g926375d99-dirty on 2020-01-10; Teensy 4.0 with IMXRT1062DVJ6A

>>> import pystone_lowmem_monotonic
>>> pystone_lowmem_monotonic.main()
Pystone(1.2) time for 500 passes = 761000000 ms
This machine benchmarks at 657 pystones/second
But not on v2020-01-18 and v2020-01-19 :
Code:
Adafruit CircuitPython 5.0.0-beta.3-70-g7f960151b on 2020-01-19; Teensy 4.0 with IMXRT1062DVJ6A

>>> import pystone_lowmem_monotonic
>>> pystone_lowmem_monotonic.main()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'main'

Interesting! What does `dir(pystone_lowmem_monotonic)` show? That will list all names within the module.

@tannewt your latest hex file in post #53 only shows a build date of 20200119, i was hoping for something newer. That .hex file fails on T4 as before (post #45) when porting my longer .py scripts (llutm.py raytrace.py pystone.py wator.py) :( T4 does run hsquare.py faster (11.2 us) and 10 s counting script reaches 3247698. I still have the most success with your first .hex (2020-01-10). I've added some more performance numbers for that .hex in post #14

There are no newer changes. This port isn't my top priority currently. (My top priority is fixing bugs for 5.0.0 stable and making sure Bluetooth on the nRF52840 is solid.) The syntax error problem sounds like the file didn't transfer successfully. I'm surprised you are still hitting it with the DCache turned off.

C looks like an unfair comparison :) - I didn't expect it to be that extreme. Will be interesting to see how T4 comes along. I saw linked table that includes T_3.x's - odd they don't stand out there.

'left to the reader' … okay forum isn't always good for tables anyhow ...

C will always be much faster to run than Python. However, Python is much faster to write and iterate on. CircuitPython/MicroPython is the best of both worlds, use Python to connect together C bits that are fast.
 
@tannewt I cloned your latest teensy4-dev branch and was able to run all of my benchmark circuitypython scripts. I updated results in post #14. As I understand it, DCache is disabled? So things could get faster with additional memory/cache tuning.

ref https://github.com/adafruit/circuitpython/pull/2532

Thanks for testing it out! I've merged that PR in so any changes can go on the master adafruit branch now. It does have the DCache disabled so we could get a bit of a speed boost from turning it back on. The build isn't using link time optimization yet either which should be able to speed things up as well.

Will let you know when I circle back around to it. Thanks!
 
Hack alert:

OK, I've updated post #14 again. I added SCB_EnableDCache() to cpu.voltage, and then invoke cpu.voltage at the start of each of my .py tests, so I should have DCache enabled, and performance (post #14) did improve. To get fact.py times i used GPT micros (also hacked into cpu.voltage), other tests are using time.monotonic()

Are you using ITCM/DTCM? Teensy 4 core, copies all of FLASH to RAM, and runs instructions out of ITCM (0-delay) and most of the data is in DTCM.
ref: https://www.pjrc.com/store/teensy40.html
 
Hack alert:

OK, I've updated post #14 again. I added SCB_EnableDCache() to cpu.voltage, and then invoke cpu.voltage at the start of each of my .py tests, so I should have DCache enabled, and performance (post #14) did improve. To get fact.py times i used GPT micros (also hacked into cpu.voltage), other tests are using time.monotonic()

Whoa! Those numbers do look better! It'd be great to hunt down the TinyUSB bug so we can leave it on.

Are you using ITCM/DTCM? Teensy 4 core, copies all of FLASH to RAM, and runs instructions out of ITCM (0-delay) and most of the data is in DTCM.
ref: https://www.pjrc.com/store/teensy40.html

My PR adds basic support for the ITCM and the DTCM. I allocate 32k to each. Our first board will be the 1010 so I wanted to focus on that limited amount first. I tried to move all of the core VM stuff there to speed things up. The stack also lives in the DTCM now. You can play with adding things to the TCMs using the macros here: https://github.com/adafruit/circuitpython/blob/master/supervisor/linker.h#L32 It'd be worth validating that we're using the hardware float support too. I haven't looked into it yet.
 
It'd be worth validating that we're using the hardware float support too. I haven't looked into it yet.

i'm pretty sure FPU is configured (by SDK SystemInit()) and compiler switches are correct, and that double and float are working for your Teensy 4 branch. Many of my .py tests were float intensive (raytrace hsquare llutm)

i also embedded a C floating point test in cpu.temperature and speeds suggest FPU is working in circuitpython core
Code:
        ddot 2.12534e+10 93 ms   8.60 Mflops   double
        daxpy 2.12534e+10 61 ms  13.11 Mflops
         
        sdot 2.12578e+10 31 ms  25.81 Mflops   float
        saxpy 2.12578e+10 34 ms  23.53 Mflops

        cache enabled   SCB_EnableDCache();
        ddot 2.12534e+10 67 ms  11.94 Mflops
        daxpy 2.12534e+10 33 ms  24.24 Mflops

        sdot 2.12578e+10 4 ms 200.00 Mflops
        saxpy 2.12578e+10 7 ms 114.29 Mflops
 
Last edited:
i'm pretty sure FPU is configured (by SDK SystemInit()) and compiler switches are correct, and that double and float are working for your Teensy 4 branch. Many of my .py tests were float intensive (raytrace hsquare llutm)

i also embedded a C floating point test in cpu.temperature and speeds suggest FPU is working in circuitpython core
Code:
        ddot 2.12534e+10 93 ms   8.60 Mflops   double
        daxpy 2.12534e+10 61 ms  13.11 Mflops
         
        sdot 2.12578e+10 31 ms  25.81 Mflops   float
        saxpy 2.12578e+10 34 ms  23.53 Mflops

        cache enabled   SCB_EnableDCache();
        ddot 2.12534e+10 67 ms  11.94 Mflops
        daxpy 2.12534e+10 33 ms  24.24 Mflops

        sdot 2.12578e+10 4 ms 200.00 Mflops
        saxpy 2.12578e+10 7 ms 114.29 Mflops

Very cool! It looks like we always use float internally as well: https://github.com/adafruit/circuitpython/blob/master/py/circuitpy_mpconfig.h#L69
 
Micropython and Teensy 4.0

Know this thread I circuitpython specific but just wanted to share something I just came across on the micropython forum regarding Teensy 4.0, https://forum.micropython.org/viewt...sid=9f8c60ed902c44334fce34fa90a00c80&start=10.

Some one seems to have started a port of micropython over to the Teensy 4.0 (https://forum.micropython.org/viewt...c60ed902c44334fce34fa90a00c80&start=10#p43549). The micropython PR for the T4.0 is at https://github.com/micropython/micropython/pull/5558.

Looks kind of interesting in case anyone is interested.
 
Last edited:
I also haven't had time to do much with this. Focusing on hardware stuff right now. Hoping to get some serious time to work with Python in by mid-March...
 
All the drivers you get with circuitpython sure make it seem easier for a noobie. I haven't started my embedded python journey yet but will very soon. The 4 is the first microcontroller I fell is plenty fast enough to run almost any script fast enough for almost any project.
 
Back
Top