Coremark

But then it will no longer be called CoreMark I guess.

Yes, that's correct. The CoreMark code comes with a lot of guidelines, which nobody seems to fully follow. But they are very specific about not changing anything within the actual code and still calling the result "CoreMark".
 
Both single core and multicore are important. Although -if I read the comments correct- the code doesn't seems to check memory access much.

Most current code will be single core and need to be adapted for multi core. It's nice to know how close you are.

BTW. I don't think that the numbers after the point will be relevant. Maybe more change from run to run.
 
I've added Coremarks for Teensy 4 / Teensy 3.6 . both are a lot better now (newer compiler?) (see github)

Q: Is -O2 or -O3 still the best? T4 seems faster with -O2 , T36 faster with -O3. Should i add some switches/flags?
 
BTW. I don't think that the numbers after the point will be relevant. Maybe more change from run to run.

correct, however, here comes the pile-on-phil-train choo choo, that train in never late :)

any hoo, for the chill nice people - limor was feeding the baby at the same time as making the commits and rounded down for simplicity: the changes were made with claude code over speech-to-text. we can re-run and commit with decimal points if that will get the PR (https://github.com/PaulStoffregen/CoreMark/pull/17) merged in
 
I'm not maintaining that CoreMark code, as you might have suspected from seeing the last commit was 7 years ago.

I really should do something like put that repo and several others I will never touch again into archive mode. Over the last 7 years I've had a tremendous number of demands on my time. Updating or even archiving old github repos has been pretty my the absolute bottom of my priority list.
 
I'm not maintaining that CoreMark code, as you might have suspected from seeing the last commit was 7 years ago. I really should do something like put that repo and several others I will never touch again into archive mode. Over the last 7 years I've had a tremendous number of demands on my time. Updating or even archiving old github repos has been pretty my the absolute bottom of my priority list.

totally fair, and thanks for clarifying. we have over 1900 repos, as you can expect folks have tons of demands that we support and optimize for their hardware in our libraries (even if we don't own that hardware!) if the repo is effectively abandoned, we will treat it as such and move forward in an active fork where results can be reproduced, reviewed, and improved. coremark compliance is fine.

folks feel free to PR to the @ladyada fork, we've already merged one request with Teensy 4.0 numbers earlier this morning https://github.com/ladyada/CoreMark/pull/2
 
It was clear to me from the start that this would not be merged. No, it would be too tedious to get the code back into a form that you would merge. Of course, you can merge it if you want. Also, the numbers in the readme... I updated Teensy 4 and 3.6 (much higher numbers). I'm already not interested enough anymore. I was just curious to see how other dual cores perform.
 
About Overclocking and Temperature

Normally overclocking leads to increased temperature not only of the chip but also for the surrounding parts.

Temperature of electronic parts can have two different effects:
A short time and reversal effect on reliability and also an effect on ageing of parts, which leads to a reduced life. While it is rather easy to find data about Ageing of electrolyte condensers, I would be interested to know more about temperature effects on endurance life of other parts.

What I did find, is some data about endurance life of flash storage, but here the main problems arise from high storage temperature, when the flash is not used and therefore cannot be refreshed: https://www.ni.com/de/support/docum...tanding-life-expectancy-of-flash-storage.html

A general paper about endurance life from TI: https://www.ti.com/lit/an/snoa994/snoa994.pdf? They state, that a semiconductor device, which has a "useful life" of 10 years at 105°C (in the chip!) will have a life of 2 years at 125°C. I think, that the paper indicates, that you have to be careful, if you use calibrated parts like ADC.

I have a ESP32 based board situated at a sunny place, where I grow my tomatoes, which monitors light intensity, humidity and temperature to calculate irrigation water demand. There is shadow on the parts all the time but the place gets hot in summer and can reach ~45°C during the afternoon. The OLED display failed first and is meanwhile totally unreadable since the 2nd summer. The DHT-22 sensor for humidity failed second in it's 2nd summer, the temperature part is still working. The controller including flash, WLAN,.... survived so far.

Of course these thoughts about endurance life are not relevant for many hobby usages, where a HDMI picture is wanted for a few hours but nothing is lost, if the device fails. But even for most of my hobby usages, it is not a good idea to risk failure due to increased temperature. My oldest item running 24/7 is a clock with 68HC11 (the first with on-board eeprom) now for 31 years, I think. My Teensy 3.5 based clock already works permanently without any problems since 2018: https://forum.pjrc.com/index.php?threads/making-a-cuckoo-clock-with-teensy.53938/

Cheers Christof
 
running coremark here - from PJRC copy? - three years old last use: Newest TD 1.60b6

T_4.1 at 600 MHZ FASTER:
CoreMark 1.0 : 2406.93 / GCC11.3.1 20220712 (flags unknown) / STACK
T_4.1 at 528 MHZ FASTER:

CoreMark 1.0 : 2118.20 / GCC11.3.1 20220712 (flags unknown) / STACK

Linear 13% difference.

quick OC run no heat sink at 816 MHz:
CoreMark 1.0 : 3273.59 / GCC11.3.1 20220712 (flags unknown) / STACK
 
Last edited:
running coremark here - from PJRC copy? - three years old last use: Newest TD 1.60b6

T_4.1 at 600 MHZ FASTER:
CoreMark 1.0 : 2406.93 / GCC11.3.1 20220712 (flags unknown) / STACK
T_4.1 at 528 MHZ FASTER:

CoreMark 1.0 : 2118.20 / GCC11.3.1 20220712 (flags unknown) / STACK

Linear 13% difference.

quick OC run no heat sink at 816 MHz:
CoreMark 1.0 : 3273.59 / GCC11.3.1 20220712 (flags unknown) / STACK
Thanks. I could have thought it with the TCM memory.
 
encrypting 21504 bytes to T:\TEMP\arduino_build_106015/CoreMark.ino.ehex

Memory Usage on Teensy 4.1:
FLASH: code:16708, data:4040, headers:8944 free for files:8096772
RAM1: variables:4864, code:14192, padding:18576 free for local variables:486656
RAM2: variables:12416 free for malloc/new:511872

TCM - though also small enough to run in 32KB cache after first read,
 
Last edited:
Hm???
I have adopted ueForth (ESP32Forth) to Pico2W and found a way to start a second Forth instance in the second core1. Together with the WLAN code, the Forth system compiles to "Sketch uses 467604 bytes (44%) of program storage space." with Arduino. I am then using the old Byte Sieve benchmark.
While with Coremark using the second core is reported more or less to double throughput, in my application in some cases (not always) running the same task in core1 does only run with half speed. 64ms in core1 for one iteration of 1899 primes instead of 29ms in core0.
There is clearly an influence on speed of core1 depending on what core0 does at this time.
There are a lot of "obscure" things like USB-serial and WLAN going on in the background, so I have no idea about the reasons. But I wonder, if the size of the code and caching effects might throttle the system?

Yes, you always have to be aware of what is being measured and how in order to obtain a reasonably reliable answer. Coremark also has many critics.
For example, “floats” and “doubles” are not measured (so you don't see an existing FPU in the result – that would push the T4 even higher).
But that's not really that important. I think that an MCU with a poor score will not execute any program significantly faster than an MCU with a better rating. In addition, the absolute numbers tempt you to take them too seriously. Actually, they should only be used to compare the same MCU repeatedly, e.g., with different compilers or optimization flags. In all other cases, you can only roughly estimate how good the performance is.

This brings me to a point of criticism regarding Lady Ada's variant. It waits until both cores are finished. This is ideal for the RP2040, of course, because 1. both cores are equally fast, and 2. the pico's multithreading has no overhead.

However, there are MCUs where one core is much slower than the other. There, too, the program would wait until both are finished – so the slow core would greatly distort the measurement result.

Question: I think this needs to be changed. The measurement should be stopped when the first core is finished. What do you think?

I think, it's reality, that it is difficult to split up work. It is not often, that you have a task that can easily just be done independently from other tasks even if you have identical cores. For the Parallax P2 processor, I once attempted to split everything, that can be done at this moment, into very tiny chunks of work and put the chunks into a queue. Each core can watch the list and then take a chunk, when it is idling.

(For fun, I just asked google AI. They quote some "Brooke's law" and say that for human work throughput increases with number of persons n according to the formula P( n ) =P1*n*(1-c)^(n-1). c=0,05...0,15. So for 2 people instead of one 2*0,85=1,7.
When asked about Computer Cores, they give a "Gunther's law", which says that too many cores might even have less power than one. For Database application they gave P( n ) =P1*sqrt( n ) - This sounds like a useful conservative estimation.)
 
Hm???
I have adopted ueForth (ESP32Forth) to Pico2W and found a way to start a second Forth instance in the second core1. Together with the WLAN code, the Forth system compiles to "Sketch uses 467604 bytes (44%) of program storage space." with Arduino. I am then using the old Byte Sieve benchmark.
While with Coremark using the second core is reported more or less to double throughput, in my application in some cases (not always) running the same task in core1 does only run with half speed. 64ms in core1 for one iteration of 1899 primes instead of 29ms in core0.
There is clearly an influence on speed of core1 depending on what core0 does at this time.
There are a lot of "obscure" things like USB-serial and WLAN going on in the background, so I have no idea about the reasons. But I wonder, if the size of the code and caching effects might throttle the system?



I think, it's reality, that it is difficult to split up work. It is not often, that you have a task that can easily just be done independently from other tasks even if you have identical cores. For the Parallax P2 processor, I once attempted to split everything, that can be done at this moment, into very tiny chunks of work and put the chunks into a queue. Each core can watch the list and then take a chunk, when it is idling.
This is quite normal. Coremark will probably run inside the core's cache and thus only use the core itself. You're test is probably using the memory and this is shared.
 
Going to jump in here again about running on multi-core processes and using multi-thread setting. Reason -- was poking around the web and asking AI :)

Rich (BB code):
CoreMark Multi Core Reporting Requirement

According to the official EEMBC CoreMark Run and Reporting Rules
(`docs/run_and_report.pdf`, Section 1.2), each CoreMark instance must run
on exactly one CPU core, and all instances must be reported separately:

“Each CoreMark instance must execute on a single core. When running
multiple instances, each instance must be reported separately.”

This project follows the EEMBC requirement by pinning each CoreMark
process to a dedicated core and reporting per core results individually.

Multi-threading:
Rich (BB code):
🧵 What the CoreMark Multi Thread Setting Actually Does

CoreMark does not implement real multithreading in the sense of shared memory parallelism, mutexes, or work stealing. Instead, the “multi thread” setting simply creates multiple independent CoreMark contexts, each of which:

•    Runs the full CoreMark workload
•    Has its own data structures
•    Does not share memory with other contexts
•    Must run on one CPU core only (per EEMBC rules)

Think of it as launching N separate CoreMark instances inside one process.

🧩 What It Does Internally
When you enable multi thread mode:
•    CoreMark allocates N copies of the benchmark state
•    It spawns N threads, each running one context
•    Each thread executes the benchmark loop independently
•    There is no synchronization between threads
•    There is no parallel speedup unless the OS schedules each thread on a different core

This is why EEMBC calls them contexts, not threads.

🧠 What It Does Not Do

The multi thread setting does not:

•    Split the workload across cores
•    Parallelize the benchmark
•    Measure shared memory performance
•    Measure thread scalability
•    Measure OS thread scheduling performance
•    Combine results into a single “multi core score”

CoreMark is intentionally not a system benchmark.

📊 How Results Must Be Reported

EEMBC requires:

•    Each thread/context = one CoreMark instance
•    Each instance must be reported separately
•    You may optionally show a “total throughput” number, but only after listing each instance individually

This is why vendors report results like:

Core 0: 3.21 CoreMark/MHz
Core 1: 3.20 CoreMark/MHz
Core 2: 3.21 CoreMark/MHz
Core 3: 3.20 CoreMark/MHz
Total: 12.82 CoreMark/MHz

🛠️ Practical Meaning for You

If you set MULTITHREAD=4, CoreMark will:

•    Spawn 4 threads
•    Run 4 independent benchmarks
•    Rely on the OS to schedule them
•    Produce 4 separate scores

If you want true per core correctness, you must:
•    Pin each thread to a specific core
•    Or run 4 separate processes, each pinned to a core

So question for everyone is does this make sense and when dealing with multi-core processors really have to show results for both?
 
📊 How Results Must Be Reported

EEMBC requires:

• Each thread/context = one CoreMark instance
• Each instance must be reported separately
• You may optionally show a “total throughput” number, but only after listing each instance individually

This is why vendors report results like:

Core 0: 3.21 CoreMark/MHz
Core 1: 3.20 CoreMark/MHz
Core 2: 3.21 CoreMark/MHz
Core 3: 3.20 CoreMark/MHz
Total: 12.82 CoreMark/MHz

🛠️ Practical Meaning for You

If you set MULTITHREAD=4, CoreMark will:

• Spawn 4 threads
• Run 4 independent benchmarks
• Rely on the OS to schedule them
• Produce 4 separate scores

If you want true per core correctness, you must:
• Pin each thread to a specific core
• Or run 4 separate processes, each pinned to a core
[/CODE]

So question for everyone is does this make sense and when dealing with multi-core processors really have to show results for both?
That looks pretty good and doable. Are there any volunteers? :)
 
I vaguely remember an attempt to port FreeRTOS to Teensy. Was that actually successful? Can you download it somewhere and does it run “out of the box”?
 
I vaguely remember an attempt to port FreeRTOS to Teensy. Was that actually successful? Can you download it somewhere and does it run “out of the box”?
I had a play with this one (link to dev branch with a “better” implementation of yield()). It worked fine, but I didn’t explore deeply enough to find out whether the thread-naïve Teensy libraries cause serious issues.
 
Back
Top