Coremark

Mcu32

Well-known member
In the other thread, Coremark was brought up again. Lady Ada seems to have expanded it for the RP2350 so that it takes both cores into account. I have further expanded the code and it now also runs with freeRTOS – i.e., “real” preemptive multitasking on ESP32.


Unfortunately, no one has recorded the GCC flags yet, so please take the old results with a grain of salt.

Of course, T4 is the fastest.

--------------------------

Measures the number for times per second your processor can perform a variety of common tasks: linked list management, matrix multiply, and executing state machines.

  • RP2350 (SMP) kooperative Scheduling
  • FreeRTOS preemtpive Scheduling.
The RP2350 runs bare-metal tasks on each core with no scheduler, maximizing per-core throughput for benchmarks like CoreMark.

ESP32 with FreeRTOS uses preemptive multitasking, adding context-switch overhead that slightly lowers peak performance but improves responsiveness for concurrent tasks.

BoardCoreMarkGCC switchesCores used
Teensy 4.02313.57n/a1
RP2350 Dual Core (276MHz overclock, -O3)1437.00n/a2
RP2350 Dual Core (200MHz overclock, -O3)1041.00n/a2
ESP32 WROOM 32 xtensa 240MHz1032.62-O3 -fjump-tables -ftree-switch-conversion2
RP2350 Dual Core (150MHz)600.00n/a2
Adafruit Metro M4 (200MHz overclock, 'dragons' optimization)536.35n/a1
ESP32 WROOM 32 xtensa 240MHz519.75-O3 -fjump-tables -ftree-switch-conversion1
Adafruit Metro M4 (180MHz overclock, faster optimizations)458.19n/a1
Teensy 3.6440.72n/a1
ESP32-C3 160MHz409.72-O3 -fjump-tables -ftree-switch-conversion1
Sparkfun ESP32 Thing351.33n/a1
Adafruit HUZZAH 32351.35n/a1
Teensy 3.5265.50n/a1
Teensy 3.2 (96MHz overclock, faster optimizations)218.26n/a1
Adafruit Metro M4 (120MHz, smaller code)214.85n/a1
Teensy 3.2 (72MHz)168.62n/a1
Teensy 3.2 (72MHz, smaller code)126.76n/a1
Arduino Due94.95n/a1
Arduino Zero56.86n/a1
Arduino Nano Every8.20n/a1
Arduino Mega7.03n/a1
(larger numbers are better)
 
Are you using Platformio by chance? If you have ESP32-S3 code/project I can run on newest ESP32-P4 and post the results.
 
Platformio, yes (see link above) . You have to add the P4 yourself - can't test it
 
Single core ESP32-P4 @360MHz result ( -DMULTITHREAD=1)

CoreMark Performance Benchmark

CoreMark measures how quickly your processor can manage linked
lists, compute matrix multiply, and execute state machine code.

Iterations/Sec is the main benchmark result, higher numbers are better
Running.... (usually requires 12 to 20 seconds)

2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 19066
Total time (secs): 19.07
Iterations/Sec : 1048.99
Iterations : 20000
Compiler version : GCC14.2.0
Compiler flags : -O3 -fjump-tables -ftree-switch-conversion
Memory location : STACK
seedcrc : 0xE9F5
[0]crclist : 0xE714
[0]crcmatrix : 0x1FD7
[0]crcstate : 0x8E3A
[0]crcfinal : 0x382F
Correct operation validated. See README.md for run and reporting rules.
CoreMark 1.0 : 1048.99 / GCC14.2.0 / STACK


Dual core ESP32-P4 @360MHz ( -DMULTITHREAD=2)
2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 19249
Total time (secs): 19.25
Iterations/Sec : 2078.03
Iterations : 40000
Compiler version : GCC14.2.0
Compiler flags : -O3 -fjump-tables -ftree-switch-conversion
CoreMark 1.0 : 2078.03 / GCC14.2.0 / STACK
 
There is a #define MULTITHREAD that must be defined via platformio build_flags to allow multithreading. For two threads:

build_flags=
-DMULTITHREAD=2
 
Hey guys
How do you run coremark on dual core - run it on each core?
Lady added the Dual-Core Code for thr RP2350, an I took her code and added support for ESP32/freeRTOS.
There is a #define MULTITHREAD that must be defined via platformio build_flags to allow multithreading. For two threads:

build_flags=
-DMULTITHREAD=2
Good idea—I'll make some adjustments to the code to improve automatic architecture detection.
 
Thank you Tomas, i've added your Benchmark results to the readme, and made the detection code more robust.

Indeed,

Code:
build_flags=
-DMULTITHREAD=2

(or 1) in platformio.ini is enough.
 
I'm wondering why all the "RP2350 Due Core" results end in .00?


1768681231372.png
 
I noticed a lot of the results are with overclocking enabled. As another data point, a couple years ago I ran CoreMark in an infinite loop on Teensy 4.1 running overclocked at 1.008GHz to conduct thermal stress testing using a heatsink with active cooling. It came in at 3887.

Just ran it again using the same CoreMark code out of curiosity and got a higher number, perhaps due to compiler optimizations since the original test?
CoreMark Size : 666
Total ticks : 14838
Total time (secs): 14.84
Iterations/Sec : 4043.67
Iterations : 60000
Compiler version : GCC11.3.1 20220712
Compiler flags : (flags unknown)
Memory location : STACK
seedcrc : 0xE9F5
[0]crclist : 0xE714
[0]crcmatrix : 0x1FD7
[0]crcstate : 0x8E3A
[0]crcfinal : 0xBD59
Correct operation validated. See README.md for run and reporting rules.
CoreMark 1.0 : 4043.67 / GCC11.3.1 20220712 (flags unknown) / STACK
 
Yes, the version of the compiler has a major impact. Optimizations are constantly being worked on.
 
Good test @KenHahn and if OC'd CPU results are in the master list, would be good to throw some OC'd Teensy ones, too, for reference. When my T41s are operational, they run at 816MHz with passive cooling (small copper finned heatsink), been doing that for several years without any issues on any of them.
 
In my opinion it does not make sense to use overclocked results (at least for sake of comparison of different processors ) because overclocking is subject to silicon lottery and reliability is questionable. If comparison is meant to be fair, stock speeds should be used.
Of course they might be interesting anyway just out of pure curiosity.
 
Last edited:
That depends on the definition and how individual manufacturers define the safety margin. It could be that some want a large safety margin and others don't. “Overclocking” is a very vague term. We can't know that. And yes, I agree, I would say that the official MHz figure should be the one to go by. But everyone is entitled to their own opinion.
 
That depends on the definition and how individual manufacturers define the safety margin.
That is very true..In fact ESP32-P4 does not seem to have any safety margin at all. Their 400MHz part can only run stable at 360MHz at least with currently available chips.
 
All the numbers I published were based on simply running the CoreMark program on each board using Arduino IDE with all settings at the defaults that board's Arduino core library / platform uses. The idea was to show the relative performance if you just use each board as it's provided.

Even that turned out to be controversial, since we've always defaulted Teensy 3.2 to 96 MHz, even though Freescale (now NXP) rated the part for 72 MHz. Some people had pretty strong opinions (and words) about Teensy 3.2's score, even though it was a comparison to show how much faster Teensy 4.0 would be... and lowering Teensy 3.2 would have shown an even larger gap.

ARM wasn't the only email I got. Turns out a lot of people have really strong brand loyalty to particular chips and will endlessly argue that the benchmark should be run on their preferred chip in some way that raises its score. Some even argues for changing the code in various ways. It's exhausting. I haven't touched CoreMark for any sort of public publishing since.
 
Also benchmarks don't tell the whole story. FPU on ARM M7 (Teensy 4.x) seems to be pipelined better than on RISCV
 
Cache size also really matters. An 8K instruction cache is the reason why Teensy 3.6 scores 66% faster than Teensy 3.5, even though is has the same Cortex-M4 processor running at only 50% higher clock speed.

If instruction and data caches are large enough to fit all the Coremark code and data, you'll see a much better score. I supposed that's fair, in that a real program than fits in the cache will also benefit. But a lot of real programs are larger or manipulate larger data sets. A small benchmark (small enough to run on Arduino Mega with 8K RAM but not tiny enough for Arduino Uno with 2K RAM) doesn't necessarily give results that extrapolate up to larger programs.
 
Yes, you always have to be aware of what is being measured and how in order to obtain a reasonably reliable answer. Coremark also has many critics.
For example, “floats” and “doubles” are not measured (so you don't see an existing FPU in the result – that would push the T4 even higher).
But that's not really that important. I think that an MCU with a poor score will not execute any program significantly faster than an MCU with a better rating. In addition, the absolute numbers tempt you to take them too seriously. Actually, they should only be used to compare the same MCU repeatedly, e.g., with different compilers or optimization flags. In all other cases, you can only roughly estimate how good the performance is.

This brings me to a point of criticism regarding Lady Ada's variant. It waits until both cores are finished. This is ideal for the RP2040, of course, because 1. both cores are equally fast, and 2. the pico's multithreading has no overhead.

However, there are MCUs where one core is much slower than the other. There, too, the program would wait until both are finished – so the slow core would greatly distort the measurement result.

Question: I think this needs to be changed. The measurement should be stopped when the first core is finished. What do you think?
 
Last edited:
The other approach would be to fix measurement time and each core should be allowed to do as many iterations as it can within given time. Then the sum of number of iterations done on each core should be used to calculate total performance. This would address multiple cores of unequal speed. But then it will no longer be called CoreMark I guess.
 
Back
Top