MIMXRT1170-EVK i'll update this post as additional tests are performed ... 1170 benchmarks
Just received my
MIMXRT1170-EVK board from
mouser ($208). Using NXP MCUXpresso and NXP SDK, I have run a few of the examples on both cores, cm7 (M7@1 GHz) and cm4 (M4@400MHz). Peripheral IO (includes fast GPIO, XBAR, daisy) and timers (GPT 6, PIT 2x4, quad 4x4, flex PWM 4x8) look a lot like T4. I haven't figured out multicore usage yet.
Code:
NXP 1170 memory
512 Mbit SDRAM memory
512 Mbit Octal Flash
128 Mbit QSPI Flash
2 Gbit Raw NAND Flash
64 Mbit LPSPI Flash
cm7 location size
BOARD_FLASH 0x30000000 0x1000000
SRAM_DTC_cm7 0x20000000 0x40000
SRAM_ITC_cm7 0x0 0x40000
SRAM_OC1 0x20240000 0x80000
SRAM_OC2 0x202c0000 0x40000
NCACHE_REGION 0x20300000 0x40000
SRAM_OC_ECC1 0x20340000 0x10000
SRAM_OC_ECC2 0x20350000 0x10000
BOARD_SDRAM 0x80000000 0x4000000
cm4
BOARD_FLASH 0x8000000 0x1000000
OCRAM_DTCM_ALIAS 0x20220000 0x20000
SRAM_ITC_cm4 0x1ffe0000 0x20000
SRAM_OC1 0x20240000 0x40000
NCACHE_REGION 0x20280000 0x40000
SRAM_OC2 0x202c0000 0x80000
SRAM_OC_ECC1 0x20340000 0x10000
SRAM_OC_ECC2 0x20350000 0x10000
BOARD_SDRAM 0x80000000 0x4000000
cm7 stack 2003ff58 ext 200014b8 const 3000a21c fcn 30003fe1 malloc 200015f0
cm4 stack 2023ff50 ext 2022a748 const 20227fd8 fcn 20221d49 malloc 2022a880
Code:
[B]coremark[/B] gcc 9.3.1 -O3
cm7@996 4073 iterations/sec 275 ma
cm4@400 728 iterations/sec
Datasheet power set points
MHz
set point cm7 cm4 DCDC_IN(ma)
1 996 400 132.4
0 700 240 79.3
5 240 120 42.2
7 200 100 19.7
9 0 100 11.8
11 0 200 24.2
Other MCU coremark results at end of
perf.txt.
Old dhrystone.c v2.1
Code:
Dhrystone 2.1 (DMIPS)
1170@996mhz 2277 SDK -O3
T4@600mhz 2175 gcc -O3
M7@600mhz 2033 ARM CC -O3
T3.6@256mhz 1120 Fastest+pure+LTO
T3.6@180mhz 287 Faster
T3.6@120mhz 191 Faster
T3.5@120mhz 138 Faster
T3.2@120mhz 106 Faster
ESP32@240mhz 255 -O2
adaM4F@120mhz 168 -O2 SAMD51
STM32L4@80mhz 63
STM32F405@168mhz 198 -O2
F767ZI@216mhz 773 -O3
F446RE@180mhz 351 -O3
pico@125mhz 20
maple@72mhz 48
DUE@84mhz 49
ZERO@48mhz 24
UNO@16mhz 6
1170 crypto acceleration (CAAM)
TRNG and crypto accel for AES, SHA, DES, and asymmetric cryptography (RSA) supported in mbedtls lib.
mbedtls SDK benchmarks -O3
Code:
cm7
mbedTLS version 2.16.6
fsys=996000000
Using following implementations:
SHA: CAAM HW accelerated
AES: CAAM HW accelerated
AES GCM: CAAM HW accelerated
DES: CAAM HW accelerated
Asymmetric cryptography: CAAM HW accelerated
MD5 : 5834.39 KB/s, 145.64 cycles/byte
SHA-1 : 24142.75 KB/s, 29.24 cycles/byte
SHA-256 : 22746.25 KB/s, 27.73 cycles/byte
SHA-512 : 813.76 KB/s, 1188.33 cycles/byte
3DES : 11885.05 KB/s, 19.72 cycles/byte
DES : 43668.04 KB/s, 11.38 cycles/byte
AES-CBC-128 : 34505.66 KB/s, 17.85 cycles/byte
AES-CBC-192 : 32308.14 KB/s, 19.94 cycles/byte
AES-CBC-256 : 30238.82 KB/s, 22.00 cycles/byte
AES-GCM-128 : 32428.12 KB/s, 19.85 cycles/byte
AES-GCM-192 : 30358.25 KB/s, 21.82 cycles/byte
AES-GCM-256 : 27575.46 KB/s, 24.01 cycles/byte
AES-CCM-128 : 22045.39 KB/s, 33.92 cycles/byte
AES-CCM-192 : 20150.00 KB/s, 38.10 cycles/byte
AES-CCM-256 : 18526.98 KB/s, 42.31 cycles/byte
CTR_DRBG (NOPR) : 1956.96 KB/s, 486.43 cycles/byte
CTR_DRBG (PR) : 1349.88 KB/s, 722.21 cycles/byte
HMAC_DRBG SHA-1 (NOPR) : 570.32 KB/s, 1697.48 cycles/byte
HMAC_DRBG SHA-1 (PR) : 527.95 KB/s, 1835.51 cycles/byte
HMAC_DRBG SHA-256 (NOPR) : 797.10 KB/s, 1210.97 cycles/byte
HMAC_DRBG SHA-256 (PR) : 797.12 KB/s, 1211.01 cycles/byte
RSA-1024 : 4545.33 public/s
RSA-1024 : 240.00 private/s
RSA-2048 : 1794.33 public/s
RSA-2048 : 94.67 private/s
DHE-2048 : 26.00 handshake/s
DH-2048 : 48.00 handshake/s
ECDSA-secp256r1 : 236.33 sign/s
ECDSA-secp256r1 : 165.00 verify/s
ECDHE-secp256r1 : 187.00 handshake/s
ECDH-secp256r1 : 351.00 handshake/s
cm4
MD5 : 1850.37 KB/s, 205.67 cycles/byte
SHA-1 : 5603.39 KB/s, 66.75 cycles/byte
SHA-256 : 5618.67 KB/s, 66.55 cycles/byte
SHA-512 : 162.21 KB/s, 2377.27 cycles/byte
3DES : 23137.90 KB/s, 14.78 cycles/byte
DES : 31683.35 KB/s, 10.36 cycles/byte
AES-CBC-128 : 26819.54 KB/s, 12.55 cycles/byte
AES-CBC-192 : 25331.83 KB/s, 13.39 cycles/byte
AES-CBC-256 : 23907.89 KB/s, 14.29 cycles/byte
AES-GCM-128 : 22565.69 KB/s, 15.15 cycles/byte
AES-GCM-192 : 21500.71 KB/s, 15.99 cycles/byte
AES-GCM-256 : 20493.28 KB/s, 16.87 cycles/byte
AES-CCM-128 : 13030.69 KB/s, 27.51 cycles/byte
AES-CCM-192 : 12365.10 KB/s, 29.10 cycles/byte
AES-CCM-256 : 11697.15 KB/s, 30.87 cycles/byte
CTR_DRBG (NOPR) : 731.82 KB/s, 522.87 cycles/byte
CTR_DRBG (PR) : 483.29 KB/s, 793.33 cycles/byte
HMAC_DRBG SHA-1 (NOPR) : 126.43 KB/s, 3055.65 cycles/byte
HMAC_DRBG SHA-1 (PR) : 116.22 KB/s, 3326.44 cycles/byte
HMAC_DRBG SHA-256 (NOPR) : 171.17 KB/s, 2251.73 cycles/byte
HMAC_DRBG SHA-256 (PR) : 171.18 KB/s, 2251.73 cycles/byte
RSA-1024 : 1037.33 public/s
RSA-1024 : 48.33 private/s
RSA-2048 : 1794.33 public/s
RSA-2048 : 94.67 private/s
DHE-2048 : 23.00 handshake/s
DH-2048 : 37.00 handshake/s
ECDSA-secp256r1 : 59.00 sign/s
ECDSA-secp256r1 : 43.67 verify/s
ECDHE-secp256r1 : 49.67 handshake/s
ECDH-secp256r1 : 120.33 handshake/s
mbedtls without and with crypto acceleration on 1170@996MHz. acceleration disables Dcache.
Code:
no accel crypto accel
100! 324 us 11780 us
DH 27316 us 2849 us
RSA private 195755 us 19114 us
RSA pub 2376 us 486 us
RSA CRT 54820 us 6178 us
SHA256 114 us 8982 KBs 8 us 128000 KBs
wolfssl performance on cm7@996mhz, -O3 SDK, no crypto accel
Code:
100! 85 us 933262154439441526816992388562...
N 2048 bits
DH 19627 us
RSA private 129262 us
RSA pub 4563 us comp 0
RSA CRT 40022 us comp 0
MD5 111 us 9225 KBs
SHA256 110 us 9309 KBs
RC4 17 us 60235 KBs
AESCBC 64 3 us 21333 KBs
mini-gmp performance cm7@996mhz, no crypto accel
Code:
100! 57 us 20 chars 93326215443944152681
DH 75439 us 1024 bits
RSA priv 373023 us
RSA pub 2932 us compare 0
RSA CRT 104778 us compare 0
See other
gmp peformance numbers
RSAsign
I measured the performance of 1170 using Paul's RSA-2048 signature benchmark
RSAsign
Code:
RSAsign seconds
T3.6@180MHz 0.474
T4@600MHz 0.085
F767ZI@216 0.332 mbed -O3 0.203 with mbedtls
1170@996MHz 0.0577 paul's tls, 32KB heap, -O3 NXP SDK
1170@996MHz 0.0069 NXP mbedtls +crypto accel (8x)
cm4@400MHz 0.344 mbedtls+accel 0.0134
The signature failed under the NXP SDK (MCUXpresso) because the heap was only 4KB. In the IDE I increased the heap to 32KB, and signature was good. The NXP SDK has an mbedtls library that supports the crypto acceleration hardware, so I tested RSA-2048 signature with NXP lib and accelerated crypto. The crypto acceleration improved performance by a factor of 8 on the 1170. More RSAsign
results
.
Random numbers (CAAM)
The 1170 crypto unit (CAAM) can generate hardware random numbers (CAAM_RNG_GetRandomData()). The NXP SDK mbedtls lib utilizes the hardware random number generator (DCache disabled). Whether you ask for one random byte or 1000, the generator takes 125 ms, slower than the
Teensy 4 TRNG. There is little documentation (need NDA), so there may be speed optimizations.
You could use the 1170 hardware random number generator to get a good initial seed, then use your favorite PRNG/hashing function (MD5, SHA, RC4, Mersenne Twister, LFSR, LCG ...) to generate subsequent random bits.
Mersenne PRNG and TinyMT, 1000 32-bit random numbers
Code:
mersenne PRNG 1000 32-bit (microseconds)
TinyMT
NXP 1170@996MHz 41 us 18 us
T4@600MHz 67 61
T3.6@180MHz 462 349
T3.5@120MHz 694 526
T3.2@120MHz 697 527
LC@48MHz 2341 1864
T2++@16MHz 38680 20636
ESP32@240MHz 349 288
F767ZI@216MHz 210 83
F446RE@180MHz 417 130
32F405@168MHz 388 411
32L476RE@80MHz 982 812 dragonfly
pico@125MHz 797 344
M4@120MHz 519 502 SAMD51
artemis@96MHz 748 851
DUE@84MHz 1519 1204 SAMD21
maple@72MHz 1443 1114
ZERO@48MHz 2522 2084
cpx@48MHz 2390 2017
Here are some
DSP performance results:
Code:
DSP FFT benchmark 1024 radix4 REVERSEBITS 0 (microseconds)
q15 q31 f32 opt arm_math.h
NXP 1170 1GHz 44.3 89.4 66.9 gcc -O3 v1.6.0 SDK
T4@600mhz 77.4 147.0 87.0 gcc -O2 v1.5.1
M7@600mhz 77.4 147.8 88.0 gcc -O3 v1.5.1 SDK
M7@600mhz 74.5 126.9 95.6 ARM GCC -O3 v1.5.1 mbed
T3.6@256mhz 291.7 720.4 424.7 Faster v1.5.3
T3.6@240mhz 311.2 768.8 453.0 Faster v1.5.3
T3.6@180mhz 463.1 1215.2 703.7 Faster v1.1.0
T3.6@180mhz 414.7 1010.7 598.2 Faster v1.5.3
T3.5@120mhz 784.7 1947.9 1079.8 Faster v1.1.0
T3.5@120mhz 658.5 1577.9 919.5 Faster v1.5.3
K64F@120mhz 635.7 1273.8 827.2 ARM GCC -O3 v1.4.5 mbed
T3.2@120mhz 869.8 2498.5 18182.5 Faster v1.1.0 no FPU
adaM4F@120mhz 701.3 1756.1 781.0 Faster v1.1.0 SAMD51
STM32L4@80mhz 917.3 1953.8 1150.4 Faster v1.4.5
STM32F405@168 466.5 1135.1 556.1 gcc -O2 v1.6.0
F767ZI@216mhz 206.9 352.7 262.7 arm gcc -O3 v1.5.1
and
CMSIS-NN (neural network, CIFAR10)
Code:
1170@996mhz 13818 us SDK -O3 arm_math.h 1.6.0
T4.0@600mhz 71102 us Faster arm_math.h 1.5.1
T3.6@180mhz 445994 us Faster arm_math.h 1.5.3
T3.5@120mhz 669922 us
float/double linear algebra
Code:
Linpack 100x100 mflops
double float
1170@996mhz 120.3 289 NXP SDK -O3 10/19/21
cm4@400mhz 2.7 46.2 NXP SDK -O3
T4@600mhz 71.4 166.3 gcc -O3
M7@600mhz 66.97 125.5 ARM CC -O3
T3.6@256mhz 2.85 41.1 Fastest
T3.6@180mhz 2.13 28.4 Faster
T3.5@120mhz 0.88 19.2 Faster
T3.2@120mhz 0.65 1.0 Faster no FPU
ESP32@240mhz 2.8 44.5
adaM4F@120mhz 1.4 20.1 SAMD51
STM32L4@80mhz 0.88 15.4 dragonfly -O2
STM32F405@168 1.8 28.3 -O2 adafruit
F767ZI@216mhz 24.1 47.5 ARM CC -O3
Floating point interpolation, raytrace, and finite difference:
Code:
float interpolate (us) 8x8 to 70x70
bilinear bicubic
T3.2@120mhz 18773 223109 no FPU
T3.5@120mhz 1944 26618
T3.6@180mhz 1294 16712
T4@600mhz 255 6406
1170@996mhz 158 2048 SDK -O3
adaM4F@120 1905 207326
ESP32@240mhz 1983 114813
STM32L4@80mhz 2897 37962
STM32F405@168mhz 1356 157633 -O2
F446RE@180mhz 1692 161939
F767ZI@216mhz 875 20149 -O3
raytrace 8x8 float -O2
microseconds
1170@996mhz 28960 NXP SDK -O3
T4@600mhz 45372
T3.6@180mhz 186409
T3.5@120mhz 301454
T3.2@120mhz 6634437 no FPU
ESP32@240mhz 204252
adaM4F@120mhz 328686
STM32F405@168mhz 225093
F446RE@180mhz 213244
STM32L4@80mhz 546230
F767ZI@216mhz 134194 -O3
finite difference 51x21 float -O2 fd
microseconds
T4@600mhz 42305 double 84134
T3.6@180mhz 169559 3662799
T3.5@120mhz 257085
T3.2@120mhz 4541625 no FPU
1170@996mhz 24672 double 49391
ESP32@240mhz 521806
adaM4F@120mhz 250132
STM32F405@168mhz 179313
F446RE@180 227017
STM32L4@80mhz 424934
F767ZI@216 130689 double 25987
See stochastic simulation
performance and Cortex M7
superscalar speedup
FastCRC benchmark, table-driven
Code:
CRC Benchmark length: 16384 bytes
Maxim (iButton) FastCRC: Value:0x 000000f6 89 us 1472.719101 mbs
Maxim (iButton) builtin: Value:0x 000000f6 871 us 150.484501 mbs
MODBUS FastCRC: Value:0x 00007029 121 us 1083.239669 mbs
MODBUS builtin: Value:0x 00007029 803 us 163.227895 mbs
XMODEM FastCRC: Value:0x 000098d9 109 us 1202.495413 mbs
XMODEM builtin: Value:0x 000098d9 919 us 142.624592 mbs
MCRF4XX FastCRC: Value:0x 00004a29 132 us 992.969697 mbs
MCRF4XX builtin: Value:0x 00004a29 165 us 794.375758 mbs
KERMIT FastCRC: Value:0x 0000b259 49 us 2674.938776 mbs
Ethernet FastCRC: Value:0x 1271457f 444 us 295.207207 mbs
Notes
- In the NXP SDK, GPT timer clock sources are only 24 MHz, and probably RC based -- drift of 980 ppm from GPS PPS. Tested drift with quad timer PWM and measured 34 ppm. Quad timer and PIT timer use 240 MHz bus clock. Also tested 24 MHz crystal using 64-bit PIT timer (34.67 ppm). GPT FIX: one can configure GPTx clocks with IDE's clock tool or hack clock_config.c to make GPT2 use kCLOCK_GPT2_ClockRoot_MuxOsc24MOut (or GPT_SetClockSource(GPTx,2)), then 24MHz crystal drift is 34.67 ppm. GPT 32khz clock source (GPT_SetClockSource(GPTx,4) measures -47 ppm.
- Still not clear how memory banks are shared/protected between the cm4 and cm7 for a multicore app.
- 12-bit DAC (1.8v, 1 ma, 4 us settle time) is available on test pad (TP18) on EVK board. 8-bit internal DAC can be routed internally to ADC or comparator/ACMP.
- max ADC voltage is 1.8v on EVK board. 1.2us/sample with 24mhz ADC clock (12-bit resolution, average 1)
- EVK power set points in SDK example power_mode_switch, running cm7 coremark: 275 ma (meter J38 1-2), Compare: T4 106 ma, 1060 EVK 184 ma
- NXP SDK mbedtls benchmark example disables DCache SCB_DisableDCache(), and uses SysTick for timing. Test harness insures code being timed runs at least 3 seconds. I've also done timing with GPT micros()
- 1170 eval board has Gig and 100T ethernet jacks. Tested 100T UDP, TCP, and ping on 1170. Uses lwIP (v2.1.1) polling and callbacks. See comparative performance table. To improve TCP receive performance, edit lwipopts.h to TCP_WND (6 * TCP_MSS).
- tested onboard microSD, read rate 39 mbs (2048-byte read's)
References
NXP
MIMXRT1170-EVK: i.MX RT1170 Evaluation Kit
1170
datasheet
1060 to 1170 migration guide
Cortex-M7 instruction cycle counts, timings, and dual-issue combinations
I'll add other results of 1170 experiments to this post ....