PaulStoffregen
Well-known member
That's true, we need a better benchmark.
Agreed, we do need a better benchmark. I'm open to ideas. Well, ideas that don't involve me developing one from scratch....
That's true, we need a better benchmark.
Agreed, we do need a better benchmark. I'm open to ideas. Well, ideas that don't involve me developing one from scratch....
But sure its fun to know the rank of the teensy is this list (?):I request that benchmarkers re-run this new, corrected version of Dhrystone, turning off or bypassing optimizers which perform more than peephole optimization
*----------------DHRYSTONE VERSION 1.1 RESULTS BEGIN--------------------------
*
* MACHINE MICROPROCESSOR OPERATING COMPILER DHRYSTONES/SEC.
* TYPE SYSTEM NO REG REGS
* -------------------------- ------------ ----------- ---------------
* Apple IIe 65C02-1.02Mhz DOS 3.3 Aztec CII v1.05i 37 37
* - Z80-2.5Mhz CPM-80 v2.2 Aztec CII v1.05g 91 91
* - 8086-8Mhz RMX86 V6 Intel C-86 V2.0 197 203LM??
* IBM PC/XT 8088-4.77Mhz COHERENT 2.3.43 Mark Wiiliams 259 275
* - 8086-8Mhz RMX86 V6 Intel C-86 V2.0 287 304 ??
* Fortune 32:16 68000-6Mhz V7+sys3+4.1BSD cc 360 346
* PDP-11/34A w/FP-11C UNIX V7m cc 406 449
* Macintosh512 68000-7.7Mhz Mac ROM O/S DeSmet(C ware) 625 625
* VAX-11/750 w/FPA UNIX 4.2BSD cc 831 852
* DataMedia 932 68000-10Mhz UNIX sysV cc 837 888
* Plexus P35 68000-12.5Mhz UNIX sysIII cc 835 894
* ATT PC7300 68010-10Mhz UNIX 5.0.3 cc 973 1034
* Compaq II 80286-8Mhz MSDOS 3.1 MS C 3.0 1086 1140 LM
* IBM PC/AT 80286-7.5Mhz Venix/286 SVR2 cc 1159 1254 *15
* Compaq II 80286-8Mhz MSDOS 3.1 MS C 3.0 1190 1282 MM
* MicroVAX II - Mach/4.3 cc 1361 1385
* DEC uVAX II - Ultrix-32m v1.1 cc 1385 1399
* Compaq II 80286-8Mhz MSDOS 3.1 MS C 3.0 1351 1428
* VAX 11/780 - UNIX 4.2BSD cc 1417 1441
* VAX-780/MA780 Mach/4.3 cc 1428 1470
* VAX 11/780 - UNIX 5.0.1 cc 4.1.1.31 1650 1640
* Ridge 32C V1 - ROS 3.3 Ridge C (older) 1628 1695
* Gould PN6005 - UTX 1.1c+ (4.2) cc 1732 1884
* Gould PN9080 custom ECL UTX-32 1.1C cc 4745 4992
* VAX-784 - Mach/4.3 cc 5263 5555 &4
* VAX 8600 - 4.3 BSD cc 6329 6423
* Amdahl 5860 - UTS sysV cc 1.22 28735 28846
* IBM3090/200 - ? ? 31250 31250
*
*
*----------------DHRYSTONE VERSION 1.0 RESULTS BEGIN--------------------------
*
* MACHINE MICROPROCESSOR OPERATING COMPILER DHRYSTONES/SEC.
* TYPE SYSTEM NO REG REGS
* -------------------------- ------------ ----------- ---------------
* Commodore 64 6510-1MHz C64 ROM C Power 2.8 36 36
* HP-110 8086-5.33Mhz MSDOS 2.11 Lattice 2.14 284 284
* IBM PC/XT 8088-4.77Mhz PC/IX cc 271 294
* CCC 3205 - Xelos(SVR2) cc 558 592
* Perq-II 2901 bitslice Accent S5c cc (CMU) 301 301
* IBM PC/XT 8088-4.77Mhz COHERENT 2.3.43 MarkWilliams cc 296 317
* Cosmos 68000-8Mhz UniSoft cc 305 322
* IBM PC/XT 8088-4.77Mhz Venix/86 2.0 cc 297 324
* DEC PRO 350 11/23 Venix/PRO SVR2 cc 299 325
* IBM PC 8088-4.77Mhz MSDOS 2.0 b16cc 2.0 310 340
* PDP11/23 11/23 Venix (V7) cc 320 358
* Commodore Amiga ? Lattice 3.02 368 371
* PC/XT 8088-4.77Mhz Venix/86 SYS V cc 339 377
* IBM PC 8088-4.77Mhz MSDOS 2.0 CI-C86 2.20M 390 390
* IBM PC/XT 8088-4.77Mhz PCDOS 2.1 Wizard 2.1 367 403
* IBM PC/XT 8088-4.77Mhz PCDOS 3.1 Lattice 2.15 403 403 @
* Colex DM-6 68010-8Mhz Unisoft SYSV cc 378 410
* IBM PC 8088-4.77Mhz PCDOS 3.1 Datalight 1.10 416 416
* IBM PC NEC V20-4.77Mhz MSDOS 3.1 MS 3.1 387 420
* IBM PC/XT 8088-4.77Mhz PCDOS 2.1 Microsoft 3.0 390 427
* IBM PC NEC V20-4.77Mhz MSDOS 3.1 MS 3.1 (186) 393 427
* PDP-11/34 - UNIX V7M cc 387 438
* IBM PC 8088, 4.77mhz PC-DOS 2.1 Aztec C v3.2d 423 454
* Tandy 1000 V20, 4.77mhz MS-DOS 2.11 Aztec C v3.2d 423 458
* Tandy TRS-16B 68000-6Mhz Xenix 1.3.5 cc 438 458
* PDP-11/34 - RSTS/E decus c 438 495
* Onyx C8002 Z8000-4Mhz IS/1 1.1 (V7) cc 476 511
* Tandy TRS-16B 68000-6Mhz Xenix 1.3.5 Green Hills 609 617
* DEC PRO 380 11/73 Venix/PRO SVR2 cc 577 628
* FHL QT+ 68000-10Mhz Os9/68000 version 1.3 603 649 FH
* Apollo DN550 68010-?Mhz AegisSR9/IX cc 3.12 666 666
* HP-110 8086-5.33Mhz MSDOS 2.11 Aztec-C 641 676
* ATT PC6300 8086-8Mhz MSDOS 2.11 b16cc 2.0 632 684
* IBM PC/AT 80286-6Mhz PCDOS 3.0 CI-C86 2.1 666 684
* Tandy 6000 68000-8Mhz Xenix 3.0 cc 694 694
* IBM PC/AT 80286-6Mhz Xenix 3.0 cc 684 704 MM
* Macintosh 68000-7.8Mhz 2M Mac Rom Mac C 32 bit int 694 704
* Macintosh 68000-7.7Mhz - MegaMax C 2.0 661 709
* Macintosh512 68000-7.7Mhz Mac ROM O/S DeSmet(C ware) 714 714
* IBM PC/AT 80286-6Mhz Xenix 3.0 cc 704 714 LM
* Codata 3300 68000-8Mhz UniPlus+ (v7) cc 678 725
* WICAT MB 68000-8Mhz System V WICAT C 4.1 585 731 ~
* Cadmus 9000 68010-10Mhz UNIX cc 714 735
* AT&T 6300 8086-8Mhz Venix/86 SVR2 cc 668 743
* Cadmus 9790 68010-10Mhz 1MB SVR0,Cadmus3.7 cc 720 747
* NEC PC9801F 8086-8Mhz PCDOS 2.11 Lattice 2.15 768 - @
* ATT PC6300 8086-8Mhz MSDOS 2.11 CI-C86 2.20M 769 769
* Burroughs XE550 68010-10Mhz Centix 2.10 cc 769 769 CT1
* EAGLE/TURBO 8086-8Mhz Venix/86 SVR2 cc 696 779
* ALTOS 586 8086-10Mhz Xenix 3.0b cc 724 793
* DEC 11/73 J-11 micro Ultrix-11 V3.0 cc 735 793
* ATT 3B2/300 WE32000-?Mhz UNIX 5.0.2 cc 735 806
* Apollo DN320 68010-?Mhz AegisSR9/IX cc 3.12 806 806
* IRIS-2400 68010-10Mhz UNIX System V cc 772 829
* Atari 520ST 68000-8Mhz TOS DigResearch 839 846
* IBM PC/AT 80286-6Mhz PCDOS 3.0 MS 3.0(large) 833 847 LM
* WICAT MB 68000-8Mhz System V WICAT C 4.1 675 853 S~
* VAX 11/750 - Ultrix 1.1 4.2BSD cc 781 862
* CCC 7350A 68000-8MHz UniSoft V.2 cc 821 875
* VAX 11/750 - UNIX 4.2bsd cc 862 877
* Fast Mac 68000-7.7Mhz - MegaMax C 2.0 839 904 +
* IBM PC/XT 8086-9.54Mhz PCDOS 3.1 Microsoft 3.0 833 909 C1
* DEC 11/44 Ultrix-11 V3.0 cc 862 909
* Macintosh 68000-7.8Mhz 2M Mac Rom Mac C 16 bit int 877 909 S
* CCC 3210 - Xelos R01(SVR2) cc 849 924
* CCC 3220 - Ed. 7 v2.3 cc 892 925
* IBM PC/AT 80286-6Mhz Xenix 3.0 cc -i 909 925
* AT&T 6300 8086, 8mhz MS-DOS 2.11 Aztec C v3.2d 862 943
* IBM PC/AT 80286-6Mhz Xenix 3.0 cc 892 961
* VAX 11/750 w/FPA Eunice 3.2 cc 914 976
* IBM PC/XT 8086-9.54Mhz PCDOS 3.1 Wizard 2.1 892 980 C1
* IBM PC/XT 8086-9.54Mhz PCDOS 3.1 Lattice 2.15 980 980 C1
* Plexus P35 68000-10Mhz UNIX System III cc 984 980
* PDP-11/73 KDJ11-AA 15Mhz UNIX V7M 2.1 cc 862 981
* VAX 11/750 w/FPA UNIX 4.3bsd cc 994 997
* IRIS-1400 68010-10Mhz UNIX System V cc 909 1000
* IBM PC/AT 80286-6Mhz Venix/86 2.1 cc 961 1000
* IBM PC/AT 80286-6Mhz PCDOS 3.0 b16cc 2.0 943 1063
* Zilog S8000/11 Z8001-5.5Mhz Zeus 3.2 cc 1011 1084
* NSC ICM-3216 NSC 32016-10Mhz UNIX SVR2 cc 1041 1084
* IBM PC/AT 80286-6Mhz PCDOS 3.0 MS 3.0(small) 1063 1086
* VAX 11/750 w/FPA VMS VAX-11 C 2.0 958 1091
* Stride 68000-10Mhz System-V/68 cc 1041 1111
* Plexus P/60 MC68000-12.5Mhz UNIX SYSIII Plexus 1111 1111
* ATT PC7300 68010-10Mhz UNIX 5.0.2 cc 1041 1111
* CCC 3230 - Xelos R01(SVR2) cc 1040 1126
* Stride 68000-12Mhz System-V/68 cc 1063 1136
* IBM PC/AT 80286-6Mhz Venix/286 SVR2 cc 1056 1149
* Plexus P/60 MC68000-12.5Mhz UNIX SYSIII Plexus 1111 1163 T
* IBM PC/AT 80286-6Mhz PCDOS 3.0 Datalight 1.10 1190 1190
* ATT PC6300+ 80286-6Mhz MSDOS 3.1 b16cc 2.0 1111 1219
* IBM PC/AT 80286-6Mhz PCDOS 3.1 Wizard 2.1 1136 1219
* Sun2/120 68010-10Mhz Sun 4.2BSD cc 1136 1219
* IBM PC/AT 80286-6Mhz PCDOS 3.0 CI-C86 2.20M 1219 1219
* WICAT PB 68000-8Mhz System V WICAT C 4.1 998 1226 ~
* MASSCOMP 500 68010-10MHz RTU V3.0 cc (V3.2) 1156 1238
* Alliant FX/8 IP (68012-12Mhz) Concentrix cc -ip;exec -i 1170 1243 FX
* Cyb DataMate 68010-12.5Mhz Uniplus 5.0 Unisoft cc 1162 1250
* PDP 11/70 - UNIX 5.2 cc 1162 1250
* IBM PC/AT 80286-6Mhz PCDOS 3.1 Lattice 2.15 1250 1250
* IBM PC/AT 80286-7.5Mhz Venix/86 2.1 cc 1190 1315 *15
* Sun2/120 68010-10Mhz Standalone cc 1219 1315
* Intel 380 80286-8Mhz Xenix R3.0up1 cc 1250 1315 *16
* Sequent Balance 8000 NS32032-10MHz Dynix 2.0 cc 1250 1315 N12
* IBM PC/DSI-32 32032-10Mhz MSDOS 3.1 GreenHills 2.14 1282 1315 C3
* ATT 3B2/400 WE32100-?Mhz UNIX 5.2 cc 1315 1315
* CCC 3250XP - Xelos R01(SVR2) cc 1215 1318
* IBM PC/RT 032 RISC(801?)?Mhz BSD 4.2 cc 1248 1333 RT
* DG MV4000 - AOS/VS 5.00 cc 1333 1333
* IBM PC/AT 80286-8Mhz Venix/86 2.1 cc 1275 1380 *16
* IBM PC/AT 80286-6Mhz MSDOS 3.0 Microsoft 3.0 1250 1388
* ATT PC6300+ 80286-6Mhz MSDOS 3.1 CI-C86 2.20M 1428 1428
* COMPAQ/286 80286-8Mhz Venix/286 SVR2 cc 1326 1443
* IBM PC/AT 80286-7.5Mhz Venix/286 SVR2 cc 1333 1449 *15
* WICAT PB 68000-8Mhz System V WICAT C 4.1 1169 1464 S~
* Tandy II/6000 68000-8Mhz Xenix 3.0 cc 1384 1477
* MicroVAX II - Mach/4.3 cc 1513 1536
* WICAT MB 68000-12.5Mhz System V WICAT C 4.1 1246 1537 ~
* IBM PC/AT 80286-9Mhz SCO Xenix V cc 1540 1556 *18
* Cyb DataMate 68010-12.5Mhz Uniplus 5.0 Unisoft cc 1470 1562 S
* VAX 11/780 - UNIX 5.2 cc 1515 1562
* MicroVAX-II - - - 1562 1612
* VAX-780/MA780 Mach/4.3 cc 1587 1612
* VAX 11/780 - UNIX 4.3bsd cc 1646 1662
* Apollo DN660 - AegisSR9/IX cc 3.12 1666 1666
* ATT 3B20 - UNIX 5.2 cc 1515 1724
* NEC PC-98XA 80286-8Mhz PCDOS 3.1 Lattice 2.15 1724 1724 @
* HP9000-500 B series CPU HP-UX 4.02 cc 1724 -
* Ridge 32C V1 - ROS 3.3 Ridge C (older) 1776 -
* IBM PC/STD 80286-8Mhz MSDOS 3.0 Microsoft 3.0 1724 1785 C2
* WICAT MB 68000-12.5Mhz System V WICAT C 4.1 1450 1814 S~
* WICAT PB 68000-12.5Mhz System V WICAT C 4.1 1530 1898 ~
* DEC-2065 KL10-Model B TOPS-20 6.1FT5 Port. C Comp. 1937 1946
* Gould PN6005 - UTX 1.1(4.2BSD) cc 1675 1964
* DEC2060 KL-10 TOPS-20 cc 2000 2000 NM
* Intel 310AP 80286-8Mhz Xenix 3.0 cc 1893 2009
* VAX 11/785 - UNIX 5.2 cc 2083 2083
* VAX 11/785 - VMS VAX-11 C 2.0 2083 2083
* VAX 11/785 - UNIX SVR2 cc 2123 2083
* VAX 11/785 - ULTRIX-32 1.1 cc 2083 2091
* VAX 11/785 - UNIX 4.3bsd cc 2135 2136
* WICAT PB 68000-12.5Mhz System V WICAT C 4.1 1780 2233 S~
* Pyramid 90x - OSx 2.3 cc 2272 2272
* Pyramid 90x FPA,cache,4Mb OSx 2.5 cc no -O 2777 2777
* Pyramid 90x w/cache OSx 2.5 cc w/-O 3333 3333
* IBM-4341-II - VM/SP3 Waterloo C 1.2 3333 3333
* IRIS-2400T 68020-16.67Mhz UNIX System V cc 3105 3401
* Celerity C-1200 ? UNIX 4.2BSD cc 3485 3468
* SUN 3/75 68020-16.67Mhz SUN 4.2 V3 cc 3333 3571
* IBM-4341 Model 12 UTS 5.0 ? 3685 3685
* SUN-3/160 68020-16.67Mhz Sun 4.2 V3.0A cc 3381 3764
* Sun 3/180 68020-16.67Mhz Sun 4.2 cc 3333 3846
* IBM-4341 Model 12 UTS 5.0 ? 3910 3910 MN
* MC 5400 68020-16.67MHz RTU V3.0 cc (V4.0) 3952 4054
* Intel 386/20 80386-12.5Mhz PMON debugger Intel C386v0.2 4149 4386
* NCR Tower32 68020-16.67Mhz SYS 5.0 Rel 2.0 cc 3846 4545
* MC 5600/5700 68020-16.67MHz RTU V3.0 cc (V4.0) 4504 4746 %
* Intel 386/20 80386-12.5Mhz PMON debugger Intel C386v0.2 4534 4794 i1
* Intel 386/20 80386-16Mhz PMON debugger Intel C386v0.2 5304 5607
* Gould PN9080 custom ECL UTX-32 1.1C cc 5369 5676
* Gould 1460-342 ECL proc UTX/32 1.1/c cc 5342 5677 G1
* VAX-784 - Mach/4.3 cc 5882 5882 &4
* Intel 386/20 80386-16Mhz PMON debugger Intel C386v0.2 5801 6133 i1
* VAX 8600 - UNIX 4.3bsd cc 7024 7088
* VAX 8600 - VMS VAX-11 C 2.0 7142 7142
* Alliant FX/8 CE Concentrix cc -ce;exec -c 6952 7655 FX
* CCI POWER 6/32 COS(SV+4.2) cc 7500 7800
* CCI POWER 6/32 POWER 6 UNIX/V cc 8236 8498
* CCI POWER 6/32 4.2 Rel. 1.2b cc 8963 9544
* Sperry (CCI Power 6) 4.2BSD cc 9345 10000
* CRAY-X-MP/12 105Mhz COS 1.14 Cray C 10204 10204
* IBM-3083 - UTS 5.0 Rel 1 cc 16666 12500
* CRAY-1A 80Mhz CTSS Cray C 2.0 12100 13888
* IBM-3083 - VM/CMS HPO 3.4 Waterloo C 1.2 13889 13889
* Amdahl 470 V/8 UTS/V 5.2 cc v1.23 15560 15560
* CRAY-X-MP/48 105Mhz CTSS Cray C 2.0 15625 17857
* Amdahl 580 - UTS 5.0 Rel 1.2 cc v1.5 23076 23076
* Amdahl 5860 UTS/V 5.2 cc v1.23 28970 28970
Proc1(PtrParIn)
REG RecordPtr PtrParIn;
{
Then, i hav'nt seen any example for ARM now, that is significant slower with -O2. Perhaps someone can show one ?
flops
-----
-O
Float (4 bytes) multiplications per second: (888746)
-O2
Float (4 bytes) multiplications per second: (780809)
ILI9341_t3 graphicstest example
-------------------------------
-O
Screen fill 280099
Text 16621
Lines 73153
Horiz/Vert Lines 22943
Rectangles (outline) 14599
Rectangles (filled) 581627
Circles (filled) 85967
Circles (outline) 73601
Triangles (outline) 17685
Triangles (filled) 191978
Rounded rects (outline) 33292
Rounded rects (filled) 634488
-O2
Benchmark Time (microseconds)
Screen fill 280101
Text 15912
Lines 73083
Horiz/Vert Lines 22940
Rectangles (outline) 14598
Rectangles (filled) 581630
Circles (filled) 86116
Circles (outline) 73238
Triangles (outline) 17676
Triangles (filled) 191458
Rounded rects (outline) 33046
Rounded rects (filled) 634447
fft8
----
-O
1024 2946
512 1371
256 632
128 291
64 133
32 61
16 28
-O2
1024 2924
512 1361
256 630
128 290
64 133
32 61
16 29
benchmark_memcpy
----------------
-O
memcpy32 0
memset32 44
loop 75
memcpy 51
memset 25
set loop 53
-O2
memcpy32 0
memset32 45
loop 96
memcpy 51
memset 25
set loop 53
5aa: f000 f93b bl 824 <micros>
5ae: 2400 movs r4, #0
5b0: 9000 str r0, [sp, #0]
5b2: 5933 ldr r3, [r6, r4]
5b4: 512b str r3, [r5, r4]
5b6: 3404 adds r4, #4
5b8: f5b4 5f80 cmp.w r4, #4096 ; 0x1000
5bc: 4f3a ldr r7, [pc, #232] ; (6a8 <loop+0x18c>)
5be: d1f8 bne.n 5b2 <loop+0x96>
5c0: f000 f930 bl 824 <micros>
Here's 4 quick benchmarks. 3 are attached files below. The other is File > Examples > ILI9341_t3 > graphicstest.
I ran all 4 of these, with -O and -O2. Results are copied below.
In many cases, -O2 is about 0.5% to 1.0% faster than -O. But in a couple, -O2 is dramatically slower. Look at the flops test, and the benchmark_memcpy "loop" test.
Code:flops ----- -O Float (4 bytes) multiplications per second: (888746) -O2 Float (4 bytes) multiplications per second: (780809) ILI9341_t3 graphicstest example ------------------------------- -O Screen fill 280099 Text 16621 Lines 73153 Horiz/Vert Lines 22943 Rectangles (outline) 14599 Rectangles (filled) 581627 Circles (filled) 85967 Circles (outline) 73601 Triangles (outline) 17685 Triangles (filled) 191978 Rounded rects (outline) 33292 Rounded rects (filled) 634488 -O2 Benchmark Time (microseconds) Screen fill 280101 Text 15912 Lines 73083 Horiz/Vert Lines 22940 Rectangles (outline) 14598 Rectangles (filled) 581630 Circles (filled) 86116 Circles (outline) 73238 Triangles (outline) 17676 Triangles (filled) 191458 Rounded rects (outline) 33046 Rounded rects (filled) 634447 fft8 ---- -O 1024 2946 512 1371 256 632 128 291 64 133 32 61 16 28 -O2 1024 2924 512 1361 256 630 128 290 64 133 32 61 16 29 benchmark_memcpy ---------------- -O memcpy32 0 memset32 44 loop 75 memcpy 51 memset 25 set loop 53 -O2 memcpy32 0 memset32 45 loop 96 memcpy 51 memset 25 set loop 53
benchmark_memcpy
----------------
-O2
memcpy32 0
memset32 45
loop 54
memcpy 51
memset 34
set loop 43
-O1
memcpy32 1
memset32 44
loop 66
memcpy 51
memset 33
set loop 54
5b8: f000 f93e bl 838 <micros>
5bc: 4942 ldr r1, [pc, #264] ; (6c8 <loop+0x19c>)
5be: 4b3c ldr r3, [pc, #240] ; (6b0 <loop+0x184>)
5c0: 9000 str r0, [sp, #0]
[b]
5c2: f856 2f04 ldr.w r2, [r6, #4]!
5c6: f843 2f04 str.w r2, [r3, #4]!
5ca: 428e cmp r6, r1
5cc: d1f9 bne.n 5c2 <loop+0x96>
[/b]
5ce: f000 f933 bl 838 <micros>
Hm i think we can calculate 3 here because the m4 is able to do the second cycle of ldr in parallel with the first of str.2 for load, 2 for store
Obviously, they found the bug and fixed it -O2 is now faster, as expected.
char * ultoa(unsigned long val, char *buf, int radix);
char * ltoa(long val, char *buf, int radix);
//static inline char * utoa(unsigned int val, char *buf, int radix) __attribute__((always_inline, unused));
//static inline char * utoa(unsigned int val, char *buf, int radix) { return ultoa(val, buf, radix); }
//static inline char * itoa(int val, char *buf, int radix) __attribute__((always_inline, unused));
//static inline char * itoa(int val, char *buf, int radix) { return ltoa(val, buf, radix); }
char * dtostrf(float val, int width, unsigned int precision, char *buf);