Teensyduino 1.22 Features

Agreed, we do need a better benchmark. I'm open to ideas. Well, ideas that don't involve me developing one from scratch....

not fully on-topic, but I want to try and get the dhrystones of the teensy :
https://classes.soe.ucsc.edu/cmpe202/benchmarks/standard/dhrystone.c

It's not what we want :
I request that benchmarkers re-run this new, corrected version of Dhrystone, turning off or bypassing optimizers which perform more than peephole optimization
But sure its fun to know the rank of the teensy is this list (?):
Code:
 *----------------DHRYSTONE VERSION 1.1 RESULTS BEGIN--------------------------
 *
 * MACHINE	MICROPROCESSOR	OPERATING	COMPILER	DHRYSTONES/SEC.
 * TYPE				SYSTEM				NO REG	REGS
 * --------------------------	------------	-----------	---------------
 * Apple IIe	65C02-1.02Mhz	DOS 3.3		Aztec CII v1.05i  37	  37
 * -		Z80-2.5Mhz	CPM-80 v2.2	Aztec CII v1.05g  91	  91
 * -		8086-8Mhz	RMX86 V6	Intel C-86 V2.0	 197	 203LM??
 * IBM PC/XT	8088-4.77Mhz	COHERENT 2.3.43	Mark Wiiliams	 259	 275
 * -		8086-8Mhz	RMX86 V6	Intel C-86 V2.0	 287	 304 ??
 * Fortune 32:16 68000-6Mhz	V7+sys3+4.1BSD  cc		 360	 346
 * PDP-11/34A	w/FP-11C	UNIX V7m	cc		 406	 449
 * Macintosh512	68000-7.7Mhz	Mac ROM O/S	DeSmet(C ware)	 625	 625
 * VAX-11/750	w/FPA		UNIX 4.2BSD	cc		 831	 852
 * DataMedia 932 68000-10Mhz	UNIX sysV	cc		 837	 888
 * Plexus P35	68000-12.5Mhz	UNIX sysIII	cc		 835	 894
 * ATT PC7300	68010-10Mhz	UNIX 5.0.3	cc		 973	1034
 * Compaq II	80286-8Mhz	MSDOS 3.1	MS C 3.0 	1086	1140 LM
 * IBM PC/AT    80286-7.5Mhz    Venix/286 SVR2  cc              1159    1254 *15
 * Compaq II	80286-8Mhz	MSDOS 3.1	MS C 3.0 	1190	1282 MM
 * MicroVAX II	-		Mach/4.3	cc		1361	1385
 * DEC uVAX II	-		Ultrix-32m v1.1	cc		1385	1399
 * Compaq II	80286-8Mhz	MSDOS 3.1	MS C 3.0 	1351	1428
 * VAX 11/780	-		UNIX 4.2BSD	cc		1417	1441
 * VAX-780/MA780		Mach/4.3	cc		1428	1470
 * VAX 11/780	-		UNIX 5.0.1	cc 4.1.1.31	1650	1640
 * Ridge 32C V1	-		ROS 3.3		Ridge C (older)	1628	1695
 * Gould PN6005	-		UTX 1.1c+ (4.2)	cc		1732	1884
 * Gould PN9080	custom ECL	UTX-32 1.1C	cc		4745	4992
 * VAX-784	-		Mach/4.3	cc		5263	5555 &4
 * VAX 8600	-		4.3 BSD		cc		6329	6423
 * Amdahl 5860	-		UTS sysV	cc 1.22	       28735   28846
 * IBM3090/200	-		?		?	       31250   31250
 *
 *
 *----------------DHRYSTONE VERSION 1.0 RESULTS BEGIN--------------------------
 *
 * MACHINE	MICROPROCESSOR	OPERATING	COMPILER	DHRYSTONES/SEC.
 * TYPE				SYSTEM				NO REG	REGS
 * --------------------------	------------	-----------	---------------
 * Commodore 64	6510-1MHz	C64 ROM		C Power 2.8	  36	  36
 * HP-110	8086-5.33Mhz	MSDOS 2.11	Lattice 2.14	 284	 284
 * IBM PC/XT	8088-4.77Mhz	PC/IX		cc		 271	 294
 * CCC 3205	-		Xelos(SVR2) 	cc		 558	 592
 * Perq-II	2901 bitslice	Accent S5c 	cc (CMU)	 301	 301
 * IBM PC/XT	8088-4.77Mhz	COHERENT 2.3.43	MarkWilliams cc  296	 317
 * Cosmos	68000-8Mhz	UniSoft		cc		 305	 322
 * IBM PC/XT	8088-4.77Mhz	Venix/86 2.0	cc		 297	 324
 * DEC PRO 350  11/23           Venix/PRO SVR2  cc               299     325
 * IBM PC	8088-4.77Mhz	MSDOS 2.0	b16cc 2.0	 310	 340
 * PDP11/23	11/23           Venix (V7)      cc               320     358
 * Commodore Amiga		?		Lattice 3.02	 368	 371
 * PC/XT        8088-4.77Mhz    Venix/86 SYS V  cc               339     377
 * IBM PC	8088-4.77Mhz	MSDOS 2.0	CI-C86 2.20M	 390	 390
 * IBM PC/XT	8088-4.77Mhz	PCDOS 2.1	Wizard 2.1	 367	 403
 * IBM PC/XT	8088-4.77Mhz	PCDOS 3.1	Lattice 2.15	 403	 403 @
 * Colex DM-6	68010-8Mhz	Unisoft SYSV	cc		 378	 410
 * IBM PC	8088-4.77Mhz	PCDOS 3.1	Datalight 1.10	 416	 416
 * IBM PC	NEC V20-4.77Mhz	MSDOS 3.1	MS 3.1 		 387	 420
 * IBM PC/XT	8088-4.77Mhz	PCDOS 2.1	Microsoft 3.0	 390	 427
 * IBM PC	NEC V20-4.77Mhz	MSDOS 3.1	MS 3.1 (186) 	 393	 427
 * PDP-11/34	-		UNIX V7M	cc		 387	 438
 * IBM PC	8088, 4.77mhz	PC-DOS 2.1	Aztec C v3.2d	 423	 454
 * Tandy 1000	V20, 4.77mhz	MS-DOS 2.11	Aztec C v3.2d	 423	 458
 * Tandy TRS-16B 68000-6Mhz	Xenix 1.3.5	cc		 438	 458
 * PDP-11/34	-		RSTS/E		decus c		 438	 495
 * Onyx C8002	Z8000-4Mhz	IS/1 1.1 (V7)	cc		 476	 511
 * Tandy TRS-16B 68000-6Mhz	Xenix 1.3.5	Green Hills	 609	 617
 * DEC PRO 380  11/73           Venix/PRO SVR2  cc               577     628
 * FHL QT+	68000-10Mhz	Os9/68000	version 1.3	 603	 649 FH
 * Apollo DN550	68010-?Mhz	AegisSR9/IX	cc 3.12		 666	 666
 * HP-110	8086-5.33Mhz	MSDOS 2.11	Aztec-C		 641	 676 
 * ATT PC6300	8086-8Mhz	MSDOS 2.11	b16cc 2.0	 632	 684
 * IBM PC/AT	80286-6Mhz	PCDOS 3.0	CI-C86 2.1	 666	 684
 * Tandy 6000	68000-8Mhz	Xenix 3.0	cc		 694	 694
 * IBM PC/AT	80286-6Mhz	Xenix 3.0	cc		 684	 704 MM
 * Macintosh	68000-7.8Mhz 2M	Mac Rom		Mac C 32 bit int 694	 704
 * Macintosh	68000-7.7Mhz	-		MegaMax C 2.0	 661	 709
 * Macintosh512	68000-7.7Mhz	Mac ROM O/S	DeSmet(C ware)	 714	 714
 * IBM PC/AT	80286-6Mhz	Xenix 3.0	cc		 704	 714 LM
 * Codata 3300	68000-8Mhz	UniPlus+ (v7)	cc		 678	 725
 * WICAT MB	68000-8Mhz	System V	WICAT C 4.1	 585	 731 ~
 * Cadmus 9000	68010-10Mhz	UNIX		cc		 714	 735
 * AT&T 6300    8086-8Mhz       Venix/86 SVR2   cc               668     743
 * Cadmus 9790	68010-10Mhz 1MB	SVR0,Cadmus3.7	cc		 720	 747
 * NEC PC9801F	8086-8Mhz	PCDOS 2.11	Lattice 2.15	 768	  -  @
 * ATT PC6300	8086-8Mhz	MSDOS 2.11	CI-C86 2.20M	 769	 769
 * Burroughs XE550 68010-10Mhz	Centix 2.10	cc		 769	 769 CT1
 * EAGLE/TURBO  8086-8Mhz       Venix/86 SVR2   cc               696     779
 * ALTOS 586	8086-10Mhz	Xenix 3.0b	cc 		 724	 793
 * DEC 11/73	J-11 micro	Ultrix-11 V3.0	cc		 735	 793
 * ATT 3B2/300	WE32000-?Mhz	UNIX 5.0.2	cc		 735	 806
 * Apollo DN320	68010-?Mhz	AegisSR9/IX	cc 3.12		 806	 806
 * IRIS-2400	68010-10Mhz	UNIX System V	cc		 772	 829
 * Atari 520ST  68000-8Mhz      TOS             DigResearch      839     846
 * IBM PC/AT	80286-6Mhz	PCDOS 3.0	MS 3.0(large)	 833	 847 LM
 * WICAT MB	68000-8Mhz	System V	WICAT C 4.1	 675	 853 S~
 * VAX 11/750	-		Ultrix 1.1	4.2BSD cc	 781	 862
 * CCC  7350A	68000-8MHz	UniSoft V.2	cc		 821	 875
 * VAX 11/750	-		UNIX 4.2bsd	cc		 862	 877
 * Fast Mac	68000-7.7Mhz	-		MegaMax C 2.0	 839	 904 +
 * IBM PC/XT	8086-9.54Mhz	PCDOS 3.1	Microsoft 3.0	 833	 909 C1
 * DEC 11/44			Ultrix-11 V3.0	cc		 862	 909
 * Macintosh	68000-7.8Mhz 2M	Mac Rom		Mac C 16 bit int 877	 909 S
 * CCC 3210	-		Xelos R01(SVR2)	cc		 849	 924
 * CCC 3220	-               Ed. 7 v2.3      cc		 892	 925
 * IBM PC/AT	80286-6Mhz	Xenix 3.0	cc -i		 909	 925
 * AT&T 6300	8086, 8mhz	MS-DOS 2.11	Aztec C v3.2d	 862	 943
 * IBM PC/AT	80286-6Mhz	Xenix 3.0	cc		 892	 961
 * VAX 11/750	w/FPA		Eunice 3.2	cc		 914	 976
 * IBM PC/XT	8086-9.54Mhz	PCDOS 3.1	Wizard 2.1	 892	 980 C1
 * IBM PC/XT	8086-9.54Mhz	PCDOS 3.1	Lattice 2.15	 980	 980 C1
 * Plexus P35	68000-10Mhz	UNIX System III cc		 984	 980
 * PDP-11/73	KDJ11-AA 15Mhz	UNIX V7M 2.1	cc		 862     981
 * VAX 11/750	w/FPA		UNIX 4.3bsd	cc		 994	 997
 * IRIS-1400	68010-10Mhz	UNIX System V	cc		 909	1000
 * IBM PC/AT	80286-6Mhz	Venix/86 2.1	cc		 961	1000
 * IBM PC/AT	80286-6Mhz	PCDOS 3.0	b16cc 2.0	 943	1063
 * Zilog S8000/11 Z8001-5.5Mhz	Zeus 3.2	cc		1011	1084
 * NSC ICM-3216 NSC 32016-10Mhz	UNIX SVR2	cc		1041	1084
 * IBM PC/AT	80286-6Mhz	PCDOS 3.0	MS 3.0(small)	1063	1086
 * VAX 11/750	w/FPA		VMS		VAX-11 C 2.0	 958	1091
 * Stride	68000-10Mhz	System-V/68	cc		1041	1111
 * Plexus P/60  MC68000-12.5Mhz	UNIX SYSIII	Plexus		1111	1111
 * ATT PC7300	68010-10Mhz	UNIX 5.0.2	cc		1041	1111
 * CCC 3230	-		Xelos R01(SVR2)	cc		1040	1126
 * Stride	68000-12Mhz	System-V/68	cc		1063	1136
 * IBM PC/AT    80286-6Mhz      Venix/286 SVR2  cc              1056    1149
 * Plexus P/60  MC68000-12.5Mhz	UNIX SYSIII	Plexus		1111	1163 T
 * IBM PC/AT	80286-6Mhz	PCDOS 3.0	Datalight 1.10	1190	1190
 * ATT PC6300+	80286-6Mhz	MSDOS 3.1	b16cc 2.0	1111	1219
 * IBM PC/AT	80286-6Mhz	PCDOS 3.1	Wizard 2.1	1136	1219
 * Sun2/120	68010-10Mhz	Sun 4.2BSD	cc		1136	1219
 * IBM PC/AT	80286-6Mhz	PCDOS 3.0	CI-C86 2.20M	1219	1219
 * WICAT PB	68000-8Mhz	System V	WICAT C 4.1	 998	1226 ~
 * MASSCOMP 500	68010-10MHz	RTU V3.0	cc (V3.2)	1156	1238
 * Alliant FX/8 IP (68012-12Mhz) Concentrix	cc -ip;exec -i 	1170	1243 FX
 * Cyb DataMate	68010-12.5Mhz	Uniplus 5.0	Unisoft cc	1162	1250
 * PDP 11/70	-		UNIX 5.2	cc		1162	1250
 * IBM PC/AT	80286-6Mhz	PCDOS 3.1	Lattice 2.15	1250	1250
 * IBM PC/AT	80286-7.5Mhz	Venix/86 2.1	cc		1190	1315 *15
 * Sun2/120	68010-10Mhz	Standalone	cc		1219	1315
 * Intel 380	80286-8Mhz	Xenix R3.0up1	cc		1250	1315 *16
 * Sequent Balance 8000	NS32032-10MHz	Dynix 2.0	cc	1250	1315 N12
 * IBM PC/DSI-32 32032-10Mhz	MSDOS 3.1	GreenHills 2.14	1282	1315 C3
 * ATT 3B2/400	WE32100-?Mhz	UNIX 5.2	cc		1315	1315
 * CCC 3250XP	-		Xelos R01(SVR2)	cc		1215	1318
 * IBM PC/RT 032 RISC(801?)?Mhz BSD 4.2         cc              1248    1333 RT
 * DG MV4000	-		AOS/VS 5.00	cc		1333	1333
 * IBM PC/AT	80286-8Mhz	Venix/86 2.1	cc		1275	1380 *16
 * IBM PC/AT	80286-6Mhz	MSDOS 3.0	Microsoft 3.0	1250	1388
 * ATT PC6300+	80286-6Mhz	MSDOS 3.1	CI-C86 2.20M	1428	1428
 * COMPAQ/286   80286-8Mhz      Venix/286 SVR2  cc              1326    1443
 * IBM PC/AT    80286-7.5Mhz    Venix/286 SVR2  cc              1333    1449 *15
 * WICAT PB	68000-8Mhz	System V	WICAT C 4.1	1169	1464 S~
 * Tandy II/6000 68000-8Mhz	Xenix 3.0	cc      	1384	1477
 * MicroVAX II	-		Mach/4.3	cc		1513	1536
 * WICAT MB	68000-12.5Mhz	System V	WICAT C 4.1	1246	1537 ~
 * IBM PC/AT    80286-9Mhz      SCO Xenix V     cc              1540    1556 *18
 * Cyb DataMate	68010-12.5Mhz	Uniplus 5.0	Unisoft cc	1470	1562 S
 * VAX 11/780	-		UNIX 5.2	cc		1515	1562
 * MicroVAX-II	-		-		-		1562	1612
 * VAX-780/MA780		Mach/4.3	cc		1587	1612
 * VAX 11/780	-		UNIX 4.3bsd	cc		1646	1662
 * Apollo DN660	-		AegisSR9/IX	cc 3.12		1666	1666
 * ATT 3B20	-		UNIX 5.2	cc		1515	1724
 * NEC PC-98XA	80286-8Mhz	PCDOS 3.1	Lattice 2.15	1724	1724 @
 * HP9000-500	B series CPU	HP-UX 4.02	cc		1724	-
 * Ridge 32C V1	-		ROS 3.3		Ridge C (older)	1776	-
 * IBM PC/STD	80286-8Mhz	MSDOS 3.0 	Microsoft 3.0	1724	1785 C2
 * WICAT MB	68000-12.5Mhz	System V	WICAT C 4.1	1450	1814 S~
 * WICAT PB	68000-12.5Mhz	System V	WICAT C 4.1	1530	1898 ~
 * DEC-2065	KL10-Model B	TOPS-20 6.1FT5	Port. C Comp.	1937	1946
 * Gould PN6005	-		UTX 1.1(4.2BSD)	cc		1675	1964
 * DEC2060	KL-10		TOPS-20		cc		2000	2000 NM
 * Intel 310AP	80286-8Mhz	Xenix 3.0	cc		1893	2009
 * VAX 11/785	-		UNIX 5.2	cc		2083	2083
 * VAX 11/785	-		VMS		VAX-11 C 2.0	2083	2083
 * VAX 11/785	-		UNIX SVR2	cc		2123	2083
 * VAX 11/785   -               ULTRIX-32 1.1   cc		2083    2091 
 * VAX 11/785	-		UNIX 4.3bsd	cc		2135	2136
 * WICAT PB	68000-12.5Mhz	System V	WICAT C 4.1	1780	2233 S~
 * Pyramid 90x	-		OSx 2.3		cc		2272	2272
 * Pyramid 90x	FPA,cache,4Mb	OSx 2.5		cc no -O	2777	2777
 * Pyramid 90x	w/cache		OSx 2.5		cc w/-O		3333	3333
 * IBM-4341-II	-		VM/SP3		Waterloo C 1.2  3333	3333
 * IRIS-2400T	68020-16.67Mhz	UNIX System V	cc		3105	3401
 * Celerity C-1200 ?		UNIX 4.2BSD	cc		3485	3468
 * SUN 3/75	68020-16.67Mhz	SUN 4.2 V3	cc		3333	3571
 * IBM-4341	Model 12	UTS 5.0		?		3685	3685
 * SUN-3/160    68020-16.67Mhz  Sun 4.2 V3.0A   cc		3381    3764
 * Sun 3/180	68020-16.67Mhz	Sun 4.2		cc		3333	3846
 * IBM-4341	Model 12	UTS 5.0		?		3910	3910 MN
 * MC 5400	68020-16.67MHz	RTU V3.0	cc (V4.0)	3952	4054
 * Intel 386/20	80386-12.5Mhz	PMON debugger	Intel C386v0.2	4149	4386
 * NCR Tower32  68020-16.67Mhz  SYS 5.0 Rel 2.0 cc              3846	4545
 * MC 5600/5700	68020-16.67MHz	RTU V3.0	cc (V4.0)	4504	4746 %
 * Intel 386/20	80386-12.5Mhz	PMON debugger	Intel C386v0.2	4534	4794 i1
 * Intel 386/20	80386-16Mhz	PMON debugger	Intel C386v0.2	5304	5607
 * Gould PN9080	custom ECL	UTX-32 1.1C	cc		5369	5676
 * Gould 1460-342 ECL proc      UTX/32 1.1/c    cc              5342    5677 G1
 * VAX-784	-		Mach/4.3	cc		5882	5882 &4
 * Intel 386/20	80386-16Mhz	PMON debugger	Intel C386v0.2	5801	6133 i1
 * VAX 8600	-		UNIX 4.3bsd	cc		7024	7088
 * VAX 8600	-		VMS		VAX-11 C 2.0	7142	7142
 * Alliant FX/8 CE		Concentrix	cc -ce;exec -c 	6952	7655 FX
 * CCI POWER 6/32		COS(SV+4.2)	cc		7500	7800
 * CCI POWER 6/32		POWER 6 UNIX/V	cc		8236	8498
 * CCI POWER 6/32		4.2 Rel. 1.2b	cc		8963	9544
 * Sperry (CCI Power 6)		4.2BSD		cc		9345   10000
 * CRAY-X-MP/12	   105Mhz	COS 1.14	Cray C         10204   10204
 * IBM-3083	-		UTS 5.0 Rel 1	cc	       16666   12500
 * CRAY-1A	    80Mhz	CTSS		Cray C 2.0     12100   13888
 * IBM-3083	-		VM/CMS HPO 3.4	Waterloo C 1.2 13889   13889
 * Amdahl 470 V/8 		UTS/V 5.2       cc v1.23       15560   15560
 * CRAY-X-MP/48	   105Mhz	CTSS		Cray C 2.0     15625   17857
 * Amdahl 580	-		UTS 5.0 Rel 1.2	cc v1.5        23076   23076
 * Amdahl 5860	 		UTS/V 5.2       cc v1.23       28970   28970

Whow.. cray had 105Mhz ?? did not know this..
I remember fotos .. they were round and seats on it :)
 
Last edited:
Thank you, Paul. I'll order my LC right away. It'll take the usual 4+ weeks to get here to Brazil, but I'll keep you updated.

If you want me to send a test version of the Mini54 implementation, please don't hesitate.

I'll update the debugger thread as soon as I have it working.
 
My input on optimization would be to make sure it fits the core library and examples usage cases and work with the built in 3.1 [LC] cache constraints/features.

MichaelM mentioned in another thread that most user code sits waiting for device returns. Those devices ideally run on interrupts so anything that would push their code from cache (?) or interfere with FIFO/DMA efficiency would be bad for net throughput.

Below are two notes that suggest finding standard background tasks that could run when looking for benchmark & optimization metrics, if the measure suggests you are getting something from nothing - check for compromising effects on the rest of the system (Serial I/O, SPI, USB, PWM, I2C, ADC, DAC ???) [optimization may affect these, but the idea was just to have them perhaps at 25-50% cpu stasis and functioning normally as background tasks while testing whatever else is found for a benchmark in the other 50+%]:

The one thing I saw was the @KPC work on FFT_1024 speedup - the reported Audio processing dropped from over 50 to under 15 in my usage of the FFT Sinewave sample - but the number of extra times I saw my LOOP() code entered (waiting on fft available) only went up by 6% if I was measuring what I thought I was. The guess was added latency elsewhere outside the Audio library counter. My loop() was empty when !available() - so it seemed the system would just have more free time to cycle through loop() - but I didn't find it doing that 35% more.

If I saw today what I think I did a scope picture of a PWM pin got choppy during the hyper refresh of a TFT display. This example TFT backlights with PWM, and the light wasn't steady with unlimited refresh - which that O-scope image might be showing, or a servo might jitter. In this case an unending SPI queue might explain that - but so might the wrong optimization.
 
Last edited:
@defragster - That's a pretty incredible number of loosely related topics to cram into just 1 message!

Regarding overall I/O interrupt strategy, please look at this thread. In particular, look around #46-47 and the following conversation. Prioritizing interrupts is still very much a work in progress. So is a nice API and dynamic allocation scheme for software interrupts....

The audio library FFT issue is a matter of peak CPU usage during 2.9 ms update intervals. That's quite different that the ordinary concept of overall CPU usage, as in conventional computing. The current FFT code does the entire FFT every 4th update. That takes about 1.5 ms, of the 2.9 ms update. In terms of the audio library meeting it's real-time goals, that's 52% CPU usage. But in terms of traditional overall CPU usage, it's 1.5 ms of work done every 11.6 ms, or about 13% overall. That code (which I admittedly haven't used yet... it's on my lengthy TO-DO list) attempts to spread the work out more evenly over the 4 updates. I believe it's using the radix-2 algorithm, which requires slightly more work overall. But again, actually digging into that code and merging it with the official audio library is on my list.

That scope trace is probably an analog effect, unrelated to software optimization. Probably?

One thing I have learned about optimization is it's often counterintuitive. Almost every time I've put a lot of work into carefully benchmarking and studying software and hardware performance, as I've indeed done this many times, the result is almost always that my initial assumptions weren't totally right. Usually not totally wrong either. But often, it turns out something I had believed was a major factor turned out to be relatively minor, and some other thing I'd discounted as unimportant, well, wasn't, and more often than not some unanticipated thing turns up.

So I try to resist the urge to do optimization and plan abstract optimization strategies based on assumptions and ideas only. Real benchmarking and truly digging deep into the real performance issues in an actual application takes a LOT more work, but it almost always brings unanticipated issues to light, and puts the relative merit of different things into clear perspective. If I seem unwilling or resistant to optimization stuff, please understand my resistance isn't lack of interest to optimize. It's only disinterest in abstract & general optimization ideas. I believe in the talking about real, well run benchmarks and deeply analyzing actual performance in realistic use cases.
 
Yeah, I've seen that over and over again in my 35+ years working on compilers. We've seen some cases, where we removed a slow instruction that saved/restored results nobody used anymore, and in doing so, it made one benchmark even slower.
 
Last edited:
Thanks Paul - As usual longer than intended and apparently missed the mark. I wasn't looking for answers or unification - more exactly the opposite: saying that having benchmarks running in a system w/ timer interrupting to do causal stuff might point out losses unseen changes with out a metric. Changing compiler flags is global - not like fixing a hot spot - agree with your last paragraph. *Stopping now - this is where I usually edit in*
 
Hm i don't want a big discussion here, but i believe you all missed a bit my point, or perhaps i did'nt explain it:
Michael, Paul and perhaps some others here are able to write code that is fast, even without any help from the compiler. They even don't think about this, they do it automatically.
They know that certain constructs are fast and others not.

But not the average arduino-beginner / or user. They usually don't think about inlining, loop-unrolling, moving constants outside of loops, jump-tables, optimzed switches, looup-table, integer-arithmetics and so on.
The Compiler can't help with all these topics too, thats correct (the best optimizer is the user) but in some cases the compiler-optimizations are really helpful. The more, the better.
Look at my example "unexperienced user" in the "benchmark". This little inlining improves the speed drastically, but -O1 does not help here.
I know that most sketches don't need any optimzation, ok, use -Os then. But in some cases you want it as fast as possible.

Then, i hav'nt seen any example for ARM now, that is significant slower with -O2. Perhaps someone can show one ?

Edit:
Sorry, i'm not good in English, i have difficulties to say exactly what i want.
 
Last edited:
The answer is it really depends. In general, we in GCC land try to make -O1/-O2/-O3 give you better performance for higher optimization levels at a cost of slower compilation times. But compilers (and OSes, libraries, etc.) are all complex pieces of code.

Sometimes, two different optimizations interfere with each other, so it makes things slower.

Sometimes the compiler writer does not have all of the information (look I work for the company that designs the chips, and even there we continually find new things about the hardware that are not documented in the internal documentation, let alone the external documentation).

Sometimes when you have multiple chip makers making chips with the same instruction set, one might speed some things up and slow other things down. If the compiler was tuned for one, it might make the other slower.

Sometimes as I mentioned when you do optimizations it makes things bigger, so that if it fit in a cache before, it no longer fits in a cache, and executing the program means you have more loads and stores.

Sometimes there is just a bug in the compiler.

I spend a good deal of my work life tracking down why things are slower. So I wouldn't be surprised if there are programs that run faster at -O1 rather than -O2. But it takes a lot of skull sweat to figure out what is really going on. The takeaway should be try features on the real hardware. Sometimes the magic works, and sometimes it does not. But until you actually test and measure it, it is just speculation.

In the server/desktop space, CPUs have a bunch of counters that when you enable the appropriate counters, can tell you if there is a particular slowdown. Similarly in the higher end embedded world, you have simulators and such that allow you to get a detailed view of what the issues are. Since I don't work on ARM processors for my day job, I have not looked to see what type of support there it. Paul has the rather thick datasheet for the processors used in the Teensys and that can be a starting point.
 
Last edited:
Yes i understamd this, i know this longer than you would think... :)
Indeed, on AVR -Os was the fastest (if i remember correctly) - but this was, because AVR is 8-bit and GCC is not written for this usage-case.

But i promised not to talk about this too much, so will be my last post regarding "-O1/-O2".

- Compilation-time doesn't matter much, today. Perhaps if you have thousands lines of code, okay. But not with teensy (show me the skecth?). Then, the question is, what is more important - the user knows this, and he can choose between the different options.

- Size (?) Can you show me some teensy 3.1 project that uses the whole flash ? Or, the half ? (ok, my web-radio with mp/aac/flac/TFT-Display/WLAN/flashplayer/ IR-Remote/ SD/ tft- Display needs with -O3 much less than 200kb, currently .. what a waste :) !! (hehe i'll find a way to use the free space *g*)) , where's the point to have as much as possible unused flash ? Use -Os, then, if this is important, whyever.

- You rely on tests and benchmarks... but i still hav'nt seen one that is significant slower with -O2. (for teensy3 please, not any other architecture)

- cache ? let me ask..(i really don't know it) how much cache has the teensy ? does there fit more than one tight loop ? how much is it ? 128 Byte ? 256 ? .sorry to take your arguments :) - this is too theoreticaly and abstract. show benchmark where o2 is slower than o1 - perhaps because of the cache?

-O2 is pretty much standard. Any idea, why ?

So.. don't understand me wrong: It was and is only a suggestion to use O2 instead of O1. Not more. It's in no way important for me.
For me, it plays absolutly NO role, because i'm using my own settings anyway :) So... Do what YOU think what better is.


Edit: ok, one last, from the documentation only, but perhaps someone should inform them that they are wrong in most cases?
-O2 Optimize even more. GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. As compared to -O, this option increases both compilation time and the performance of the generated code.
 
Last edited:
Then, i hav'nt seen any example for ARM now, that is significant slower with -O2. Perhaps someone can show one ?

Here's 4 quick benchmarks. 3 are attached files below. The other is File > Examples > ILI9341_t3 > graphicstest.

I ran all 4 of these, with -O and -O2. Results are copied below.

In many cases, -O2 is about 0.5% to 1.0% faster than -O. But in a couple, -O2 is dramatically slower. Look at the flops test, and the benchmark_memcpy "loop" test.


Code:
flops
-----
-O
Float (4 bytes)  multiplications per second:    (888746)

-O2
Float (4 bytes)  multiplications per second:    (780809)


ILI9341_t3 graphicstest example
-------------------------------
-O
Screen fill              280099
Text                     16621
Lines                    73153
Horiz/Vert Lines         22943
Rectangles (outline)     14599
Rectangles (filled)      581627
Circles (filled)         85967
Circles (outline)        73601
Triangles (outline)      17685
Triangles (filled)       191978
Rounded rects (outline)  33292
Rounded rects (filled)   634488

-O2
Benchmark                Time (microseconds)
Screen fill              280101
Text                     15912
Lines                    73083
Horiz/Vert Lines         22940
Rectangles (outline)     14598
Rectangles (filled)      581630
Circles (filled)         86116
Circles (outline)        73238
Triangles (outline)      17676
Triangles (filled)       191458
Rounded rects (outline)  33046
Rounded rects (filled)   634447


fft8
----
-O
1024  2946
 512  1371
 256 632
 128 291
  64 133
  32 61
  16 28

-O2
1024  2924
 512  1361
 256 630
 128 290
  64 133
  32 61
  16 29


benchmark_memcpy
----------------
-O
memcpy32 0
memset32 44
loop 75
memcpy 51
memset 25
set loop 53

-O2
memcpy32 0
memset32 45
loop 96
memcpy 51
memset 25
set loop 53
 

Attachments

  • benchmark_memcpy.ino
    2.4 KB · Views: 131
  • fft8.ino
    1.2 KB · Views: 144
  • flops.ino
    1.5 KB · Views: 125
loop:
that's indeed interesting.
for -O3 it is as fast as memcpy, that's damn good !
but -O2 ?? there's something going wrong....
 
this is the assembler:
Code:
     5aa:	f000 f93b 	bl	824 <micros>
     5ae:	2400      	movs	r4, #0
     5b0:	9000      	str	r0, [sp, #0]
     5b2:	5933      	ldr	r3, [r6, r4]
     5b4:	512b      	str	r3, [r5, r4]
     5b6:	3404      	adds	r4, #4
     5b8:	f5b4 5f80 	cmp.w	r4, #4096	; 0x1000
     5bc:	4f3a      	ldr	r7, [pc, #232]	; (6a8 <loop+0x18c>)
     5be:	d1f8      	bne.n	5b2 <loop+0x96>
     5c0:	f000 f930 	bl	824 <micros>

looks not too bad (?) but, i don't understand the ldr r7..it loads a constant from flash, perhaps the address of the array

6a8: 1fff9068 .word 0x1fff9068
6ac: 1fffa068 .word 0x1fffa068
 
Also don't understand the r7, it is not used in the loop, so why is it there?
As for the assembly, the loop uses probably 8 cycles (excluding the r7 line). By using postincrement pointers, you could save one cycle (without unrolling)
Edit: sorry don't know the original source, the assembly looks like a memcpy of 4096 bytes. Maybe the r7 address is defined as a volatile, sometimes it won't optimize it away. For example, imagine that a read of a special purpose adress can triggers some event.
 
Last edited:
My thought is adding a 'load blink' button to the teensy loader software would help the support side.

Idea is to help those people popping up on the forum struggling with apparently bricked teensy's to bypass all IDE/compiler/install related issues and just get a blink.hex based on the detected device successfully downloaded to confirm the hardware is still alive.

Happy to be told this is a silly idea given it'll bloat the loader out with 'features' but do see a lot of posts on 'have you tried loading blink' which could be 'have you pressed the "is it alive" button'.
 
Last edited:
Here's 4 quick benchmarks. 3 are attached files below. The other is File > Examples > ILI9341_t3 > graphicstest.

I ran all 4 of these, with -O and -O2. Results are copied below.

In many cases, -O2 is about 0.5% to 1.0% faster than -O. But in a couple, -O2 is dramatically slower. Look at the flops test, and the benchmark_memcpy "loop" test.


Code:
flops
-----
-O
Float (4 bytes)  multiplications per second:    (888746)

-O2
Float (4 bytes)  multiplications per second:    (780809)


ILI9341_t3 graphicstest example
-------------------------------
-O
Screen fill              280099
Text                     16621
Lines                    73153
Horiz/Vert Lines         22943
Rectangles (outline)     14599
Rectangles (filled)      581627
Circles (filled)         85967
Circles (outline)        73601
Triangles (outline)      17685
Triangles (filled)       191978
Rounded rects (outline)  33292
Rounded rects (filled)   634488

-O2
Benchmark                Time (microseconds)
Screen fill              280101
Text                     15912
Lines                    73083
Horiz/Vert Lines         22940
Rectangles (outline)     14598
Rectangles (filled)      581630
Circles (filled)         86116
Circles (outline)        73238
Triangles (outline)      17676
Triangles (filled)       191458
Rounded rects (outline)  33046
Rounded rects (filled)   634447


fft8
----
-O
1024  2946
 512  1371
 256 632
 128 291
  64 133
  32 61
  16 28

-O2
1024  2924
 512  1361
 256 630
 128 290
  64 133
  32 61
  16 29


benchmark_memcpy
----------------
-O
memcpy32 0
memset32 44
loop 75
memcpy 51
memset 25
set loop 53

-O2
memcpy32 0
memset32 45
loop 96
memcpy 51
memset 25
set loop 53


They fixed the "loop"-bug, but now the second memset (the one from newlib!) is slower(!?) But perhaps they changed some defines there..
4.9-2015-q1-update :
benchmark_memcpy
----------------

-O2
memcpy32 0
memset32 45
loop 54
memcpy 51
memset 34
set loop 43

-O1
memcpy32 1
memset32 44
loop 66
memcpy 51
memset 33
set loop 54

Edit: Added results for -O1
Obviously, they found the bug and fixed it :) -O2 is now faster, as expected.

Edit: Disassembled:
Code:
     5b8:	f000 f93e 	bl	838 <micros>
     5bc:	4942      	ldr	r1, [pc, #264]	; (6c8 <loop+0x19c>)
     5be:	4b3c      	ldr	r3, [pc, #240]	; (6b0 <loop+0x184>)
     5c0:	9000      	str	r0, [sp, #0]
[b]     
     5c2:	f856 2f04 	ldr.w	r2, [r6, #4]!
     5c6:	f843 2f04 	str.w	r2, [r3, #4]!
     5ca:	428e      	cmp	r6, r1
     5cc:	d1f9      	bne.n	5c2 <loop+0x96>
[/b]
     5ce:	f000 f933 	bl	838 <micros>
 
Last edited:
This dissassemble looks like the 7 cycles (2 for load, 2 for store, 1 for compare and 2 for branch), that I think is the optimum. As said, it uses two times the postincrement. I don't think you can get better without unrolling.
 
Obviously, they found the bug and fixed it :) -O2 is now faster, as expected.

Good news! :)

I want to wait until at least June to make a toolchain upgrade. By then, the new trend & pace of Arduino releases will be come much clearer. It'll also give me an opportunity to speak in person with Massimo Banzi in May.

We may try to get Teensy and Arduino Due (and Zero, if they officially release) synced to use the same toolchain version. Or maybe not. It's been talked about briefly, but nothing is clear yet.

I also want to focus on other stuff for a few months, so for all these reasons I want to leave the toolchain as it is for the next few months.
 
Yes, a good decision, but don't wait years instead of month :)
A hint, (i think i have mentioned it in an other thread), the launchpad-team switched to a newer newlib.
There are some edits in avr_functions.h needed:
Code:
char * ultoa(unsigned long val, char *buf, int radix);
char * ltoa(long val, char *buf, int radix);
//static inline char * utoa(unsigned int val, char *buf, int radix) __attribute__((always_inline, unused));
//static inline char * utoa(unsigned int val, char *buf, int radix) { return ultoa(val, buf, radix); }
//static inline char * itoa(int val, char *buf, int radix) __attribute__((always_inline, unused));
//static inline char * itoa(int val, char *buf, int radix) { return ltoa(val, buf, radix); }
char * dtostrf(float val, int width, unsigned int precision, char *buf);
(around line 97)
 
@Frank B, The cortex M4 keeps on amazing me. Depending on number of bytes copied, the loop is either 5 or 6 cycles. Probably due to caching
 
Back
Top