PDA

View Full Version : Teensyduino 1.34 Beta #1 (ARM Toolchain Update)



Paul
12-16-2016, 01:50 PM
Here is a first beta test for Teensyduino 1.34.

This version updates the ARM toolchain (https://forum.pjrc.com/threads/40829-Toolchain-update-will-need-help-testing) used to compile for Teensy LC and 3.x.


Old beta download links removed. Please use the latest version:
https://www.pjrc.com/teensy/td_download.html


Changes since Teensyduino 1.33:

Update ARM Toolchain to gcc 5.4 (was gcc 4.8)
Add Tools > Optimize menu
Fix driver install/update on Windows 7 & 8
Prevent "might not have installed correctly" message on Windows

Paul
12-16-2016, 01:58 PM
This beta adds a Tools > Optimize menu, so you can easily experiment with 10 different optimization settings.

9163
(click for full size)

LTO is Link Time Optimization. It can really reduce the size and increase the speed of many programs. But it might come with compatibility issues.

Fastest, Faster and Fast are optimization flags -O3, -O2 and -O. Debug is -Og, and Smallest is -Os plus use of the nano C library.

Please give these various settings a try. We probably have a couple months before Arduino releases a new version. If this toolchain update or LTO breaks too many programs & libraries, I can always revert to the 4.8 toolchain. But hopefully we can play with this quite a bit in the coming weeks and build enough confidence to use it in a non-beta release.

KurtE
12-16-2016, 02:02 PM
Here is a first beta test for Teensyduino 1.34.
Linux ARM: (still struggling to build the new toolchain...)
Now downloading windows version...

Curious, what are you using to build the Arm version? RPI? If so you would probably get much faster response using something like an Odroid XU4, especially with an EMMC.
http://ameridroid.com/products/odroid-xu4

PaulStoffregen
12-16-2016, 02:07 PM
Does Odroid XU4 use the same ARM ABI? All the info I've found says RPi is v6 and upwards compatible with v7 or v8, but programs compiled on those boards can't run on v6.

But if there is a way, I'd love to use faster hardware! So far I've been using the original model B. Ordered a RPi3. It's supposed to arrive early next week.

MichaelMeissner
12-16-2016, 02:08 PM
Now downloading windows version...

Curious, what are you using to build the Arm version? RPI? If so you would probably get much faster response using something like an Odroid XU4, especially with an EMMC.
http://ameridroid.com/products/odroid-xu4

I believe the Odroid uses Arm in 64-bit mode, and Raspberry Pi uses Arm 32-bit chips. I don't know if you run/build 32-bit binaries on a 64-bit system.

KurtE
12-16-2016, 02:19 PM
I believe the Odroid uses Arm in 64-bit mode, and Raspberry Pi uses Arm 32-bit chips. I don't know if you run/build 32-bit binaries on a 64-bit system.
Actually Odroid XU4 and C1 are 32 bit. It is the C2 which is 64 bit.

defragster
12-16-2016, 04:43 PM
PURGED - wow <57 MB downloaded! TD_1.33 was 72 MB.

Question - would it be easy for 'somebody' to make a makefile that would casually build multiple binaries with a single command?

-O3, -O2 and -O. Debug is -Og, and Smallest is -Os plus use of the nano C library

<edit>: TD_1.34 Installed fine on copy of 1.6.13, first few default compiles of open sketches no problems. Just this one accurate warning I see that may not be new:

TimedStartBlink_SerEv:55: warning: array subscript has type 'char'
That added 'optimize' Menu does look nice.

manitou
12-16-2016, 04:49 PM
I believe the Odroid uses Arm in 64-bit mode, and Raspberry Pi uses Arm 32-bit chips. I don't know if you run/build 32-bit binaries on a 64-bit system.

raspberry pi 3 is 64-bit, gcc 4.9.2,debian(jessie)

I like the "optimize window"

defragster
12-16-2016, 04:51 PM
raspberry pi 3 is 64-bit, gcc 4.9.2,debian(jessie)

I like the "optimize window"

I got a pi 3 months back - unpowered yet - but I understood it was still using 32 bit Jessie for compatibility to all existing code/usage?

Frank B
12-16-2016, 05:49 PM
I'm getting the following errors in some of my sketches:

- error: 'strncmpi' was not declared in this scope
- error: 'strcasestr' was not declared in this scope

(not from libs, but from my own code)



Edit: They should be in string.h - i've included that lib (and it worked with older teensyduino-versions)

Frank B
12-16-2016, 06:17 PM
string.h:


#if __GNU_VISIBLE
char *_EXFUN(strcasestr,(const char *, const char *));


I added "#define __GNU_VISIBLE 1" to my Sketch, and now the compiler complains:



c:\arduino\hardware\tools\arm\arm-none-eabi\include\sys\features.h:256:0: note: this is the location of the previous definition

#define __GNU_VISIBLE 0


features.h :


#ifdef _GNU_SOURCE
#define __GNU_VISIBLE 1
#else
#define __GNU_VISIBLE 0
#endif


, so _GNU_SOURCE is not defined.

Is there a compiler-switch missing ?

duff
12-16-2016, 06:25 PM
Audio FFT example doesn't work with any Optimize menu options. I'm using Auduino 1.6.13

manitou
12-16-2016, 07:47 PM
you're right, RPI3 jessie is still running in 32 bit mode.

i tested 1.6.12 with 1.34beta1 on mac os,
coremark:
previously T3.2@96mhz -O2 189.4 iterations/sec | with LTO fastest 207.29
previously T3.6@180mhz -O2 384.0 | with LTO fastest 447.7
... so many optimization choices ...


T3.6@180mhz coremark
fastest LTO 447.676389
fastest 463.692033
faster LTO 437.121360
faster 434.528617
fast LTO 333.619557
fast 333.032915
small LTO 323.248789 no float printf
small 320.692182

manitou
12-16-2016, 07:52 PM
is compiler doing hardware floating point? my linpack benchmark with LTO fastest is giving < 1 megaflop, should be 30 megaflops

is it really using gcc 5.4, if i look at version of /Applications/Arduino.app//Contents/Java/hardware/tools/arm/arm-none-eabi/bin/gcc
it still says 4.8.4
maybe i'm looking in wrong place (i don't use mac that much)

EDIT found it /Applications/Arduino.app//Contents/Java/hardware/tools/arm/bin/arm-none-eabi-gcc-5.4.1

EDIT 2, restarted IDE floating point seems OK now ... pilot error i guess

brtaylor
12-16-2016, 08:04 PM
Installed just fine on Fedora 25 and Arduino 1.6.13. I'll let you know if I run into anything as I use it...

defragster
12-17-2016, 09:45 AM
One simple sketch: TD_1.34 beats TD_1.33, and using LTO on TD_1.34 makes a real difference

I created a sketch watching pin 3 with FreqMeasure.countToFrequency().

Then in the sketch under test with pin 12 as output in loop() I do :: #define q12() {GPIOC_PTOR=128;} // Toggle pin 12

That pin 12 feeds the first T_3.6 pin 3 and shows the cycle rate each sec.

So far that - with a not quite empty loop() on a T_3.6 at 240 MHz shows:



TD_1.33 [ using an empty yield() ]:: == 1,153,846.13
TD_1.33 [ using PJRC yield() ]:: == 451,127.81



TD_1.34 Without LTO [ using an empty yield() ]:: FASTER == FASTEST == 1,304,347.88 Hz

TD_1.34 With LTO [ using an empty yield() ]::
FASTER = 1,935,483.88 Hz
FASTEST = 1,818,181.88 Hz

TD_1.34 With LTO [ using PJRC yield() ]::
FASTER = 857,142.88
FASTEST = 800,000.00

PaulStoffregen
12-17-2016, 10:33 AM
Raspberry Pi 3 is up and running here, after a few false starts with weaker power supplies. It really does draw over 2 amps when running at full load! (edit: or maybe not... could have been just crappy wall wart power supplies...) Building the toolchain now. Seems to be going a *lot* faster. The Broadcom chip gets scorching hot. I've got a 120mm fan cooling the whole thing.

Edit: "vcgencmd measure_temp" says "temp=51.0'C", and that's with a big 120mm fan blowing down.

9176

KurtE
12-17-2016, 01:11 PM
Hi Paul,

That sounds HOT! It is interesting that last week I purchased a RPI3 as it looks like Trossen Robotics will be using them in some of their products... Mine came with two small heat sinks to stick on the two chips and with the package I purchased from Amazon a 2.5 amp power supply.

For the fun of it, I pulled my spare Odroid Xu3-lite out from my cabinet (my 2 XU4's are mounted in two robots) and I downloaded that latest Ubuntu 16.04 image and updated the 32gb EMMC...

I saw a project up in your github: https://github.com/PaulStoffregen/ARM_Toolchain_2016q3_Source And it looked like maybe it was the one you are trying to build?

So I downloaded it to Odroid and tried to follow your steps. I did not create virtual memory yet, but went through some of the steps.

It looked like it completed: ./build-prerequisites.sh
But it died somewhere in: ./build-toolchain.sh

If this is what you are building, it will be interesting to see the differences in time building between using this and RPI3. But some of this may depend on if the build process can use more of the cores of the 8 cores of the processor...

If it works, I could probably loan it to you.

PaulStoffregen
12-17-2016, 02:14 PM
Yup, that's the new toolchain.

I almost got through ./build-toolchain.sh, until the USB hub I was using for power died. It ran for approx 5 hours to get into 2nd gcc compile stage.

On the old Pi (original version 1) running wheezy (from 2014), it complains about a gcc bug after about a day compiling.

This one is the old toolchain. It builds in approx 47 hours on the old Pi running wheezy.

https://github.com/PaulStoffregen/ARM_Toolchain_2014q3_Source

I'm restarting the new toolchain build again on the Pi3 with jessie. This time I've added a small heatsink on the chip and a lab bench power supply capable of 10 amps.

KurtE
12-17-2016, 03:07 PM
I have restarted the build on the Odroid 3 times now, as it failed with missing some different things I need to apt-get install... It is using one of their 4amp power supplies. And I don't think it is overheating too badly as the automatic fan is only on part of the time. So I don't think the automatic throttling of the Cores down to 900mhz is happening due to heat... Could be wrong.

It was good I added 2GB of virtual memory using the information in: https://www.digitalocean.com/community/tutorials/how-to-add-swap-on-ubuntu-14-04

As the htop command has shown all 8 cores doing something and at times the SWP being used.

Not sure how long this one will run.

It does run reasonably warm by looking at:

odroid@odroid:~$ cat /sys/devices/10060000.tmu/temp
sensor0 : 77000
sensor1 : 69000
sensor2 : 83000
sensor3 : 77000
sensor4 : 65000
I think it throttles when it gets to 100C so here shows maybe 83C high

9177

Frank B
12-17-2016, 09:35 PM
Can you add something like the defs.h solution (https://forum.pjrc.com/threads/38533-HOWTO-Store-Projects-settings-(like-F_CPU-USB-Keyboard-layout)) ? It really makes life easier... :)

PaulStoffregen
12-18-2016, 12:27 AM
I've completed building the new toolchain for Raspberry Pi and (hopefully) other ARM-based machines.

Edited the first post. The installer for linuxarm is now available.

If anyone tests on non-RPI boards, or anything other than Raspberry Pi 3, please let me know if it worked?

PaulStoffregen
12-18-2016, 01:07 AM
Has anyone tried other stuff depending on the arm_math.h? Does it work with and/or without LTO?

My guess is the library probably needs to be rebuilt for LTO. Earlier this year I started this work, in anticipation of someday needing to be able to rebuild it....

https://github.com/PaulStoffregen/arm_math

defragster
12-18-2016, 07:24 AM
Talkie compiles and runs and sounds right with PJRC PROP on T_3.6 @240 MHz Faster and Fastest +/- LTO : no warnings or errors. Same at 180 Fastest and Smallest with LTO.

<edit>: Verify compiled for T_3.2 - no warn/err - but not uploaded tested.

Frank B
12-18-2016, 12:58 PM
Do you have plans to use a newer version ? arm_math is from 2012

PaulStoffregen
12-18-2016, 01:12 PM
The newer arm_math versions have a huge FFT bit reversal table which doesn't fit into Teensy's flash memory.

I'm considering merging some or all new features, but those FFT tables need to be restructured (a unique table for each FFT length) for compatibility with Teensy.

KurtE
12-18-2016, 03:04 PM
I've completed building the new toolchain for Raspberry Pi and (hopefully) other ARM-based machines.

Edited the first post. The installer for linuxarm is now available.

If anyone tests on non-RPI boards, or anything other than Raspberry Pi 3, please let me know if it worked?
I installed it on the Odroid XU3-lite (which I also finally completed the build of the toolchain - 4th attempt due to missing things).

I installed it on the XU3, Programmed two different T3.6s, include one of my boards that backs the Touch screen and then tried it with my version of the ILI9341_t3n library, with SPIN and used my example version of the touch paint that is for the right touch controller and it worked.

KurtE
12-18-2016, 04:50 PM
Not sure If I should mention it here or different thread, but thought I would try downloading the FreeRTOS to try out the float print issue.

On my machine with 1.6.13, this beta on Windows 10 64 bit, when I try to compile the app in that thread or the frBlink Example it fails for T3.6


:\Users\Kurt\Documents\Arduino\libraries\FreeRTOS_ ARM\utility\port.c: In function 'systick_isr':

C:\Users\Kurt\Documents\Arduino\libraries\FreeRTOS _ARM\utility\port.c:86:23: warning: implicit declaration of function 'xPortSysTickHandler' [-Wimplicit-function-declaration]

if (sysTickEnabled) xPortSysTickHandler();

^

C:\Users\Kurt\Documents\Arduino\libraries\FreeRTOS _ARM\utility\port.c: At top level:

C:\Users\Kurt\Documents\Arduino\libraries\FreeRTOS _ARM\utility\port.c:176:6: warning: conflicting types for 'xPortSysTickHandler'

void xPortSysTickHandler( void );

^

C:\Users\Kurt\Documents\Arduino\libraries\FreeRTOS _ARM\utility\port.c:86:23: note: previous implicit declaration of 'xPortSysTickHandler' was here

if (sysTickEnabled) xPortSysTickHandler();

^

C:\Users\Kurt\AppData\Local\Temp\ccBigtf2.ltrans11 .ltrans.o: In function `pendablesrvreq_isr':

<artificial>:(.text+0x1e): undefined reference to `vTaskSwitchContext'

collect2.exe: error: ld returned 1 exit status

Error compiling for board Teensy 3.6.

The frBlink built on Arduino 1.6.9 with TD1.31 (still had the same warnings)

PaulStoffregen
12-18-2016, 05:38 PM
I just ran all the arm_math examples. They all seem to work with the new toolchain.

The linear interpolation example (https://github.com/PaulStoffregen/arm_math/tree/master/tests/linear_interpolation_f32) is odd. I get slightly different results depending on the optimization used. Amazingly, the fastest with LTO seems to completely optimize away all the computations and the massive 736 kbyte data table!

Frank B
12-18-2016, 06:10 PM
? Without looking at/running the example - does the compiler (with lto) detect that the result is always the same and just outputs consts ?
That would be amazing.

Edit: Did you try "fastmath", too ? I wonder, how good/bad the results are.

PaulStoffregen
12-18-2016, 06:53 PM
Yes. I just looked at the generated code. It's completely replacing arm_linear_interp_f32() and the huge testInputSin_f32[] array with pre-computed results. But arm_linear_interp_f32() is a static inline function completely defined within arm_math.h, so I guess this isn't too surprising. None of the other optimization setting eliminate the huge const array.

It's also doing some pretty substantial optimization with accessing lots of variables from the same base address register using indexed addressing mode, which makes the generated assembly much harder to read.

defragster
12-18-2016, 07:14 PM
Found a <edit> REMINDER. Looking back to the code I did for : Teensy3-alternative-for-dtostrf() (https://forum.pjrc.com/threads/1227-Teensy3-alternative-for-dtostrf()?p=86723&viewfull=1#post86723)

I pulled the sketch sources from there - combined two posts: 9183
{ Output prints only once so uses :: while (!Serial ); }

TD_1.34b1 works as it did in TD_1.33 and before when complied Fast or above it seems but going to "SMALLEST" doesn't get the output for printf or sprint as follows - a short snip [LTO doesn't affect output]:

Fast [ no compile warn/error ]show this as expected::

dtostrf(y, 10, 5, cbuf); == -0.00001
sprintf(cbuf, "%10.5f", y); == -0.00001
printf("%f", y);-0.000012

... // clipped

------
-0.00001
-0.00001234567935171071439981460
-0.00001234567935171071439981460
-0.000012
-0.00001234567890123456780746176
-0.000012
-0.00001234567935171071439981461
... // clipped


Smallest [ no compile warn/error] only shows this where the above RED output is missing::

dtostrf(y, 10, 5, cbuf); == -0.00001
sprintf(cbuf, "%10.5f", y); ==
printf("%f", y);

... // clipped

------
-0.00001
-0.00001234567935171071439981460
-0.00001234567935171071439981460
... // clipped

PaulStoffregen
12-18-2016, 07:20 PM
Smallest uses the nano libc which doesn't support printing floats.

You should see the same in 1.33 if you select the CPU Speed option for smallest code size.

defragster
12-18-2016, 07:28 PM
Smallest uses the nano libc which doesn't support printing floats.

You should see the same in 1.33 if you select the CPU Speed option for smallest code size.

doh - actually - well that is a mere restatement of FACT then :) Demonstrating that it fails silently. And of course showing that the dtostrf() code edits are holding up.

I saw missing output last night and had to quit before I could see it as my Teensy was not restarting after programming - I'm not seeing that now with TYQT removed - and a couple of programming retries. It was preventing me from putting T_3.1 code on a T_3.1 saying the MCU was wrong. I just repeated with Teensy.exe and the T_3.1 is running well.

<edit>Retested: TYQT won't push this sample compiled with TD_1.34 to a T_3.1: posted ''dtostrf_TEST_b.ino.TEENSY31.hex' is not compatible' on TYQT thread (https://forum.pjrc.com/threads/27825-Teensy-Qt?p=127822&viewfull=1#post127822).

manitou
12-18-2016, 08:41 PM
I just ran all the arm_math examples. They all seem to work with the new toolchain.



I had a couple of small arm_math examples, they ran fine at -Os and fastest LTO. Also CAU assembly stuff worked, and the libcau.a version worked (modifying boards.txt and adding the lib to tootlchain lib).

old school cosmetic request: how about adding -O3 -O2 -Os etc to menu descriptions (at least the ones without LTO), e.g. Fastest -O3

Theremingenieur
12-19-2016, 05:52 AM
I've spent my Sunday porting parts of my current (for more than a year now *sigh*) project from Q31(starting in the time on T3.2) to F32 arithmetics on the T3.6 and compiling with the new toolchain. I'm amazed with the speed and precision, especially since I can't use the audio lib or other arm_math libs because of different sampling rate needs and sample-for-sample processing to avoid latency. "Building" quadrature oscillators, state variable filters, and nonlinear filters in software has really become fun with the new Teensy and the new compiler!

PaulStoffregen
12-19-2016, 03:36 PM
I dug up the flops benchmark... the one which convinced me to use -O1 instead of -O2 as the default for use with gcc 4.8.



// https://forum.pjrc.com/threads/27959-FLOPS-not-scaling-to-F_CPU
#include <math.h>
#include <string.h>

//FASTRUN
void float_MatMult(float* A, float* B, int m, int p, int n, float* C) {
// A = input matrix (m x p)
// B = input matrix (p x n)
// m = number of rows in A
// p = number of columns in A = number of rows in B
// n = number of columns in B
// C = output matrix = A*B (m x n)
int i, j, k;
for ( i = 0; i < m; i++ )
for ( j = 0; j < n; j++ ){
C[i*n+j] = 0;
for( k = 0; k < p; k++ )
C[i*n+j] += A[i*p+k]*B[k*n+j];
}
}

void setup() {

while (!Serial) ;

// variables for timing
int i=0;
int dt;

// variables for calculation
#define N 16
float A[N][N];
float B[N][N];
float C[N][N];

memset(A,3.1415,sizeof(A));
memset(B,8.1415,sizeof(B));
memset(C,0.0,sizeof(C));

int tbegin = micros();
#if 1
for (i=1;;i++) {
// do calculation
float_MatMult((float*) A, (float*)B, N,N,N, (float*)C);
// check if t_delay has passed
dt = micros() - tbegin;
if (dt > 1000000) break;
}
#else
for (i=0; i < 200; i++) {
float_MatMult((float*) A, (float*)B, N,N,N, (float*)C);
}
dt = micros() - tbegin;
#endif

Serial.printf("(%dx%d) matrices: ", N, N);
Serial.printf("%d matrices in %d usec: ", i, dt);
Serial.printf("%d matrices/second\n", (int)((float)i*1000000/dt));
Serial.printf("Float (%d bytes) ", sizeof(float));
float total = N*N*N*i*1e6 / (float)dt;
Serial.printf(" multiplications per second:\t(%u)\n", (unsigned int)total);
}

void loop() {
}


Here is how it's performing now, on Teensy 3.2 and 3.6:



flops: (Teensy 3.2, 96 MHz)

O3 lto: Float (4 bytes) multiplications per second: (820928)
O3: Float (4 bytes) multiplications per second: (781485)

O2 lto: Float (4 bytes) multiplications per second: (839266)
O2: Float (4 bytes) multiplications per second: (884790)

O1 lto: Float (4 bytes) multiplications per second: (746603)
O1: Float (4 bytes) multiplications per second: (759376)

Og lto: Float (4 bytes) multiplications per second: (704953)
Og: Float (4 bytes) multiplications per second: (697528)

Os lto: Float (4 bytes) multiplications per second: (738290)
Os: Float (4 bytes) multiplications per second: (627845)




O3 lto: Float (4 bytes) multiplications per second: (32080536)
O3: Float (4 bytes) multiplications per second: (32275866)

O2 lto: Float (4 bytes) multiplications per second: (14277215)
O2: Float (4 bytes) multiplications per second: (14094865)

O1 lto: Float (4 bytes) multiplications per second: (13105863)
O1: Float (4 bytes) multiplications per second: (12236621)

Og lto: Float (4 bytes) multiplications per second: (8205052)
Og: Float (4 bytes) multiplications per second: (8205233)

Os lto: Float (4 bytes) multiplications per second: (13295244)
Os: Float (4 bytes) multiplications per second: (11327418)

MichaelMeissner
12-19-2016, 04:12 PM
I dug up the flops benchmark... the one which convinced me to use -O1 instead of -O2 as the default for use with gcc 4.8.



// variables for calculation
#define N 16
float A[N][N];
float B[N][N];
float C[N][N];

memset(A,3.1415,sizeof(A));
memset(B,8.1415,sizeof(B));
memset(C,0.0,sizeof(C));


Ummm, this doesn't do what the benchmark thinks it does. Instead, it assigns 0x03030303 to every float in A, 0x08080808 to every float in B, and 0 to every float in C. This means that every A will be 3.85009e-37 and every B will be 4.09355e-34. The multiplication produces 0, but since these are denormal numbers, it may/may not be slower than normal multiplies and/or issue a trap.

manitou
12-19-2016, 05:19 PM
here are linpack megaflops for 100x100 single-precision float on T3.6 @180mhz and T3.2 (1.6.12 1.34beta1)

K66@180mhz T3.2@96mhz
fastest LTO 28.92 0.998
fastest 28.92 1.002
faster LTO 28.44 0.996
faster 28.45 0.999
fast LTO 28.75 1.009
fast 28.75 1.166
debug LTO 20.7 0.986
debug 20.7 0.987
smallest LTO 26.65 0.973
smallest 26.34 0.984

interestingly, older 1.6.12 with 1.32 (-O), K66 linpack runs at 30.02 mflops (faster than 1.34beta1)

Frank B
12-19-2016, 07:22 PM
Ummm

In addition, this "Benchmark" does more or less only one thing.
It's very questionable to use it's results for anything other than testing exactly this floatmult.

This benchmark is way too simple.

Frank B
12-19-2016, 07:53 PM
Indeed, this - instead - memset shows even higher numbers


for (int i=0; i<N; i++) {
for (int j=0; j<N; j++) {
A[i][j]=3.1415f;
B[i][j]=8.1415f;
C[i][j]=0.0f;
}
}

manitou
12-19-2016, 08:20 PM
N = 16 must be a sweet spot. Even with Frank's init, with N = 20 performance drops to 14368672 mults/sec

Frank B
12-19-2016, 08:25 PM
For N=2 you get 62564256 mults/sec

thats ~3 cycles per multiplication (including overhead...) ?
I guess, the compiler is good with optimizing... ;-)
Then, the cache plays a role for higher N, and may not be sufficiant for N>16

manitou
12-19-2016, 08:57 PM
it's actually twice that fast because the total op count is 2*N*N*N*i if you count the float add in the inner loop. the label would change to flops:

PaulStoffregen
12-19-2016, 10:51 PM
I ran several more tests, which probably aren't great benchmarks, but they are based on things commonly done with Arduino.

In some cases, -O1 seems to beat -O2 and -O3. For some specific tasks, -O3 with LTO works wonders.

I'm still debating what the default should be for each board. Obviously Teensy LC needs -Os. In most the tests I've tried, -O3 adds significantly to the program size, so I'm a bit reluctant to make it the default for Teensy 3.0, 3.1, 3.2 where so many programs already exist.

duff
12-19-2016, 11:43 PM
Here is what delayMicroseconds disassembly look like with -O3 and LTO enabled on a T3:



000018b8 <L_783_delayMicroseconds>:
18b8: 3b01 subs r3, #1
18ba: d1fd bne.n 18b8 <L_783_delayMicroseconds>
18bc: f892 3200 ldrb.w r3, [r2, #512] ; 0x200
18c0: f892 1280 ldrb.w r1, [r2, #640] ; 0x280
18c4: b2db uxtb r3, r3
18c6: 2900 cmp r1, #0
18c8: d1f2 bne.n 18b0 <main+0x28>
18ca: b13b cbz r3, 18dc <L_783_delayMicroseconds+0x24>
18cc: 6803 ldr r3, [r0, #0]
18ce: f023 0302 bic.w r3, r3, #2
18d2: 6003 str r3, [r0, #0]
18d4: e7ef b.n 18b6 <main+0x2e>
18d6: f882 5100 strb.w r5, [r2, #256] ; 0x100
18da: e7ec b.n 18b6 <main+0x2e>
18dc: 6803 ldr r3, [r0, #0]
18de: f043 0303 orr.w r3, r3, #3
18e2: 6003 str r3, [r0, #0]
18e4: e7e7 b.n 18b6 <main+0x2e>
18e6: f8df 8078 ldr.w r8, [pc, #120] ; 1960 <L_869_delayMicroseconds+0x58>
18ea: f8df c078 ldr.w ip, [pc, #120] ; 1964 <L_869_delayMicroseconds+0x5c>
18ee: f8df e078 ldr.w lr, [pc, #120] ; 1968 <L_869_delayMicroseconds+0x60>
18f2: 4f18 ldr r7, [pc, #96] ; (1954 <L_869_delayMicroseconds+0x4c>)
18f4: 4e18 ldr r6, [pc, #96] ; (1958 <L_869_delayMicroseconds+0x50>)
18f6: 4d19 ldr r5, [pc, #100] ; (195c <L_869_delayMicroseconds+0x54>)
18f8: 4c15 ldr r4, [pc, #84] ; (1950 <L_869_delayMicroseconds+0x48>)
18fa: e010 b.n 191e <L_869_delayMicroseconds+0x16>
18fc: b1eb cbz r3, 193a <L_869_delayMicroseconds+0x32>
18fe: 6803 ldr r3, [r0, #0]
1900: f023 0302 bic.w r3, r3, #2
1904: 6003 str r3, [r0, #0]
1906: 4623 mov r3, r4


00001908 <L_869_delayMicroseconds>:
1908: 3b01 subs r3, #1
190a: d1fd bne.n 1908 <L_869_delayMicroseconds>
190c: f898 3000 ldrb.w r3, [r8]
1910: f89c 3000 ldrb.w r3, [ip]
1914: f89e 3000 ldrb.w r3, [lr]
1918: 783b ldrb r3, [r7, #0]
191a: 7833 ldrb r3, [r6, #0]
191c: 782b ldrb r3, [r5, #0]
191e: f892 3200 ldrb.w r3, [r2, #512] ; 0x200
1922: f892 1280 ldrb.w r1, [r2, #640] ; 0x280
1926: b2db uxtb r3, r3
1928: 2900 cmp r1, #0
192a: d0e7 beq.n 18fc <L_783_delayMicroseconds+0x44>
192c: b113 cbz r3, 1934 <L_869_delayMicroseconds+0x2c>
192e: f882 9100 strb.w r9, [r2, #256] ; 0x100
1932: e7e8 b.n 1906 <L_783_delayMicroseconds+0x4e>
1934: f882 9080 strb.w r9, [r2, #128] ; 0x80
1938: e7e5 b.n 1906 <L_783_delayMicroseconds+0x4e>
193a: 6803 ldr r3, [r0, #0]
193c: f043 0303 orr.w r3, r3, #3
1940: 6003 str r3, [r0, #0]
1942: e7e0 b.n 1906 <L_783_delayMicroseconds+0x4e>
1944: 4004b014 andmi fp, r4, r4, lsl r0
1948: 43fe1014 mvnsmi r1, #20
194c: 1fff8e08 svcne 0x00ff8e08
1950: 00f42400 rscseq r2, r4, r0, lsl #8
1954: 1fff8e0c svcne 0x00ff8e0c
1958: 1fff8e00 svcne 0x00ff8e00
195c: 1fff8dff svcne 0x00ff8dff
1960: 1fff8e09 svcne 0x00ff8e09
1964: 1fff8e0a svcne 0x00ff8e0a
1968: 1fff8e0b svcne 0x00ff8e0b

Ouch, seems like all the inline assembly gets mucked up with LTO enabled! I found this out with my Zilch library which heavily uses inline assembly.

For reference here is delayMicroseconds using -03 without LTO:


0000048a <L_36_delayMicroseconds>:
48a: 3b01 subs r3, #1
48c: d1fd bne.n 48a <L_36_delayMicroseconds>
48e: bd08 pop {r3, pc}
490: 00f42400 rscseq r2, r4, r0, lsl #8

defragster
12-20-2016, 01:13 AM
Here is what delayMicroseconds disassembly look like with -O3 and LTO enabled on a T3:
...
Ouch, seems like all the inline assembly gets mucked up with LTO enabled! I found this out with my Zilch library which heavily uses inline assembly.

For reference here is delayMicroseconds using -03 without LTO:


I was going to make a note on Zilch - the FASTEST & FASTER LTO won't run the samples. Fastest and Faster seem to work - with simple as I have changed it - adding in use of FrankB's vaporized <pending refinement?> WFI based monitoring 'processorUsage.h'.

duff
12-20-2016, 04:32 AM
How do you stop LTO from touching my inline assembly code?

Frank B
12-20-2016, 05:00 AM
Does "asm volatile" help?

duff
12-20-2016, 08:21 AM
Does "asm volatile" help?
Unfortunately not, tried to find some attribute but nothing? delayMicroseconds seems to work though but I'll see if the timing is good on my scope tomorrow.

PaulStoffregen
12-20-2016, 11:31 AM
How do you stop LTO from touching my inline assembly code?

What is it doing to your inline asm?

manitou
12-20-2016, 11:45 AM
How do you stop LTO from touching my inline assembly code?

does barrier work with enclosing asm volatile("" ::: "memory");
see https://forum.pjrc.com/threads/17469-millis()-on-teensy-3?p=22279&viewfull=1#post22279

duff
12-20-2016, 04:51 PM
What is it doing to your inline asm?
here you can see without downloading anything, just compile this sketch with and without LTO and look at disassembly of the sketch using objdump: (I used Fastest with LTO, Fastest menu options for this test)


void setup() {

}

void loop() {
delayMicroseconds(100000);
}

While 'delayMicroseconds' seems to work it looks like LTO is inlining everything, the disassembly for the 'delyMicroseconds' is not the same as without LTO. Also with LTO I don't see a call to 'setup' or the 'loop' functions in 'main' either. This example is to just show what LTO is doing to the assembly, not point to a problem with delayMicroseconds!

I will check with my scope to see if the delay times are right today.


does barrier work with enclosing asm volatile("" ::: "memory");
see https://forum.pjrc.com/threads/17469-millis()-on-teensy-3?p=22279&viewfull=1#post22279
LTO still seems to muck with the inline assembly even when the memory barrier is added.

For my particular problem, in my Zilch (https://github.com/duff2013/Zilch_Beta) library I redefine yield which does the context switch (https://github.com/duff2013/Zilch_Beta/blob/master/zilch.cpp#L269) and the compiled assembly needs to be exactly what inline assembly is or it does not work. With LTO enabled it is really not even close. I tried to stop inlining my yield function and putting in memory barriers to the inline assembly part but to no avail.

I think if this toolchain update does get adopted there should be menu control for using or not using LTO?

Sorry for all the LTO references, might give some a headache.:p


edit: I'm using a Teensy 3.2 for this.

manitou
12-20-2016, 05:09 PM
LTO still seems to muck with the inline assembly even when the memory barrier is added.


Since it is a Link-Time Optimization I guess it makes sense that a compile-time memory-barrier would have no effect

Frank B
12-20-2016, 08:48 PM
I'm still debating what the default should be for each board. Obviously Teensy LC needs -Os. In most the tests I've tried, -O3 adds significantly to the program size, so I'm a bit reluctant to make it the default for Teensy 3.0, 3.1, 3.2 where so many programs already exist.

Yes, why don't stay with -O1 :-)
The other added options are great. If one wants to use them, they are available now.

@Duff: why not print a #warning that your code does not work with LTO

MichaelMeissner
12-20-2016, 09:02 PM
It might be the LTO bug/feature has been fixed. However, unless somebody sends in a bug report to the proper channels, it may never be fixed (https://gcc.gnu.org/bugs/)

Frank B
12-20-2016, 09:06 PM
Yes, @Duff should create a minimal example and report the bug ...

duff
12-20-2016, 10:14 PM
It might be the LTO bug/feature has been fixed. However, unless somebody sends in a bug report to the proper channels, it may never be fixed (https://gcc.gnu.org/bugs/)

Is this bug or feature, I don't know but LTO seems to really turn your code upside down.



@Duff: why not print a #warning that your code does not work with LTO

That could work, I'll see what I can do with that.



Yes, @Duff should create a minimal example and report the bug ...

Before I go that far I would want to see if other peoples code is effected, I really don't know enough right now to say one way or the other. Frank did you look at the memcpy_audio.S disassembly? Does it work?

MichaelMeissner
12-20-2016, 10:22 PM
LTO (or link time optimization) essentially combines all of your modules together, and then optimizes the whole program. So if you have a function in a module, say foo.cpp:



float add_flt (float a, float b)
{
return a+b;
}


And you have a call to add_flt in a different module, say bar.cpp or bar.pde.

If LTO is not used, then it will always generate a call instruction, because it does not know what add_flt does.

If LTO is used, then the compiler can see that it is a simple function, and it should inline the function, instead of generating a function call, floating point add, and return.

<edit>
It really depends on the code whether LTO is a win or not. Since I'm more focused on individual optimizations in the PowerPC (specifically adding support for the forthcoming power9 processor), I don't tend to use LTO for my tests. Other people in IBM run spec with LTO and make pronouncements about the speed of the machine. I don't speak for IBM, but as a generalization, roughly half of the benchmarks were performance neutral, one benchmark was 3% slower, and the rest were faster (ranging from 2-24% faster). Of course spec code is much different than most of the code that runs on Teensys, so your mileage will vary.

duff
12-20-2016, 10:58 PM
Is LTO tied to the optimization level at all? Is there some way to stop it from inlining a function? I found a pragma "no-lto" but it gives all types of warnings about "plugin needed to handle lto object".

Just to update everyone that delayMicroseconds works the same for LTO and non LTO, i checked with my scope just now.

Frank B
12-21-2016, 05:54 PM
Is this bug or feature, I don't know but LTO seems to really turn your code upside down.

Duff!! Indeed !



Frank did you look at the memcpy_audio.S disassembly? Does it work?

Yes, it works, but, really it is totally different

Edit: err.. no my fault, it is identical :) I compared the wrong parts..lol..
sorry

PaulStoffregen
12-21-2016, 06:05 PM
Just discovered -O3 with LTO optimizes away all the data from the audio lib sample player example. Not good.

Frank B
12-21-2016, 06:27 PM
My mp3 codecs work.. it is quite complex code, with lots of tables and inline-assembler, , too.

duff
12-21-2016, 06:54 PM
So I found at least with my code this works:


#pragma GCC push_options
#pragma GCC optimize ("no-lto")

void funct() {

}


#pragma GCC pop_options

It does give all types of warnings when I tried it for delayMicroseconds though.

KurtE
12-22-2016, 09:31 PM
Not sure how important this, is, but have had Teensy loader app fault, if I had the windows still open when I try to shutdown windows.

This is Windows 10 64 bit, with this beta...

9218

It happened today and yesterday when I did a shutdown.

bmillier
12-22-2016, 10:37 PM
@KurtE. I have had the same fault display (at shutdown every time I have used the loader) on my win 10 64-bit system since I updated my teensyduino to the latest version. However it doesn't seem to affect anything that I can see.

PaulStoffregen
12-23-2016, 08:52 AM
Arduino just released version 1.8.0. The version increase appears to be related to unifying Arduino.org and Arduino.cc boards into a single software release.

I'm going to revert to the old toolchain and publish a new beta. We'll probably do a week or so of testing and merging little last-minute updates. Then this gcc 5.4 toolchain testing can resume in January. Or if anyone really wants to keep playing with gcc 5.4 can still use 1.34-beta1, just not with the new Arduino 1.8.0 release.

Frank B
12-24-2016, 01:12 PM
Or if anyone really wants to keep playing with gcc 5.4 can still use 1.34-beta1, just not with the new Arduino 1.8.0 release.

I stay with 1.34-beta1

KurtE
12-24-2016, 01:39 PM
I have both on my Windows machine

PaulStoffregen
12-24-2016, 02:32 PM
I found a fix for the crash on Windows 10 restart problem.

Right now I'm looking into compiler warnings with several libraries. Some happen with -O2 or -O3 optimization, even gcc 4.8. Many others are just sloppy library code. I'm trying to clean as much of this up as I can. Debating whether to make 1.34-beta3, or just do a normal release.

PaulStoffregen
12-24-2016, 05:51 PM
Before I lose this... here's my list of libraries known to have errors with the new toolchain:




Adafruit_CC3000 buildtest example

ks0108 error compiling

LowPower fails on all boards, even Teensy 2.0

PS2Keyboard errors

ST7565 error, C++ overload on srandom()



These libraries have warnings. Probably harmless, and most probably also happen with the old toolchain.



FlexCAN CANtest warning

OSC many warnings

RadioHead warnings

teensy_ssd1351 warnings

TinyGPS test_with_gps_device warning

VirtualWire warnings, unused stuff

X10 many warnings - ancient arduino stuff


Adafruit_SleepyDog warning on Teensy 3.x

AppleMidi warnings

Eigen313 warnings

MFRC522 warnings

EthernetBonjour many warnings

Frank B
01-21-2017, 04:38 PM
With Teensy 3.2, "fastest with LTO" i get

"upload@1679610-Teensy Firmware 'print_mac.ino.TEENSY31.hex' is not compatible with '1679610-Teensy'"

with TYQT :-)

Might be a TYQT issue ??

defragster
01-21-2017, 04:48 PM
With Teensy 3.2, "fastest with LTO" i get

"upload@1679610-Teensy Firmware 'print_mac.ino.TEENSY31.hex' is not compatible with '1679610-Teensy'"

with TYQT :-)

Might be a TYQT issue ??

I posted a note to Koromix last year and he posted a fix shortly after - it is working [on T_3.0] on the latest version I confirmed last night - see this post (https://forum.pjrc.com/threads/27825-Teensy-Qt?p=130948&viewfull=1#post130948)

Frank B
01-21-2017, 05:01 PM
Confirmed :-)

defragster
01-21-2017, 05:03 PM
Confirmed :-)

Good - I posted a note and link on the new 1.36 beta thread

Frank B
01-22-2017, 06:18 PM
you're right, RPI3 jessie is still running in 32 bit mode.

i tested 1.6.12 with 1.34beta1 on mac os,
coremark:
previously T3.2@96mhz -O2 189.4 iterations/sec | with LTO fastest 207.29
previously T3.6@180mhz -O2 384.0 | with LTO fastest 447.7
... so many optimization choices ...


T3.6@180mhz coremark
fastest LTO 447.676389
fastest 463.692033
faster LTO 437.121360
faster 434.528617
fast LTO 333.619557
fast 333.032915
small LTO 323.248789 no float printf
small 320.692182


GCC6 :

- 180MHz fastest with LTO: Compiler crashes with this sketch ("lto1.exe: internal compiler error: Segmentation fault")
- 180MHz fastest withou tLTO:


Start
2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 13072
Total time (secs): 13.072000
Iterations/Sec : 458.996328
Iterations : 6000
Compiler version : GCC6.2.1 20161205 (release) [ARM/embedded-6-branch revision 243739]
Compiler flags :
Memory location : STACK
seedcrc : 0xe9f5
[0]crclist : 0xe714
[0]crcmatrix : 0x1fd7
[0]crcstate : 0x8e3a
[0]crcfinal : 0xa14c
Correct operation validated. See readme.txt for run and reporting rules.
CoreMark 1.0 : 458.996328 / GCC6.2.1 20161205 (release) [ARM/embedded-6-branch revision 243739] / STACK


So, again 10 points faster, even without LTO (more than twice as fast as T3.2 @ 96MHz)

240MHz: 612.199466 ... no comment ..wow..
My benchmarks were done with "-mpure-code" - seems to be a little bit faster.

wizard69
01-26-2017, 08:50 PM
I got a pi 3 months back - unpowered yet - but I understood it was still using 32 bit Jessie for compatibility to all existing code/usage?

I was under the impression that it is now 64 bit. I know Odroid is trying to transition to 64 bit Linux on the C2.

KurtE
01-26-2017, 10:10 PM
My guess is that there are some alternate 64 bit setups for RPI3, but I don't know of any mainline one ones yet. Although I have not looked yet.

Yes Odroid C2 main setups is 64 bits. There are still issues with it. For example trying to run Arduino on it. I played around enough to get the main parts of the compiler and downloads to work, but have not gotten the Serial monitor to work. More details in the thread: http://forum.odroid.com/viewtopic.php?f=136&t=21249

Started trying to see about building a 64 bit version from sources. But then ran into issues where pieces of the build are from zip files or the like that have components for the different distros and there is not one for ARM 64 bits.... So I punted

PaulStoffregen
01-26-2017, 10:20 PM
In case anyone's wondering, I am not eager to expand Linux's portion of the Teensyduino release process from 3 of 5 to 4 of 6 files built.

Even if I was, my position is the same as before the 32 bit linuxarm build: I will officially support whatever architectures Arduino.cc officially supports with their non-beta releases. Until Arduino.cc adds a 64 bit linuxarm build, I will not do it. I know that's probably not the answer some Odriod enthusiasts probably want to hear, but hopefully a clear answer is better than uncertainty?

KurtE
01-26-2017, 10:32 PM
Thanks Paul,

Actually I would be happy with the 32 bit stuff working fine. And for me just having the compiler and upload is fine... There are obviously alternatives to using the terminal monitor.

Actually I would be even happier to be able to do all of it from the command line. The current Arduino added Linux support for command line to work without GUI, which I tried and it worked all the way up to upload, which failed... But that is another story...

tni
03-04-2017, 12:39 PM
Here is what delayMicroseconds disassembly look like with -O3 and LTO enabled on a T3:



000018b8 <L_783_delayMicroseconds>:
18b8: 3b01 subs r3, #1
18ba: d1fd bne.n 18b8 <L_783_delayMicroseconds>
18bc: f892 3200 ldrb.w r3, [r2, #512] ; 0x200
18c0: f892 1280 ldrb.w r1, [r2, #640] ; 0x280
18c4: b2db uxtb r3, r3
18c6: 2900 cmp r1, #0
18c8: d1f2 bne.n 18b0 <main+0x28>
18ca: b13b cbz r3, 18dc <L_783_delayMicroseconds+0x24>
18cc: 6803 ldr r3, [r0, #0]
18ce: f023 0302 bic.w r3, r3, #2
18d2: 6003 str r3, [r0, #0]
18d4: e7ef b.n 18b6 <main+0x2e>
18d6: f882 5100 strb.w r5, [r2, #256] ; 0x100
18da: e7ec b.n 18b6 <main+0x2e>
18dc: 6803 ldr r3, [r0, #0]
18de: f043 0303 orr.w r3, r3, #3
18e2: 6003 str r3, [r0, #0]
18e4: e7e7 b.n 18b6 <main+0x2e>
18e6: f8df 8078 ldr.w r8, [pc, #120] ; 1960 <L_869_delayMicroseconds+0x58>
18ea: f8df c078 ldr.w ip, [pc, #120] ; 1964 <L_869_delayMicroseconds+0x5c>
18ee: f8df e078 ldr.w lr, [pc, #120] ; 1968 <L_869_delayMicroseconds+0x60>
18f2: 4f18 ldr r7, [pc, #96] ; (1954 <L_869_delayMicroseconds+0x4c>)
18f4: 4e18 ldr r6, [pc, #96] ; (1958 <L_869_delayMicroseconds+0x50>)
18f6: 4d19 ldr r5, [pc, #100] ; (195c <L_869_delayMicroseconds+0x54>)
18f8: 4c15 ldr r4, [pc, #84] ; (1950 <L_869_delayMicroseconds+0x48>)
18fa: e010 b.n 191e <L_869_delayMicroseconds+0x16>
18fc: b1eb cbz r3, 193a <L_869_delayMicroseconds+0x32>
18fe: 6803 ldr r3, [r0, #0]
1900: f023 0302 bic.w r3, r3, #2
1904: 6003 str r3, [r0, #0]
1906: 4623 mov r3, r4


00001908 <L_869_delayMicroseconds>:
1908: 3b01 subs r3, #1
190a: d1fd bne.n 1908 <L_869_delayMicroseconds>
190c: f898 3000 ldrb.w r3, [r8]
1910: f89c 3000 ldrb.w r3, [ip]
1914: f89e 3000 ldrb.w r3, [lr]
1918: 783b ldrb r3, [r7, #0]
191a: 7833 ldrb r3, [r6, #0]
191c: 782b ldrb r3, [r5, #0]
191e: f892 3200 ldrb.w r3, [r2, #512] ; 0x200
1922: f892 1280 ldrb.w r1, [r2, #640] ; 0x280
1926: b2db uxtb r3, r3
1928: 2900 cmp r1, #0
192a: d0e7 beq.n 18fc <L_783_delayMicroseconds+0x44>
192c: b113 cbz r3, 1934 <L_869_delayMicroseconds+0x2c>
192e: f882 9100 strb.w r9, [r2, #256] ; 0x100
1932: e7e8 b.n 1906 <L_783_delayMicroseconds+0x4e>
1934: f882 9080 strb.w r9, [r2, #128] ; 0x80
1938: e7e5 b.n 1906 <L_783_delayMicroseconds+0x4e>
193a: 6803 ldr r3, [r0, #0]
193c: f043 0303 orr.w r3, r3, #3
1940: 6003 str r3, [r0, #0]
1942: e7e0 b.n 1906 <L_783_delayMicroseconds+0x4e>
1944: 4004b014 andmi fp, r4, r4, lsl r0
1948: 43fe1014 mvnsmi r1, #20
194c: 1fff8e08 svcne 0x00ff8e08
1950: 00f42400 rscseq r2, r4, r0, lsl #8
1954: 1fff8e0c svcne 0x00ff8e0c
1958: 1fff8e00 svcne 0x00ff8e00
195c: 1fff8dff svcne 0x00ff8dff
1960: 1fff8e09 svcne 0x00ff8e09
1964: 1fff8e0a svcne 0x00ff8e0a
1968: 1fff8e0b svcne 0x00ff8e0b

Ouch, seems like all the inline assembly gets mucked up with LTO enabled! I found this out with my Zilch library which heavily uses inline assembly.

For reference here is delayMicroseconds using -03 without LTO:


0000048a <L_36_delayMicroseconds>:
48a: 3b01 subs r3, #1
48c: d1fd bne.n 48a <L_36_delayMicroseconds>
48e: bd08 pop {r3, pc}
490: 00f42400 rscseq r2, r4, r0, lsl #8

No, it doesn't. The inline assembly part is just the 'subs ...; bne.n ...' which is identical in both cases.

BTW, Zilch Simple_Task works with higher optimization levels, if I change zilch.cpp:


void task_swap( volatile stack_frame_t *prevframe, volatile stack_frame_t *nextframe ) {


to either:


void __attribute__ ((noinline)) task_swap( volatile stack_frame_t *prevframe, volatile stack_frame_t *nextframe ) {

or:


void __attribute__ ((naked)) task_swap( volatile stack_frame_t *prevframe, volatile stack_frame_t *nextframe ) {


I think there is a GCC bug here, since simply adding a proper clobber list to the asm statement doesn't work.

\\

In general, GCC has no idea what the inline assembly does and assumes it doesn't change memory or registers. You need to add proper clobber lists, which you don't have for the Zilch inline assembly.

duff
03-14-2017, 05:59 PM
My latest version of Zlich (https://github.com/duff2013/Zilch) currently only supports T3.2 and works with all optimizations except Fastest w/ LTO.

markonian
09-14-2017, 07:25 PM
My latest version of Zlich (https://github.com/duff2013/Zilch_Beta) currently only supports T3.2 and works with all optimizations except Fastest w/ LTO.

@duff, FYI, your link (https://github.com/duff2013/Zilch_Beta) is broken and Zilch is misspelled in the above post. However, I DID find the Zilch project here: https://github.com/duff2013/Zilch

Is that the correct one?

duff
09-14-2017, 08:09 PM
@duff, FYI, your link (https://github.com/duff2013/Zilch_Beta) is broken and Zilch is misspelled in the above post. However, I DID find the Zilch project here: https://github.com/duff2013/Zilch

Is that the correct one?
:o Thanks, I'll fix that!