Teensyduino 1.34 Beta #1 (ARM Toolchain Update)

Status
Not open for further replies.
The newer arm_math versions have a huge FFT bit reversal table which doesn't fit into Teensy's flash memory.

I'm considering merging some or all new features, but those FFT tables need to be restructured (a unique table for each FFT length) for compatibility with Teensy.
 
I've completed building the new toolchain for Raspberry Pi and (hopefully) other ARM-based machines.

Edited the first post. The installer for linuxarm is now available.

If anyone tests on non-RPI boards, or anything other than Raspberry Pi 3, please let me know if it worked?
I installed it on the Odroid XU3-lite (which I also finally completed the build of the toolchain - 4th attempt due to missing things).

I installed it on the XU3, Programmed two different T3.6s, include one of my boards that backs the Touch screen and then tried it with my version of the ILI9341_t3n library, with SPIN and used my example version of the touch paint that is for the right touch controller and it worked.
 
Not sure If I should mention it here or different thread, but thought I would try downloading the FreeRTOS to try out the float print issue.

On my machine with 1.6.13, this beta on Windows 10 64 bit, when I try to compile the app in that thread or the frBlink Example it fails for T3.6
Code:
:\Users\Kurt\Documents\Arduino\libraries\FreeRTOS_ARM\utility\port.c: In function 'systick_isr':

C:\Users\Kurt\Documents\Arduino\libraries\FreeRTOS_ARM\utility\port.c:86:23: warning: implicit declaration of function 'xPortSysTickHandler' [-Wimplicit-function-declaration]

   if (sysTickEnabled) xPortSysTickHandler();

                       ^

C:\Users\Kurt\Documents\Arduino\libraries\FreeRTOS_ARM\utility\port.c: At top level:

C:\Users\Kurt\Documents\Arduino\libraries\FreeRTOS_ARM\utility\port.c:176:6: warning: conflicting types for 'xPortSysTickHandler'

 void xPortSysTickHandler( void );

      ^

C:\Users\Kurt\Documents\Arduino\libraries\FreeRTOS_ARM\utility\port.c:86:23: note: previous implicit declaration of 'xPortSysTickHandler' was here

   if (sysTickEnabled) xPortSysTickHandler();

                       ^

C:\Users\Kurt\AppData\Local\Temp\ccBigtf2.ltrans11.ltrans.o: In function `pendablesrvreq_isr':

<artificial>:(.text+0x1e): undefined reference to `vTaskSwitchContext'

collect2.exe: error: ld returned 1 exit status

Error compiling for board Teensy 3.6.
The frBlink built on Arduino 1.6.9 with TD1.31 (still had the same warnings)
 
I just ran all the arm_math examples. They all seem to work with the new toolchain.

The linear interpolation example is odd. I get slightly different results depending on the optimization used. Amazingly, the fastest with LTO seems to completely optimize away all the computations and the massive 736 kbyte data table!
 
? Without looking at/running the example - does the compiler (with lto) detect that the result is always the same and just outputs consts ?
That would be amazing.

Edit: Did you try "fastmath", too ? I wonder, how good/bad the results are.
 
Last edited:
Yes. I just looked at the generated code. It's completely replacing arm_linear_interp_f32() and the huge testInputSin_f32[] array with pre-computed results. But arm_linear_interp_f32() is a static inline function completely defined within arm_math.h, so I guess this isn't too surprising. None of the other optimization setting eliminate the huge const array.

It's also doing some pretty substantial optimization with accessing lots of variables from the same base address register using indexed addressing mode, which makes the generated assembly much harder to read.
 
Smallest Code - (nano) missing s/printf output

Found a <edit> REMINDER. Looking back to the code I did for : Teensy3-alternative-for-dtostrf()

I pulled the sketch sources from there - combined two posts: View attachment dtostrf_TEST_b.ino
{ Output prints only once so uses :: while (!Serial ); }

TD_1.34b1 works as it did in TD_1.33 and before when complied Fast or above it seems but going to "SMALLEST" doesn't get the output for printf or sprint as follows - a short snip [LTO doesn't affect output]:

Fast [ no compile warn/error ]show this as expected::
dtostrf(y, 10, 5, cbuf); == -0.00001
sprintf(cbuf, "%10.5f", y); == -0.00001
printf("%f", y);-0.000012

... // clipped

------
-0.00001
-0.00001234567935171071439981460
-0.00001234567935171071439981460
-0.000012
-0.00001234567890123456780746176
-0.000012
-0.00001234567935171071439981461
... // clipped

Smallest [ no compile warn/error] only shows this where the above RED output is missing::
dtostrf(y, 10, 5, cbuf); == -0.00001
sprintf(cbuf, "%10.5f", y); ==
printf("%f", y);

... // clipped

------
-0.00001
-0.00001234567935171071439981460
-0.00001234567935171071439981460
... // clipped
 
Last edited:
Smallest uses the nano libc which doesn't support printing floats.

You should see the same in 1.33 if you select the CPU Speed option for smallest code size.
 
Smallest uses the nano libc which doesn't support printing floats.

You should see the same in 1.33 if you select the CPU Speed option for smallest code size.

doh - actually - well that is a mere restatement of FACT then :) Demonstrating that it fails silently. And of course showing that the dtostrf() code edits are holding up.

I saw missing output last night and had to quit before I could see it as my Teensy was not restarting after programming - I'm not seeing that now with TYQT removed - and a couple of programming retries. It was preventing me from putting T_3.1 code on a T_3.1 saying the MCU was wrong. I just repeated with Teensy.exe and the T_3.1 is running well.

<edit>Retested: TYQT won't push this sample compiled with TD_1.34 to a T_3.1: posted ''dtostrf_TEST_b.ino.TEENSY31.hex' is not compatible' on TYQT thread.
 
Last edited:
I just ran all the arm_math examples. They all seem to work with the new toolchain.

I had a couple of small arm_math examples, they ran fine at -Os and fastest LTO. Also CAU assembly stuff worked, and the libcau.a version worked (modifying boards.txt and adding the lib to tootlchain lib).

old school cosmetic request: how about adding -O3 -O2 -Os etc to menu descriptions (at least the ones without LTO), e.g. Fastest -O3
 
I've spent my Sunday porting parts of my current (for more than a year now *sigh*) project from Q31(starting in the time on T3.2) to F32 arithmetics on the T3.6 and compiling with the new toolchain. I'm amazed with the speed and precision, especially since I can't use the audio lib or other arm_math libs because of different sampling rate needs and sample-for-sample processing to avoid latency. "Building" quadrature oscillators, state variable filters, and nonlinear filters in software has really become fun with the new Teensy and the new compiler!
 
I dug up the flops benchmark... the one which convinced me to use -O1 instead of -O2 as the default for use with gcc 4.8.

Code:
// https://forum.pjrc.com/threads/27959-FLOPS-not-scaling-to-F_CPU
#include <math.h>
#include <string.h>

//FASTRUN
void float_MatMult(float* A, float* B, int m, int p, int n, float* C) {
  // A = input matrix (m x p)
  // B = input matrix (p x n)
  // m = number of rows in A
  // p = number of columns in A = number of rows in B
  // n = number of columns in B
  // C = output matrix = A*B (m x n)
  int i, j, k;
  for ( i = 0; i < m; i++ )
    for ( j = 0; j < n; j++ ){
      C[i*n+j] = 0;
      for( k = 0; k < p; k++ )
	C[i*n+j] += A[i*p+k]*B[k*n+j];
    }
}

void setup() {
  
  while (!Serial) ;
  
  // variables for timing
  int i=0;
  int dt;
  
  // variables for calculation
#define N 16
  float A[N][N];
  float B[N][N];
  float C[N][N];
  
  memset(A,3.1415,sizeof(A));
  memset(B,8.1415,sizeof(B));
  memset(C,0.0,sizeof(C));
  
  int tbegin = micros();
#if 1
  for (i=1;;i++) {
    // do calculation
    float_MatMult((float*) A, (float*)B, N,N,N, (float*)C);
    // check if t_delay has passed
    dt = micros() - tbegin;
    if (dt > 1000000) break;
  }
#else
  for (i=0; i < 200; i++) {
    float_MatMult((float*) A, (float*)B, N,N,N, (float*)C);
  }
  dt = micros() - tbegin;
#endif

  Serial.printf("(%dx%d) matrices: ", N, N);
  Serial.printf("%d matrices in %d usec: ", i, dt);
  Serial.printf("%d matrices/second\n", (int)((float)i*1000000/dt));
  Serial.printf("Float (%d bytes) ", sizeof(float));
  float total = N*N*N*i*1e6 / (float)dt;
  Serial.printf(" multiplications per second:\t(%u)\n", (unsigned int)total);
}

void loop() {
}

Here is how it's performing now, on Teensy 3.2 and 3.6:

Code:
flops:  (Teensy 3.2, 96 MHz)

O3 lto: Float (4 bytes)  multiplications per second:    (820928)
O3:     Float (4 bytes)  multiplications per second:    (781485)

O2 lto: Float (4 bytes)  multiplications per second:    (839266)
O2:     Float (4 bytes)  multiplications per second:    (884790)

O1 lto: Float (4 bytes)  multiplications per second:    (746603)
O1:     Float (4 bytes)  multiplications per second:    (759376)

Og lto: Float (4 bytes)  multiplications per second:    (704953)
Og:     Float (4 bytes)  multiplications per second:    (697528)

Os lto: Float (4 bytes)  multiplications per second:    (738290)
Os:     Float (4 bytes)  multiplications per second:    (627845)

Code:
O3 lto: Float (4 bytes)  multiplications per second:    (32080536)
O3:     Float (4 bytes)  multiplications per second:    (32275866)

O2 lto: Float (4 bytes)  multiplications per second:    (14277215)
O2:     Float (4 bytes)  multiplications per second:    (14094865)

O1 lto: Float (4 bytes)  multiplications per second:    (13105863)
O1:     Float (4 bytes)  multiplications per second:    (12236621)

Og lto: Float (4 bytes)  multiplications per second:    (8205052)
Og:     Float (4 bytes)  multiplications per second:    (8205233)

Os lto: Float (4 bytes)  multiplications per second:    (13295244)
Os:     Float (4 bytes)  multiplications per second:    (11327418)
 
I dug up the flops benchmark... the one which convinced me to use -O1 instead of -O2 as the default for use with gcc 4.8.

Code:
  // variables for calculation
#define N 16
  float A[N][N];
  float B[N][N];
  float C[N][N];
  
  memset(A,3.1415,sizeof(A));
  memset(B,8.1415,sizeof(B));
  memset(C,0.0,sizeof(C));

Ummm, this doesn't do what the benchmark thinks it does. Instead, it assigns 0x03030303 to every float in A, 0x08080808 to every float in B, and 0 to every float in C. This means that every A will be 3.85009e-37 and every B will be 4.09355e-34. The multiplication produces 0, but since these are denormal numbers, it may/may not be slower than normal multiplies and/or issue a trap.
 
here are linpack megaflops for 100x100 single-precision float on T3.6 @180mhz and T3.2 (1.6.12 1.34beta1)
Code:
          K66@180mhz         T3.2@96mhz
        fastest LTO 28.92    0.998
        fastest 28.92        1.002
        faster LTO 28.44     0.996
        faster 28.45         0.999
        fast LTO 28.75       1.009
        fast  28.75          1.166
        debug LTO 20.7       0.986
        debug  20.7          0.987
        smallest LTO 26.65   0.973
        smallest  26.34      0.984

interestingly, older 1.6.12 with 1.32 (-O), K66 linpack runs at 30.02 mflops (faster than 1.34beta1)
 
Last edited:
Indeed, this - instead - memset shows even higher numbers
Code:
  for (int i=0; i<N; i++) {
   for (int j=0; j<N; j++) {
    A[i][j]=3.1415f;
    B[i][j]=8.1415f;
    C[i][j]=0.0f;
   }
  }
 
N = 16 must be a sweet spot. Even with Frank's init, with N = 20 performance drops to 14368672 mults/sec
 
Last edited:
For N=2 you get 62564256 mults/sec

thats ~3 cycles per multiplication (including overhead...) ?
I guess, the compiler is good with optimizing... ;-)
Then, the cache plays a role for higher N, and may not be sufficiant for N>16
 
Last edited:
it's actually twice that fast because the total op count is 2*N*N*N*i if you count the float add in the inner loop. the label would change to flops:
 
Last edited:
I ran several more tests, which probably aren't great benchmarks, but they are based on things commonly done with Arduino.

In some cases, -O1 seems to beat -O2 and -O3. For some specific tasks, -O3 with LTO works wonders.

I'm still debating what the default should be for each board. Obviously Teensy LC needs -Os. In most the tests I've tried, -O3 adds significantly to the program size, so I'm a bit reluctant to make it the default for Teensy 3.0, 3.1, 3.2 where so many programs already exist.
 
Here is what delayMicroseconds disassembly look like with -O3 and LTO enabled on a T3:

Code:
000018b8 <L_783_delayMicroseconds>:
    18b8:    3b01          subs    r3, #1
    18ba:    d1fd          bne.n    18b8 <L_783_delayMicroseconds>
    18bc:    f892 3200     ldrb.w    r3, [r2, #512]    ; 0x200
    18c0:    f892 1280     ldrb.w    r1, [r2, #640]    ; 0x280
    18c4:    b2db          uxtb    r3, r3
    18c6:    2900          cmp    r1, #0
    18c8:    d1f2          bne.n    18b0 <main+0x28>
    18ca:    b13b          cbz    r3, 18dc <L_783_delayMicroseconds+0x24>
    18cc:    6803          ldr    r3, [r0, #0]
    18ce:    f023 0302     bic.w    r3, r3, #2
    18d2:    6003          str    r3, [r0, #0]
    18d4:    e7ef          b.n    18b6 <main+0x2e>
    18d6:    f882 5100     strb.w    r5, [r2, #256]    ; 0x100
    18da:    e7ec          b.n    18b6 <main+0x2e>
    18dc:    6803          ldr    r3, [r0, #0]
    18de:    f043 0303     orr.w    r3, r3, #3
    18e2:    6003          str    r3, [r0, #0]
    18e4:    e7e7          b.n    18b6 <main+0x2e>
    18e6:    f8df 8078     ldr.w    r8, [pc, #120]    ; 1960 <L_869_delayMicroseconds+0x58>
    18ea:    f8df c078     ldr.w    ip, [pc, #120]    ; 1964 <L_869_delayMicroseconds+0x5c>
    18ee:    f8df e078     ldr.w    lr, [pc, #120]    ; 1968 <L_869_delayMicroseconds+0x60>
    18f2:    4f18          ldr    r7, [pc, #96]    ; (1954 <L_869_delayMicroseconds+0x4c>)
    18f4:    4e18          ldr    r6, [pc, #96]    ; (1958 <L_869_delayMicroseconds+0x50>)
    18f6:    4d19          ldr    r5, [pc, #100]    ; (195c <L_869_delayMicroseconds+0x54>)
    18f8:    4c15          ldr    r4, [pc, #84]    ; (1950 <L_869_delayMicroseconds+0x48>)
    18fa:    e010          b.n    191e <L_869_delayMicroseconds+0x16>
    18fc:    b1eb          cbz    r3, 193a <L_869_delayMicroseconds+0x32>
    18fe:    6803          ldr    r3, [r0, #0]
    1900:    f023 0302     bic.w    r3, r3, #2
    1904:    6003          str    r3, [r0, #0]
    1906:    4623          mov    r3, r4


00001908 <L_869_delayMicroseconds>:
    1908:    3b01          subs    r3, #1
    190a:    d1fd          bne.n    1908 <L_869_delayMicroseconds>
    190c:    f898 3000     ldrb.w    r3, [r8]
    1910:    f89c 3000     ldrb.w    r3, [ip]
    1914:    f89e 3000     ldrb.w    r3, [lr]
    1918:    783b          ldrb    r3, [r7, #0]
    191a:    7833          ldrb    r3, [r6, #0]
    191c:    782b          ldrb    r3, [r5, #0]
    191e:    f892 3200     ldrb.w    r3, [r2, #512]    ; 0x200
    1922:    f892 1280     ldrb.w    r1, [r2, #640]    ; 0x280
    1926:    b2db          uxtb    r3, r3
    1928:    2900          cmp    r1, #0
    192a:    d0e7          beq.n    18fc <L_783_delayMicroseconds+0x44>
    192c:    b113          cbz    r3, 1934 <L_869_delayMicroseconds+0x2c>
    192e:    f882 9100     strb.w    r9, [r2, #256]    ; 0x100
    1932:    e7e8          b.n    1906 <L_783_delayMicroseconds+0x4e>
    1934:    f882 9080     strb.w    r9, [r2, #128]    ; 0x80
    1938:    e7e5          b.n    1906 <L_783_delayMicroseconds+0x4e>
    193a:    6803          ldr    r3, [r0, #0]
    193c:    f043 0303     orr.w    r3, r3, #3
    1940:    6003          str    r3, [r0, #0]
    1942:    e7e0          b.n    1906 <L_783_delayMicroseconds+0x4e>
    1944:    4004b014     andmi    fp, r4, r4, lsl r0
    1948:    43fe1014     mvnsmi    r1, #20
    194c:    1fff8e08     svcne    0x00ff8e08
    1950:    00f42400     rscseq    r2, r4, r0, lsl #8
    1954:    1fff8e0c     svcne    0x00ff8e0c
    1958:    1fff8e00     svcne    0x00ff8e00
    195c:    1fff8dff     svcne    0x00ff8dff
    1960:    1fff8e09     svcne    0x00ff8e09
    1964:    1fff8e0a     svcne    0x00ff8e0a
    1968:    1fff8e0b     svcne    0x00ff8e0b
Ouch, seems like all the inline assembly gets mucked up with LTO enabled! I found this out with my Zilch library which heavily uses inline assembly.

For reference here is delayMicroseconds using -03 without LTO:
Code:
0000048a <L_36_delayMicroseconds>:
     48a:    3b01          subs    r3, #1
     48c:    d1fd          bne.n    48a <L_36_delayMicroseconds>
     48e:    bd08          pop    {r3, pc}
     490:    00f42400     rscseq    r2, r4, r0, lsl #8
 
Here is what delayMicroseconds disassembly look like with -O3 and LTO enabled on a T3:
...
Ouch, seems like all the inline assembly gets mucked up with LTO enabled! I found this out with my Zilch library which heavily uses inline assembly.

For reference here is delayMicroseconds using -03 without LTO:

I was going to make a note on Zilch - the FASTEST & FASTER LTO won't run the samples. Fastest and Faster seem to work - with simple as I have changed it - adding in use of FrankB's vaporized <pending refinement?> WFI based monitoring 'processorUsage.h'.
 
Last edited:
Status
Not open for further replies.
Back
Top