Forum Rule: Always post complete source code & details to reproduce any issue!
Page 2 of 4 FirstFirst 1 2 3 4 LastLast
Results 26 to 50 of 84

Thread: Teensyduino 1.34 Beta #1 (ARM Toolchain Update)

  1. #26
    Senior Member PaulStoffregen's Avatar
    Join Date
    Nov 2012
    Posts
    17,615
    The newer arm_math versions have a huge FFT bit reversal table which doesn't fit into Teensy's flash memory.

    I'm considering merging some or all new features, but those FFT tables need to be restructured (a unique table for each FFT length) for compatibility with Teensy.

  2. #27
    Senior Member+ KurtE's Avatar
    Join Date
    Jan 2014
    Posts
    3,534
    Quote Originally Posted by PaulStoffregen View Post
    I've completed building the new toolchain for Raspberry Pi and (hopefully) other ARM-based machines.

    Edited the first post. The installer for linuxarm is now available.

    If anyone tests on non-RPI boards, or anything other than Raspberry Pi 3, please let me know if it worked?
    I installed it on the Odroid XU3-lite (which I also finally completed the build of the toolchain - 4th attempt due to missing things).

    I installed it on the XU3, Programmed two different T3.6s, include one of my boards that backs the Touch screen and then tried it with my version of the ILI9341_t3n library, with SPIN and used my example version of the touch paint that is for the right touch controller and it worked.

  3. #28
    Senior Member+ KurtE's Avatar
    Join Date
    Jan 2014
    Posts
    3,534
    Not sure If I should mention it here or different thread, but thought I would try downloading the FreeRTOS to try out the float print issue.

    On my machine with 1.6.13, this beta on Windows 10 64 bit, when I try to compile the app in that thread or the frBlink Example it fails for T3.6
    Code:
    :\Users\Kurt\Documents\Arduino\libraries\FreeRTOS_ARM\utility\port.c: In function 'systick_isr':
    
    C:\Users\Kurt\Documents\Arduino\libraries\FreeRTOS_ARM\utility\port.c:86:23: warning: implicit declaration of function 'xPortSysTickHandler' [-Wimplicit-function-declaration]
    
       if (sysTickEnabled) xPortSysTickHandler();
    
                           ^
    
    C:\Users\Kurt\Documents\Arduino\libraries\FreeRTOS_ARM\utility\port.c: At top level:
    
    C:\Users\Kurt\Documents\Arduino\libraries\FreeRTOS_ARM\utility\port.c:176:6: warning: conflicting types for 'xPortSysTickHandler'
    
     void xPortSysTickHandler( void );
    
          ^
    
    C:\Users\Kurt\Documents\Arduino\libraries\FreeRTOS_ARM\utility\port.c:86:23: note: previous implicit declaration of 'xPortSysTickHandler' was here
    
       if (sysTickEnabled) xPortSysTickHandler();
    
                           ^
    
    C:\Users\Kurt\AppData\Local\Temp\ccBigtf2.ltrans11.ltrans.o: In function `pendablesrvreq_isr':
    
    <artificial>:(.text+0x1e): undefined reference to `vTaskSwitchContext'
    
    collect2.exe: error: ld returned 1 exit status
    
    Error compiling for board Teensy 3.6.
    The frBlink built on Arduino 1.6.9 with TD1.31 (still had the same warnings)

  4. #29
    Senior Member PaulStoffregen's Avatar
    Join Date
    Nov 2012
    Posts
    17,615
    I just ran all the arm_math examples. They all seem to work with the new toolchain.

    The linear interpolation example is odd. I get slightly different results depending on the optimization used. Amazingly, the fastest with LTO seems to completely optimize away all the computations and the massive 736 kbyte data table!

  5. #30
    Senior Member+ Frank B's Avatar
    Join Date
    Apr 2014
    Location
    Germany NRW
    Posts
    4,438
    ? Without looking at/running the example - does the compiler (with lto) detect that the result is always the same and just outputs consts ?
    That would be amazing.

    Edit: Did you try "fastmath", too ? I wonder, how good/bad the results are.
    Last edited by Frank B; 12-18-2016 at 06:21 PM.

  6. #31
    Senior Member PaulStoffregen's Avatar
    Join Date
    Nov 2012
    Posts
    17,615
    Yes. I just looked at the generated code. It's completely replacing arm_linear_interp_f32() and the huge testInputSin_f32[] array with pre-computed results. But arm_linear_interp_f32() is a static inline function completely defined within arm_math.h, so I guess this isn't too surprising. None of the other optimization setting eliminate the huge const array.

    It's also doing some pretty substantial optimization with accessing lots of variables from the same base address register using indexed addressing mode, which makes the generated assembly much harder to read.

  7. #32
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    6,441

    Smallest Code - (nano) missing s/printf output

    Found a <edit> REMINDER. Looking back to the code I did for : Teensy3-alternative-for-dtostrf()

    I pulled the sketch sources from there - combined two posts: dtostrf_TEST_b.ino
    { Output prints only once so uses :: while (!Serial ); }

    TD_1.34b1 works as it did in TD_1.33 and before when complied Fast or above it seems but going to "SMALLEST" doesn't get the output for printf or sprint as follows - a short snip [LTO doesn't affect output]:

    Fast [ no compile warn/error ]show this as expected::
    dtostrf(y, 10, 5, cbuf); == -0.00001
    sprintf(cbuf, "%10.5f", y); == -0.00001
    printf("%f", y);-0.000012

    ... // clipped

    ------
    -0.00001
    -0.00001234567935171071439981460
    -0.00001234567935171071439981460
    -0.000012
    -0.00001234567890123456780746176
    -0.000012
    -0.00001234567935171071439981461
    ... // clipped
    Smallest [ no compile warn/error] only shows this where the above RED output is missing::
    dtostrf(y, 10, 5, cbuf); == -0.00001
    sprintf(cbuf, "%10.5f", y); ==
    printf("%f", y);

    ... // clipped

    ------
    -0.00001
    -0.00001234567935171071439981460
    -0.00001234567935171071439981460
    ... // clipped
    Last edited by defragster; 12-18-2016 at 08:00 PM. Reason: nano build smaller by losing float print

  8. #33
    Senior Member PaulStoffregen's Avatar
    Join Date
    Nov 2012
    Posts
    17,615
    Smallest uses the nano libc which doesn't support printing floats.

    You should see the same in 1.33 if you select the CPU Speed option for smallest code size.

  9. #34
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    6,441
    Quote Originally Posted by PaulStoffregen View Post
    Smallest uses the nano libc which doesn't support printing floats.

    You should see the same in 1.33 if you select the CPU Speed option for smallest code size.
    doh - actually - well that is a mere restatement of FACT then Demonstrating that it fails silently. And of course showing that the dtostrf() code edits are holding up.

    I saw missing output last night and had to quit before I could see it as my Teensy was not restarting after programming - I'm not seeing that now with TYQT removed - and a couple of programming retries. It was preventing me from putting T_3.1 code on a T_3.1 saying the MCU was wrong. I just repeated with Teensy.exe and the T_3.1 is running well.

    <edit>Retested: TYQT won't push this sample compiled with TD_1.34 to a T_3.1: posted ''dtostrf_TEST_b.ino.TEENSY31.hex' is not compatible' on TYQT thread.
    Last edited by defragster; 12-18-2016 at 08:57 PM.

  10. #35
    Senior Member+ manitou's Avatar
    Join Date
    Jan 2013
    Posts
    1,476
    Quote Originally Posted by PaulStoffregen View Post
    I just ran all the arm_math examples. They all seem to work with the new toolchain.
    I had a couple of small arm_math examples, they ran fine at -Os and fastest LTO. Also CAU assembly stuff worked, and the libcau.a version worked (modifying boards.txt and adding the lib to tootlchain lib).

    old school cosmetic request: how about adding -O3 -O2 -Os etc to menu descriptions (at least the ones without LTO), e.g. Fastest -O3

  11. #36
    Senior Member+ Theremingenieur's Avatar
    Join Date
    Feb 2014
    Location
    Colmar, France
    Posts
    1,626
    I've spent my Sunday porting parts of my current (for more than a year now *sigh*) project from Q31(starting in the time on T3.2) to F32 arithmetics on the T3.6 and compiling with the new toolchain. I'm amazed with the speed and precision, especially since I can't use the audio lib or other arm_math libs because of different sampling rate needs and sample-for-sample processing to avoid latency. "Building" quadrature oscillators, state variable filters, and nonlinear filters in software has really become fun with the new Teensy and the new compiler!

  12. #37
    Senior Member PaulStoffregen's Avatar
    Join Date
    Nov 2012
    Posts
    17,615
    I dug up the flops benchmark... the one which convinced me to use -O1 instead of -O2 as the default for use with gcc 4.8.

    Code:
    // https://forum.pjrc.com/threads/27959-FLOPS-not-scaling-to-F_CPU
    #include <math.h>
    #include <string.h>
    
    //FASTRUN
    void float_MatMult(float* A, float* B, int m, int p, int n, float* C) {
      // A = input matrix (m x p)
      // B = input matrix (p x n)
      // m = number of rows in A
      // p = number of columns in A = number of rows in B
      // n = number of columns in B
      // C = output matrix = A*B (m x n)
      int i, j, k;
      for ( i = 0; i < m; i++ )
        for ( j = 0; j < n; j++ ){
          C[i*n+j] = 0;
          for( k = 0; k < p; k++ )
    	C[i*n+j] += A[i*p+k]*B[k*n+j];
        }
    }
    
    void setup() {
      
      while (!Serial) ;
      
      // variables for timing
      int i=0;
      int dt;
      
      // variables for calculation
    #define N 16
      float A[N][N];
      float B[N][N];
      float C[N][N];
      
      memset(A,3.1415,sizeof(A));
      memset(B,8.1415,sizeof(B));
      memset(C,0.0,sizeof(C));
      
      int tbegin = micros();
    #if 1
      for (i=1;;i++) {
        // do calculation
        float_MatMult((float*) A, (float*)B, N,N,N, (float*)C);
        // check if t_delay has passed
        dt = micros() - tbegin;
        if (dt > 1000000) break;
      }
    #else
      for (i=0; i < 200; i++) {
        float_MatMult((float*) A, (float*)B, N,N,N, (float*)C);
      }
      dt = micros() - tbegin;
    #endif
    
      Serial.printf("(%dx%d) matrices: ", N, N);
      Serial.printf("%d matrices in %d usec: ", i, dt);
      Serial.printf("%d matrices/second\n", (int)((float)i*1000000/dt));
      Serial.printf("Float (%d bytes) ", sizeof(float));
      float total = N*N*N*i*1e6 / (float)dt;
      Serial.printf(" multiplications per second:\t(%u)\n", (unsigned int)total);
    }
    
    void loop() {
    }
    Here is how it's performing now, on Teensy 3.2 and 3.6:

    Code:
    flops:  (Teensy 3.2, 96 MHz)
    
    O3 lto: Float (4 bytes)  multiplications per second:    (820928)
    O3:     Float (4 bytes)  multiplications per second:    (781485)
    
    O2 lto: Float (4 bytes)  multiplications per second:    (839266)
    O2:     Float (4 bytes)  multiplications per second:    (884790)
    
    O1 lto: Float (4 bytes)  multiplications per second:    (746603)
    O1:     Float (4 bytes)  multiplications per second:    (759376)
    
    Og lto: Float (4 bytes)  multiplications per second:    (704953)
    Og:     Float (4 bytes)  multiplications per second:    (697528)
    
    Os lto: Float (4 bytes)  multiplications per second:    (738290)
    Os:     Float (4 bytes)  multiplications per second:    (627845)
    Code:
    O3 lto: Float (4 bytes)  multiplications per second:    (32080536)
    O3:     Float (4 bytes)  multiplications per second:    (32275866)
    
    O2 lto: Float (4 bytes)  multiplications per second:    (14277215)
    O2:     Float (4 bytes)  multiplications per second:    (14094865)
    
    O1 lto: Float (4 bytes)  multiplications per second:    (13105863)
    O1:     Float (4 bytes)  multiplications per second:    (12236621)
    
    Og lto: Float (4 bytes)  multiplications per second:    (8205052)
    Og:     Float (4 bytes)  multiplications per second:    (8205233)
    
    Os lto: Float (4 bytes)  multiplications per second:    (13295244)
    Os:     Float (4 bytes)  multiplications per second:    (11327418)

  13. #38
    Senior Member+ MichaelMeissner's Avatar
    Join Date
    Nov 2012
    Location
    Ayer Massachussetts
    Posts
    2,718
    Quote Originally Posted by PaulStoffregen View Post
    I dug up the flops benchmark... the one which convinced me to use -O1 instead of -O2 as the default for use with gcc 4.8.

    Code:
      // variables for calculation
    #define N 16
      float A[N][N];
      float B[N][N];
      float C[N][N];
      
      memset(A,3.1415,sizeof(A));
      memset(B,8.1415,sizeof(B));
      memset(C,0.0,sizeof(C));
    Ummm, this doesn't do what the benchmark thinks it does. Instead, it assigns 0x03030303 to every float in A, 0x08080808 to every float in B, and 0 to every float in C. This means that every A will be 3.85009e-37 and every B will be 4.09355e-34. The multiplication produces 0, but since these are denormal numbers, it may/may not be slower than normal multiplies and/or issue a trap.

  14. #39
    Senior Member+ manitou's Avatar
    Join Date
    Jan 2013
    Posts
    1,476
    here are linpack megaflops for 100x100 single-precision float on T3.6 @180mhz and T3.2 (1.6.12 1.34beta1)
    Code:
              K66@180mhz         T3.2@96mhz
            fastest LTO 28.92    0.998
            fastest 28.92        1.002
            faster LTO 28.44     0.996
            faster 28.45         0.999
            fast LTO 28.75       1.009
            fast  28.75          1.166
            debug LTO 20.7       0.986
            debug  20.7          0.987
            smallest LTO 26.65   0.973
            smallest  26.34      0.984
    interestingly, older 1.6.12 with 1.32 (-O), K66 linpack runs at 30.02 mflops (faster than 1.34beta1)
    Last edited by manitou; 12-19-2016 at 07:04 PM.

  15. #40
    Senior Member+ Frank B's Avatar
    Join Date
    Apr 2014
    Location
    Germany NRW
    Posts
    4,438
    Quote Originally Posted by MichaelMeissner View Post
    Ummm
    In addition, this "Benchmark" does more or less only one thing.
    It's very questionable to use it's results for anything other than testing exactly this floatmult.

    This benchmark is way too simple.
    Last edited by Frank B; 12-19-2016 at 07:26 PM.

  16. #41
    Senior Member+ Frank B's Avatar
    Join Date
    Apr 2014
    Location
    Germany NRW
    Posts
    4,438
    Indeed, this - instead - memset shows even higher numbers
    Code:
      for (int i=0; i<N; i++) {
       for (int j=0; j<N; j++) {
        A[i][j]=3.1415f;
        B[i][j]=8.1415f;
        C[i][j]=0.0f;
       }
      }

  17. #42
    Senior Member+ manitou's Avatar
    Join Date
    Jan 2013
    Posts
    1,476
    N = 16 must be a sweet spot. Even with Frank's init, with N = 20 performance drops to 14368672 mults/sec
    Last edited by manitou; 12-20-2016 at 10:09 AM.

  18. #43
    Senior Member+ Frank B's Avatar
    Join Date
    Apr 2014
    Location
    Germany NRW
    Posts
    4,438
    For N=2 you get 62564256 mults/sec

    thats ~3 cycles per multiplication (including overhead...) ?
    I guess, the compiler is good with optimizing... ;-)
    Then, the cache plays a role for higher N, and may not be sufficiant for N>16
    Last edited by Frank B; 12-19-2016 at 08:29 PM.

  19. #44
    Senior Member+ manitou's Avatar
    Join Date
    Jan 2013
    Posts
    1,476
    it's actually twice that fast because the total op count is 2*N*N*N*i if you count the float add in the inner loop. the label would change to flops:
    Last edited by manitou; 12-19-2016 at 10:26 PM.

  20. #45
    Senior Member PaulStoffregen's Avatar
    Join Date
    Nov 2012
    Posts
    17,615
    I ran several more tests, which probably aren't great benchmarks, but they are based on things commonly done with Arduino.

    In some cases, -O1 seems to beat -O2 and -O3. For some specific tasks, -O3 with LTO works wonders.

    I'm still debating what the default should be for each board. Obviously Teensy LC needs -Os. In most the tests I've tried, -O3 adds significantly to the program size, so I'm a bit reluctant to make it the default for Teensy 3.0, 3.1, 3.2 where so many programs already exist.

  21. #46
    Senior Member duff's Avatar
    Join Date
    Jan 2013
    Location
    Las Vegas
    Posts
    904
    Here is what delayMicroseconds disassembly look like with -O3 and LTO enabled on a T3:

    Code:
    000018b8 <L_783_delayMicroseconds>:
        18b8:    3b01          subs    r3, #1
        18ba:    d1fd          bne.n    18b8 <L_783_delayMicroseconds>
        18bc:    f892 3200     ldrb.w    r3, [r2, #512]    ; 0x200
        18c0:    f892 1280     ldrb.w    r1, [r2, #640]    ; 0x280
        18c4:    b2db          uxtb    r3, r3
        18c6:    2900          cmp    r1, #0
        18c8:    d1f2          bne.n    18b0 <main+0x28>
        18ca:    b13b          cbz    r3, 18dc <L_783_delayMicroseconds+0x24>
        18cc:    6803          ldr    r3, [r0, #0]
        18ce:    f023 0302     bic.w    r3, r3, #2
        18d2:    6003          str    r3, [r0, #0]
        18d4:    e7ef          b.n    18b6 <main+0x2e>
        18d6:    f882 5100     strb.w    r5, [r2, #256]    ; 0x100
        18da:    e7ec          b.n    18b6 <main+0x2e>
        18dc:    6803          ldr    r3, [r0, #0]
        18de:    f043 0303     orr.w    r3, r3, #3
        18e2:    6003          str    r3, [r0, #0]
        18e4:    e7e7          b.n    18b6 <main+0x2e>
        18e6:    f8df 8078     ldr.w    r8, [pc, #120]    ; 1960 <L_869_delayMicroseconds+0x58>
        18ea:    f8df c078     ldr.w    ip, [pc, #120]    ; 1964 <L_869_delayMicroseconds+0x5c>
        18ee:    f8df e078     ldr.w    lr, [pc, #120]    ; 1968 <L_869_delayMicroseconds+0x60>
        18f2:    4f18          ldr    r7, [pc, #96]    ; (1954 <L_869_delayMicroseconds+0x4c>)
        18f4:    4e18          ldr    r6, [pc, #96]    ; (1958 <L_869_delayMicroseconds+0x50>)
        18f6:    4d19          ldr    r5, [pc, #100]    ; (195c <L_869_delayMicroseconds+0x54>)
        18f8:    4c15          ldr    r4, [pc, #84]    ; (1950 <L_869_delayMicroseconds+0x48>)
        18fa:    e010          b.n    191e <L_869_delayMicroseconds+0x16>
        18fc:    b1eb          cbz    r3, 193a <L_869_delayMicroseconds+0x32>
        18fe:    6803          ldr    r3, [r0, #0]
        1900:    f023 0302     bic.w    r3, r3, #2
        1904:    6003          str    r3, [r0, #0]
        1906:    4623          mov    r3, r4
    
    
    00001908 <L_869_delayMicroseconds>:
        1908:    3b01          subs    r3, #1
        190a:    d1fd          bne.n    1908 <L_869_delayMicroseconds>
        190c:    f898 3000     ldrb.w    r3, [r8]
        1910:    f89c 3000     ldrb.w    r3, [ip]
        1914:    f89e 3000     ldrb.w    r3, [lr]
        1918:    783b          ldrb    r3, [r7, #0]
        191a:    7833          ldrb    r3, [r6, #0]
        191c:    782b          ldrb    r3, [r5, #0]
        191e:    f892 3200     ldrb.w    r3, [r2, #512]    ; 0x200
        1922:    f892 1280     ldrb.w    r1, [r2, #640]    ; 0x280
        1926:    b2db          uxtb    r3, r3
        1928:    2900          cmp    r1, #0
        192a:    d0e7          beq.n    18fc <L_783_delayMicroseconds+0x44>
        192c:    b113          cbz    r3, 1934 <L_869_delayMicroseconds+0x2c>
        192e:    f882 9100     strb.w    r9, [r2, #256]    ; 0x100
        1932:    e7e8          b.n    1906 <L_783_delayMicroseconds+0x4e>
        1934:    f882 9080     strb.w    r9, [r2, #128]    ; 0x80
        1938:    e7e5          b.n    1906 <L_783_delayMicroseconds+0x4e>
        193a:    6803          ldr    r3, [r0, #0]
        193c:    f043 0303     orr.w    r3, r3, #3
        1940:    6003          str    r3, [r0, #0]
        1942:    e7e0          b.n    1906 <L_783_delayMicroseconds+0x4e>
        1944:    4004b014     andmi    fp, r4, r4, lsl r0
        1948:    43fe1014     mvnsmi    r1, #20
        194c:    1fff8e08     svcne    0x00ff8e08
        1950:    00f42400     rscseq    r2, r4, r0, lsl #8
        1954:    1fff8e0c     svcne    0x00ff8e0c
        1958:    1fff8e00     svcne    0x00ff8e00
        195c:    1fff8dff     svcne    0x00ff8dff
        1960:    1fff8e09     svcne    0x00ff8e09
        1964:    1fff8e0a     svcne    0x00ff8e0a
        1968:    1fff8e0b     svcne    0x00ff8e0b
    Ouch, seems like all the inline assembly gets mucked up with LTO enabled! I found this out with my Zilch library which heavily uses inline assembly.

    For reference here is delayMicroseconds using -03 without LTO:
    Code:
    0000048a <L_36_delayMicroseconds>:
         48a:    3b01          subs    r3, #1
         48c:    d1fd          bne.n    48a <L_36_delayMicroseconds>
         48e:    bd08          pop    {r3, pc}
         490:    00f42400     rscseq    r2, r4, r0, lsl #8

  22. #47
    Senior Member+ defragster's Avatar
    Join Date
    Feb 2015
    Posts
    6,441
    Quote Originally Posted by duff View Post
    Here is what delayMicroseconds disassembly look like with -O3 and LTO enabled on a T3:
    ...
    Ouch, seems like all the inline assembly gets mucked up with LTO enabled! I found this out with my Zilch library which heavily uses inline assembly.

    For reference here is delayMicroseconds using -03 without LTO:
    I was going to make a note on Zilch - the FASTEST & FASTER LTO won't run the samples. Fastest and Faster seem to work - with simple as I have changed it - adding in use of FrankB's vaporized <pending refinement?> WFI based monitoring 'processorUsage.h'.
    Last edited by defragster; 12-20-2016 at 05:04 AM.

  23. #48
    Senior Member duff's Avatar
    Join Date
    Jan 2013
    Location
    Las Vegas
    Posts
    904
    How do you stop LTO from touching my inline assembly code?

  24. #49
    Senior Member+ Frank B's Avatar
    Join Date
    Apr 2014
    Location
    Germany NRW
    Posts
    4,438
    Does "asm volatile" help?

  25. #50
    Senior Member duff's Avatar
    Join Date
    Jan 2013
    Location
    Las Vegas
    Posts
    904
    Quote Originally Posted by Frank B View Post
    Does "asm volatile" help?
    Unfortunately not, tried to find some attribute but nothing? delayMicroseconds seems to work though but I'll see if the timing is good on my scope tomorrow.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •