Teensyduino 1.34 Beta #1 (ARM Toolchain Update)

Status
Not open for further replies.
you're right, RPI3 jessie is still running in 32 bit mode.

i tested 1.6.12 with 1.34beta1 on mac os,
coremark:
previously T3.2@96mhz -O2 189.4 iterations/sec | with LTO fastest 207.29
previously T3.6@180mhz -O2 384.0 | with LTO fastest 447.7
... so many optimization choices ...

Code:
T3.6@180mhz coremark
        fastest LTO 447.676389
        fastest     463.692033
        faster LTO  437.121360
        faster      434.528617
        fast  LTO   333.619557
        fast        333.032915
        small LTO   323.248789  no float printf
        small       320.692182

GCC6 :

- 180MHz fastest with LTO: Compiler crashes with this sketch ("lto1.exe: internal compiler error: Segmentation fault")
- 180MHz fastest withou tLTO:
Code:
Start
2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 13072
Total time (secs): 13.072000
Iterations/Sec   : 458.996328
Iterations       : 6000
Compiler version : [COLOR=#ff0000][B]GCC6.2.1[/B][/COLOR] 20161205 (release) [ARM/embedded-6-branch revision 243739]
Compiler flags   : 
Memory location  : STACK
seedcrc          : 0xe9f5
[0]crclist       : 0xe714
[0]crcmatrix     : 0x1fd7
[0]crcstate      : 0x8e3a
[0]crcfinal      : 0xa14c
Correct operation validated. See readme.txt for run and reporting rules.
CoreMark 1.0 : 458.996328 / GCC6.2.1 20161205 (release) [ARM/embedded-6-branch revision 243739]  / STACK

So, again 10 points faster, even without LTO (more than twice as fast as T3.2 @ 96MHz)

240MHz: 612.199466 ... no comment ..wow..
My benchmarks were done with "-mpure-code" - seems to be a little bit faster.
 
Last edited:
I got a pi 3 months back - unpowered yet - but I understood it was still using 32 bit Jessie for compatibility to all existing code/usage?

I was under the impression that it is now 64 bit. I know Odroid is trying to transition to 64 bit Linux on the C2.
 
My guess is that there are some alternate 64 bit setups for RPI3, but I don't know of any mainline one ones yet. Although I have not looked yet.

Yes Odroid C2 main setups is 64 bits. There are still issues with it. For example trying to run Arduino on it. I played around enough to get the main parts of the compiler and downloads to work, but have not gotten the Serial monitor to work. More details in the thread: http://forum.odroid.com/viewtopic.php?f=136&t=21249

Started trying to see about building a 64 bit version from sources. But then ran into issues where pieces of the build are from zip files or the like that have components for the different distros and there is not one for ARM 64 bits.... So I punted
 
In case anyone's wondering, I am not eager to expand Linux's portion of the Teensyduino release process from 3 of 5 to 4 of 6 files built.

Even if I was, my position is the same as before the 32 bit linuxarm build: I will officially support whatever architectures Arduino.cc officially supports with their non-beta releases. Until Arduino.cc adds a 64 bit linuxarm build, I will not do it. I know that's probably not the answer some Odriod enthusiasts probably want to hear, but hopefully a clear answer is better than uncertainty?
 
Thanks Paul,

Actually I would be happy with the 32 bit stuff working fine. And for me just having the compiler and upload is fine... There are obviously alternatives to using the terminal monitor.

Actually I would be even happier to be able to do all of it from the command line. The current Arduino added Linux support for command line to work without GUI, which I tried and it worked all the way up to upload, which failed... But that is another story...
 
Here is what delayMicroseconds disassembly look like with -O3 and LTO enabled on a T3:

Code:
000018b8 <L_783_delayMicroseconds>:
    18b8:    3b01          subs    r3, #1
    18ba:    d1fd          bne.n    18b8 <L_783_delayMicroseconds>
    18bc:    f892 3200     ldrb.w    r3, [r2, #512]    ; 0x200
    18c0:    f892 1280     ldrb.w    r1, [r2, #640]    ; 0x280
    18c4:    b2db          uxtb    r3, r3
    18c6:    2900          cmp    r1, #0
    18c8:    d1f2          bne.n    18b0 <main+0x28>
    18ca:    b13b          cbz    r3, 18dc <L_783_delayMicroseconds+0x24>
    18cc:    6803          ldr    r3, [r0, #0]
    18ce:    f023 0302     bic.w    r3, r3, #2
    18d2:    6003          str    r3, [r0, #0]
    18d4:    e7ef          b.n    18b6 <main+0x2e>
    18d6:    f882 5100     strb.w    r5, [r2, #256]    ; 0x100
    18da:    e7ec          b.n    18b6 <main+0x2e>
    18dc:    6803          ldr    r3, [r0, #0]
    18de:    f043 0303     orr.w    r3, r3, #3
    18e2:    6003          str    r3, [r0, #0]
    18e4:    e7e7          b.n    18b6 <main+0x2e>
    18e6:    f8df 8078     ldr.w    r8, [pc, #120]    ; 1960 <L_869_delayMicroseconds+0x58>
    18ea:    f8df c078     ldr.w    ip, [pc, #120]    ; 1964 <L_869_delayMicroseconds+0x5c>
    18ee:    f8df e078     ldr.w    lr, [pc, #120]    ; 1968 <L_869_delayMicroseconds+0x60>
    18f2:    4f18          ldr    r7, [pc, #96]    ; (1954 <L_869_delayMicroseconds+0x4c>)
    18f4:    4e18          ldr    r6, [pc, #96]    ; (1958 <L_869_delayMicroseconds+0x50>)
    18f6:    4d19          ldr    r5, [pc, #100]    ; (195c <L_869_delayMicroseconds+0x54>)
    18f8:    4c15          ldr    r4, [pc, #84]    ; (1950 <L_869_delayMicroseconds+0x48>)
    18fa:    e010          b.n    191e <L_869_delayMicroseconds+0x16>
    18fc:    b1eb          cbz    r3, 193a <L_869_delayMicroseconds+0x32>
    18fe:    6803          ldr    r3, [r0, #0]
    1900:    f023 0302     bic.w    r3, r3, #2
    1904:    6003          str    r3, [r0, #0]
    1906:    4623          mov    r3, r4


00001908 <L_869_delayMicroseconds>:
    1908:    3b01          subs    r3, #1
    190a:    d1fd          bne.n    1908 <L_869_delayMicroseconds>
    190c:    f898 3000     ldrb.w    r3, [r8]
    1910:    f89c 3000     ldrb.w    r3, [ip]
    1914:    f89e 3000     ldrb.w    r3, [lr]
    1918:    783b          ldrb    r3, [r7, #0]
    191a:    7833          ldrb    r3, [r6, #0]
    191c:    782b          ldrb    r3, [r5, #0]
    191e:    f892 3200     ldrb.w    r3, [r2, #512]    ; 0x200
    1922:    f892 1280     ldrb.w    r1, [r2, #640]    ; 0x280
    1926:    b2db          uxtb    r3, r3
    1928:    2900          cmp    r1, #0
    192a:    d0e7          beq.n    18fc <L_783_delayMicroseconds+0x44>
    192c:    b113          cbz    r3, 1934 <L_869_delayMicroseconds+0x2c>
    192e:    f882 9100     strb.w    r9, [r2, #256]    ; 0x100
    1932:    e7e8          b.n    1906 <L_783_delayMicroseconds+0x4e>
    1934:    f882 9080     strb.w    r9, [r2, #128]    ; 0x80
    1938:    e7e5          b.n    1906 <L_783_delayMicroseconds+0x4e>
    193a:    6803          ldr    r3, [r0, #0]
    193c:    f043 0303     orr.w    r3, r3, #3
    1940:    6003          str    r3, [r0, #0]
    1942:    e7e0          b.n    1906 <L_783_delayMicroseconds+0x4e>
    1944:    4004b014     andmi    fp, r4, r4, lsl r0
    1948:    43fe1014     mvnsmi    r1, #20
    194c:    1fff8e08     svcne    0x00ff8e08
    1950:    00f42400     rscseq    r2, r4, r0, lsl #8
    1954:    1fff8e0c     svcne    0x00ff8e0c
    1958:    1fff8e00     svcne    0x00ff8e00
    195c:    1fff8dff     svcne    0x00ff8dff
    1960:    1fff8e09     svcne    0x00ff8e09
    1964:    1fff8e0a     svcne    0x00ff8e0a
    1968:    1fff8e0b     svcne    0x00ff8e0b
Ouch, seems like all the inline assembly gets mucked up with LTO enabled! I found this out with my Zilch library which heavily uses inline assembly.

For reference here is delayMicroseconds using -03 without LTO:
Code:
0000048a <L_36_delayMicroseconds>:
     48a:    3b01          subs    r3, #1
     48c:    d1fd          bne.n    48a <L_36_delayMicroseconds>
     48e:    bd08          pop    {r3, pc}
     490:    00f42400     rscseq    r2, r4, r0, lsl #8
No, it doesn't. The inline assembly part is just the 'subs ...; bne.n ...' which is identical in both cases.

BTW, Zilch Simple_Task works with higher optimization levels, if I change zilch.cpp:
Code:
void task_swap( volatile stack_frame_t *prevframe, volatile stack_frame_t *nextframe ) {

to either:
Code:
void __attribute__ ((noinline)) task_swap( volatile stack_frame_t *prevframe, volatile stack_frame_t *nextframe ) {
or:
Code:
void __attribute__ ((naked)) task_swap( volatile stack_frame_t *prevframe, volatile stack_frame_t *nextframe ) {

I think there is a GCC bug here, since simply adding a proper clobber list to the asm statement doesn't work.

\\

In general, GCC has no idea what the inline assembly does and assumes it doesn't change memory or registers. You need to add proper clobber lists, which you don't have for the Zilch inline assembly.
 
My latest version of Zlich currently only supports T3.2 and works with all optimizations except Fastest w/ LTO.
 
Last edited:
Status
Not open for further replies.
Back
Top