Teensyduino 1.22 Features

PaulStoffregen

Well-known member
Many items on this list were originally planned for 1.21, and have been discussed on that older thread.


  • Emulate Arduino Uno restart behavior upon serial monitor open. Teensy 2.0 has this, 3.x needs it.
  • Fix pin 33 bug (driving this pin low disrupts code upload and causes all sorts of other problems)
  • Breakpoint instruction to request software debug mode & disable C_DEBUGEN bit
  • Method to set the MK20 lock bit
  • AnalogRead should support differential inputs and using PGA
  • Ethernet should support UDP multicast
  • Command line loader update for Teensy 3.1 and include with installer
  • Wire library should define Wire1 for other I2C port on Teensy 3.1
  • Wire library should have setSDA(pin) and setSCL(pin), to control which pins are used
  • Serial1,2,3 should have setRX(pin) and setTX(pin), to control which pins are used
  • SD library should release SPI bus while SD card is busy
  • Support MacOS 10.9.5 & 10.x new app signature requirements (very difficult with lib installs)
  • SoftwareSerial support for any pins on Teensy 3.x, using CMT & DMA
  • detachInterruptVector()
  • pinMode to support more modes: INPUT_PULLDOWN, INPUT_DISABLE, INPUT_ANALOG
  • Fix IRremote at 72 MHz & other speeds
  • Library install/update/uninstall manager
  • Libraries to include or update:
    • USB Host Shield
    • ADC
    • SmartMatrix
    • i2c_t3
    • SdFat (Teensy-LC support and long file names)

This thread is the place to propose and discuss any new features. Please understand not everything on this list will necessarily make it into 1.22. My general goal is to release every 3 to 4 months, rolling up all the new features and libraries that are stable at the time, and also as necessary to stay in sync with Arduino's releases.
 
Last edited:
Many items on this list were originally planned for 1.21, and have been discussed on that older thread.



  • .....
  • Command line loader update for Teensy 3.1 and include with installer
    .....

This thread is the place to propose and discuss any new features. Please understand not everything on this list will necessarily make it into 1.22. My general goal is to release every 3 to 4 months, rolling up all the new features and libraries that are stable at the time, and also as necessary to stay in sync with Arduino's releases.

I found the following features useful for my multi Teensy application, so I added them to the PJRC command line loader:

-Upload same program to all (or selected) Teensies attached to USB
-Fast command line reset and reboot of all (or selected) Teensies attached to USB

While the mods are straight forward, I would like to see an PJRC supported version

I would appreciate also a
  • Non_Arduino distribution of 'tools' and support functions

ceterum censeo
I look forward to a Teensy4
 
Many items on this list were originally planned for 1.21, and have been discussed on that older thread.


  • Serial1,2,3 should have setRX(pin) and setTX(pin), to control which pins are used
I am wondering in the same light as this, the reason I was asking about knowing which Hardware Serial ports are defined, is I use that in my AX servo support.
In particular I use it to make use of the Half Duplex mode of the serial ports.
Example for Serial1 to enable Half Duplex, I do:
Code:
    if (s_paxStream == (Stream*)&Serial1) {
        Serial1.begin(baud);
#if defined(__MK20DX256__) || defined(__MKL26Z64__)
        UART0_C1 |= UART_C1_LOOPS | UART_C1_RSRC;
        CORE_PIN1_CONFIG |= PORT_PCR_PE | PORT_PCR_PS; // pullup on output pin
#endif
    }
To set in Transmit mode:
Code:
        UART0_C3 |=  UART_C3_TXDIR;
And likewise to turn to receive mode:
Code:
        UART0_C3 &= ~UART_C3_TXDIR;

And likewise if Serial2 or Serial3... Would be great if the Serial objects had the duplex support built into them.

Kurt
 
Based on public statements Massimo and other Arduino devs have made recently, odds seem very likely Arduino will release either a new product or new software next week, on March 28 "Arduino Day". If it's hardware, a new software release would likely follow a few days later.

Limor & Phil (Adafruit) showed an official Arduino brand Gemma (ATTINY-based) bare circuit board on their video 2 days ago. That very well may be the new product they'll announce on the 28th?

So it's looking like I'll making a new Teensyduino release to support whatever new version they publish. My hope is to get some of the smaller, simpler features on the above list done over the next several days. With such a short time, I could really use help testing!

If anyone knows of libraries that should be added or updated or other things that really should be addressed quickly before "Arduino Day", now's the time to speak up. We only have a matter of several days.
 
I'll run what I can - or if you see suspect areas of change I can get to please post.

re: item first:
•Emulate Arduino Uno restart behavior upon serial monitor open. Teensy 2.0 has this, 3.x needs it.

Please make sure this is somehow optional. It can have great value to start fresh and in sync with monitor - but it can also ruin an ongoing test if say the monitor gets closed or redirected to another teensy, and then you want to pop in on 'this' ongoing device to monitor it - only to have it restart would be 'rude'. Perhaps with a battery powered remote unit coming back to review a days work and recharge - popped on USB and then - BAM! restart.
 
To add one item would be to be able to stop auto-upload from the IDE. With two teensies active you cannot choose where the upload goes. If it waited for a button this would be addressed. When you have a master/slave setup and are alternatively changing code on both having the wrong code overwrite a running system is not right. And really ugly when you mix LC&3.1 and the upload is refused for wrong device.

The TeensyD does have UI space for 'IDE Auto' right beside (Button) 'Auto'.

*As noted before being "PC" centric that is what I expected the TeensyD 'Auto' button to do, but alas history and Teensy centricity precedes me.

Question: Under 1.6 could you have multiple IDE monitors open? You cannot under 1.6.1, and I'm sure I did under 1.6.0.
 
Last edited:
In 1.0.6 and earlier, Verify used to tell Teensy Loader about your freshly compiled file, but not attempt to reboot.

We lost that in 1.6.0 and 1.6.1. It didn't seem to have any way to work into platform.txt. I should really add a patch to bring this back.
 
In 1.0.6 and earlier, Verify used to tell Teensy Loader about your freshly compiled file, but not attempt to reboot.

We lost that in 1.6.0 and 1.6.1. It didn't seem to have any way to work into platform.txt. I should really add a patch to bring this back.

That would be good and explains the disconnect. I wondered why TeensyD told me to 'Verify' at the end of Install.

{I posted that the IDE told me Teensy Loader could not find the file TEST_tft_Ring.cpp - I'm wondering if this break on Verify might run into Upload too?}
 
I'm wondering if this break on Verify might run into Upload too?

Upload always informs Teensy Loader of the new file, of course if the compile process succeeds in creating it. You can easily check this by using Help > Verbose Info in Teensy Loader. You'll see the communication from Arduino, where it gives Teensy Loader the new directory and file name.
 
"Optimized" in the menu results in -O (=-O1) which is almost no optimizing (?)

"With -O, the compiler tries to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time. "

Edit: Clearification, perhaps -O2 is a better choice since it gives better results and there is almost no difference in "compilation time"

"Optimize for size. -Os enables all -O2 optimizations that do not typically increase code size. "
..so.. with -Os are the most options of -O2 already switched on. With -O compared to -Os you disable them. Does it make sense to call this "optimized" ?

Edit: My choice would be -O3 or -Ofast... but perhaps -O2 is ok...
 
Last edited:
When I got that IDE Error message I'm assuming it didn't tell Teensy Loader as it had no file to forward? From that post m#8 I scanned the verbose and in 22 instances (in 9 minutes) of '...Ring.cpp.hex' verbose didn't say it missed the file. Perhaps the Loader had the file open already and the IDE was denied access and reported 'not find'?
 
Edit: Clearification, perhaps -O2 is a better choice since it gives better results and there is almost no difference in "compilation time"

Did you test this with any benchmarks? Or is this statement based only on the general gcc documentation?
 
Edit: My choice would be -O3 or -Ofast... but perhaps -O2 is ok...

You have to be careful before making this kind of assumption. Bigger does not mean better. -O2 tends to be a safe bet for desktop applications, and in most cases you don't need the potential very limited improvements -O3 will get you, especially since it can make the performance worse (from Icache trashing, among other things). Profiling and optimizing the hot code paths will get you much more than a magic flag.

But embedded microcontrollers have very different performance characteristics. Maybe -O2 is better, but you should prove it first. For -O3, honestly it's very doubtful. I think all it will get you is bloated binaries (because of aggressive inlining), but I could be wrong.

Another thing to keep in mind is that more optimization results in more undefined behavior wonkiness. Again, not a big problem for most desktop code, but the combination of tricky compiler optimizations with tricky low-level code magic should make you wary.

You can find benchmarks testing the various optimization levels on the web. Here on Phoronix (I know, Phoronix is evil, but Michael's test suite is quite good): http://www.phoronix.com/scan.php?page=article&item=gcc_47_optimizations&num=2
 
Last edited:
Paul, Koromix:
Yea i know that sometimes o1 is faster and o3 is not faster than o2 in every case.
Basically thats what i wrote.. O2 is ok.
Take a look at the link above :)

P.s. i know that it is arguable to compare different architectures.

I dont want to spend time with this discussion. It was only a suggestion. Perhaps place some bugreports if o1 is faster than o2 most of the time on ARM? -That would be a bug!

Edit:
There's no need to proove that O2 is faster than O1. But it would be bad if it was generally slower than O1.

p.s.The newlib is compiled with -O2 (GCC ARM toolchain)
 
Last edited:
I ran just a few benchmarks. I'll admit, nothing comprehensive, but at least some tests.

-O1 was the clear winner. -O2 and -O3 were significantly slower on Teensy 3.1, and they usually produced slightly larger code. On Teensy-LC, all the results were very similar.

I'm certainly interested to hear any more (well practiced) benchmarking. But I don't want to talk about -O2 & -O3 in abstract, theoretical terms. I believe in real benchmarks.
 
I ran just a few benchmarks. I'll admit, nothing comprehensive, but at least some tests.

-O1 was the clear winner. -O2 and -O3 were significantly slower on Teensy 3.1, and they usually produced slightly larger code. On Teensy-LC, all the results were very similar.

I'm certainly interested to hear any more (well practiced) benchmarking. But I don't want to talk about -O2 & -O3 in abstract, theoretical terms. I believe in real benchmarks.

Perhaps try to write some which take advantage of inline small-functions or loop-unrolling, auto inline-functions, or unswitch-loops for example?

I don't want to argue further, its your decision.
But, can you give a link to your benchmarks please ? And perhaps they are of interest for the ARM-guys @ launchpad, too?
 
Here's one from a recent thread.

Code:
// https://forum.pjrc.com/threads/27959-FLOPS-not-scaling-to-F_CPU
#include <math.h>
#include <string.h>

//FASTRUN
void float_MatMult(float* A, float* B, int m, int p, int n, float* C) {
  // A = input matrix (m x p)
  // B = input matrix (p x n)
  // m = number of rows in A
  // p = number of columns in A = number of rows in B
  // n = number of columns in B
  // C = output matrix = A*B (m x n)
  int i, j, k;
  for ( i = 0; i < m; i++ )
    for ( j = 0; j < n; j++ ){
      C[i*n+j] = 0;
      for( k = 0; k < p; k++ )
        C[i*n+j] += A[i*p+k]*B[k*n+j];
    }
}

void setup() {

  while (!Serial) ;

  // variables for timing
  int i=0;
  int dt;

  // variables for calculation
#define N 16
  float A[N][N];
  float B[N][N];
  float C[N][N];

  memset(A,3.1415,sizeof(A));
  memset(B,8.1415,sizeof(B));
  memset(C,0.0,sizeof(C));

  int tbegin = micros();
#if 1
  for (i=1;;i++) {
    // do calculation
    float_MatMult((float*) A, (float*)B, N,N,N, (float*)C);
    // check if t_delay has passed
    dt = micros() - tbegin;
    if (dt > 1000000) break;
  }
#else
  for (i=0; i < 200; i++) {
    float_MatMult((float*) A, (float*)B, N,N,N, (float*)C);
  }
  dt = micros() - tbegin;
#endif

  Serial.printf("(%dx%d) matrices: ", N, N);
  Serial.printf("%d matrices in %d usec: ", i, dt);
  Serial.printf("%d matrices/second\n", (int)((float)i*1000000/dt));
  Serial.printf("Float (%d bytes) ", sizeof(float));
  float total = N*N*N*i*1e6 / (float)dt;
  Serial.printf(" multiplications per second:\t(%u)\n", (unsigned int)total);
}

void loop() {
}
 
My decision on optimize flags is flexible. If real benchmarking can prove other settings are better, I will be happy to change the settings for Teensyduino 1.22 or other future versions.
 
Ok, i back from work now and I have a little more time.

I can't follow.. O1 is not faster than O2 here. A little bit, but not "significant".
To be fair, one should mention, that there are several thousand interrupts from systick during these loops and this can not give reliable results:
Then, your example is not very optimizable by the compiler, because its already quite good.

The best is the optimizer is sitting in front of the computer. With good written routines, gcc has not much left to optimize...
But most Arduino-users are not very experienced programmers and as good as you, Paul. Arduino is for beginners.

Lets look at this modified example:
Code:
#include <math.h>
#include <string.h>

//Compile with ""optimized" from the Arduino-Menu


FASTRUN
void float_MatMult(float* A, float* B, int m, int p, int n, float* C) {
  // A = input matrix (m x p)
  // B = input matrix (p x n)
  // m = number of rows in A
  // p = number of columns in A = number of rows in B
  // n = number of columns in B
  // C = output matrix = A*B (m x n)
  int i, j, k;
  for ( i = 0; i < m; i++ )
    for ( j = 0; j < n; j++ ){
      C[i*n+j] = 0;
      for( k = 0; k < p; k++ )
        C[i*n+j] += A[i*p+k]*B[k*n+j];
    }
}

FASTRUN __attribute__((optimize("O2")))
void float_MatMultO2(float* A, float* B, int m, int p, int n, float* C) {
  // A = input matrix (m x p)
  // B = input matrix (p x n)
  // m = number of rows in A
  // p = number of columns in A = number of rows in B
  // n = number of columns in B
  // C = output matrix = A*B (m x n)
  int i, j, k;
  for ( i = 0; i < m; i++ )
    for ( j = 0; j < n; j++ ){
      C[i*n+j] = 0;
      for( k = 0; k < p; k++ )
        C[i*n+j] += A[i*p+k]*B[k*n+j];
    }
}


FASTRUN __attribute__((optimize("O3")))
void float_MatMultO3(float* A, float* B, int m, int p, int n, float* C) {
  // A = input matrix (m x p)
  // B = input matrix (p x n)
  // m = number of rows in A
  // p = number of columns in A = number of rows in B
  // n = number of columns in B
  // C = output matrix = A*B (m x n)
  int i, j, k;
  for ( i = 0; i < m; i++ )
    for ( j = 0; j < n; j++ ){
      C[i*n+j] = 0;
      for( k = 0; k < p; k++ )
        C[i*n+j] += A[i*p+k]*B[k*n+j];
    }
}

float multiply(float* A, float* B, int i, int p, int k, int n, int j) {
  return A[i*p+k]*B[k*n+j];
}  

void float_MatMult_noexperienceduser(float* A, float* B, int m, int p, int n, float* C) {
  // A = input matrix (m x p)
  // B = input matrix (p x n)
  // m = number of rows in A
  // p = number of columns in A = number of rows in B
  // n = number of columns in B
  // C = output matrix = A*B (m x n)
  int i, j, k;
  for ( i = 0; i < m; i++ )
    for ( j = 0; j < n; j++ ){
      C[i*n+j] = 0;
      for( k = 0; k < p; k++ ) {
        C[i*n+j] += multiply(A, B, i, p, k, n, j);
      }
    }
}

__attribute__((optimize("O2")))
float multiplyO2(float* A, float* B, int i, int p, int k, int n, int j) {
  return A[i*p+k]*B[k*n+j];
}  

__attribute__((optimize("O2")))
void float_MatMult_noexperienceduserO2(float* A, float* B, int m, int p, int n, float* C) {
  // A = input matrix (m x p)
  // B = input matrix (p x n)
  // m = number of rows in A
  // p = number of columns in A = number of rows in B
  // n = number of columns in B
  // C = output matrix = A*B (m x n)
  int i, j, k;
  for ( i = 0; i < m; i++ )
    for ( j = 0; j < n; j++ ){
      C[i*n+j] = 0;
      for( k = 0; k < p; k++ ) {
        C[i*n+j] += multiplyO2(A, B, i, p, k, n, j);
      }
    }
}





void setup() {

  while (!Serial) ;

  // variables for timing
  int i=0;
  int dt;

  // variables for calculation
#define N 16
  float A[N][N];
  float B[N][N];
  float C[N][N];

  memset(A,3.1415,sizeof(A));
  memset(B,8.1415,sizeof(B));
  memset(C,0.0,sizeof(C));

  int tbegin = micros();

  for (i=1;;i++) {
    // do calculation
    float_MatMult((float*) A, (float*)B, N,N,N, (float*)C);
    // check if t_delay has passed
    dt = micros() - tbegin;
    if (dt > 1000000) break;
  }
  Serial.println("Optimized");
  Serial.printf("(%dx%d) matrices: ", N, N);
  Serial.printf("%d matrices in %d usec: ", i, dt);
  Serial.printf("%d matrices/second\n", (int)((float)i*1000000/dt));
  Serial.printf("Float (%d bytes) ", sizeof(float));
  float total = N*N*N*i*1e6 / (float)dt;
  Serial.printf(" multiplications per second:\t(%u)\n", (unsigned int)total);
  
  delay(500);
  
  tbegin = micros();

  for (i=1;;i++) {
    // do calculation
    float_MatMult((float*) A, (float*)B, N,N,N, (float*)C);
    // check if t_delay has passed
    dt = micros() - tbegin;
    if (dt > 1000000) break;
  }
  Serial.println("Optimized O2");
  Serial.printf("(%dx%d) matrices: ", N, N);
  Serial.printf("%d matrices in %d usec: ", i, dt);
  Serial.printf("%d matrices/second\n", (int)((float)i*1000000/dt));
  Serial.printf("Float (%d bytes) ", sizeof(float));
 total = N*N*N*i*1e6 / (float)dt;
  Serial.printf(" multiplications per second:\t(%u)\n", (unsigned int)total);  
  
  
  
  
    
  delay(500);
  
  tbegin = micros();

  for (i=1;;i++) {
    // do calculation
    float_MatMult((float*) A, (float*)B, N,N,N, (float*)C);
    // check if t_delay has passed
    dt = micros() - tbegin;
    if (dt > 1000000) break;
  }
  Serial.println("Optimized O3");
  Serial.printf("(%dx%d) matrices: ", N, N);
  Serial.printf("%d matrices in %d usec: ", i, dt);
  Serial.printf("%d matrices/second\n", (int)((float)i*1000000/dt));
  Serial.printf("Float (%d bytes) ", sizeof(float));
 total = N*N*N*i*1e6 / (float)dt;
  Serial.printf(" multiplications per second:\t(%u)\n", (unsigned int)total);  
  
  
  
  
  
  Serial.println();
    
  delay(500);
  
  tbegin = micros();

  for (i=1;;i++) {
    // do calculation
    float_MatMult_noexperienceduser((float*) A, (float*)B, N,N,N, (float*)C);
    // check if t_delay has passed
    dt = micros() - tbegin;
    if (dt > 1000000) break;
  }
  
  Serial.println("Optimized unexperienced");
  Serial.printf("(%dx%d) matrices: ", N, N);
  Serial.printf("%d matrices in %d usec: ", i, dt);
  Serial.printf("%d matrices/second\n", (int)((float)i*1000000/dt));
  Serial.printf("Float (%d bytes) ", sizeof(float));
 total = N*N*N*i*1e6 / (float)dt;
  Serial.printf(" multiplications per second:\t(%u)\n", (unsigned int)total);    




  delay(500);
  
  tbegin = micros();

  for (i=1;;i++) {
    // do calculation
    float_MatMult_noexperienceduserO2((float*) A, (float*)B, N,N,N, (float*)C);
    // check if t_delay has passed
    dt = micros() - tbegin;
    if (dt > 1000000) break;
  }

  Serial.println("Optimized unexperienced O2");
  Serial.printf("(%dx%d) matrices: ", N, N);
  Serial.printf("%d matrices in %d usec: ", i, dt);
  Serial.printf("%d matrices/second\n", (int)((float)i*1000000/dt));
  Serial.printf("Float (%d bytes) ", sizeof(float));
 total = N*N*N*i*1e6 / (float)dt;
  Serial.printf(" multiplications per second:\t(%u)\n", (unsigned int)total);    

  
}

void loop() {
}

I added two variants which show what the compiler can archieve for not so experienced users.

Here are the results:
Code:
Optimized
(16x16) matrices: 194 matrices in 1001789 usec: 193 matrices/second
Float (4 bytes)  multiplications per second:	(793204)
Optimized O2
(16x16) matrices: 194 matrices in 1001794 usec: 193 matrices/second
Float (4 bytes)  multiplications per second:	(793201)
Optimized O3
(16x16) matrices: 194 matrices in 1001776 usec: 193 matrices/second
Float (4 bytes)  multiplications per second:	(793215)

Optimized unexperienced
(16x16) matrices: 133 matrices in 1005602 usec: 132 matrices/second
Float (4 bytes)  multiplications per second:	(541733)
Optimized unexperienced O2
(16x16) matrices: 191 matrices in 1002199 usec: 190 matrices/second
Float (4 bytes)  multiplications per second:	(780619)

You'll notice that
a) there is almost no difference between O1, O2 and O3. I can't see any significance that O1 ist faster.
b) if you give the the optimizer a chance to do his job, there is indeed a big difference. O2 is really MUCH faster than O1 (and thats only one additional optimization (i think): autoinlining).

It should further be mentioned that optimization works better for whole compile-units, the single functions in this example are not optimal.



Edit: These are the Results for Teensy LC (n=8 due to less RAM):

Code:
Optimized
(8x8) matrices: 320 matrices in 1000715 usec: 319 matrices/second
Float (4 bytes)  multiplications per second:	(163722)
Optimized O2
(8x8) matrices: 320 matrices in 1000694 usec: 319 matrices/second
Float (4 bytes)  multiplications per second:	(163726)
Optimized O3
(8x8) matrices: 320 matrices in 1000691 usec: 319 matrices/second
Float (4 bytes)  multiplications per second:	(163726)

Optimized unexperienced
(8x8) matrices: 296 matrices in 1000530 usec: 295 matrices/second
Float (4 bytes)  multiplications per second:	(151471)
Optimized unexperienced O2
(8x8) matrices: 331 matrices in 1000038 usec: 330 matrices/second
Float (4 bytes)  multiplications per second:	(169465)
 
Last edited:
Given the benchmark is all floating point, and given the Teenys do not support hardware floating point, I would have to imagine that the majority of the time is spent in floating point emulation functions. Please consider using a benchmark that more closely represents your actual application. If your actual application is floating point dependent, perhaps you should consider using a processor with hardware floating point, such as Raspberry Pi, Beagle Bone Black, or Edison.
 
Last edited:
perhaps we can try one or two from SPECint ?
But we want test the ability of the compiler to optimize, not the absolute power of the teensy
 
Count me in on testing C_DEBUGEN...

This fact hasn't been documented anywhere (yet), but I did implement this on Teensy-LC.

There: now it's documented! ;)

C_DEBUGEN is cleared right before the KL02 (replacement for the Mini54) goes into shutdown mode. It does this when it sees the main chip enter a deep sleep mode. You need to remain in deep sleep for at least 1/4 second to have it reliably detected, and 1/2 second would be recommended. Once your code wakes up, you'll be running with C_DEBUGEN disabled.

I realize this isn't nearly as useful on Cortex-M0+. But if you want to at least see it in action, you can on Teensy-LC.

I'm planning to implement this on the next bootloader upgrade for Teensy 3.1. The mechanism will be the same: enter a deep sleep mode for at least 1/4 second. The former idea of a breakpoint-based command won't be used. You'll have to use the sleep mode.
 
Last edited:
SpecINT tends to be too big for a Teensy and it requires file I/O. Each of the 12 components in Spec 2006 INT benchmarks has different performance behaviors, and I find often times in making one benchmark faster, it slows down another (which we call whack-a-mole). There is a set of embedded benchmarks called EEMBC, but it has its own problems. Its been a few years since I last looked at EEMBC, but I believe the automotive sub-section, uses a static variable for loop index variables, which kills performance on anything above 8-bit processors (i.e. processors like ARM where it is possible to remember things like loop indexes and pointers in registers). Finally, both SPEC and EEMBC are licensed benchmarks, and not available for free.
 
Back
Top