NXPK66 vs. STM32L4

Status
Not open for further replies.

onehorse

Well-known member
I finally received a Teensy 3.6 to beta test (thanks Paul and Robin!), was able to quickly configure the Arduino IDE to load programs and started playing around with it. It's beautifully engineered, small, peripheral rich and easy to use. I joined the Kickstarter and bought a few more.

The Teensy 3.1 and 3.2 I started with are great too, I have used them extensively for a lot of my own and customer's projects. But I have always been interested in the STM32 line of MCUs and struggled with mbed and Keil, etc to make use of them. Why bother when we have the Teensy, right? Well the STM32F4 offered a single-precision FPU that I wanted to make use of. When the low-power STM32L4 came out, this proved irresistible to me and I and Thomas Roell designed an STM32L476 MCU development board with a Teensy 3.1 footprint and Thomas wrote an Arduino core for it so we could have our cake and eat it too. That is, we have access to an 80 MHz MCU with an FPU programmable via the USB using an Arduino IDE just like the Teensy. Heaven!

I didn't know at the time we did this that Paul was working on what would become the Teensy 3.6. Now with two small MCUs both with an FPU it is appropriate to ask, how do they compare? I intend to start answering this question in this thread.

First up is power usage, where we expect the STM32L4 to do very well; it is specifically designed as a low-power MCU.

I loaded a simple blink program,

Code:
/* LED Blink, Teensyduino Tutorial #1
   http://www.pjrc.com/teensy/tutorial.html
 
   This example code is in the public domain.
*/
#define myLed1 13 // Teensy

void setup() {
//  Serial.begin(38400);
  pinMode(myLed1, OUTPUT);
  digitalWrite(myLed1, LOW);

  digitalWrite(myLed1, HIGH); // Test function
  delay(5000);
  digitalWrite(myLed1, LOW);
}

void loop() {
    digitalWrite(myLed1, !digitalRead(myLed1));
delay(5000);
    digitalWrite(myLed1, !digitalRead(myLed1));
delay(5000);
}

onto both the K66 and L4 and measured the current required to run with both the led on and off while changing the CPU clock speed in the IDE. This is a very simple test of the native power usage when there is very low demand on the MCU resources but all MCU peripherals active by default contribute to the power usage. Here is the results:

K66vsL4PowerTest.png

The difference on vs. off in each case is simply a matter of which current limiting resistor and color of led was chosen, the useful comparison is with led off. Here you can see that the K66 is using ~350 microAmp per MHz CPU speed while the L4 is using ~39 microAmp per MHz, one-ninth the slope (the actual power usage difference is less) to perform the same task(s). Of course, the tasks aren't exactly the same and that is partly the point, both MCUs are busy doing a lot of things not needed for this particular application. But the L4 has been designed by ST to minimize the number of these tasks, the power each draws, and allows the user (as does the K66 to a degree) to configure the MCU to lower power even further.

So what? If you want to use an MCU to control a megawatt string of leds, you won't care about the power drawn by the MCU! But if you have portable, remote and/or wearable applications running from a small LiPo battery (like almost all of my applications), then MCU power usage is critical. A typical small application uses a 110 mAH LiPo battery for the smallest form factor. This will last ~20 hours, 2.5 eight-hour days, when using the L4 MCU running at 72 MHz continuously (perhaps less when we add sensors, etc). The K66 will last ~3 hours. For some applications, recharging often is no big deal, for others it makes all the difference. But this is why low power can matter.

Next up is Madgwick sensor fusion filter rates, an excellent test of FPU performance.

Here are the results:

MadgwickK66L4Test.png

I ran very similar sketches (they can't be identical) using an interrupt-based data ready scheme with a MPU9250+MS5637 breakout board connected to pins 16/17 for I2C, 3V3 and GND and pin 9 for interrupt, as I usually do on the Teensy 3.2. For the STM32 I used an identical setup. The I2C bus was run at 400 kHz for both and apart from a few extra serial outputs for the L4, the sketches were functionally identical (I will claim).

The results compare well and I would say that both MCUs perform similarly. It is nice to see the fusion rate is linear with zero intercept with the K66 as I would expect (the rate being totally a function of raw processing speed) and gets up to the 130 kHz level, which is certainly overkill. Although I note the orientation solution was very stable at that rate. It is odd that the L4 fusion rate is less linear and that the best fit doesn't go through the origin. I will keep looking at this, but it might have to do with the extra gravity/linear acceleration calculations and subsequent serial output that doesn't appear in the Teensy sketch.

Overall I would say both of these FPUs perform well, and it is for computations like these that the 180 MHz of the K66 shines.

Lastly, how does the power usage scale with a more or less real world computational task? I have plotted the current measured vs. the Madgwick sensor fusion rate for both the K66 and L4 at different CPU speeds:

K66L4PwrPerf.png

Now that the task is mostly FPU calculations rather than waiting in a delay, the power usages compare more closely. But note there is still about a factor of two difference in current usage for the same sensor fusion rate. I judge about ~7 mA is drawn to power just the MPU9250. So by reducing the accl/gyro rate from 200 Hz to 100 Hz and dropping the mag rate from 100 Hz to 8 Hz one can save some power. But power minimization isn't the point of this series of tests. Still, is it possible to lower the power usage on the K66 further? I don't know what kind of optimizations have been attempted already but I can say we have worked pretty hard to get the power usage down on the L4 in software beyond what the ST designers have already done in silicon.
 
Last edited:
It does not make sense to compare power-usage with the blink-example. In a real-wolrd application, where power-saving is essential, you'd never use a delay(5000) (a simple loop, internally) where the cpu jumps and jumps and jumps instead of going to a power-saving state with less MHz (or even ZERO MHz). I guess, at low speed, or stopped, the difference is much much less.

Btw, the second digitalWrite and delay in the posted code is superflues (as most lines in setup, too..)
 
Last edited:
Delay on the Teensy uses a simple busy loop, while the STM32L4 implementation puts the CPU to sleep, right?

Have you measured the power consumption for the FPU code?
 
Did you account for what peripherals where running on this test, i.e., clocks and such to those peripherals? In other words in the startup code for the Teensy it turns on some clocks not used in this test, I have no idea what clocks your board has running from startup.

Edit - I see you mention that about the peripherals used but the fact remains that you could change the startup code to have similar set peripherals running or better yet not running? I'm not saying the k66 would be as power efficient but would give a better real world comparison of run mode power usage between the two.
 
Last edited:
The whole point of these simple tests is to measure the power usage and performance in the default condition. I am not trying to minimize power here, since in this case there are all kinds of methods available for both devices including using lower CPU speeds, using low power modes of the MCU, sleep/perform duty cycling, etc. The question I am asking is, if you measure the devices under similar conditions as received, how do they compare?

For those of you for whom this is not a relevant question, please run your own tests.

Edit: I don't know enough to know whether I used a "fair" comparison of peripherals . I did take pains to use as close to identical sketches as I could in all cases; this means I did not ask for an SPI port in one and not the other, for example. But it is always possible that someone else who performs such a comparison might see different results, but I wouldn't expect they would be that different.
 
Last edited:
Interesting results within it's limits and guess the only 'fair' way to do this would be to gather sufficient data to determine unit energy cost per peripheral operation, per mathematical operation and per second and then build a model that took example use cases and produced likely consumption. Fun to think of ways to do this but the time needed would be substantial. Off topic but do any OEMs offer tools to convert code into CPU power consumption?
 
There are standard benchmarks for performance (DMIPS, COREMARK) and power usage (microA per MHz) each chip manufacturer touts for their products. What matters to me as a dummy Arduino user is that the same sensor management and sensor data fusion tasks can be achieved at the same rates but with half the power. I'll let the smart people figure out why.
 
Looks like I oughta use idle mode in delay(). Tried that on AVR years ago, but ran into hardware bugs with the ADC. Of course that's only Atmel AVR.
 
The STM32L4 uses wait-for-event (WFE) in delay which means, essentially, that the CPU enters a very low power mode automatically unless something is happening that warrants its attention. I don't know if something like this can be done on the K66. There are a lot of hardware architecture features that allow very low power usage with the STM32L4. We haven't enabled the very low power modes in earnest yet. The above results are for the full-on, normal power setting, where the only variable is CPU speed in the one (normal) power mode.

Are there low-power features built into the K66 architecture that could be brought to bear to reduce power usage? Even though 180 MHz is a fantastic capability, 80 mA is a lot of power to expend!
 
Looks like I oughta use idle mode in delay(). Tried that on AVR years ago, but ran into hardware bugs with the ADC. Of course that's only Atmel AVR.
make sure you configure it for wait mode sleep (this would probably the equivalent to idle mode you are talking about) so any interrupt will wake it when using either wfe or wfi. In the T3.2 it saves about 3mA from run mode.
 
The simple test with delay(n) seems to be rather relevant. It simply gives a starting point to show the power consumption if really nothing is going on, and any more work you'd put onto the CPU would start adding more current draw from there. Same to adding more peripherals. From the datasheets the CPU contribution @80MHz should be around 10mA/3mA (RUN/SLEEP, all peripherals off) for STM32L476 and 35mA/24mA for K66 (numbers adjusted to 80MHz, from 120MHz listed, all peripherals off). If you compare Kris's graphs, that is reflected rather well, STM32L4 for SLEEP mode (around 5mA), and L66 for RUN mode at around 38mA. So I'd expected K66 to come in at about 26mA if one would use WFE/WFI type techniques. Perhaps @manitou could quickly test that (as I have no access to K66).

Then there is is notion of "yes, but real application don't just sit there and wait for 5 seconds". Well, many if not most of them do, actually. You wait for an I2C transmission to be done, or for sensor data to be read, or you wait for a chunk of data to be written to the SDCARD and so on. So there is a lot of waiting and an lot of transmitting data. All of this can be power optimized by putting the CPU to SLEEP rather than waiting around in RUN mode. So actually this "delay(5)" is a pretty good indicator for a typical application that does not require a lot of CPU horse power for computation only.

The other graph with Madgewick is what I find now utterly surprising. STM32L4 is about low power. It has a painfully slow FLASH interface, and in general is designed to be power-efficient, rather than fast (except USB-OTG, which is a powerhog and is slow). K66 on the other hand is designed to be fast and not really that power efficient. I would have expected the perf to be rather identical, unless either I/D cache misses put K66 at a disavdvantage, or I2C communication is more in the way than for STM32L4. In fact given that STM32L4 seems to be biased more towards power consumption, I would have not been surprised if STM32L4 would have been slower there.

Power consumption for the Madgewick test would be interesting, too. I'd oracle for STM32L4 about +7mA for the CPU being busy, +2mA for the internal bus interfaces, and +4mA for USB (if enabled) plus 3mA for MPU9250. So probably just a tad below 21mA. No idea where K66 would end up.

In general the claims in datasheets are very often wishful thinking. They represent a minimum power consumption under ideal circumstances, where the FLASH is turned off, and SYSTICK simply does not exist and so on. So it was really cool from Kris to put real world measurements into graphs that even I can read ;-)
 
Last edited:
earlier discussion of WFI on teensy 3https://forum.pjrc.com/threads/28053-power-savings-with-wfi-in-yield()-for-teensy-3-LC?highlight=wfi(i'm in montana for a while and not near my K66, so i can't test delay+WFI)
We might want to look at using the wfe instruction for delays if the the DMA controller outputs a signal pulse which will wake the processor, then the wfi overhead can be avioded. I know this is a corner case but could be a issue if someone is doing DMA memory copy and using the delay to wait for it to be done, so instead of waiting for the DMA to issue interrupt to wake the processor the event pulse will wake it. Though I don't know off hand if the Kinetis processors DMA have this event signal pulse?
 
earlier discussion of WFI on teensy 3
https://forum.pjrc.com/threads/28053-power-savings-with-wfi-in-yield()-for-teensy-3-LC?highlight=wfi
(i'm in montana for a while and not near my K66, so i can't test delay+WFI)

Here are power usage figures (ma) for K66 with delay(5000) in a loop() with and without asm("wfi") in yield.cpp
Code:
K66     default   WFI
 48mhz    26.6    18.5
 72mhz    31.9    19.7
 96mhz    39.9    23.7
120mhz    50.9    30.6
180mhz    76.1    40.2
Note: probably don't want WFI in yield.cpp since yield() is called in main.cpp while/loop(), put WFI in delay()

There is a snooze library for teensy 3* with various power reduction options
https://github.com/duff2013/Snooze
it has been extended to support K66
 
Last edited:
There are standard benchmarks for performance (DMIPS, COREMARK) and power usage (microA per MHz) each chip manufacturer touts for their products. What matters to me as a dummy Arduino user is that the same sensor management and sensor data fusion tasks can be achieved at the same rates but with half the power. I'll let the smart people figure out why.

At the bottom of this file: https://github.com/manitou48/DUEZoo/blob/master/perf.txt
are some coremarkish numbers for various MCU's including dragonfly and Teensy
coremark.pngcoremarka.png

For reference, another coremark-power plot for mbed LPC1768 or star otto-like mbed F469NI
 
Last edited:
Interestingly, the performance vs. power curves for Ladybug and Butterfly are even better since neither has a 16 MHz crystal and this consumes ~1 mA of power. In other words, I would expect the coremark numbers to be the same but the current usage to drop by 1 or 2 mA for these STM32L4 development boards.
 
Did The STM32x lines ever get a good Arduino core? I have some F4s and F7s I would like to get off ground zero on. I found them far harder to start with Adriano/Teensy.
 
I can use Teensies to get something useful done not so much with other products. Easy of use and all the code Paul and every one has worked is a huge plus. I agree this thread has probably gone too far off course for a Teensy forum.
 
Robin & I generally avoid censorship, except spam. Promoting other Teensy-like products is a gray area. So are Kickstarter campaigns. Our litmus test for when the line has been crossed is whether multiple people are starting to complain.

Kris, you've done a lot of great work with many add-on boards. I'm one of your 35 backers for this project, because I want to help support your efforts... which quite frankly I'd rather see directed towards peripheral boards, but I can certainly understand the appeal of trying to launch a dev board platform!

But the point where people start objecting and calling your messages spam can't be ignored. I know you're excited about this project, and I certainly understand the intense pressure of running a Kickstarter campaign. But you really must understand this forum is for the benefit of everyone who uses it. When people start calling your messages spam is the point you need to tone it down.
 
Tis a gray area. Many years ago I found out about teensy from Paul's posts on maple forum and arduino forums....:D
 
Status
Not open for further replies.
Back
Top