Best practice for programming a Teensy 3.0 and its 32 bit ARM

Status
Not open for further replies.

Experimentalist

Well-known member
Hi

My question really relates to efficient programming in the 32 bit environment with the Teensy 3.0 and its 32 bit ARM
Cortex-M4.

Some background to put my question in context. I am a self taught programmer and have written the majority of my code in various combinations of VB and C# for Windows PCs. I must admit to having given little thought to the inner workings of my programs and have only recently been reading up about what's on the stack and what's on the heap for example in the managed code world with oodles of resources available.

I have been programming in the embedded world for just about a year working with Teensy 2.0 and Teensy 2.0 ++ and have been on a steep learning curve. I have assumed that working with the 8 bit AVRs I should try to optimise my code around 8 bit variable types where possible, so using a byte instead of an integer for example.

So my question is if I am now porting my Teensy 2.0 and Teensy 2.0++ code to the Teensy 3.0 I presume I should now try to optimise for the 32 bit architecture? If I create an int on the Teensy 3.0 is it a 32 bit integer ? Can anyone offer guidance regarding best practices for coding when targetting the Teensy 3.0 ?

I am guessing that my previous code deliberately trying to use byte over int will now (if not always ?!?) result in a performance hit rather than gain ???

Thanks for any guidance you may have to offer

Ex.
 
Last edited:
On Teensy 3.0, int is 32 bits.

As a general rule, the largest gains in performance typically come from selecting the best algorithms, or better scheduling of tasks when latency matters. In most projects, there's usually not infinite time available to optimize, so it's important to prioritize and focus your effort. For example, a binary search outperforms a linear search for any reasonably sized data set, but a linear search is trivial to implement (binary search is tricky if the data is linked lists or other unconventional storage rather than a linear array). So optimizing algorithms, like replacing a linear search that was done to simply get the project working, will give much better results than spending the same amount of time (or even much more time & effort) optimizing the variables within that linear search code.

Using the peripherals to their fullest capability also tends to yield dramatically better results. They can be complex, especially when DMA is involved, and often there are special hardware limitations like only using certain pins. If you can work with those issues, the peripherals can lighten the load for a lot of the I/O you need, giving you more CPU time to do other stuff. I recall a phone conversation with someone who was determined to measure a motor's RPM using fast polling of a pin. He was willing to devote an incredible amount of time to optimizing that fast polling code, but didn't want to learn how to use the hardware timer input capture. Of course, the input capture can give you a 1 cycle accurate snapshot of the precise time the input pin changes (even when interrupts are disabled - far better than even the best assembly language code could do), and then your code can grab the number leisurely, as long as you do so before the next input change. The peripherals have a lot of amazing capability, so using them effectively often means you don't need to write fast code at all, or you have almost all the CPU time available to write fast code to do something else while you allow the hardware to do the high speed work for you.

But sometimes it is worthwhile to optimize lower-level details. It can also be a good learning experience, and maybe it's even fun sometimes?

Performance-wise, 8, 16 and 32 bit integers are all about the same on ARM. ARM has instructions to load and store all 3 types to and from memory, so they're all the same when the variable is global, static, or a local variable that's allocated in RAM. When allocated in registers, they're all handled as 32 bits, so again, all three are the same speed. When passed as inputs and the output of a function, sometimes 1 extra instruction is needed to mask off the unused 16 or 24 bits, but that overhead is tiny compared to even the function call & return.

So if you can accomplish a task using 3 variables that are 32 bits using fewer lines of code, or 5 variables of smaller size using more code, definitely go with the smaller 3 variable approach. But if the code would be the same either way, it usually make little or no difference if you define the variables as 8, 16, or 32 bits.

ARM has 12 registers the compiler can use for almost any purpose, and 8 of those can be accessed by smaller instructions. Sometimes the compiler uses them for temporary results, so if you can write a function using 6 or fewer variables, usually the compiler will manage to keep all of them in registers.

ARM Cortex-M4 has hardware multiply for signed and unsigned integers, and hardware divide for unsigned integers. Also, certain optimizations like converting modulus by a power of 2 to logical and only apply when the variables are unsigned. As a general rule, if you know the variable will never be negative, specifying unsigned is a good idea. Often it makes no difference, but sometimes it allows the compiler to do nice optimizations.

There are probably lots of other tiny optimizations, but they get into very specific situations, and to really have huge gains, you must look at the generated assembly as you try different approaches. It's a lot of work, so usually that's only done for very few functions where performance truly matters.
 
Last edited:
Paul

That response clearly took some time and I appreciate the effort, alot of it went over my head but inspired me to read more and sort that out. The main thing is I can get on with porting my code with a better understanding of where to spend my time.

Thanks
Ex.
 
on this page, https://www.pjrc.com/teensy/td_libs_AudioRoadmap.html it says "The Teensy Audio Library is designed for 16 bit data, because the ARM Cortex-M4 processor on Teensy 3.1 has special "DSP instructions" and features that accelerate processing 16 bit signals. Adding a 17th bit incurs a heavy performance penalty." Does this refer strictly to audio block data? From your comments in this thread, working with private 32 bit integers (or small arrays of them) should not incur a heavy performance penalty, correct?
 
Status
Not open for further replies.
Back
Top