Assembly coding for Teensy3.1

Status
Not open for further replies.

maa@vims.edu

New member
Sometime Assembly coding is necessary to have a clean and precision control of Teensy 3.1. Would you please introduce the basic that can be used for Teensy 3.1? Thanks and best.
 
The most common approach is to use inline assembler inside C and C++ files.

This has several advantages. Your assembly code can integrate with C or C++ code at the statement level, which offers really powerful capability to mix the 2 languages. Register substitution macros let the compiler pick which actual registers your code will use, so it can optimize register allocation across your assembly fragments and all the surrounding C code. Or you can implement entire functions (or interrupt routines) in assembly, but still have complex parts of your project (eg, the USB stack) that could take months or years to fully develop in assembly still coded in C and C++. Or if you really want to do all assembly, you could implement every function in assembly, using C functions only as empty containers to hold your assembly code.

No special build setup is needed for inline assembler, which means you can use the same build process as a C or C++ project. You still name the files with .c or .cpp, even though they contain assembly code.

This page has the best info about how to use inline assembler.

http://www.ethernut.de/en/documents/arm-inline-asm.html

Just to give you a quick example, here's Teensyduino's delayMicroseconds() function, which I wrote in assembly. Well, partly in assembly...

Code:
static inline void delayMicroseconds(uint32_t) __attribute__((always_inline, unused));
static inline void delayMicroseconds(uint32_t usec)
{
#if F_CPU == 96000000
        uint32_t n = usec << 5;
#elif F_CPU == 48000000
        uint32_t n = usec << 4;
#elif F_CPU == 24000000
        uint32_t n = usec << 3;
#endif
        if (usec == 0) return;
        asm volatile(
                "L_%=_delayMicroseconds:"               "\n\t"
                "subs   %0, #1"                         "\n\t"
                "bne    L_%=_delayMicroseconds"         "\n"
                : "+r" (n) :
        );
}

Here you can see now nicely C and assembly mix. The compiler does the work of shifting the input and checking if it's zero. Then the assembly language does the timing sensitive part that really must be those specific instructions. Notice the "%0" operand to the SUBS instruction? That's a register substitution macro, where "%0" will be replaced by whatever register the compiler's optimizer decided was best to use for the "n" local variable. Correctly specifying the inputs and outputs and using the macros instead of raw registers does take some extra work, but it's really worthwhile if you mix the 2 languages. In this case, the function will be placed inline with other code (which eliminates the slow branch+return) and the 1 register it uses will be integrated into the compiler's register allocation and optimization scheme for the surrounding code. While that assembly code runs, the other registers will likely have values used by the C code (or other asm code), but the compiler will always allocate 1 register for this assembly chunk. That's the real beauty of using inline asm. You get true integration and register allocation optimization with the C language, and nice ANSI C API syntax with input & output type checking... all the advantages of C with the optimization and direct hardware control of assembly.

Of course, you can also just hard-code your register choices, especially if you write just a single "asm" statement for the entire function. The ARM architecture ABI defines which registers are input parameters (r0 to r3 as I recall) and which is output, and which functions can clobber vs which the caller expects must be saved and restored using the stack. I personally prefer to use the substitution macros and let the compiler worry about that stuff, but if you're looking to take an "all in assembly" approach, perhaps you'd prefer to just pick your registers, and give an empty list with the assumption the inputs will arrive in the registers according to the ARM ABI.

If you're going to program in assembly, you'll need this ARM reference manual. It documents all the instructions and pretty much every other processor detail needed for assembly language programming.

http://www.pjrc.com/teensy/beta/DDI0403D_arm_architecture_v7m_reference_manual.pdf

The other source for this info, that's more approachable with quite a bit of explanation, is this book. It's a bit spendy, but it's by far the very best reference on the ARM microcontroller.

http://www.amazon.com/Definitive-Cortex®-M3-Cortex®-M4-Processors-Edition/dp/0124080820
 
Last edited:
Another approach I've used in the audio library are inline functions for single instructions, and then C++ code which uses them to sort-of write assembly.

For example, on utility/dspinst.h, there are inline functions for instructions.

Code:
// computes (sum + ((a[31:0] * b[15:0]) >> 16))
static inline int32_t signed_multiply_accumulate_32x16b(int32_t sum, int32_t a, uint32_t b) __attribute__((always_inline, unused));
static inline int32_t signed_multiply_accumulate_32x16b(int32_t sum, int32_t a, uint32_t b)
{
int32_t out;
asm volatile("smlawb %0, %2, %3, %1" : "=r" (out) : "r" (sum), "r" (a), "r" (b));
return out;
}

// computes (sum + ((a[31:0] * b[31:16]) >> 16))
static inline int32_t signed_multiply_accumulate_32x16t(int32_t sum, int32_t a, uint32_t b) __attribute__((always_inline, unused));
static inline int32_t signed_multiply_accumulate_32x16t(int32_t sum, int32_t a, uint32_t b)
{
int32_t out;
asm volatile("smlawt %0, %2, %3, %1" : "=r" (out) : "r" (sum), "r" (a), "r" (b));
return out;
}

Then in the C++ code, these "instructions" are used. Like this filter object:

Code:
do {
a0 = *state++;
a1 = *state++;
a2 = *state++;
b1 = *state++;
b2 = *state++;
aprev = *state++;
bprev = *state++;
sum = *state & 0x3FFF;
data = end - AUDIO_BLOCK_SAMPLES/2;
do {
in2 = *data;
sum = signed_multiply_accumulate_32x16b(sum, a0, in2);
sum = signed_multiply_accumulate_32x16t(sum, a1, aprev);
sum = signed_multiply_accumulate_32x16b(sum, a2, aprev);
sum = signed_multiply_accumulate_32x16t(sum, b1, bprev);
sum = signed_multiply_accumulate_32x16b(sum, b2, bprev);
out2 = (uint32_t)sum >> 14;
sum &= 0x3FFF;
sum = signed_multiply_accumulate_32x16t(sum, a0, in2);
sum = signed_multiply_accumulate_32x16b(sum, a1, in2);
sum = signed_multiply_accumulate_32x16t(sum, a2, aprev);
sum = signed_multiply_accumulate_32x16b(sum, b1, out2);
sum = signed_multiply_accumulate_32x16t(sum, b2, bprev);
aprev = in2;
bprev = pack_16x16(sum >> 14, out2);

I suppose that's technically not assembly language. But it is pretty much selecting the exact instruction that will be compiled for every line of code. This also follows a non-C semantic those instructions use of packing two 16 bit integers into a 32 bit register.
 
Status
Not open for further replies.
Back
Top