Fixed-point known size matrix multiplication on teensy 3.2, 3.5

leafario · Feb 2, 2018

Hello,
I have a project where I want to drive 8 outputs. Each output should be driven by a weighted sum of some inputs and some feedback calculated in the system. The weights are stored in a matrix which therefore has 8 rows, and say 12 columns. The input vector is all the inputs, the output vector are 8 values to drive the outputs. I need this to be as fast as possible. It is likely that a large number of the weights are simply 0.

What is my best option for this? I am using fixed-point math and numeric accuracy is not that important. Should I use a library (https://github.com/apmorton/teensy-template/blob/master/teensy3/arm_math.h for example), or roll (hard-code) my own implementation? Which will be the fastest, or what is the usual approach for this common problem?

el_supremo · Feb 2, 2018

My approach would be to use the arm functions first. They were presumably written by folks who know the instruction set really well and should be able to produce fast code. If it isn't fast enough for you, either you would have to dig into their code to try to improve it or get a faster processor.
I don't think the zero weights will provide a way to speed up the operation because the processor has a single-cycle multiply. Testing for zero to avoid a multiply will just slow things down.

Pete

PaulStoffregen · Feb 2, 2018

If 16 bits is enough, and if you can arrange for the data to be packed with two 16 bit number in each 32 bit word, the Cortex-M4 DSP extension instructions are the fastest possible way. There is a SIMD instruction which does two signed 16x16->32 multiplies and adds both signed 32 bit products to a 64 bit accumulator, in just 1 clock cycle. The LD and ST instructions also have a speedup where the first takes 2 cycles, then each successive memory access is single cycle. This can let you bring in 4 or 8 words (8 or 16 packed 16 bit numbers) in only 5 or 9 cycles. Then you can get 2 multiply-accumulates per cycle. If you are able to arrange everything perfectly in memory (quite a trick...) and structure your code to make use of these optimizations within the limited space of the ARM register set, then you can get those 96 multiply-accumulates done in approx 100 cycles, or just under 1 microsecond on Teensy 3.5. Add a little more time to manipulate and store the 12 results, but you asked for "as fast as possible", and that is the way to do it.

leafario · Feb 2, 2018

PaulStoffregen said:
If 16 bits is enough, and if you can arrange for the data to be packed with two 16 bit number in each 32 bit word, the Cortex-M4 DSP extension instructions are the fastest possible way. There is a SIMD instruction which does two signed 16x16->32 multiplies and adds both signed 32 bit products to a 64 bit accumulator, in just 1 clock cycle. The LD and ST instructions also have a speedup where the first takes 2 cycles, then each successive memory access is single cycle. This can let you bring in 4 or 8 words (8 or 16 packed 16 bit numbers) in only 5 or 9 cycles. Then you can get 2 multiply-accumulates per cycle. If you are able to arrange everything perfectly in memory (quite a trick...) and structure your code to make use of these optimizations within the limited space of the ARM register set, then you can get those 96 multiply-accumulates done in approx 100 cycles, or just under 1 microsecond on Teensy 3.5. Add a little more time to manipulate and store the 12 results, but you asked for "as fast as possible", and that is the way to do it.

I have always waited for a good reason to use assembler or at least intrinsics! This happens once every loop in a flight controller, so it does have to be fast and I am willing to go through the learning experience.

So if I understand this correctly: I can load 8 16-bit signed numbers (int16_t has no padding in an array, right?) to registers in the first load, and then 16 in successive loads. In each cycle, I can multiply 2 of them and store their sum in a double word using the

Code:

SMLAL**

instruction class.

I still need to think about this: the weights in the matrix are logically between -1.0 and 1.0. I will have to map this interval to the 16-bit signed range, but that means that I might have to divide the result, which seems expensive. I think I might get away with shifting? The output range is only 1000 units in the end.

Thank you very much for your answer and insight!

PaulStoffregen · Feb 3, 2018

The audio library has lots of this type of code, so that might be a place to start looking. The DSP extension stuff is done as inline functions, defined in utility/dspinst.h. These are the 2 you want.

// computes sum += ((a[15:0] * b[15:0]) + (a[31:16] * b[31:16]))
int64_t multiply_accumulate_16tx16t_add_16bx16b(int64_t sum, uint32_t a, uint32_t b)

// computes sum += ((a[15:0] * b[31:16]) + (a[31:16] * b[15:0]))
int64_t multiply_accumulate_16tx16b_add_16bx16t(int64_t sum, uint32_t a, uint32_t b)

One of the simplest audio features using these and the optimized memory read is the RMS analysis. Each loop it reads 8 samples (as 4 uint32_t in 5 cycles) and uses these instructions to quickly square and sum all 8 of them in only 4 cycles.

https://github.com/PaulStoffregen/Audio/blob/master/analyze_rms.cpp

The good news is you don't need to go to the minutia of full assembly coding. You can get excellent results with C++ code using those inline / intrinsic functions and carefully choosing the memory read & write to leverage the burst access speedup. The compiler can do the tedious asm work.

Having written quite a bit of this code, I can tell you the process involves a lot of recompiling and then reading the ASM code the compiler generated. In the temp folder where Arduino compiles your code, you can find a .lst file which has a full disassembly of the generated code. The main thing you're watching is whether the compiler kept everything in the available registers. As soon as the compiler starts spilling to the stack, you take a huge performance hit that destroys all the optimization. It's all about coaxing the compiler's register allocator to use as many of the registers as possible for your actual data. Usually the process involves pushing right up against that limit, so you can minimize looping & other overhead.

For your output result, obviously the most efficient way is to arrange for your input data to be scaled such that a simple shift operation gives you the desired output. But I wouldn't waste much time trying to do that if it's not simple. If you can get the output to fit into only 32 bits, which usually can be done by scaling the coefficients, a 32x32->64 multiply is single cycle. If you choose your multiplier carefully, you might be able to arrange for the final shift to be 32 bits, which is free by just discarding the low half of the 64 bit result. The DSP extensions also have very handy 32x16->48 multiplies that also right shift by 16, giving you the top 32 bits, all in 1 cycle. Those can be really handy if you're waging an optimization battle against the compiler's register allocator, because they don't needlessly consume one of the precious registers for the low 16 bits you're just going to throw away.

leafario · Feb 3, 2018

PaulStoffregen said:
The audio library has lots of this type of code, so that might be a place to start looking. The DSP extension stuff is done as inline functions, defined in utility/dspinst.h. These are the 2 you want.

// computes sum += ((a[15:0] * b[15:0]) + (a[31:16] * b[31:16]))
int64_t multiply_accumulate_16tx16t_add_16bx16b(int64_t sum, uint32_t a, uint32_t b)

// computes sum += ((a[15:0] * b[31:16]) + (a[31:16] * b[15:0]))
int64_t multiply_accumulate_16tx16b_add_16bx16t(int64_t sum, uint32_t a, uint32_t b)

One of the simplest audio features using these and the optimized memory read is the RMS analysis. Each loop it reads 8 samples (as 4 uint32_t in 5 cycles) and uses these instructions to quickly square and sum all 8 of them in only 4 cycles.

https://github.com/PaulStoffregen/Audio/blob/master/analyze_rms.cpp

The good news is you don't need to go to the minutia of full assembly coding. You can get excellent results with C++ code using those inline / intrinsic functions and carefully choosing the memory read & write to leverage the burst access speedup. The compiler can do the tedious asm work.

Having written quite a bit of this code, I can tell you the process involves a lot of recompiling and then reading the ASM code the compiler generated. In the temp folder where Arduino compiles your code, you can find a .lst file which has a full disassembly of the generated code. The main thing you're watching is whether the compiler kept everything in the available registers. As soon as the compiler starts spilling to the stack, you take a huge performance hit that destroys all the optimization. It's all about coaxing the compiler's register allocator to use as many of the registers as possible for your actual data. Usually the process involves pushing right up against that limit, so you can minimize looping & other overhead.

For your output result, obviously the most efficient way is to arrange for your input data to be scaled such that a simple shift operation gives you the desired output. But I wouldn't waste much time trying to do that if it's not simple. If you can get the output to fit into only 32 bits, which usually can be done by scaling the coefficients, a 32x32->64 multiply is single cycle. If you choose your multiplier carefully, you might be able to arrange for the final shift to be 32 bits, which is free by just discarding the low half of the 64 bit result. The DSP extensions also have very handy 32x16->48 multiplies that also right shift by 16, giving you the top 32 bits, all in 1 cycle. Those can be really handy if you're waging an optimization battle against the compiler's register allocator, because they don't needlessly consume one of the precious registers for the low 16 bits you're just going to throw away.

Valuable insight. I will definitely take a look at the code in the audio lib before I attempt this. Thank you - such a high level of support!

leafario · Feb 5, 2018

A little update on the situation with some good news.

A naive version using nested for loops takes ~12us according to micros(). No jitter, so far.

Those loops unrolled - entire multiplication done in 1us, again according to micros. I do not know how accurate that function is and will check soon with an oscilloscope, but in the mean time, PRAISE THE COMPILER (excuse the shouting).

No assembly needed.

Now that I think about it, it seems possible that the compiler evaluated my math at compile time. I will have to test with some input or use volatile.

EDIT: when I mark the input vector as volatile, it takes 6us or 7us. That is a reasonable time, actually very good given low implementation complexity.

PaulStoffregen · Feb 5, 2018

Sounds good. I'm pretty sure aggressive use of the DSP extensions could get it into the 1-2 us range, but that's a tremendous amount of work!

Fixed-point known size matrix multiplication on teensy 3.2, 3.5

leafario

Member

el_supremo

Well-known member

PaulStoffregen

Well-known member

leafario

Member

PaulStoffregen

Well-known member

leafario

Member

leafario

Member

PaulStoffregen

Well-known member