teensy 4.1 DSP matrix multiplication

FabianMM

New member
I have spent some time trying to find the most optimal mcu that can perform Matrix Multiplications.
Few days ago I got my hands on a Teensy 4.1. I focused in the Arm Math library (Which I understand should give the best results in regards to math operations).
However, the results are not the expected, in general they are even worse than those of using a simple Matrix Multiplication.

My only guess is that something is missing in the configuration, perhaps some flag is not being set. Any help is greatly appreaciated.

Let me start with the function I am comparing to:

Code:
void MultMatrices(const float* A, const float* B, float* C, const int m, const int n, const int k)
{
    for (int i = 0; i < m; i++) {
        for (int j = 0; j < k; j++) {
            C[i * k + j] = A[i * n] * B[j];
            for (int s = 1; s < n; s++) {
                C[i * k + j] += A[i * n + s] * B[s * k + j];
            }
        }
    }
}

On the main function I have plugged in the following code):

Code:
#include <arm_math.h>

#include <stdio.h>
#include <stdlib.h>
#include <cstdint>
#include <chrono>

int64_t GetTime()
{
    using namespace std::literals;
    const std::chrono::time_point<std::chrono::system_clock> now = std::chrono::system_clock::now();
    return std::chrono::duration_cast<std::chrono::microseconds>(now.time_since_epoch()).count();
}

class MyTimer
{
public:
    MyTimer(float* delta)
    {
        t1 = GetTime();
        _delta = delta;
    }

    ~MyTimer()
    {
        const int64_t t2 = GetTime(); //Returns number of microseconds
        *_delta = static_cast<float>((t2 - t1)) / 1000.f;
    }

private:
    int64_t t1 = 0;
    float* _delta = nullptr;
};


void MultMatrices(const float* A, const float* B, float* C, const int m, const int n, const int k)
{
    for (int i = 0; i < m; i++) {
        for (int j = 0; j < k; j++) {
            C[i * k + j] = A[i * n] * B[j];
            for (int s = 1; s < n; s++) {
                C[i * k + j] += A[i * n + s] * B[s * k + j];
            }
        }
    }
}

int main(void)
{
#if defined(USBCON)
    USBDevice.attach();
#endif

    const int numTests = 1000;
    const int M = 4;
    const int N = 4;
    const int K = 256;

    float32_t A[M*N];
    float32_t B[N*K];
    float32_t C[M*K];

#ifdef ARM_MATH_CM7
    printf(">>>> ARM Defined!\n");
#else
    printf(">>>> ARM Undefined!\n");
#endif

	float delta = 0;
	{
		MyTimer t(&delta);
        for (int i = 0; i < numTests; ++i)
        {
            MultMatrices(A, B, C, M, N, K);
        }
	}
	printf("	Simple Performance: [%.4fms]\n", delta);

	{
		arm_matrix_instance_f32 Aarm;
		arm_mat_init_f32(&Aarm, M, N, A);
		arm_matrix_instance_f32 Barm;
		arm_mat_init_f32(&Barm, N, K, B);
		arm_matrix_instance_f32 Carm;
		arm_mat_init_f32(&Carm, M, K, C);
		{
			MyTimer t(&delta);
			for (int i = 0; i < numTests; ++i)
			{
				arm_mat_mult_f32(&Aarm, &Barm, &Carm);
			}			
		}
	}
	printf("	ARM Matrix Performance: [%.4fms]\n", delta);

    return 0;
}

Running the code above will give me fhe following results:

Code:
>>>> ARM Defined!
        Simple Performance: [30.7620ms]
        ARM Matrix Performance: [50.4760ms]

From the results I can tell that the code is really using the library (as ARM_MATH_CM7 is defined). I also went as far as removing arm math library files physically, just to make sure that the compiler was accessing the right libraries.
I would expect the resulting execution will give me way more favorable results for the Arm Math multiplication function.

In regards to my setup, I am building my project Using Visual Studio Code with the PlatformIO extenstion. Framework is Arduino

Details on PlatformIO.ini file
Code:
platform = teensy
board = teensy41
framework = arduino
monitor_speed = 115200
upload_protocol = teensy-cli
build_flags = -D TEENSY_OPT_FASTEST_LTO

Details of arm_math.h file
Code:
 * Project:      CMSIS DSP Library
 * Title:        arm_math.h
 * Description:  Public header file for CMSIS DSP Library
 *
 * $Date:        27. January 2017
 * $Revision:    V.1.5.1
 *

As a last resort, I tried updating the library with this version:
https://github.com/mjs513/Teensy-DSP-1.12-Updates
Still, getting the same result

Note:

MyTimer is a simple class that tracks the time elapsed from object construction to destruction using the following code to query the time

Code:
    int64_t GetElapsedTime()
    {
        using namespace std::literals;
        const std::chrono::time_point<std::chrono::system_clock> now = std::chrono::system_clock::now();
        return std::chrono::duration_cast<std::chrono::microseconds>(now.time_since_epoch()).count();
    }
 
Last edited:
Full code please - it's not clear if your MyTimer class initializes the passed in delta value to zero (I suspect it doesn't....)
 
Sorry about that. I have updated the post and included the whole code in main.cpp
Delta is calculated when object is destroyed. It shouldn't make a difference whether it was initialized.
 
Your MultMatrices function may be inlined and unrolled, allowing good optimization and fast execution.

Supposedly arm_mat_mult_f32 is inlined (or not), the fact that you need to initialize parameters outside arm_mat_mult_f32 (arm_mat_init_f32) seems to indicate its loops are using structures members for matrix access with variable dimensions and loops iterations which may lead to a less optimized code.

Obviously I cannot state the truth but I would not be surprised if an external function of a library may lose in front of a local inlined function.

EDIT: I tried with https://godbolt.org/ and MultMatrices is clearly inlined and unrolled with -O3 and does a pretty well job because M, N and K are *constant* for it. I cannot test for arm_mat_mult_f32 but I suspect it would be less optimized because M, N and K are *NOT constant* and so lead to a more generic code to handle variable matrix dimensions.
 
Last edited:
Just to add to the confusion I modified your sketch slightly to run within the Arduino IDE using a Micromod board. The modified sketch:
Code:
#include <arm_math.h>
#include <chrono>

int64_t GetTime()
{
    using namespace std::literals;
    const std::chrono::time_point<std::chrono::system_clock> now = std::chrono::system_clock::now();
    return std::chrono::duration_cast<std::chrono::microseconds>(now.time_since_epoch()).count();
}

class MyTimer
{
public:
    MyTimer(float* delta)
    {
        t1 = GetTime();
        _delta = delta;
    }

    ~MyTimer()
    {
        const int64_t t2 = GetTime(); //Returns number of microseconds
        *_delta = static_cast<float>((t2 - t1)) / 1000.f;
    }

private:
    int64_t t1 = 0;
    float* _delta = nullptr;
};


void MultMatrices(const float* A, const float* B, float* C, const int m, const int n, const int k)
{
    for (int i = 0; i < m; i++) {
        for (int j = 0; j < k; j++) {
            C[i * k + j] = A[i * n] * B[j];
            for (int s = 1; s < n; s++) {
                C[i * k + j] += A[i * n + s] * B[s * k + j];
            }
        }
    }
}

void setup()
{
  Serial.begin(9600);
  delay(5000);
    const int numTests = 1000;
    const int M = 4;
    const int N = 4;
    const int K = 256;

    float32_t A[M*N];
    float32_t B[N*K];
    float32_t C[M*K];

#ifdef ARM_MATH_CM7
    printf(">>>> ARM Defined!\n");
#else
    printf(">>>> ARM Undefined!\n");
#endif

	float delta = 0;
	{
		MyTimer t(&delta);
        for (int i = 0; i < numTests; ++i)
        {
            MultMatrices(A, B, C, M, N, K);
        }
	}
	printf("	Simple Performance: [%.4fms]\n", delta);

	{
		arm_matrix_instance_f32 Aarm;
		arm_mat_init_f32(&Aarm, M, N, A);
		arm_matrix_instance_f32 Barm;
		arm_mat_init_f32(&Barm, N, K, B);
		arm_matrix_instance_f32 Carm;
		arm_mat_init_f32(&Carm, M, K, C);
		{
			MyTimer t(&delta);
			for (int i = 0; i < numTests; ++i)
			{
				arm_mat_mult_f32(&Aarm, &Barm, &Carm);
			}			
		}
	}
	printf("	ARM Matrix Performance: [%.4fms]\n", delta);

}

void loop() {}

Pretty much the only thing I removed was certain libraries that aren;t neede stdlib, stdio, and stdint. My results for the both show that the times are almost the same.
Code:
>>>> ARM Defined!
	Simple Performance: [46.2340ms]
	ARM Matrix Performance: [46.2950ms]

if I change your N and M to 8:
Code:
>>>> ARM Defined!
	Simple Performance: [174.2250ms]
	ARM Matrix Performance: [159.0880ms]

and N and M = 16:
Code:
>>>> ARM Defined!
	Simple Performance: [677.8870ms]
	ARM Matrix Performance: [533.1110ms]

Think you get the idea of whats going on. This is with -02 (faster) The interesting thing is if I use -03 (fastest)
Code:
>>>> ARM Defined!
	Simple Performance: [437.5300ms]
	ARM Matrix Performance: [533.1430ms]
the arm matrix performance is slower than simple performance.
 
Or put simply the library functions are not heavily optimizable as they do not know any of the loop counts at compile time, but they can be used for different dimension arrays out of the box.

Unrolling here is very powerful as it removes all the conditional branches from the code, so no pipeline stalls and in theory you get full utilization of the dual issue FPU.

So code written for pre-determined array dimensions will noticably faster than for arbitrary dimesions. This is an application where C++ templates might have been worthwhile for the library. I'm not sure if the Arm library code isn't limited to C rather than C++ though.
 
Thank you all for your replies. It makes sense that the compiler would give some advantage to the inlined unrolled code.
My confusion is that I was expecting way more performance on a function built purposedly for the platform I am working on. (As in taking advantage of specific ARM instruction set)
However it doesn't seem to be the case. It looks like M7 doesn't have much power in regards to floating point operations, which is very disappointing :(.
I have even tried the quantized versions of the matrix multiplication functions, but the results are very similar. No advantage on Vector operations whatsoever.
considering that I will need a more power (around 50%) is there any board/soc that you may suggest?

On a side note, I based my selection of teensy 4.1 based on Paul's core Benchmark. However I am using an ESP32-S3 in parallel, and the matrix operations are not that far: ESP32-S3 is around 5% slower in a 4x4x256 MatMult, and 40% slower in 2x2x256 matrixMult as it takes advantage of Vector Operations for reading/writing memory blocks.
Considering that the chip costs is just a fraction and includes Bluetooth wifi and 32mb of memory, it makes it a great alternative to M7
 
Considering that the chip costs is just a fraction and includes Bluetooth wifi and 32mb of memory, it makes it a great alternative to M7

Yes, and it is dual core and it's possible use the other core for simulantous tasks without performance impact, as long it's not using the FPU,too.

I will need a more power (around 50%) is there any board/soc that you may suggest?
I don't think there is an MCU with sufficient floating point performance.
I'm not sure about Raspberry.
What about those micro-ATX boards with laptop/desktop CPUs.
 
I will probably try a Cortex A7. Not an mcu, but I hope I can get more advantage by using the NEON extension and VFP
 
Back
Top