I have spent some time trying to find the most optimal mcu that can perform Matrix Multiplications.
Few days ago I got my hands on a Teensy 4.1. I focused in the Arm Math library (Which I understand should give the best results in regards to math operations).
However, the results are not the expected, in general they are even worse than those of using a simple Matrix Multiplication.
My only guess is that something is missing in the configuration, perhaps some flag is not being set. Any help is greatly appreaciated.
Let me start with the function I am comparing to:
On the main function I have plugged in the following code):
Running the code above will give me fhe following results:
From the results I can tell that the code is really using the library (as ARM_MATH_CM7 is defined). I also went as far as removing arm math library files physically, just to make sure that the compiler was accessing the right libraries.
I would expect the resulting execution will give me way more favorable results for the Arm Math multiplication function.
In regards to my setup, I am building my project Using Visual Studio Code with the PlatformIO extenstion. Framework is Arduino
Details on PlatformIO.ini file
Details of arm_math.h file
As a last resort, I tried updating the library with this version:
https://github.com/mjs513/Teensy-DSP-1.12-Updates
Still, getting the same result
Note:
MyTimer is a simple class that tracks the time elapsed from object construction to destruction using the following code to query the time
Few days ago I got my hands on a Teensy 4.1. I focused in the Arm Math library (Which I understand should give the best results in regards to math operations).
However, the results are not the expected, in general they are even worse than those of using a simple Matrix Multiplication.
My only guess is that something is missing in the configuration, perhaps some flag is not being set. Any help is greatly appreaciated.
Let me start with the function I am comparing to:
Code:
void MultMatrices(const float* A, const float* B, float* C, const int m, const int n, const int k)
{
for (int i = 0; i < m; i++) {
for (int j = 0; j < k; j++) {
C[i * k + j] = A[i * n] * B[j];
for (int s = 1; s < n; s++) {
C[i * k + j] += A[i * n + s] * B[s * k + j];
}
}
}
}
On the main function I have plugged in the following code):
Code:
#include <arm_math.h>
#include <stdio.h>
#include <stdlib.h>
#include <cstdint>
#include <chrono>
int64_t GetTime()
{
using namespace std::literals;
const std::chrono::time_point<std::chrono::system_clock> now = std::chrono::system_clock::now();
return std::chrono::duration_cast<std::chrono::microseconds>(now.time_since_epoch()).count();
}
class MyTimer
{
public:
MyTimer(float* delta)
{
t1 = GetTime();
_delta = delta;
}
~MyTimer()
{
const int64_t t2 = GetTime(); //Returns number of microseconds
*_delta = static_cast<float>((t2 - t1)) / 1000.f;
}
private:
int64_t t1 = 0;
float* _delta = nullptr;
};
void MultMatrices(const float* A, const float* B, float* C, const int m, const int n, const int k)
{
for (int i = 0; i < m; i++) {
for (int j = 0; j < k; j++) {
C[i * k + j] = A[i * n] * B[j];
for (int s = 1; s < n; s++) {
C[i * k + j] += A[i * n + s] * B[s * k + j];
}
}
}
}
int main(void)
{
#if defined(USBCON)
USBDevice.attach();
#endif
const int numTests = 1000;
const int M = 4;
const int N = 4;
const int K = 256;
float32_t A[M*N];
float32_t B[N*K];
float32_t C[M*K];
#ifdef ARM_MATH_CM7
printf(">>>> ARM Defined!\n");
#else
printf(">>>> ARM Undefined!\n");
#endif
float delta = 0;
{
MyTimer t(&delta);
for (int i = 0; i < numTests; ++i)
{
MultMatrices(A, B, C, M, N, K);
}
}
printf(" Simple Performance: [%.4fms]\n", delta);
{
arm_matrix_instance_f32 Aarm;
arm_mat_init_f32(&Aarm, M, N, A);
arm_matrix_instance_f32 Barm;
arm_mat_init_f32(&Barm, N, K, B);
arm_matrix_instance_f32 Carm;
arm_mat_init_f32(&Carm, M, K, C);
{
MyTimer t(&delta);
for (int i = 0; i < numTests; ++i)
{
arm_mat_mult_f32(&Aarm, &Barm, &Carm);
}
}
}
printf(" ARM Matrix Performance: [%.4fms]\n", delta);
return 0;
}
Running the code above will give me fhe following results:
Code:
>>>> ARM Defined!
Simple Performance: [30.7620ms]
ARM Matrix Performance: [50.4760ms]
From the results I can tell that the code is really using the library (as ARM_MATH_CM7 is defined). I also went as far as removing arm math library files physically, just to make sure that the compiler was accessing the right libraries.
I would expect the resulting execution will give me way more favorable results for the Arm Math multiplication function.
In regards to my setup, I am building my project Using Visual Studio Code with the PlatformIO extenstion. Framework is Arduino
Details on PlatformIO.ini file
Code:
platform = teensy
board = teensy41
framework = arduino
monitor_speed = 115200
upload_protocol = teensy-cli
build_flags = -D TEENSY_OPT_FASTEST_LTO
Details of arm_math.h file
Code:
* Project: CMSIS DSP Library
* Title: arm_math.h
* Description: Public header file for CMSIS DSP Library
*
* $Date: 27. January 2017
* $Revision: V.1.5.1
*
As a last resort, I tried updating the library with this version:
https://github.com/mjs513/Teensy-DSP-1.12-Updates
Still, getting the same result
Note:
MyTimer is a simple class that tracks the time elapsed from object construction to destruction using the following code to query the time
Code:
int64_t GetElapsedTime()
{
using namespace std::literals;
const std::chrono::time_point<std::chrono::system_clock> now = std::chrono::system_clock::now();
return std::chrono::duration_cast<std::chrono::microseconds>(now.time_since_epoch()).count();
}
Last edited: