T4 Float/Double Multiply/Divide Speed

Neal · Aug 27, 2022

It is probably somewhere in the Forum, but my searching was fruitless. What is the difference in machine cycles to do a single float/double multiply/divide for the T4.1? The data sheets says the T4 has double HW, but I can't find a reference that says how many instruction cycles each take.

Any help or reference pointing would be appreciated.

PaulStoffregen · Aug 27, 2022

A general rule of thumb is the FPU takes twice as many cycles to perform 64 bit double than it uses for 32 bit float.

More specific info is hard to find. Maybe you can find it somewhere in ARM's documentation, but they're not as open about those low-level details of M7 as they are with M4. That sort of documentation also tends to not give a clear idea of a lot of other conditions that impact overall speed, like the "register pressure" the compiler faced when compiling your code. In not running code from ITCM and using data from DTCM, cache performance also plays a factor.

Usually your best way to get a true answer is benchmarking with the ARM cycle counter. But be careful not to let the compiler optimize some or all of the math away at compile time. Usually the input has to become from a volatile variable so the compiler doesn't "know" the value and just pre-compute the answer.

Neal · Aug 27, 2022

I have been measuring execution times using my logic analyzer and doing digitalWriteFast(pin, HIGH/LOW). Your 2x rule of thumb matches what I have been seeing.

The T4/M7 is such a great device. I wish all the lower level details were easier to get find. I understand a manufacturer's need to protect their competitive advantage, but ARM documentation I have read seems to always have the reader guessing at what may be going on in the underbelly!

Thanks for the reply.

defragster · Aug 27, 2022

Mult seems double - DIV seems triple?

Code:

[U]10X dbl Mult Cycles 132[/U]
[B]10X dbl Div Cycles 397[/B]
[U]10X int Mult Cycles 62[/U]
[B]10X int Div Cycles 127[/B]

Ad least using this code - looping through an array:

Code:

double dA[10];
double dAM[10];
double dAD[10];
uint32_t iA[10];
uint32_t iAM[10];
uint32_t iAD[10];
int ii;
uint32_t tt;
void setup() {
  int ii;
  Serial.begin(115200);
  elapsedMillis foo;
  double dd = 0;
  for (ii = 0; ii < 10; ii++) {
    foo = 0;
    while ( foo < 5 ) {
      dd++;
    }
    dA[ii] = dd;
    iA[ii] = dd;
  }
  while (!Serial && millis() < 4000 );
  Serial.println("\n" __FILE__ " " __DATE__ " " __TIME__);
  for (ii = 0; ii < 10; ii++) {
    test();
  }
}

void loop() {
  Serial.println( "DONE!" );
  while (1); delay(10);
}

void test() {
  int ii;
  for (ii = 0; ii < 10; ii++) {
    dAD[ii] = dA[ii];
    iAD[ii] = iA[ii];
  }
  tt = ARM_DWT_CYCCNT;
  for (ii = 0; ii < 10; ii++) {
    dAM[ii] = dA[ii] * dA[ii];
  }
  tt = ARM_DWT_CYCCNT - tt;
  Serial.printf( "10X dbl Mult Cycles %lu\n", tt );
  tt = ARM_DWT_CYCCNT;
  for (ii = 0; ii < 10; ii++) {
    dAD[ii] = dAM[ii] / dA[ii];
  }
  tt = ARM_DWT_CYCCNT - tt;
  Serial.printf( "10X dbl Div Cycles %lu\n", tt );
  tt = ARM_DWT_CYCCNT;
  for (ii = 0; ii < 10; ii++) {
    iAM[ii] = iA[ii] * iA[ii];
  }
  tt = ARM_DWT_CYCCNT - tt;
  Serial.printf( "10X int Mult Cycles %lu\n", tt );
  tt = ARM_DWT_CYCCNT;
  for (ii = 0; ii < 10; ii++) {
    iAD[ii] = iAM[ii] / iA[ii];
  }
  tt = ARM_DWT_CYCCNT - tt;
  Serial.printf( "10X int Div Cycles %lu\n", tt );
  Serial.println();
  delay(100);
}

manitou · Aug 28, 2022

Cortex-M7 instruction cycle counts, timings, and dual-issue combinations
read https://www.quinapalus.com/cm7cycles.html

some early comparative float/double benchmarks
https://forum.pjrc.com/threads/54711-Teensy-4-0-First-Beta-Test?p=194187&viewfull=1#post194187

dual-issue can speed things up too
https://forum.pjrc.com/threads/70022-FPU-performance

Neal · Aug 28, 2022

For the example code above from @defragster, the double to int speed comparison was about 2x for multiply and 3x for divide. I modified the example to compare double to float. The result was that double multiply and divide both took 2x that of float. Also if you compare int to float looking at the two examples you see that if you compare int to float, multiply are 1x and divide is 2x.

Code:

10X dbl Mult Cycles 128
10X dbl Div Cycles 396
10X float Mult Cycles 61
10X float Div Cycles 215

Here is the modified sketch.

Code:

double dA[10];
double dAM[10];
double dAD[10];

float fA[10];
float fAM[10];
float fAD[10];

int ii;
uint32_t tt;
void setup() {
  int ii;
  Serial.begin(115200);
  elapsedMillis foo;
  double dd = 0;
  for (ii = 0; ii < 10; ii++) {
    foo = 0;
    while ( foo < 5 ) {
      dd++;
    }
    dA[ii] = dd;
//    iA[ii] = dd;
    fA[ii] = dd;
  }
  while (!Serial && millis() < 4000 );
  Serial.println("\n" __FILE__ " " __DATE__ " " __TIME__);
  for (ii = 0; ii < 10; ii++) {
    test();
  }
}

void loop() {
  Serial.println( "DONE!" );
  while (1); delay(10);
}

void test() {
  int ii;
  for (ii = 0; ii < 10; ii++) {
    dAD[ii] = dA[ii];
    fAD[ii] = fA[ii];
  }
  tt = ARM_DWT_CYCCNT;
  for (ii = 0; ii < 10; ii++) {
    dAM[ii] = dA[ii] * dA[ii];
  }
  tt = ARM_DWT_CYCCNT - tt;
  Serial.printf( "10X dbl Mult Cycles %lu\n", tt );
  tt = ARM_DWT_CYCCNT;
  for (ii = 0; ii < 10; ii++) {
    dAD[ii] = dAM[ii] / dA[ii];
  }
  tt = ARM_DWT_CYCCNT - tt;
  Serial.printf( "10X dbl Div Cycles %lu\n", tt );
  tt = ARM_DWT_CYCCNT;
  for (ii = 0; ii < 10; ii++) {
    fAM[ii] = fA[ii] * fA[ii];
  }
  tt = ARM_DWT_CYCCNT - tt;
  Serial.printf( "10X float Mult Cycles %lu\n", tt );
  tt = ARM_DWT_CYCCNT;
  for (ii = 0; ii < 10; ii++) {
    fAD[ii] = fAM[ii] / fA[ii];
  }
  tt = ARM_DWT_CYCCNT - tt;
  Serial.printf( "10X float Div Cycles %lu\n", tt );
  Serial.println();
  delay(100);
}

T4 Float/Double Multiply/Divide Speed

Neal

Well-known member

PaulStoffregen

Well-known member

Neal

Well-known member

defragster

Senior Member+

manitou

Senior Member+

Neal

Well-known member