T4 Float/Double Multiply/Divide Speed

Neal

Well-known member
It is probably somewhere in the Forum, but my searching was fruitless. What is the difference in machine cycles to do a single float/double multiply/divide for the T4.1? The data sheets says the T4 has double HW, but I can't find a reference that says how many instruction cycles each take.

Any help or reference pointing would be appreciated.
 
A general rule of thumb is the FPU takes twice as many cycles to perform 64 bit double than it uses for 32 bit float.

More specific info is hard to find. Maybe you can find it somewhere in ARM's documentation, but they're not as open about those low-level details of M7 as they are with M4. That sort of documentation also tends to not give a clear idea of a lot of other conditions that impact overall speed, like the "register pressure" the compiler faced when compiling your code. In not running code from ITCM and using data from DTCM, cache performance also plays a factor.

Usually your best way to get a true answer is benchmarking with the ARM cycle counter. But be careful not to let the compiler optimize some or all of the math away at compile time. Usually the input has to become from a volatile variable so the compiler doesn't "know" the value and just pre-compute the answer.
 
I have been measuring execution times using my logic analyzer and doing digitalWriteFast(pin, HIGH/LOW). Your 2x rule of thumb matches what I have been seeing.

The T4/M7 is such a great device. I wish all the lower level details were easier to get find. I understand a manufacturer's need to protect their competitive advantage, but ARM documentation I have read seems to always have the reader guessing at what may be going on in the underbelly!

Thanks for the reply.
 
Mult seems double - DIV seems triple?
Code:
[U]10X dbl Mult Cycles 132[/U]
[B]10X dbl Div Cycles 397[/B]
[U]10X int Mult Cycles 62[/U]
[B]10X int Div Cycles 127[/B]

Ad least using this code - looping through an array:
Code:
double dA[10];
double dAM[10];
double dAD[10];
uint32_t iA[10];
uint32_t iAM[10];
uint32_t iAD[10];
int ii;
uint32_t tt;
void setup() {
  int ii;
  Serial.begin(115200);
  elapsedMillis foo;
  double dd = 0;
  for (ii = 0; ii < 10; ii++) {
    foo = 0;
    while ( foo < 5 ) {
      dd++;
    }
    dA[ii] = dd;
    iA[ii] = dd;
  }
  while (!Serial && millis() < 4000 );
  Serial.println("\n" __FILE__ " " __DATE__ " " __TIME__);
  for (ii = 0; ii < 10; ii++) {
    test();
  }
}

void loop() {
  Serial.println( "DONE!" );
  while (1); delay(10);
}

void test() {
  int ii;
  for (ii = 0; ii < 10; ii++) {
    dAD[ii] = dA[ii];
    iAD[ii] = iA[ii];
  }
  tt = ARM_DWT_CYCCNT;
  for (ii = 0; ii < 10; ii++) {
    dAM[ii] = dA[ii] * dA[ii];
  }
  tt = ARM_DWT_CYCCNT - tt;
  Serial.printf( "10X dbl Mult Cycles %lu\n", tt );
  tt = ARM_DWT_CYCCNT;
  for (ii = 0; ii < 10; ii++) {
    dAD[ii] = dAM[ii] / dA[ii];
  }
  tt = ARM_DWT_CYCCNT - tt;
  Serial.printf( "10X dbl Div Cycles %lu\n", tt );
  tt = ARM_DWT_CYCCNT;
  for (ii = 0; ii < 10; ii++) {
    iAM[ii] = iA[ii] * iA[ii];
  }
  tt = ARM_DWT_CYCCNT - tt;
  Serial.printf( "10X int Mult Cycles %lu\n", tt );
  tt = ARM_DWT_CYCCNT;
  for (ii = 0; ii < 10; ii++) {
    iAD[ii] = iAM[ii] / iA[ii];
  }
  tt = ARM_DWT_CYCCNT - tt;
  Serial.printf( "10X int Div Cycles %lu\n", tt );
  Serial.println();
  delay(100);
}
 
For the example code above from @defragster, the double to int speed comparison was about 2x for multiply and 3x for divide. I modified the example to compare double to float. The result was that double multiply and divide both took 2x that of float. Also if you compare int to float looking at the two examples you see that if you compare int to float, multiply are 1x and divide is 2x.

Code:
10X dbl Mult Cycles 128
10X dbl Div Cycles 396
10X float Mult Cycles 61
10X float Div Cycles 215

Here is the modified sketch.
Code:
double dA[10];
double dAM[10];
double dAD[10];

float fA[10];
float fAM[10];
float fAD[10];

int ii;
uint32_t tt;
void setup() {
  int ii;
  Serial.begin(115200);
  elapsedMillis foo;
  double dd = 0;
  for (ii = 0; ii < 10; ii++) {
    foo = 0;
    while ( foo < 5 ) {
      dd++;
    }
    dA[ii] = dd;
//    iA[ii] = dd;
    fA[ii] = dd;
  }
  while (!Serial && millis() < 4000 );
  Serial.println("\n" __FILE__ " " __DATE__ " " __TIME__);
  for (ii = 0; ii < 10; ii++) {
    test();
  }
}

void loop() {
  Serial.println( "DONE!" );
  while (1); delay(10);
}

void test() {
  int ii;
  for (ii = 0; ii < 10; ii++) {
    dAD[ii] = dA[ii];
    fAD[ii] = fA[ii];
  }
  tt = ARM_DWT_CYCCNT;
  for (ii = 0; ii < 10; ii++) {
    dAM[ii] = dA[ii] * dA[ii];
  }
  tt = ARM_DWT_CYCCNT - tt;
  Serial.printf( "10X dbl Mult Cycles %lu\n", tt );
  tt = ARM_DWT_CYCCNT;
  for (ii = 0; ii < 10; ii++) {
    dAD[ii] = dAM[ii] / dA[ii];
  }
  tt = ARM_DWT_CYCCNT - tt;
  Serial.printf( "10X dbl Div Cycles %lu\n", tt );
  tt = ARM_DWT_CYCCNT;
  for (ii = 0; ii < 10; ii++) {
    fAM[ii] = fA[ii] * fA[ii];
  }
  tt = ARM_DWT_CYCCNT - tt;
  Serial.printf( "10X float Mult Cycles %lu\n", tt );
  tt = ARM_DWT_CYCCNT;
  for (ii = 0; ii < 10; ii++) {
    fAD[ii] = fAM[ii] / fA[ii];
  }
  tt = ARM_DWT_CYCCNT - tt;
  Serial.printf( "10X float Div Cycles %lu\n", tt );
  Serial.println();
  delay(100);
}
 
Back
Top