FPU performance

HI Folks

I've just got a teensy 4.1 - loving it so far!

I plan to build a portable daw/tracker - much in the style of LCDJ, Dirtywave's M8 , Reaper - it's not commercial, I'm not competing with those guys, just inspired by the great work and just a box for me to play with!

I wrote a quick "benchmark" to get a idea of the FPU performance - and it's well exceeded my expectations:

{
elapsedMillis floptime;
static float res = 0;
float target = 1;
float damp = 0.001f;
for (int a = 0; a < 20000000; a++) // simple loop unrolled IIR 6db
{
res += (target - res) * damp;
res += (target - res) * damp;
res += (target - res) * damp;
res += (target - res) * damp;
res += (target - res) * damp;
res += (target - res) * damp;
res += (target - res) * damp;
res += (target - res) * damp;
}

gridprintf(0, 5, "Math: \x4%d ms \x2mflops \4%5.2f", (int)floptime, (20 * 24) * (float)(int)floptime / 1000.f ); // 24 fops - 8 mults 12 adds
}

so from that I'm getting 1023mflops!!

is that right - am I actually getting 1,024,000,000 fops/sec - this can't be real can it ? lol

thanks

Shabby
 
Unless you went to great lengths to disable optimisations, the compiler would have rearranged the code significantly.
 
ah - I've just noticed my bad code!

it should be:
Serial.printf("Math: %d ms mflops %5.2f\n", (int)floptime, (20 * 24) * 1000.f/((float)(int)floptime) );


which gives me:

Math: 2134ms mflops: 224.93

which is still impressive!
 
Unless you went to great lengths to disable optimisations, the compiler would have rearranged the code significantly.

In what way? Every operation depends on a previous one, no scope for such optimization or even take advantage of dual-issue.

To calculate MFlops/s use: 20 * 24 / (floptime / 1000.0f)
MFlop count _divided by_ time taken. - dimensional analysis is powerful!

BTW if you allow parallelism in the benchmark you'll get the benefit of dual-issue, I get 450.07 MFlop from this:
Code:
void bench (void)
{
  elapsedMillis floptime;
  float res = 0.0f, res2 = 1.0f;
  float target = 1.0f, target2 = 2.0f;
  float damp = 0.0016725f;
  for (int a = 0; a < 20000000; a++)
  {
    res += (target - res) * damp;
    res += (target - res) * damp;
    res += (target - res) * damp;
    res += (target - res) * damp;
    res += (target - res) * damp;
    res += (target - res) * damp;
    res += (target - res) * damp;
    res += (target - res) * damp;
    res2 += (target2 - res2) * damp;  // compiler will intersperse these two sets to allow dual-issue.
    res2 += (target2 - res2) * damp;
    res2 += (target2 - res2) * damp;
    res2 += (target2 - res2) * damp;
    res2 += (target2 - res2) * damp;
    res2 += (target2 - res2) * damp;
    res2 += (target2 - res2) * damp;
    res2 += (target2 - res2) * damp;
  }
  Serial.printf("Math: %f %f, %i ms MFlops %5.2f\n\n", res, res2,  // print the results to ensure vars are live.
    (int)floptime,
    (20 * 48) / (floptime / 1000.f)); // 2 * 24 fops now
}
 
cool - thx for the parallelism example.

I'm more than happy with the FPU performance - it's ungodly for a MCU - hahaha

450mflops - would give a best case of 10k fops per sample at 44.1khz - yeah I know real world figures would be alot less unless I spent eons optimizing every single loop.

Glad I bought a teensy 4.1 now :)
 
450mflops - would give a best case of 10k fops per sample at 44.1khz - yeah I know real world figures would be alot less unless I spent eons optimizing every single loop.
You can get close to optimal with a small investment in learning what the optimizing compiler needs to know from you to do a good job.

I am surprised but this relatively ancient article I wrote may be a helpful grounding:
http://www.adrianfreed.com/content/guidelines-signal-processing-applications-c

The reason it is relevant is I wrote it after making a fast music signal processing library and tuning for a RISC MIPS chip (on the SGI O2)
which is not very different architecturally from the ARM of the Teensy 4.1
 
thanks Adrian - some great beginners advice there!

Me personally I always find understand your algorithm and platform specifics (memory/alu/pipeline e.t.c. ) , a entanglement thing - which is the key to optimization. You know it's a "chicken and egg" thing, what optimize - the algorithm or the resulting code - or tweak both until you converge on a optimal solution!
 
You can get close to optimal with a small investment in learning what the optimizing compiler needs to know from you to do a good job.

I am surprised but this relatively ancient article I wrote may be a helpful grounding:
http://www.adrianfreed.com/content/guidelines-signal-processing-applications-c

The reason it is relevant is I wrote it after making a fast music signal processing library and tuning for a RISC MIPS chip (on the SGI O2)
which is not very different architecturally from the ARM of the Teensy 4.1

@adrianfreed, very helpful. I read your essay on programming, too, and watched the flockumentary. :) I'm also in the east bay.
 
Cortex M7 superscalar speedup Teensy 4 and NXP 1170 and F767ZI

FWIW:
If I use "double" instead of "float" in sketch in post #4, I don't see a superscalar speedup.

Code:
Teensy 4 @600mhz Faster gcc 5.4.1
float
  bench1:  0.999970, 2133434 us MFlops 224.99
  bench2: 0.999982 1.999964, 2233449 us MFlops 429.83
double
  bench1:  1.000000, 4266895 us MFlops 112.49
  bench2: 1.000000 2.000000, 7833701 us MFlops 122.55

NXP 1170 @996mhz  SDK -O3   gcc 9.3.1
float
  bench1:  0.999970, 1285141 us MFlops 373.50
  bench2: 0.999982 1.999964, 1345386 us MFlops 713.55
double
   bench1:  1.000000, 2570261 us MFlops 186.75
   bench2: 1.000000 2.000000, 4718856 us MFlops 203.44

NXP 1010 @500mhz  SDK -O3   gcc 10.3.1
float
  bench1:  0.999970, 2560003 us MFlops 187.50
  bench2: 0.999982 1.999964, 2560006 us MFlops 375.00

mbed F767ZI  @216mhz  online mbed arm cc 5.6.075   -O3
float
  bench1:  0.999970, 6666668 us MFlops 72.00
  bench2: 0.999982 1.999964, 6666669 us MFlops 144.00
     (mbed cli  gcc 7.3.1  -O3   81 and 154 mflops)
double
   bench1:  1.000000, 12592498 us MFlops 38.12
   bench2: 1.000000 2.000000, 19259148 us MFlops 49.85

Note: gcc 11.3.1 doesn't show superscalar speedup on Teensy 4 (1/25/24)

more floating point performance data from Teensy 4 beta testing ...

On M7 if you issue 20 NOPs and count cycles with ARM_DWT_CYCCNT, you'll see the count is 10 cycles.

"The Cortex®-M7 core has a 6-stage dual-issue pipeline for efficient operation. It brings the ability to process 2 instructions in parallel if certain criteria are fulfilled."

Cortex-M7 instruction cycle counts, timings, and dual-issue combinations.
https://www.quinapalus.com/cm7cycles.html
 
Last edited:
Back
Top