Teensy 4.X FPU

hoek67

Active member
Although yet to get a Teensy 4.1 to fiddle with I have compiled some code for it to check out the FPU disassembly to hopefully see what and how (and if) things are handled.

Just wondering if there is a simple outline of what's handled on the FPU and what isn't.

Just did some quick tests and found :-

float <--> int is handled and better still it can be done in place within the same register.

Seems normal multiply, addition, etc are all supported.

Trig functions like sinf, cosf, tanf, etc are NOT however I would assume because the functions they call use a lot of addition, multiplication and conversions to and from integer they should overall be faster than a pure software approach.

Seems sqrtf are also not handled but may still get a general speed up.

All up, looks quite good as sinf, sqrtf etc were all handled by SSE2 etc on x86 platform. ;)
 
All up, looks quite good as sinf, sqrtf etc were all handled by SSE2 etc on x86 platform. ;)

Not really. Yes, the 387 and later floating point units had the "fsin", etc. instructions for doing sine calculations using the 387 floating point stack. But GCC no longer generates code to use the "fsin" instruction unless the operand is long double (i.e. 80-bit) and an appropriate fast math operation was used on the compilation line. It turns out for later x86 compatible chips, that the "fsin" is very slow and it also turns out to be more inaccurate than the documentation says. So for float arguments to sinf and double arguments to sin, the GCC compiler just calls the math function, where it is optimized. By doing the appropriate Taylor series expansion, the library can generate a faster result for float/double than by using the 80-bit arithmetic. And the error bounds are also more under control.

If you use the option "-mfpmath=387" to disable using the SSE floating point support, the compiler will generate code to use "fsin". But all modern x86 processors set fpmath to be SSE or higher for float/double arguments, and they don't use the old 387 instructions except for 80-bit long double support.

In addition, the compiler will always call the math function unless the fast math options are used, since the math functions are specified to set errno in some cases.

Square root on the other hand is done with the 387 instructions, because it is faster using the 387 support than doing the expansion.

Note, this is mostly from memory. I haven't worked on the GCC 386/x86_64 support for at least 11 years now, but I did use GCC 11.2 to try and generate "fsin", and I looked at the current i386 machine description.

In terms of the ARM, some of the ARM hardware has explicit square root instructions for 32-bit and 64-bit (on ARM, long double is 64-bit). So the compiler generates these if fast math is used. IIRC, adding sqrt to a machine does not much to the hardware circuitry if hardware division is supported, so machines like the ARM and PowerPC have square root instructions. Note, unlike the x86 and quite a few other processors, I have never grok'ed the ARM instruction set, so I was only going by what the arm-linux-gnu-gcc and aarch64-linux-gnu-gcc compilers generated on my Linux system.
 
Last edited:
haha... hence why I decided to see what happens as there are so many conditions etc on what is done.

For a 3D engine I have been writing I always have SSE2 and know sinf/cosf and sqrt are hardware.

I have an all integer FFT that runs like a cut snake for a reason I guess. For what I use it for it's more than adequate. I should put the code up for it as it may be improved on.
 
haha... hence why I decided to see what happens as there are so many conditions etc on what is done.

For a 3D engine I have been writing I always have SSE2 and know sinf/cosf and sqrt are hardware.

I have an all integer FFT that runs like a cut snake for a reason I guess. For what I use it for it's more than adequate. I should put the code up for it as it may be improved on.

Yup, probably take a look at the audio lib which has this.
 
haha... hence why I decided to see what happens as there are so many conditions etc on what is done.

For a 3D engine I have been writing I always have SSE2 and know sinf/cosf and sqrt are hardware.

I have an all integer FFT that runs like a cut snake for a reason I guess. For what I use it for it's more than adequate. I should put the code up for it as it may be improved on.

Ummm, as I said above, yes the x86 has instructions to do sine and cosine instructions in hardware, but these instructions are microcoded which means they are slow. They also do not return correct results for the lower bits in some cases. So for float and double, glibc does not use these instructions. Instead, it does the Taylor series to calculate the answer faster and more accurately than the hardware instruction. Sqrt is different than sin/cos and is often implemented in hardware.

And in terms of the Teensys, only the Teensy 4.0, 4.1, and Micromod have both double and single precision support in hardware for the basic instructions. The Teensy 3.5 and 3.6 has support for single precision, but not double precision, and the Teensy 3.2/LC do not have any floating point support at all in the hardware.
 
For a Teensy 4.1, what is the difference between a double and long double? (How many bits for each?) Are they IEEE compliant, which I guess the specs were written around Intel processors?

How does one declare it? Just double and long double? How to use/declare a double number? For float it would be 1.01f, how to do this for double and long double?
 
On Teensy 4.1 double and long double are both implemented with the same 64 bits. So while there are some finer points to compiler semantics, ultimately they both end up compiling to exactly the same implementation.
 
64 bits is good enough. So there seems to be no advantage to long double.

From my ancient copy of K&R, the book states that floating point constants have a type of double unless suffixed. Is this true for the Teensy 4.1 / Arduino compiler today? Don't know why it would change, but thought I'd ask.

For the Teensy 4.x, where do I find execution time for double precision math operations (clock cycles)? ARM docs are tough to wade through. They are both wordy and vacuous at the same time!
Would that be found in the equivalent of a Teensy SDK?
 
Last edited:
This webpage has a description of the FPU instructions and their timing. There's also a PDF version of the M4 DSP instructions.

Pete

I'll look, but is there an equivalent for the M7? Assuming M4 and M7 timing are the same, seems to be a bit presumptuous!

Apparently, M7 cycle times are hidden behind ARM registration wall. Wonder why they do that? Why is that information privileged? ARM is a odd company. Besides, the Teensy is made by NXP, (using ARM IP) so is there an NXP document that shows the cycle times? Isn't this an important thing to know when sizing a processor for use?
 
Last edited:
I've not been able to find one for the M7 so I have presumed that the M4 info is a lower bound on the performance of the M7 :)

Pete
 
Last edited:
I can't explain why ARM is so secretive about some types of information. I'm pretty sure there's nothing any of us can say which will convince them to act differently.

But I can tell you we've had many conversations on this forum where attempts to benchmark tiny pieces of code using the cycle counter register have shown a lot of other difficult to anticipate factors come into play. The CPU is dual issue, but only normal integer instructions are capable of executing 2 per cycle. M7 has a longer pipeline than M4 & earlier. It also has branch prediction, a mix of tightly coupled memory and memory & peripherals accessed over a slower 64 bit AXI bus, which is supported by 2 large caches, 32K for instructions and 32K for data.

In practice this all adds up to a lot of complex behavior with regard to the actual number of cycles any particular small piece of code requires. Many people who've tried to use the approach common with older microcontrollers of counting the number of cycles for each instruction have found a lot of frustration with how difficult any particular instruction is to predict. What is already in the dual-issue pipeline, branch prediction buffer, and caches plays a strong role in the actual cycles taken, even when all instruction & data access is to the tightly coupled memory.
 
Back
Top