Poll: Binary floating point txfr from t3.5, 4.x to host?

LenSamuelson · Jun 26, 2020

Hi everyone, with the increasingly common of floating-point hardware in recent NXP Kinetis parts including the T3.5, 3.6, 4X, I wonder whether anyone here has begun using binary data transfer of single-precision floating point values between hosts and Teensies, and if so, has it caused unexpected or subtle problems?

My project at work would benefit from transferring a variety of conversion constants between the embedded side and a host system. I would like to avoid the development, documentation and potential reliability costs of manually scaling these constants, instead counting on using of single-precision values. Our project does not depend on various corner cases like NaNs, denormals, etc.

My initial experiments show that it works, with the host-resident code using the python "struct" library module as well as C language code interpreting the same values in a C structure.

I am old enough to remember the challenges of interoperability with every manufacturer supporting a different floating-point format.

joepasquariello · Jun 26, 2020

Sounds like you've got it working. Floating point formats have long been standardized via IEEE 754. The only processor I've ever worked with that did not use IEEE 754 was the TI TMS320C30 DSP. I don't know of any new processors that don't use the IEEE standard. If the Python library allows you to specify the endian-ness of the source/server (Teensy) and the client, and performs byte reversal if necessary, you should be fine.

LenSamuelson · Jun 27, 2020

Thanks for the response. Given that nobody has reported obvious problems with binary FP interoperability, I'll just go with it and verify the results carefully. The project already uses FP arithmetic in the embedded side with great success, so it does not seem like much of a risk.

joepasquariello · Jun 27, 2020

IEEE 754 was established in 1985. It guarantees interoperability, meaning the representation of any and every floating-point value is guaranteed to be identical on any compliant processor. There is nothing to worry about!

Nominal Animal · Jun 28, 2020

There is no issue, as long as you explicitly specify which byte order you use in the transfers.

On some host hardware architectures, accessing an unaligned float may not work, and byte order conversion may be needed. I recommend combining these. For example:

Code:

#include <stdint.h>

static inline float  get_float32le(const void *ptr)
{
    const unsigned char *const  data = ptr;
    union {
        float     f;
        uint32_t  u;
    } temp;
    temp.u = data[0] | ((uint32_t)(data[1]) << 8) | ((uint32_t)(data[2]) << 16) | ((uint32_t)(data[3]) << 24);
    return temp.f;
}

Instead of *(float *)(mybuffer + offset) you use get_float32le(mybuffer + offset). This is equivalent to using Python struct.unpack("<f", mybuffer[offset

ffset+4]).

While it looks like the above function is "slow", using GCC on x86-64 with -O2 (as is common and recommended), it optimizes to just a single MOVSS SSE/AVX machine instruction. (Verified on GCC 7.5.0.)
The code itself is portable, and will work on any architecture where 'float' is IEEE-754 Binary32.

Teensy 3.x and 4.x use IEEE-754 Binary32 in little-endian byte order, so as long as you observe correct alignment, you don't need to do anything special.

The inverse, storing a float in little-endian byte order to an unaligned buffer, (equivalent to Python struct.pack("<f", value)) is e.g.

Code:

#include <stdint.h>

static inline void set_float32le(void *dst, const float value)
{
    unsigned char *const data = dst;
    const union {
        float     f;
        uint32_t  u;
    } temp = { .f = value };

    data[0] = temp.u,
    data[1] = temp.u >> 8,
    data[2] = temp.u >> 16,
    data[3] = temp.u >> 24;
}

but GCC tends to compile it to less optimal code, not just a single store.

Over a decade ago, I wrote some routines in Fortran and C to store and access IEEE-754 Binary64 ("double") data in arbitrary byte order, using a "prototype" value for both byte order and format identification.

For example, 65432.125 in IEEE-754 Binary32 ("float") is 0_10001110_(1)11111111001100000100000 in binary, and corresponds to 32-bit unsigned integer 1199544352 = 0x477F9820 if the floating-point and integer byte orders are the same.

There are four possible byte orders for 32-bit values, and eight for 64-bit values. Floating-point accessors that can use any byte order are e.g.

Code:

float float32(const void *const src, const unsigned char order)
{
    const unsigned char *const data = src;
    union {
        float           f;
        uint32_t        u;
        unsigned char   c[4];
    } temp = { .c = { data[0], data[1], data[2], data[3] } };

    if (order & 1)
        temp.u = ((temp.u & 0x00FF00FF) << 8)
               | ((temp.u >> 8) & 0x00FF00FF);

    if (order & 2)
        temp.u = ((temp.u & 0x0000FFFF) << 16)
               | ((temp.u >> 16) & 0x0000FFFF);

    return temp.f;
}

double float64(const void *const src, const unsigned char order)
{
    const unsigned char *const data = src;
    union {
        float           f;
        uint64_t        u;
        unsigned char   c[8];
    } temp = { .c = { data[0], data[1], data[2], data[3], data[4], data[5], data[6], data[7] } };

    if (order & 1)
        temp.u = ((temp.u & UINT64_C(0x00FF00FF00FF00FF)) << 8)
               | ((temp.u >> 8) & UINT64_C(0x00FF00FF00FF00FF));

    if (order & 2)
        temp.u = ((temp.u & UINT64_C(0x0000FFFF0000FFFF)) << 16)
               | ((temp.u >> 16) & UINT64_C(0x0000FFFF0000FFFF));

    if (order & 4)
        temp.u = ((temp.u & UINT64_C(0x00000000FFFFFFFF)) << 32)
               | ((temp.u >> 32) & UINT64_C(0x00000000FFFFFFFF));

    return temp.f;
}

which can be used for data access, but more importantly can be used to test if the prototype values are recognized:

Code:

int float32_endian(const void *src, const float prototype)
{
    int  order;

    for (order = 0; order < 4; order++)
        if (float32(src, order) == prototype)
            return order;

    return -1;
}

int float64_endian(const void *src, const double prototype)
{
    int order;

    for (order = 0; order < 8; order++)
        if (float64(src, order) == prototype)
            return order;

    return -1;
}

Both functions return the byte order that parses the value as the prototype, or -1 if the format does not match. Simples!

For bulk data conversion, I wrote optimized versions for no-byte-order-change (0, basically memmove()), and reverse-byte-order-change (~0), with the above "slow" version handling any of the other byte orders since I've never encountered them in the wild (but who knows, might exist). It is not human-slow, however; we're talking about whether accessing a gigabyte of data takes no appreciable time, or a fraction of a second.

This may be relevant to some Teensyduino developers. If you implement a Teensy gadget that stores binary data to files on an SD card, you might wish to make the format portable by adding a header that contains a suitable prototype value for each type (float, uint32_t, et cetera) you use, and order the elements in the structure so that they are aligned to their size and no padding is needed. Then, an application that processes/converts those files can trivially check and compensate for the byte order, if it ever happened to differ from what the Teensy uses – without having to do any sort of #if - #endif preprocessor macro shenanigans. Even the above "slow" arbitrary-byte-order accessor functions are so fast on current computers that accessing a few million entries takes an insignificant fraction of a second, much less than a human can perceive; but the code itself is perfectly portable and byte-order-agnostic, and lets you just not worry about it.

Just like JPEG and PNG files, using binary data does not need to mean "unportable" or "hardware-specific". It takes just a bit of thinking beforehand, and verifying the code works as intended.

(My routines were used to let a distributed molecular dynamics simulation with a couple of hundred million atoms store snapshots of the system locally, with minimal delay to the simulation itself, and keeping individual files to a manageable size; with a helper library (in Fortran and in C) and a helper utility using that library that allowed the user to slice the system in time and/or space, outputting the slice in a standard format, while accessing the data spread over a large number of files. The entire dataset was just under two terabytes, if I recall correctly.)

LenSamuelson · Jun 28, 2020

Sounds positive all around. Supports my hypothesis and early test results, that communicating basic configuration and diagnostic constants will work predictably. We are already ensuring proper alignment and byte ordering (I use the python struct module in support code, and properly aligned structures in C). Our implementation includea the equivalent of "protocol numbers" in our message flows, so receivers can grok content structure from messages.

Many thanks for all the ideas, it's good to be confident in moving forward with low probability of wasting time.

Poll: Binary floating point txfr from t3.5, 4.x to host?

LenSamuelson

Well-known member

joepasquariello

Well-known member

LenSamuelson

Well-known member

joepasquariello

Well-known member

Nominal Animal

Well-known member

LenSamuelson

Well-known member