Is it faster to use 32-bit operations on 32-bit processors?

gfvalvo · Mar 15, 2018

Hi all. Just wondering, in cases where execution speed is the primary concern, is it generally better to use native-sized 32-bit integers on T3.x boards even though I don’t need values that large? I’m mainly talking about integer arithmetic, ‘for’ loop indices, array indices, bit manipulation, etc.

Just thinking that using 8 or 16-bit integers might involve extra packing / unpacking, shifting, masking, etc.

Thanks.

Greg

PaulStoffregen · Mar 15, 2018

Yes, 32 bit variables are occasionally faster. Usually the speed is identical, but in some cases like inputs to function calls, the compiler does have to add extra instructions to mask to 8 or 16 bits when you use less than 32 bits.

PaulStoffregen · Mar 15, 2018

On Teensy 3.5 and 3.6, 32 bit float is approximately the same speed as integers, due to the FPU. The FPU adds many extra float registers, so in some cases float can be faster than integer.

Because the processor uses a 3 stage pipeline without branch prediction, one of the most expensive operations is conditional tests. You can sometimes get quite a good speed increase by replacing if-else code with even fairly complex expressions. Sometimes float variables can be used in ways that don't require if-else checks for numerical ranges as integer, which can give float-based code a significant speed advantage.

Cortex-M4 also has DSP extension instructions that can give a substantial speedup to specific 16 bit math, but they are quite difficult to use.

gfvalvo · Mar 15, 2018

Thanks Paul!

gfvalvo · Mar 15, 2018

PaulStoffregen said:
You can sometimes get quite a good speed increase by replacing if-else code with even fairly complex expressions.

So, assuming 'index' and 'limit' are both unit32_t, would you do this:

Code:

  if (++index >= limit) {
    index = 0;
  }

or this:

Code:

index = (++index % limit);

Of course, if 'limit' could be confined to a power of 2, then just mask out all but the lower order bits.

MichaelMeissner · Mar 15, 2018

gfvalvo said:
Hi all. Just wondering, in cases where execution speed is the primary concern, is it generally better to use native-sized 32-bit integers on T3.x boards even though I don’t need values that large? I’m mainly talking about integer arithmetic, ‘for’ loop indices, array indices, bit manipulation, etc.

Just thinking that using 8 or 16-bit integers might involve extra packing / unpacking, shifting, masking, etc.

Thanks.

Greg

In general, it depends on the low level details of the processor. Neither Arm nor AVR are architectures that I've done compiler support for, so I can't say what they support, and what they don't..

Note, the ISO C/C++ standards say that char/short values are logically converted to int when used in an expression. Typically, most machines provide direct instructions to load and store 8-bit and 16-bit values into 32-bit or 64-bit registers. The arithmetic is done via 32-bit or 64-bit instructions, and then the store only stores the bottom 8 or 16-bit values.

On some 64-bit machines, there aren't 32-bit instructions, so the compiler every so often has to do a convert to 32-bit if the expression is being done in int rather than long. But since the Teensy is 32-bit that isn't an issue to you.

FWIW, the PowerPC though does not have a load 8-bit with sign extension, so signed char has to be done as two instructions, load 8-bit with zero extension, and sign extend, but it does have load 16-bit and 32-bit values with either sign or zero extension.

Be aware of pre-mature optimization. The things that you think are going to be the bottlenecks may not be where the chip is spending its time.

gfvalvo · Mar 16, 2018

MichaelMeissner said:
Neither Arm nor AVR are architectures that I've done compiler support for, so I can't say what they support, and what they don't..

My question is specifically about the ARM processor on the T3.2.

vladn · Mar 17, 2018

gfvalvo - the first example will work faster, simply because division and modulo operations are multycyle on arm. If you had only +,-,* and logicals then it would be faster to calculate.
And don't forget that arm has conditional execition, hence checks with 1-2 instructions inside will execute faster than a pipeline restart.
But of course if your "limit" is a power of two there is a better way

.

PaulStoffregen · Mar 17, 2018

vladn said:
And don't forget that arm has conditional execition, hence checks with 1-2 instructions inside will execute faster than a pipeline restart.

Does this feature really exist in Thumb2-based Cortex-M4?

There is a IF-ELSE instruction that applied conditions to the next 4 opcodes, but as far as embedded condition codes within each opcode, I'm pretty sure that's only available in standard ARM mode, but not Thumb mode.

Theremingenieur · Mar 17, 2018

I had the compiler already moaning about some of my inline asm, saying that instruction xyz was not allowed in IT block, thus I think that there are definitively limits.

vladn · Mar 17, 2018

Not sure about Cortex-M4 specifically, but it is available in Thumb modes.
More importantly in the posted example the division/modulo (unless the divisor is a power of 2) will be definitely slower then a pipeline restart.

Is it faster to use 32-bit operations on 32-bit processors?

gfvalvo

Well-known member

PaulStoffregen

Well-known member

PaulStoffregen

Well-known member

gfvalvo

Well-known member

gfvalvo

Well-known member

MichaelMeissner

Senior Member+

gfvalvo

Well-known member

vladn

Well-known member

PaulStoffregen

Well-known member

Theremingenieur

Senior Member+

vladn

Well-known member