Is it faster to use 32-bit operations on 32-bit processors?

Status
Not open for further replies.

gfvalvo

Well-known member
Hi all. Just wondering, in cases where execution speed is the primary concern, is it generally better to use native-sized 32-bit integers on T3.x boards even though I don’t need values that large? I’m mainly talking about integer arithmetic, ‘for’ loop indices, array indices, bit manipulation, etc.

Just thinking that using 8 or 16-bit integers might involve extra packing / unpacking, shifting, masking, etc.

Thanks.

Greg
 
Yes, 32 bit variables are occasionally faster. Usually the speed is identical, but in some cases like inputs to function calls, the compiler does have to add extra instructions to mask to 8 or 16 bits when you use less than 32 bits.
 
On Teensy 3.5 and 3.6, 32 bit float is approximately the same speed as integers, due to the FPU. The FPU adds many extra float registers, so in some cases float can be faster than integer.

Because the processor uses a 3 stage pipeline without branch prediction, one of the most expensive operations is conditional tests. You can sometimes get quite a good speed increase by replacing if-else code with even fairly complex expressions. Sometimes float variables can be used in ways that don't require if-else checks for numerical ranges as integer, which can give float-based code a significant speed advantage.

Cortex-M4 also has DSP extension instructions that can give a substantial speedup to specific 16 bit math, but they are quite difficult to use.
 
You can sometimes get quite a good speed increase by replacing if-else code with even fairly complex expressions.
So, assuming 'index' and 'limit' are both unit32_t, would you do this:

Code:
  if (++index >= limit) {
    index = 0;
  }

or this:

Code:
index = (++index % limit);

Of course, if 'limit' could be confined to a power of 2, then just mask out all but the lower order bits.
 
Hi all. Just wondering, in cases where execution speed is the primary concern, is it generally better to use native-sized 32-bit integers on T3.x boards even though I don’t need values that large? I’m mainly talking about integer arithmetic, ‘for’ loop indices, array indices, bit manipulation, etc.

Just thinking that using 8 or 16-bit integers might involve extra packing / unpacking, shifting, masking, etc.

Thanks.

Greg

In general, it depends on the low level details of the processor. Neither Arm nor AVR are architectures that I've done compiler support for, so I can't say what they support, and what they don't..

Note, the ISO C/C++ standards say that char/short values are logically converted to int when used in an expression. Typically, most machines provide direct instructions to load and store 8-bit and 16-bit values into 32-bit or 64-bit registers. The arithmetic is done via 32-bit or 64-bit instructions, and then the store only stores the bottom 8 or 16-bit values.

On some 64-bit machines, there aren't 32-bit instructions, so the compiler every so often has to do a convert to 32-bit if the expression is being done in int rather than long. But since the Teensy is 32-bit that isn't an issue to you.

FWIW, the PowerPC though does not have a load 8-bit with sign extension, so signed char has to be done as two instructions, load 8-bit with zero extension, and sign extend, but it does have load 16-bit and 32-bit values with either sign or zero extension.

Be aware of pre-mature optimization. The things that you think are going to be the bottlenecks may not be where the chip is spending its time.
 
Last edited:
gfvalvo - the first example will work faster, simply because division and modulo operations are multycyle on arm. If you had only +,-,* and logicals then it would be faster to calculate.
And don't forget that arm has conditional execition, hence checks with 1-2 instructions inside will execute faster than a pipeline restart.
But of course if your "limit" is a power of two there is a better way :).
 
And don't forget that arm has conditional execition, hence checks with 1-2 instructions inside will execute faster than a pipeline restart.

Does this feature really exist in Thumb2-based Cortex-M4?

There is a IF-ELSE instruction that applied conditions to the next 4 opcodes, but as far as embedded condition codes within each opcode, I'm pretty sure that's only available in standard ARM mode, but not Thumb mode.
 
I had the compiler already moaning about some of my inline asm, saying that instruction xyz was not allowed in IT block, thus I think that there are definitively limits.
 
Not sure about Cortex-M4 specifically, but it is available in Thumb modes.
More importantly in the posted example the division/modulo (unless the divisor is a power of 2) will be definitely slower then a pipeline restart.
 
Status
Not open for further replies.
Back
Top