As a general rule, uint32_t is most efficient. If your numbers are always positive, use uint32_t.
Usually there is no speed penalty for int32_t. Any tricks to avoid int32_t will usually cost far more than simply using int32_t.
Likewise, 32 bit floating point is almost as fast as integers due to the FPU. The FPU adds many extra registers that aren't usually used for integers, so float can run surprisingly fast. So if you're working with data that naturally is fractions or decimal numbers, extra code to shoehorn numerical data into integers usually ends up running slower than simply using 32 bit float.
The one exception for float speed is interrupts. The ARM ABI uses "lazy stacking" for the FPU registers, which results in extra pushes and pops to the memory as you use the FPU. Best to keep interrupts to use of integers only.
64 bit double is also implemented by the FPU, but at half the speed and twice the register pressure of 32 bit float. Be careful of the compiler's rules to promote math to 64 bits if you use decimal constants without a trailing "f" to make them only 32 bits.
Usually 8 and 16 bit integers are as fast as 32 bits, but in some cases the compiler must add logical AND instructions. The inputs to functions are one of the most common examples. But in those cases, where your code really is executing a function call so much where the call overhead is substantial, you probably should make it an inline function if you care about performance.
Just to make this already-long post complete, the other special case is the DSP extension instructions, which support packing two 16 bit signed integers into 32 bit registers. The compiler doesn't use these instructions automatically. You have to use inline assembly (or call inline functions with that inline asm) to make use of these very special features. When done very carefully, certain type of signal processing algorithms can run much faster with 16 bit signed integers. The audio library makes extensive use of this technique, if you're interested to see an example.
Donald Knuth's classic "premature optimization" quote is good advice. But you do need to choose your data types somehow on the first pass. Go with uint32_t for all unsigned integers (unless you need 64 bits), int32_t when you need negative numbers, and don't be shy to use 32 bit float when appropriate, since you have a FPU which implements most basic operations in a single cycle.