I had an idea that an ongoing forum post containing little optimization tricks might be a benefit to the community, and just an interesting read overall. Certainly there are blogs you can search on Google and harvest a trick or two, but having everything in one place (plus ones not generally known) would be helpful.
What are your best tricks? I'll start...
-If you are creating a circular buffer(s), lets say for the purpose of an interpolation algorithm, consider sizing the array(s) to powers of 2. For example;
Code:
int16_t cBuffer[4]; //4 elements
Why...?
Well imagine you are using an INT counter to clock your circular buffer, in order to wrap around you are perhaps using the modulus operator to do just this;
Code:
heads1 = (Counter1 + 1) % 0x03;
tails1 = Counter1 % 0x03;
Counter1++;
There is a significant error in this "Optimization"!
An integer reduced modulus 3 can have the values 0,1,2. Thus your code will never use buffer element [3].
your example should read:
Code:
heads1 = (Counter1 + 1) % 0x04;
tails1 = Counter1 % 0x04;
Counter1++;
It's always good to check your code before you optimize too much!
As is pointed out in some of the following posts, GCC for the ARM doesn't use a divide instruction, but converts the modulus operation to a combination of multiplies by constants and right shifts.
In actual practice, the difference between the logical AND and the modulus method reduces to just over a tenth of a microsecond (or about 16 clock cycles on a T3.6 at 168MHz).
A significant advantage of the modulus method is that it doesn't restrict your buffers to sizes an even power of two. This can be usefule for larger buffers. If you have 60K of RAM for buffers, you would have to stop at 32768 bytes with the logical AND method, but could use all 60K with the modulus method.
I wrote a quick timing test sketch to look at the difference between the algorithms:
Code:
// run one million operations for timing
#define NUMTORUN 1000000
#define LONGBUFFSIZE 60000
uint32_t longbuffer[LONGBUFFSIZE];
void setup() {
// put your setup code here, to run once:
uint32_t startmicro, deltamicro1,deltamicro2,deltamicro3, retval1, retval2, retval3;
delay(1500);
Serial.begin(9600);
Serial.println("\nBuffer insertion Timing Test");
startmicro = micros();
retval1 = modtest(NUMTORUN);
deltamicro1 = micros()-startmicro;
Serial.printf("A Buffer insertion with modulus took %6.4f microseconds\n", (float)deltamicro1/NUMTORUN);
startmicro = micros();
retval2 = andtest(NUMTORUN);
deltamicro2 = micros()-startmicro;
Serial.printf("A Buffer insertion with logical and took %6.4f microseconds\n", (float)deltamicro2/NUMTORUN);
startmicro = micros();
retval3 = longtest(NUMTORUN);
deltamicro3 = micros()-startmicro;
Serial.printf("A long Buffer insertion with modulus took %6.4f microseconds\n", (float)deltamicro3/NUMTORUN);
Serial.printf("retval1: %lu retval2: %lu retval3: %lu \n", retval1,retval2, retval3);
}
#define BUFFSIZE 128
// test buffer insertion with modulus operation
// if we leave the buff1size at 128, the compiler will optimize the function to use logical and
uint32_t modtest(uint32_t numtorun){
uint32_t buff1[BUFFSIZE+10];
volatile uint32_t i, idx; // use volatile to prevent optimizing out of existence
idx = 0;
for(i=0; i<numtorun; i++){
buff1[idx++] = i;
// this is the part we are testing
idx = idx % (BUFFSIZE+10);
}
return buff1[1]; // needed to keep compiler from optimizing loop out of existence!
}
// Test buffer insertion with logical and
uint32_t andtest(uint32_t numtorun){
uint32_t buff1[BUFFSIZE];
volatile uint32_t i, idx;
idx = 0;
for(i=0; i<numtorun; i++){
buff1[idx++] = i;
// this is the part we are testing
idx = idx &(BUFFSIZE-1);
}
return buff1[1]; // needed to keep compiler from optimizing loop out of existence!
}
// Test long buffer insertion with modulus
// uses global buffer to keep stack usage reasonable
uint32_t longtest(uint32_t numtorun){
volatile uint32_t i, idx;
idx = 0;
for(i=0; i<numtorun; i++){
longbuffer[idx++] = i;
// this is the part we are testing
idx = idx % LONGBUFFSIZE;
}
return longbuffer[1]; // needed to keep compiler from optimizing loop out of existence!
}
void loop() {
// put your main code here, to run repeatedly:
}
I actually looked at the generated assembly in the .LST file and there was no division instruction in the modulus algorithm.
The modulus operation is hidden somewhere in :
Code:
// this is the part we are testing
idx = idx % LONGBUFFSIZE;
5b0: fbae 4302 umull r4, r3, lr, r2
5b4: 0b9b lsrs r3, r3, #14
5b6: fb07 2313 mls r3, r7, r3, r2
5ba: 9303 str r3, [sp, #12]
// Test long buffer insertion with modulus
However, I don't really know ARM assembly well enough to fully understand that code. Thankfully, the authors of GCC-ARM are on top of things!