Coding Tricks - Reducing SRAM & CPU cycles

martianredskies · May 3, 2020

I had an idea that an ongoing forum post containing little optimization tricks might be a benefit to the community, and just an interesting read overall. Certainly there are blogs you can search on Google and harvest a trick or two, but having everything in one place (plus ones not generally known) would be helpful.

What are your best tricks? I'll start...

-If you are creating a circular buffer(s), lets say for the purpose of an interpolation algorithm, consider sizing the array(s) to powers of 2. For example;

Code:

int16_t cBuffer[4]; //4 elements

Why...?

Well imagine you are using an INT counter to clock your circular buffer, in order to wrap around you are perhaps using the modulus operator to do just this;

Code:

heads1 = (Counter1 + 1) % 0x03;
tails1 = Counter1 % 0x03;

Counter1++;

This works well, but modulus performs a division, and thus is not super efficient. However if your cBuffer[] array is sized to a power of 2, you can instead simply use a BITWISE mask. The savings aren't huge, but if you use a few, it all adds ups.

Code:

heads1 = (Counter1 + 1) & 0x03; //mask 0011
tails1 = Counter1 & 0x03;

Counter1++;

martianredskies · May 3, 2020

Following on from this;

A boolean variable is a byte, despite only having two possible variables. If memory is a concern, and you have many boolean variables, consider wrapping them all inside a single integer variable;

uint8_t has 8 possible boolean variables.
uint16_t has 16 possible boolean variables.
uint32_t has 32 possible boolean variables.

where before you may be doing something like;

Code:

boolean logicSwitch1 = true; //1 byte
boolean logicSwitch2 = true;
boolean logicSwitch3 = true;
boolean logicSwitch4 = true;
boolean logicSwitch5 = true;
boolean logicSwitch6 = true;
boolean logicSwitch7 = true;
boolean logicSwitch8 = true; //8bytes total

if(logicSwitch1){
//do something
}

if(logicSwitch2){
//do something
}

You would instead do;

Code:

uint8_t logicSwitches; // 1 byte Vs 8 bytes (above)
logicSwitches |= 0x01; //set first bit to true
logicSwitches |= 0x02; //set 2nd bit to true

if(logicSwitches & 0x01){ //condition if 1st bit is true
//do something
}

if(logicSwitches & 0x02){ //condition if 2nd bit is true
//do something
}

there is also the benefit of being able to OR multiple boolean variables without any additional OR operator;

Code:

if(logicSwitch1 | logicSwitch2){ //0001 | 0010 (0x01 | 0x02);
//do something
}

can simply be done with

if(logicSwitches & 0x03){ //0001 | 0010 (0x01 | 0x02);
//do something
}

If you need to clear the bits, you would use;

Code:

logicSwitches &= ~0x01; //clear bit 1
logicSwitches &= ~0x02; //clear bit 2
logicSwitches &= ~0x03; //clear bit 1 & 2

MichaelMeissner · May 3, 2020

The problem with doing micro-optimizations like that is that it depends on the processor and the compiler.

First of all, if you use unsigned variables to hold the index value, and your array size is a power of 2, GCC will optimize modulus to use a single AND instruction and divide to do a single SHIFT right instruction. If is signed, then the compiler might need to add some fixup code. In general, the best type to use for indexes is size_t which is an unsigned type large enough to hold the the largest array element. This way if you ever move from the current Teensys to other platforms, size_t will continue to work, which in large hosted memory applications, uint32_t might not.

Second, on machines where the divide is slow, but the multiply is faster, and you have a multiply instruction that gives you double the bits back, GCC may do a divide by a constant by doing a multiply extended with a value that gives 1/value in the upper bits.

Third, I seem to recall that divide is actually pretty fast on the Arm M4/M7 processors. So on machines with a slow divide, that would work, but on a machine with a fast divide, the compiler might just do the normal divide.

Also, be sure you are optimizing the time critical loop that is hot. Even if you are sure it is the hot loop, measure it first, do your optimization, and see if it is faster. If it is faster, great. If it isn't, judge whether it makes the code harder to understand, and toss it out if it doesn't help and it clutters up the program. If your app isn't spending the majority of time in the function, it may not be worth it to optimize it.

martianredskies · May 3, 2020

Yes, Absolutely. My suggestion is to try things out and see what results you get. Since i'm working a lot with the audio library i'm constantly benchmarking using AudioProcessorUsage();
I guess my idea for this post was more to create a large collection of tips that readers could look through and hopefully something comes up that is applicable for their own projects.

mstiller · May 3, 2020

For setting single bits use something like Paul documented on this page:

https://www.pjrc.com/teensy/pins.html

So instead of using "precalculated" values like

switches |= 0xC4

use

switches |= (1 << 7) | (1 <<6) | (1<<2).

Especially useful is the method for clearing bits. Look it up on the mentioned page.

mborgerson · May 4, 2020

martianredskies said:
I had an idea that an ongoing forum post containing little optimization tricks might be a benefit to the community, and just an interesting read overall. Certainly there are blogs you can search on Google and harvest a trick or two, but having everything in one place (plus ones not generally known) would be helpful.

What are your best tricks? I'll start...

-If you are creating a circular buffer(s), lets say for the purpose of an interpolation algorithm, consider sizing the array(s) to powers of 2. For example;

Code:

int16_t cBuffer[4]; //4 elements

Why...?

Well imagine you are using an INT counter to clock your circular buffer, in order to wrap around you are perhaps using the modulus operator to do just this;

Code:

heads1 = (Counter1 + 1) % 0x03; tails1 = Counter1 % 0x03; Counter1++;

There is a significant error in this "Optimization"!

An integer reduced modulus 3 can have the values 0,1,2. Thus your code will never use buffer element [3].

your example should read:

Code:

heads1 = (Counter1 + 1) % 0x04; tails1 = Counter1 % 0x04; Counter1++;

It's always good to check your code before you optimize too much!

As is pointed out in some of the following posts, GCC for the ARM doesn't use a divide instruction, but converts the modulus operation to a combination of multiplies by constants and right shifts.

In actual practice, the difference between the logical AND and the modulus method reduces to just over a tenth of a microsecond (or about 16 clock cycles on a T3.6 at 168MHz).

A significant advantage of the modulus method is that it doesn't restrict your buffers to sizes an even power of two. This can be usefule for larger buffers. If you have 60K of RAM for buffers, you would have to stop at 32768 bytes with the logical AND method, but could use all 60K with the modulus method.

I wrote a quick timing test sketch to look at the difference between the algorithms:

Code:

// run one million operations for timing #define NUMTORUN 1000000 #define LONGBUFFSIZE 60000 uint32_t longbuffer[LONGBUFFSIZE]; void setup() { // put your setup code here, to run once: uint32_t startmicro, deltamicro1,deltamicro2,deltamicro3, retval1, retval2, retval3; delay(1500); Serial.begin(9600); Serial.println("\nBuffer insertion Timing Test"); startmicro = micros(); retval1 = modtest(NUMTORUN); deltamicro1 = micros()-startmicro; Serial.printf("A Buffer insertion with modulus took %6.4f microseconds\n", (float)deltamicro1/NUMTORUN); startmicro = micros(); retval2 = andtest(NUMTORUN); deltamicro2 = micros()-startmicro; Serial.printf("A Buffer insertion with logical and took %6.4f microseconds\n", (float)deltamicro2/NUMTORUN); startmicro = micros(); retval3 = longtest(NUMTORUN); deltamicro3 = micros()-startmicro; Serial.printf("A long Buffer insertion with modulus took %6.4f microseconds\n", (float)deltamicro3/NUMTORUN); Serial.printf("retval1: %lu retval2: %lu retval3: %lu \n", retval1,retval2, retval3); } #define BUFFSIZE 128 // test buffer insertion with modulus operation // if we leave the buff1size at 128, the compiler will optimize the function to use logical and uint32_t modtest(uint32_t numtorun){ uint32_t buff1[BUFFSIZE+10]; volatile uint32_t i, idx; // use volatile to prevent optimizing out of existence idx = 0; for(i=0; i<numtorun; i++){ buff1[idx++] = i; // this is the part we are testing idx = idx % (BUFFSIZE+10); } return buff1[1]; // needed to keep compiler from optimizing loop out of existence! } // Test buffer insertion with logical and uint32_t andtest(uint32_t numtorun){ uint32_t buff1[BUFFSIZE]; volatile uint32_t i, idx; idx = 0; for(i=0; i<numtorun; i++){ buff1[idx++] = i; // this is the part we are testing idx = idx &(BUFFSIZE-1); } return buff1[1]; // needed to keep compiler from optimizing loop out of existence! } // Test long buffer insertion with modulus // uses global buffer to keep stack usage reasonable uint32_t longtest(uint32_t numtorun){ volatile uint32_t i, idx; idx = 0; for(i=0; i<numtorun; i++){ longbuffer[idx++] = i; // this is the part we are testing idx = idx % LONGBUFFSIZE; } return longbuffer[1]; // needed to keep compiler from optimizing loop out of existence! } void loop() { // put your main code here, to run repeatedly: }

I actually looked at the generated assembly in the .LST file and there was no division instruction in the modulus algorithm.

The modulus operation is hidden somewhere in :

Code:

// this is the part we are testing idx = idx % LONGBUFFSIZE; 5b0: fbae 4302 umull r4, r3, lr, r2 5b4: 0b9b lsrs r3, r3, #14 5b6: fb07 2313 mls r3, r7, r3, r2 5ba: 9303 str r3, [sp, #12] // Test long buffer insertion with modulus

However, I don't really know ARM assembly well enough to fully understand that code. Thankfully, the authors of GCC-ARM are on top of things!

Coding Tricks - Reducing SRAM & CPU cycles

martianredskies

Well-known member

martianredskies

Well-known member

MichaelMeissner

Senior Member+

martianredskies

Well-known member

mstiller

Well-known member

mborgerson

Well-known member