Exponential Moving Average - How does it work?

Status
Not open for further replies.

xenington

Well-known member
Hello everyone

I am using an exponential moving average to smooth the readings from an LSM303DLHC compass module and it seems to work. However, I do not understand how it works. Here is the code (reading = raw reading from the module):

Code:
uint16_t headingTotal = 0;
uint8_t K = 2;

int16_t headingAvg(int16_t reading)
{
  headingTotal += reading;  
  int16_t average = headingTotal + (1 << (K - 1));
  if (reading < 0) {
    average -= 1;
  }
  headingTotal -= average;
  return average >> K;
}

To me, the filter: 1) adds the latest value to the headingTotal, 2) adds 2 to headingTotal to give average, 3) subtracts 1 from average if value < 0, 4) subtracts average from headingTotal and finally 5) divides average by 4 to produce the smoothed value. What is the point of adding 2 to headingTotal? How does subtracting 1 from average allow for signed integers?

I have tried going through the maths on the website where the filter came from, but it is too much for my poor head:

https://tttapa.github.io/Pages/Math...nential Moving Average/C++Implementation.html
https://tttapa.github.io/Pages/Math...oving Average/Exponential-Moving-Average.html

Can someone please explain it in simple terms? Thanks.
 
I'm using exponential smoothing quite often. It is very convenient to smooth values on the fly without the need for a buffer. The version you are using seems to be some integer optimized version of the usual float algorithm.

Code:
constexpr float k = 0.1f;

T_avg = T_avg + k *(T - T_avg);

In this version it is much easier to understand how it works. You'd call this whenever you get a new value T. The algorithm simply calculates the difference of the new value T to the current average T_avg. It then adds a fraction of this difference to the average.

If you e.g. start with T_avg = 0 and call the algorithm repetitively with a constant T = 1 you see that T_avg will slowly rise. The more it rises the smaller the difference of T_avg to T and the slower it rises. Actually it can never really reach T but it gets arbitrary close to it. (Actually it would follow some exponential curve, hence the name)

The smaller k is the better the smoothing but the longer it takes to reach a constant value for a constant input. The larger k is (k < 1) the less smoothing you get. For k = 1 the average simply follows the current value T. For k = 0 the average doesn't change at all.

Hope that helps.
 
I'd second that way of coding it as being clear, but it presupposes either fixed point fractions or floats.

[ The original code using just integer arithmetic and has the problems this creates (needing a scaling down
with the a shift, bias due to truncation rather than rounding, and the DC gain isn't immediately obvious). ]

If you have a long time constant (i.e. small value of k) its best to treat the first sample specially and
set the average directly from it, otherwise it will take a while to settle down.

Technically this is a first-order digital low-pass filter, equivalent to an RC circuit(*) or such. The single real pole is at
1-k inside the unit-circle and the gain is scaled to unity at DC. So for instance if k = 0.5, the pole at 0.5 will give a
3-fold attenuation at the Nyquist limit (alternating 1,-1 samples for instance), as 0.5 is three times closer to +1 than -1.

For small k the roll-off frequency is about k Fs / (2 pi) where Fs is the sample rate, which means if you want
a roll-off at f (<< Fs), chose k = 2 pi f / Fs. In the presence of white noise the noise power will reduce by a
factor of about pi / k, and thus noise amplitude by about √(pi/k)

(*) more accurately for small values of k.
 
No, OP's code is not right. A correct implementation would be
Code:
constexpr int_fast16_t fractional_bits = 2;
int_fast16_t heading = 0;

int_fast16_t average_heading(int_fast16_t sample)
{
    int_fast16_t average;
    heading += sample;
    average = (heading - (heading < 0) + (1 << (fractional_bits - 1))) >> fractional_bits;
    heading -= average;
    return average;
}
This implements the following maths:
        heading = heading + (sample / 2fractional_bits)
        average = round(heading)
        heading = heading - average
        (returns average)

The way it does this, is having heading be a fixed point number with fractional_bits bits for the fractional part. Essentially, although it is obviously an integer, it represents a mathematical value heading / 2fractional_bits.

The tricky part is how average is rounded from heading, noting that average is a normal integer, and not a fixed-point value.

The (heading < 0) part evaluates to 1 if heading is less than zero, and to zero otherwise. This is standard in C and C++: the result of a comparison decays to integer 1 if true, and to 0 if false.

The (1 << (fractional_bits - 1)) represents the mathematical value 0.5 in this fixed point number.

The final >> fractional_bits divides the left side by 2fractional_bits, rounding down. For positive values, it rounds towards zero; for negative values, towards the next smaller (more negative) integer. (As an example, -4>>2 == -3>>2 = -2>>2 = -1>>2 = -1, but -5>>2 == -2.)

To convert from rounding down to rounding towards nearest integer, we need to add half; and subtract the unit in the least significant place if negative.

So, the -(heading < 0) fixes the asymmetry when shifting negative integers right; and the +(1<<(fractional_bits-1)) turns the bit shift into rounding operation (because it adjusts the rounding down to rounding up from half).

Final note: int16_t is sub-optimal on 32-bit Teensies (LC, 3.x, 4.x). The int_fast16_t is a standard type (although many programmers don't know about it), that is at least 16-bit, but larger if the current hardware prefers larger words. On 32-bit architectures, it is usually 32 bits.



I do not expect anyone to take my word for this, though. I suggest you compile and run the following C program on your computer:
Code:
#include <stdlib.h>
#include <inttypes.h>
#include <string.h>
#include <stdio.h>
#include <math.h>

int main(int argc, char *argv[])
{
    int  n, nmin, nmax, bits, v;
    double b, d;
    char dummy;

    if (argc != 4 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help") || !strcmp(argv[1], "/?")) {
        const char *argv0 = (argc > 0 && argv && argv[0] && argv[0][0]) ? argv[0] : "(this)";
        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s [ -h | --help | /? ]\n", argv0);
        fprintf(stderr, "       %s BITS MIN MAX\n", argv0);
        fprintf(stderr, "\n");
        fprintf(stderr, "This program prints\n");
        fprintf(stderr, "    (N - (N < 0) + (N << (BITS - 1))) >> BITS\n");
        fprintf(stderr, "for N = MIN .. MAX, inclusive.\n");
        fprintf(stderr, "\n");
        return EXIT_FAILURE;
    }

    if (sscanf(argv[1], " %d %c", &bits, &dummy) != 1 || bits < 1) {
        fprintf(stderr, "%s: BITS must be a positive integer.\n", argv[1]);
        return EXIT_FAILURE;
    }
    if (sscanf(argv[2], " %d %c", &nmin, &dummy) != 1) {
        fprintf(stderr, "%s: MIN must be an integer.\n", argv[2]);
        return EXIT_FAILURE;
    }
    if (sscanf(argv[3], " %d %c", &nmax, &dummy) != 1 || nmax < nmin) {
        fprintf(stderr, "%s: MAX must be an integer, at least as large as MIN.\n", argv[3]);
        return EXIT_FAILURE;
    }

    b = pow(2.0, bits);

    for (n = nmin; n <= nmax; n++) {
        v = (n - (n < 0) + (1 << (bits - 1))) >> bits;
        d = round( (double)n / b );
        printf("%d %d %.0f\n", n, v, d);
    }

    return EXIT_SUCCESS;
}
supplying the number of fractional bits to consider, and the minimum and maximum integer value n, it shows the values of n and (n - (n < 0) + (1 << (bits-1))) >> bits and round(n / 2bits). You'll see that the second and third columns match, i.e. that the odd integer expression really does correct rounding.

(You can safely consider -0 == 0.)

For example, saving the above as ints.c and compiling it in Linux using gcc -Wall -Wextra -O2 ints.c -lm -o ints and running ./ints 2 -17 17 outputs
Code:
-17 -4 -4
-16 -4 -4
-15 -4 -4
-14 -4 -4
-13 -3 -3
-12 -3 -3
-11 -3 -3
-10 -3 -3
-9 -2 -2
-8 -2 -2
-7 -2 -2
-6 -2 -2
-5 -1 -1
-4 -1 -1
-3 -1 -1
-2 -1 -1
-1 0 -0
0 0 0
1 0 0
2 1 1
3 1 1
4 1 1
5 1 1
6 2 2
7 2 2
8 2 2
9 2 2
10 3 3
11 3 3
12 3 3
13 3 3
14 4 4
15 4 4
16 4 4
17 4 4
which proves this expression correctly rounds (at least for bits == 2). It does work for other fractional bits counts, from 1 to 30 (on architectures with 32-bit ints) or thereabouts, but proving it all would make this post overly long.
 
Last edited:
@Nominal Animal

Brilliant answer, thank you so much for your time. My understanding is a lot clearer now. Also, I swapped your implementation for the one I was using and it works much better (I don't think mine was actually filtering anyway, it just appeared to be). Using fractional_bits = 8 seems to be the best compromise between speed and stability.

You are right - I hadn't heard of the int_fast16_t before!

Thanks also to luni and MarkT. It all helps!

:)
 
No, OP's code is not right. A correct implementation would be
.....
Not sure if it is the same, but I implement exp. averaging as

float:
avg = avg +aa*(x-avg), where aa is a small number 0<aa<1 and x in the input

integer:
avg = (avg*A + (x-avg))/A, where A is an integer >1. in particular A = 1/aa.

to speed up I choose A to be a power of 2, so I can use right and left shifts.
So, instead of, say using A=1000, I use A=1024 and formula becomes
avg = (avg<<10 +(x-avg))>>10

Obviously, it may be necessary to use say int32_t to average int16_t values, etc.
 
Exponential decay filtering,
        average = C·sample + (1-C)·average
(where C is a real coefficient smaller than 1) is different to exponential moving average.

Their main difference is that the decay filter retards the signal (peaks in the filtered signal occur later than in the samples), whereas the windowed filter does not.

This particular exponential moving average filter is cheap to calculate, but if the source is noisy, a small amount of noise passes through (a few units in the least place). You can mitigate that by averaging a few samples, so the signal retardation is just a few samples, but the output is smooth.
 
If speed is an issue, do use a power of two. But keep your code more readable and let the compiler worry about replacing division/multiply with a shift.
 
Exponential decay filtering,
        average = C·sample + (1-C)·average
(where C is a real coefficient smaller than 1) is different to exponential moving average.
That's 2 multiplies, whereas the the "average += k * (sample-average)" approach is only one multiply and doesn't risk problems
when you change C and forget to change 1-C to match. If k is a negative power of 2 its one shift, no multiplies.
 
> that's 2 multiplies, whereas the the "average += k * (sample-average)" approach is only one multiply

In yet another case of it not being wise to outguess the compiler, the mathematically equivalent "average = C·sample + (1-C)·average" actually runs faster (T4.1, floating point). And there is no need to ever change 1-C when C changes. Even with integers (about the same speed, so why bother), one can just rearrange the formula slightly to avoid (1-C).
 
Last edited:
> that's 2 multiplies, whereas the the "average += k * (sample-average)" approach is only one multiply

In yet another case of it not being wise to outguess the compiler

That's one compiler, at one optimize setting, on a machine with hardware floating point and super-scalar dual-issue, not
wise to assume this is generally true.
 
That's 2 multiplies, whereas the the "average += k * (sample-average)" approach is only one multiply and doesn't risk problems
when you change C and forget to change 1-C to match. If k is a negative power of 2 its one shift, no multiplies.
Eh, what? I was describing a mathematical formula, not suggesting code.

There is an important distinction here, by the way.

The often used mathematical form

        state = (1 - weightstate + weight·sample
        Return state

is not exactly the same as

        state = state + weight·sample
        average = round(state)
        state = state - weight·average
        Return average

The first one is equivalent to
        state = state + weight·(sample - state)
which you can trivially verify by just reordering the terms.
I call this the exponential decay filter.

The second one (that returns average), called here the Exponential Moving Average, is similar, but not exactly the same: in this one, the state is a sum of two parts, the first part being equivalent to the state in the first form, and the second part being the fractional bits of the average scaled by weight. This causes the average to cycle between two integer values, even for a constant incoming sample value. For example, when all sample=2.5 and weight=0.5, then average = 2, 3, 2, 3, 2, 3, ... . Similarly, if sample varies between 9.5 and 10.5 (so average 10, with ±0.5 of noise), then average won't stay 10, but occasionally dips to 9 or 11. This has both upsides and downsides; for example, you could say it is both more precise and more noisy.

This effect remains, even when sample is an integer, but has any noise.



When you deal with a formula like (1-wa+w·b, the form I recommend one uses depends on the type of the variables used.

When using integer types, the form a+w·(b-a) is recommended, because it uses only a single multiplication, but also because you can avoid integer overflow by using a+w·(b-a) when b>a, and a-w·(a-b) otherwise. The two are mathematically equivalent.

When using floating-point types, the form (1-wa+w·b is recommended, because it is numerically more robust. In particular, the form a+w·(b-a) risks domain cancellation error, when b and a differ a lot in magnitude. The first form is robust against domain cancellation error: if you increase w from 0 to 1, the value of the expression will start from a and end at b, even if they differ a lot in magnitude. The second form, if a is much greater than b in magnitude, just goes from a to zero.
 
No, OP's code is not right. A correct implementation would be ...
Indeed, @xenington where did you get that code from? The pages you linked to seem to list the correct code (https://tttapa.github.io/Pages/Math...ge/C++Implementation.html#implementation-in-c).

There is an important distinction here, by the way.

The often used mathematical form

        state = (1 - weightstate + weight·sample
        Return state

is not exactly the same as

        state = state + weight·sample
        average = round(state)
        state = state - weight·average
        Return average

Where does the second formulation come from? The code in the original article implements the following:

        state = state + sample
        average = state * weight
        state = state - average
        Return average

The derivation here shows the equivalence with your first form, using a change of variables, introducing a new variable z. The variable z is essentially a fixed-point representation of the original state y: z = y/weight. The actual state that's saved in the code is state - state * weight = state - average = z - y = state - state * weight = (1-weight) * state. In the next iteration, the input sample is added (no need to multiply by weight, since state is already scaled by weight), and that results in the same formula as your first mathematical form.
This re-formulation is done to make the integer implementation more efficient.

You can also quickly verify the equivalence numerically:
Code:
import numpy as np

N = 100
output_1 = np.zeros((N,))
output_2 = np.zeros((N,))

state_1 = 0
state_2 = 0

weight = .25

rng = np.random.default_rng()

for i in range(N):
    sample = rng.uniform(0, 1024)

    # 1
    state_1 = (1 - weight) * state_1 + weight * sample
    output_1[i] = state_1

    # 2
    state_2 += sample
    average = state_2 * weight
    state_2 = state_2 - average
    output_2[i] = average

import matplotlib.pyplot as plt 

plt.step(np.arange(N), output_1, label='1')
plt.step(np.arange(N), output_2, label='2')
plt.legend()
plt.show()

Pieter
 
The "Exponential Moving Average filter" implementation using integers is slick - on a T4, about 2x faster than the simple float version.
 
The code in the original article implements the following:

        state = state + sample
        average = state * weight
        state = state - average
        Return average
Perhaps that was your intention, but fact is, the C++ implementation includes the rounding operation.

Code to implement the above is different: you need to calculate average in fixed point as well, and round or truncate it only after the subtraction.

As it stands, the rounding introduces quantization noise, because it (rounding) affects the state. If it didn't, it'd be a different matter.
 
Here's a simple Python3 test bench anyone can experiment with:
Code:
from random import Random
from math import sin, pi

class Source:
    __slots__ = ('prng', 'phase')

    def __init__(self, *args, **kwargs):
        self.prng = Random()
        self.phase = 0.0

    def __call__(self):
        self.phase = self.phase + pi / 127.0
        return round(17.5 * sin(self.phase) + 7.3 * sin(self.phase * 3.1) + self.prng.uniform(-5.0, +5.0))


class Filter:
    __slots__ = ()

    def __init__(self, *args, **kwargs):
        pass

    def __call__(self, sample):
        return round(sample)


class ExponentialFilter(Filter):
    __slots__ = ('state', 'weight')

    def __init__(self, weight, state=0):
        self.weight = float(weight)
        self.state = float(state)

    def __call__(self, sample):
        self.state = self.state + sample
        average = self.state * self.weight
        self.state = self.state - average
        return round(average)


class RoundingFilter(Filter):
    __slots__ = ('state', 'weight')

    def __init__(self, weight, state=0):
        self.weight = float(weight)
        self.state = float(state)

    def __call__(self, sample):
        self.state = self.state + sample
        average = round(self.state * self.weight)
        self.state = self.state - average
        return average


class DecayFilter(Filter):
    __slots__ = ('state', 'weight')

    def __init__(self, weight, state=0):
        self.weight = float(weight)
        self.state = float(state)

    def __call__(self, sample):
        self.state = (1 - self.weight)*self.state + self.weight*sample
        return round(self.state)


if __name__ == '__main__':
    sample = Source()
    weight = 0.125
    filter1 = DecayFilter(weight)
    filter2 = ExponentialFilter(weight)
    filter3 = RoundingFilter(weight)

    print("#n x Decay(x) Exponential(x) Rounding(x)")

    N = 256
    n12 = 0
    n13 = 0
    n23 = 0
    for i in range(0, N):
        x = sample()
        y1 = filter1(x)
        y2 = filter2(x)
        y3 = filter3(x)

        n12 += (y1 == y2)
        n13 += (y1 == y3)
        n23 += (y2 == y3)

        print("%3d %3.0f   %3.0f %3.0f %3.0f" % (i, x, y1, y2, y3))

    print("# Decay(x) == Exponential(x) in %d of %d samples" % (n12, N))
    print("# Decay(x) == Rounding(x) in %d of %d samples" % (n13, N))
    print("# Exponential(x) == Rounding(x) in %d of %d samples" % (n23, N))
The Source is the source of the original signal: a dual sinusoidal signal with periods 127.0 and 3.1/127.0 and amplitudes 17.5 and 7.3, respectively, with uniform random noise between -5 and +5 on top.

Decay and Exponential are the two equivalent mathematical forms for the exponential filter; Rounding implements the "exponential moving average" filter that rounds average before subtracting it from the state.

You'll see that Decay and Exponential always match (feel free to change the Source form!), but the Rounding filter occasionally differs. For this particular signal and noise level, abot 8% of the filtered samples differ by ±1.

The output format is suitable for plotting in e.g. gnuplot:
Code:
plot 'output' u 1:2 t 'Signal' w lines lc -1, '' u 1:3 t 'Decay' w lines lc 1, '' u 1:4 t 'Exponential' w lines lc 2, '' u 1:5 t 'Rounding' w lines lc 3
 
Ah, I see what you mean.
The difference is of course that both 'Decay' and 'Exponential' use floating point states and operations, whereas the C++ implementation uses only integer arithmetic, so it's not a surprise that the output differs some of the time.

The reason for the integer implementation is that it's significantly faster than the floating point implementation, especially in MCUs without hardware float support.

When using integers, rounding the average before subtracting it from the state yields much better results than simply using a flooring division:
Code:
import numpy as np

bits = 4
N = 64
output_1 = np.zeros((N,))
output_2 = np.zeros((N,))
output_3 = np.zeros((N,))
output_4 = np.zeros((N,))

state_1 = 0
state_3 = 0
state_4 = 0

weight = .25
recip_weight = int(1 / weight)

rng = np.random.default_rng()

for i in range(N):
    sample = rng.integers(2**bits)

    # 1: float reference
    state_1 = (1 - weight) * state_1 + weight * sample
    output_1[i] = state_1

    # 2: round(float reference)
    output_2[i] = round(output_1[i])

    # 3: integer state without rounding, output with rounding
    state_3 += sample
    average_round = (state_3 + recip_weight // 2) // recip_weight
    average = state_3 // recip_weight
    state_3 = state_3 - average
    output_3[i] = average_round

    # 4: integer state with rounding, output with rounding
    state_4 += sample
    average_round = (state_4 + recip_weight // 2) // recip_weight
    state_4 = state_4 - average_round
    output_4[i] = average_round

mae12 = np.abs(output_1 - output_2).mean()
mae13 = np.abs(output_1 - output_3).mean()
mae14 = np.abs(output_1 - output_4).mean()
print(f'{mae12=}')
print(f'{mae13=}')
print(f'{mae14=}')
mse23 = (np.square(output_2 - output_3)).mean()
mse24 = (np.square(output_2 - output_4)).mean()
print(f'{mse23=}')
print(f'{mse24=}')
err_cnt23 = 100. * np.count_nonzero(output_2 - output_3) / N
err_cnt24 = 100. * np.count_nonzero(output_2 - output_4) / N
print(f'{err_cnt23=}%')
print(f'{err_cnt24=}%')

import matplotlib.pyplot as plt 

plt.step(np.arange(N), output_1, where='post',
         label='1: float reference')
plt.step(np.arange(N), output_2, where='post',
         label='2: round(float reference)')
plt.step(np.arange(N), output_3, '--', where='post',
         label='3: integer state without rounding, output with rounding')
plt.step(np.arange(N), output_4, '--', where='post',
         label='4: integer state with rounding, output with rounding')
plt.legend()
plt.xlim((0, N-1))
plt.ylim((0, 2**bits))
plt.show()
Figure_1.png
The blue line is the "exact" floating point output. The orange line is the "exact" output rounded to the nearest integer. This orange line is used as the reference to compare the integer implementations with.

The first integer implementation is (if I understood your reply correctly) what you propose: updating the state without rounding, only rounding the output.
It is shown as the green line in the graph, and is often too high (because the flooring division doesn't subtract enough from the state, about half of the time).

The second integer implementation first rounds the output, and then subtracts it from the state, it is the same as the C++ implementation proposed in the Exponential Moving Average pages. It's shown in red on the graph. As you can see, it follows the orange reference much more accurately.

The code I posted above also computes the mean squared error between the two integer implementations and the rounded floating point reference. If you simulate with a uniform random input for a long time, you'll see that it settles to an MSE of 0.500 for the first integer implementation, and a MSE of 0.063 for the second implementation (the one that rounds before updating the state). This corresponds to an incorrect output 50% of the time vs 6.3% of the time, since the error is always ±1.

I'm sure you could use more fractional bits in the fixed point representation to get a more accurate result, but that requires more shifting (which is slower, especially on AVR) and larger integer types. Besides, the second implementation is accurate enough for 99% of smoothing applications, occasionally being one bit off on a 10-bit ADC value is usually not an issue.

Especially if you compare the mean absolute error between the rounded floating point output and the "exact" floating point output and the mean absolute error between the second integer implementation and the "exact" floating point output:
Code:
mae12=0.24995945723741
mae13=0.5026141932125929
mae14=0.255913320189325
The additional error introduced by using an integer state variable instead of a floating point state variable is negligible (0.250 vs 0.256).
 
Last edited:
Do ignore my post #7: when I wrote that, I was obviously confusing a sliding window filter with the exponential moving average. (I'd edit it to correct it, but the grace period is over.)

When using integers, rounding the average before subtracting it from the state yields much better results than simply using a flooring division
I generated some signals and did an FFT from them, but couldn't really see any difference; the mean squared error to the "exact" (floating-point) filtered value is smaller, but since even the truncating version is within ±1, I am not certain if it is meaningful in the real world. (And I mean literally not sure: I do not know.)

Would I use rounding myself? Sure; but, if I was running out of CPU oomph, I'd use truncation instead.

There was one thing in your derivation that kept me scratching my head. The recurrence relation is
        si = si-1 + weight·(xi - si-1)
for 0 < weight < 1. In fixed point format with Q fractional bits in the internal state S, that becomes
        Si = Si-1 + xi - round(Si-1 / 2Q)
        si = round(Si / 2Q)
It took me a while (until now, that is) to verify that it is equivalent to your
        Si = Zi-1 + xi
        si = round(Zi / 2Q)
        Zi = Si - si
for i ≥ 1 iff S0 = 0, because your internal state Zi is a fractional subtraction "ahead" of the fixed-point filtered value Si. Clever!
If one initializes S0 = x0·2Q, and Z0 = x0·2Q - x0, then the filters are equivalent.

I'm sure you could use more fractional bits in the fixed point representation to get a more accurate result, but that requires more shifting (which is slower, especially on AVR) and larger integer types. Besides, the second implementation is accurate enough for 99% of smoothing applications, occasionally being one bit off on a 10-bit ADC value is usually not an issue.
True.

You can do unsigned integer arithmetic, too, if you offset the sample by the magnitude of the largest negative sample:
Code:
#include <stdint.h>

const uint_fast8_t  filter16_shift = 5;
uint_fast16_t       filter16_state = 32768 - (32768 >> filter16_shift); /* == zero */
/* Initial state for x is (32768 - x + (x<<filter16_shift) - (32768>>filter16_shift)) */

int_fast16_t  filter16(int_fast16_t  sample)
{
    const uint_fast16_t  value = sample + (32768 >> filter16_shift);
    filter16_state += value;
    uint_fast16_t  result = (filter16_state + (1 << (filter16_shift - 1))) >> filter16_shift;
    filter16_state -= result;
    return result - (32768 >> filter16_shift);
}
which on GCC-AVR 5.4.0 with -O2 compiles to
Code:
filter16(int):
        lds  r18, filter16_state
        lds  r19, filter16_state+1
        subi r19, -4
        add  r18, r24
        adc  r19, r25
        mov  r24, r18
        mov  r25, r19
        adiw r24, 16
        lsr  r25
        ror  r24
        swap r25
        swap r24
        andi r24, 0x0f
        eor  r24, r25
        andi r25, 0x0f
        eor  r24, r25
        sub  r18, r24
        sbc  r19, r25
        sts  filter16_state+1, r19
        sts  filter16_state, r18
        subi r25, 4
        ret

filter16_state:
        .word   31744
and something ridiculously fast on ARMs..

This produces the exact same results (including correct rounding), and is suitable for filtering samples between -1024 and 1023, inclusive, without overflowing.
(In general, sequences of samples between -(32768>>filter16_shift) and ((32768>>filter16_shift)-1), inclusive, cannot overflow.)
The smaller shifts (4, 3, 2, or 1) yield even shorter/faster code (but not by much).

This could be turned into a C++ template that takes the minimum sample value as a parameter, and uses that as the offset, replacing (32768>>filter16_shift), if someone has asymmetric sample ranges, but wants to do filtering.
 
Indeed, @xenington where did you get that code from? The pages you linked to seem to list the correct code (https://tttapa.github.io/Pages/Math...ge/C++Implementation.html#implementation-in-c).

Hi Pieter

I tried to insert the signed division implementation into the C++ example (for unsigned division):

Code:
Implementation of Signed Division by a Multiple of Two

constexpr unsigned int K = 3;

signed int div_s1(signed int val) {
    int round = val + (1 << (K - 1));
    if (val < 0)
        round -= 1;
    return round >> K;
}

Code:
Implementation in C++

We now have everything in place to write an implementation of the EMA in C++:
EMA.cpp

#include <stdint.h>

template <uint8_t K, class uint_t = uint16_t>
class EMA {
  public:
    uint_t operator()(uint_t x) {
        z += x;
        uint_t y = (z + (1 << (K - 1))) >> K;
        z -= y;
        return y;
    }

    static_assert(
        uint_t(0) < uint_t(-1),  // Check that `uint_t` is an unsigned type
        "Error: the uint_t type should be an unsigned integer, otherwise, "
        "the division using bit shifts is invalid.");

  private:
    uint_t z = 0;
};

It didn't work because I didn't really understand what I was doing. The result was a simple division and not an average.
 
I generated some signals and did an FFT from them, but couldn't really see any difference; the mean squared error to the "exact" (floating-point) filtered value is smaller, but since even the truncating version is within ±1, I am not certain if it is meaningful in the real world. (And I mean literally not sure: I do not know.)

Would I use rounding myself? Sure; but, if I was running out of CPU oomph, I'd use truncation instead.
For truly random inputs, it probably doesn't matter, but the problem with truncation is that it makes it impossible to reach zero again (assuming non-negative inputs).

Using a rectangular input from 0 to 16 and back to 0, for instance:
Figure_2.png

You can do unsigned integer arithmetic, too, if you offset the sample by the magnitude of the largest negative sample:
Nice, that's indeed a better solution than the conditional correction for negative values.

I think you can even optimize it further by only subtracting/adding the offset only once:
Code:
const uint_fast8_t  filter16_shift   = 5;
uint_fast16_t       filter16_state_2 = 32768; /* == zero */

int_fast16_t  filter16_2(int_fast16_t  sample)
{
    const uint_fast16_t  value = sample;
    filter16_state_2 += value;
    uint_fast16_t  result = (filter16_state_2 + (1 << (filter16_shift - 1))) >> filter16_shift;
    result -= 32768 >> filter16_shift;
    filter16_state_2 -= result;
    return result;
}
It saves a single subtraction: https://godbolt.org/z/14457Y

Or as a C++ class that handles both signed and unsigned integers: https://godbolt.org/z/nj54Gs

Or as an Arduino sketch for OP:
Code:
#include <type_traits> // std::make_unsigned_t, is_unsigned
#include <limits>      // std::numeric_limits
#include <cstdint>     // uint_fast16_t

template <uint8_t K,
          class input_t = uint_fast16_t,
          class state_t = std::make_unsigned_t<input_t>>
class EMA {
  public:
    EMA(input_t initial = input_t(0))
      : state(zero + (initial << K) - initial) {}

    input_t operator()(input_t input) {
      state         += state_t(input);
      state_t output = (state + half) >> K;
      output        -= zero >> K;
      state         -= output;
      return input_t(output);
    }

  constexpr static state_t 
    max_state = std::numeric_limits<state_t>::max(),
    zero      = std::is_unsigned<input_t>::value ? 0 : max_state / 2 + 1,
    half      = state_t(1) << (K - 1);
    
  static_assert(std::is_unsigned<state_t>::value, 
                "State type should be unsigned");

  private:
    state_t state;
};

// -----

EMA<5, int_fast16_t> filter;

int_fast16_t sample() {
  return 2 * analogRead(A1) - 1024;
}

void setup() {
  Serial.begin(115200);
  while (!Serial);
}

void loop() {
  const unsigned long period = 1000; // 1000 µs = 1 kHz
  static unsigned long timer = micros() - period;
  if (micros() - timer >= period) {
    Serial.println(filter(sample()));
    timer += period;
  }
}
 
Be careful when using 16 bits - your input sample range is limited. For example, -1024 to 1023 (??) for the code in #21. Some comments or checks would be a good idea.
 
Last edited:
if data are 16 bit, then IMO the state variable (effectively sample<<K) should be declared as 32 bit
 
Note: It looks like the state has to be an exact-width type (so uintN_t and not uint_fastN_t), as any extra bits will mess up the modulo arithmetic.

The 16-bit state is intended for input samples within an N-bit range and Q-bit shift, where N+Q ≤ 16. On 8-bit microcontrollers, this yields much, much faster code than using 32-bit state.

On 32-bit architectures,
Code:
#include <stdint.h>

static const uint_fast8_t  filter32_shift = 4;
uint32_t                   filter32_state;

void  filter32_init(const int32_t  value)
{
    filter32_state = (1 << 31) - value + (value << filter32_shift);
}

int32_t  filter32(const int32_t  sample)
{
    filter32_state += sample;
    uint_fast32_t  result = ((filter32_state + (1 << (filter32_shift - 1))) >> filter32_shift) - (1 << (31 - filter32_shift));
    filter32_state -= result;
    return (int32_t)result;
}
for example arm-gcc-7.3 -O2 produces
Code:
filter32_init:
        rsb     r0, r0, r0, lsl #4
        ldr     r3, .L3
        add     r0, r0, #-2147483648
        str     r0, [r3]
        bx      lr
.L3:
        .word   filter32_state

filter32:
        ldr     r2, .L6
        ldr     r3, [r2]
        add     r3, r0, r3
        add     r0, r3, #8
        lsr     r0, r0, #4
        add     r0, r0, #-134217728
        sub     r3, r3, r0
        str     r3, [r2]
        bx      lr
.L6:
        .word   filter32_state
and marking the functions static inline should mean the per-sample filtering overhead is minimal, just a handful of cycles.

On the other hand, on avr-gcc-5.4.0, we get
Code:
filter32_init(long):
        push r12
        push r13
        push r14
        push r15
        push r16
        push r17
        mov r16,r22
        mov r17,r23
        mov r18,r24
        mov r19,r25
        lsl r16
        rol r17
        rol r18
        rol r19
        lsl r16
        rol r17
        rol r18
        rol r19
        lsl r16
        rol r17
        rol r18
        rol r19
        lsl r16
        rol r17
        rol r18
        rol r19
        mov r12,r16
        mov r13,r17
        mov r14,r18
        mov r15,r19
        sub r12,r22
        sbc r13,r23
        sbc r14,r24
        sbc r15,r25
        sts filter32_state,r12
        sts filter32_state+1,r13
        sts filter32_state+2,r14
        sts filter32_state+3,r15
        pop r17
        pop r16
        pop r15
        pop r14
        pop r13
        pop r12
        ret

filter32(long):
        push r16
        push r17
        lds r16,filter32_state
        lds r17,filter32_state+1
        lds r18,filter32_state+2
        lds r19,filter32_state+3
        add r16,r22
        adc r17,r23
        adc r18,r24
        adc r19,r25
        mov r23,r19
        mov r22,r18
        mov r21,r17
        mov r20,r16
        subi r20,-8
        sbci r21,-1
        sbci r22,-1
        sbci r23,-1
        mov r25,r23
        mov r24,r22
        mov r23,r21
        mov r22,r20
        ldi r20,4
        1:
        lsr r25
        ror r24
        ror r23
        ror r22
        dec r20
        brne 1b
        sub r16,r22
        sbc r17,r23
        sbc r18,r24
        sbc r19,r25
        sts filter32_state,r16
        sts filter32_state+1,r17
        sts filter32_state+2,r18
        sts filter32_state+3,r19
        pop r17
        pop r16
        ret

filter32_state:
        .zero   4
which is not really something I'd like to use.

The equivalent but limited to 16-filter16_shift -bit sample range, with minimum sample value -filter16_offset (you can set that to whatever 16-bit value you need before initializing the filter), is
Code:
#include <stdint.h>

static const uint_fast8_t  filter16_shift = 4;
static const int16_t       filter16_offset = 1 << (15 - filter16_shift);
static const uint16_t      filter16_half = 1 << (filter16_shift - 1);
uint16_t                   filter16_state;

void  filter16_init(const int16_t  value)
{
    filter16_state = ((value + filter16_offset) << filter16_shift) - value;
}

int16_t  filter16(const int16_t  sample)
{
    filter16_state += sample;
    uint_fast16_t  result = ((filter16_state + filter16_half) >> filter16_shift) - filter16_offset;
    filter16_state -= result;
    return (int16_t)result;
}
which compiles on avr-gcc-5.4 -O2 to a much nicer
Code:
filter16_init(int):
        mov r18,r24
        mov r19,r25
        subi r19,-8
        swap r18
        swap r19
        andi r19,0xf0
        eor r19,r18
        andi r18,0xf0
        eor r19,r18
        mov r20,r18
        mov r21,r19
        sub r20,r24
        sbc r21,r25
        sts filter16_state+1,r21
        sts filter16_state,r20
        ret

filter16(int):
        lds r18,filter16_state
        lds r19,filter16_state+1
        add r18,r24
        adc r19,r25
        mov r24,r18
        mov r25,r19
        adiw r24,8
        swap r25
        swap r24
        andi r24,0x0f
        eor r24,r25
        andi r25,0x0f
        eor r24,r25
        subi r25,8
        sub r18,r24
        sbc r19,r25
        sts filter16_state+1,r19
        sts filter16_state,r18
        ret

filter16_state:
        .zero   2
which is much, much preferable.

I did verify the above produce the correct results, but since this is just idle exploration, brainfarts/typos may have slipped in.
 
In C (as opposed to C++), one can use the following pattern. First, filter16.h:
Code:
#if !defined(FILTER_NAME)
#error Define FILTER_NAME as the name of the filter.
#elif !defined(FILTER_MIN_SAMPLE) || !defined(FILTER_MAX_SAMPLE) || (FILTER_MIN_SAMPLE) >= (FILTER_MAX_SAMPLE)
#error Define FILTER_MIN_SAMPLE and FILTER_MAX_SAMPLE as the minimum and maximum samples to be filtered, inclusive.
#elif !defined(FILTER_SHIFT)
#error Define FILTER_SHIFT as the power of two the filter implements.
#elif FILTER_SHIFT < 1 || FILTER_SHIFT > 14
#error FILTER_SHIFT must be a small positive number!
#elif (((FILTER_MAX_SAMPLE - (FILTER_MIN_SAMPLE)) << FILTER_SHIFT) > 65535)
#error Sample range and filter shift exceed 16 bits.
#elif FILTER_MIN_SAMPLE < -32768
#error Minimum sample value is too small.
#elif (FILTER_MAX_SAMPLE - (FILTER_MIN_SAMPLE)) > 65535
#error Sample range does not fit in 16 bits.
#else

#define MERGE2_(a, b)  a ## b
#define MERGE2(a, b)   MERGE2_(a, b)

#define FILTER_STATE   MERGE2(FILTER_NAME, _state)
#define FILTER_INIT    MERGE2(FILTER_NAME, _init)

#if FILTER_MIN_SAMPLE < 0
#define FILTER_SAMPLE_TYPE  int16_t
#if FILTER_MAX_SAMPLE > 32767
#error Your sample values do not fit in 16 bits.
#endif
#else
#define FILTER_SAMPLE_TYPE  uint16_t
#if FILTER_MAX_SAMPLE > 65535
#error Your sample values do not fit in 16 bits.
#endif
#endif

static uint16_t  FILTER_STATE;

static inline void FILTER_INIT (FILTER_SAMPLE_TYPE value)
{
    FILTER_STATE = ((value - (FILTER_MIN_SAMPLE)) << FILTER_SHIFT) - value;
}

static inline FILTER_SAMPLE_TYPE FILTER_NAME (FILTER_SAMPLE_TYPE value)
{
    FILTER_STATE += value;
    uint16_t  result = ((FILTER_STATE + (1 << (FILTER_SHIFT - 1))) >> FILTER_SHIFT) + (FILTER_MIN_SAMPLE);
    FILTER_STATE -= result;
    return (FILTER_SAMPLE_TYPE)result;
}

#undef   FILTER_SAMPLE_TYPE
#undef   FILTER_STATE
#undef   FILTER_INIT
#undef   MERGE2_
#undef   MERGE2
#endif
#undef   FILTER_MIN_SAMPLE
#undef   FILTER_MAX_SAMPLE
#undef   FILTER_SHIFT
#undef   FILTER_NAME
Then, if one wants a 16-bit filter with shift 5 that can handle samples from 0 to 2047, inclusive, and another with shift 4 that can handle samples from -15 to 4402, inclusive:
Code:
#include <stdint.h>

#define  FILTER_NAME        filter1
#define  FILTER_MIN_SAMPLE  0
#define  FILTER_MAX_SAMPLE  2047
#define  FILTER_SHIFT       5
#include "filter16.h"

#define  FILTER_NAME        filter2
#define  FILTER_MIN_SAMPLE  -15
#define  FILTER_MAX_SAMPLE  4402
#define  FILTER_SHIFT       3
#include "filter16.h"
The first #include causes the state variable filter1_state and functions filter1_init() and filter1() to be defined, and the second filter2_state, filter2_init(), and filter2(). They are local to the current compilation unit (because the functions are marked static inline), but that way the compiler should have the best opportunities for optimization.

This may look like gobbledygook, but is actually just standard C preprocessor shenanigans. I'm not sure why anyone would use this instead of C++ templates, though, except that I do like how I can do quite careful compile-time checks: this will error out if the specified sample range or shift is problematic. You could easily add bounds checking/limiting, for example, if say FILTER_BOUNDS is defined when the file is included.

(If it matters, I consider this so obvious it is not copyrightable, but if anyone disagrees, then it is licensed under CC0-1.0, i.e. in Public Domain.)
 
Status
Not open for further replies.
Back
Top