## Algorithms for the Mean and Variance

Two common summaries of data are the mean and the variance.
Mean = x-bar = sum x_i / n
Variance = s^2 = sum (x_i - x-bar)^2 / (n-1)

As written,
computation of the variance requires two passes through the data,
one to sum the data and compute the mean,
followed by a second pass to find the sum of the squared deviations
from the mean and the variance.

#### The Desk Calculator Algorithm

It is more efficient to find an algorithm which requires
just a single pass through the data.
The desk calculator algorithm is one such choice.
As data is entered,
this algorithm keeps track of the sum of the data and the sum of the squares
of the data.
Straightforward algebra shows that the above expression for the variance
is equivalent to this expression.
s^2 = ( sum (x_i)^2 - (sum x_i)^2 / n ) / (n-1)

However,
this algorithm gives incorrect answers
for ill-conditioned data.
For example, try using the statistics functions
on your calculator to find the standard deviation of the numbers
10,000,001, 10,000,003, and 10,000,005.
Exact computation gives the answer as 2
(the same as the standard deviation of 1, 3, and 5),
but your calculator probably returns the number 0.
The desk calculator algorithm will have problems whenever the coefficient
of variation (s / x-bar) is quite small.
Computers do not store the values of real numbers exactly.
The value called the machine EPSILON is the smallest number
for which 1 + EPSILON is greater than 1 on the computer.
If the square of the coefficient of variation is smaller than the machine
EPSILON,
the desk calculator algorithm will fail.

#### The Method of Provisional Means

Here is a one-pass algorithm which updates the values of the mean
and the sum of the squared deviations from that mean
every time a new data value is read.
Define
m_k = mean of first k numbers
s_k = sum (x_i - m_k)^2, where sum is over first k numbers

By simplifying the expression for m_k - m_{k-1} and s_k - s_{k-1},
we arrive at these difference equations.
m_k = m_{k-1} + (x_k - m_{k-1}) / k
s_k = s_{k-1} + (x_k - m_{k-1}) * (x_k - m_k)

(See Homework #5, Problem 1 for a related question.)
An algorithm built upon these difference equations
can handle many data sets for which the desk calculator algorithm fails
at a price of a small increase in storage and computation.

#### The Subtract the First Number Algorithm

An alternative algorithm is to simply subtract
the first number from each observation and use the desktop
algorithm on this modified data.
The mean will be too small, and needs to be corrected by adding
the first number back on.
The variance is unaffected by the subtraction of a constant
from every value.
s^2 = ( sum (x_i - x_1)^2 - ( sum (x_i - x_1) )^2 / n ) / (n-1)

This method also avoids many numerical difficulties
because the first number is often close in size to most of the numbers
in the data set.

Last modified: March 19, 1997

Bret Larget,
larget@mathcs.duq.edu