To: numpy-discussion
From: Robert Kern
Date: Wed, 20 Sep 2006 03:01:18 -0500
Let me offer a third path: the algorithms used for .mean() and .var() are
substandard. There are much better incremental algorithms that entirely avoid
the need to accumulate such large (and therefore precision-losing) intermediate
values. The algorithms look like the following for 1D arrays in Python:
def mean(a):
m = a[0]
for i in range(1, len(a)):
m += (a[i] - m) / (i + 1)
return m
def var(a):
m = a[0]
t = a.dtype.type(0)
for i in range(1, len(a)):
q = a[i] - m
r = q / (i+1)
m += r
t += i * q * r
t /= len(a)
return t
Alternatively, from Knuth:
def var_knuth(a):
m = a.dtype.type(0)
variance = a.dtype.type(0)
for i in range(len(a)):
delta = a[i] - m
m += delta / (i+1)
variance += delta * (a[i] - m)
variance /= len(a)
return variance
David Cooke replies:
Now, if you want to avoid losing precision, you want to use a better summation technique, like compensated (or Kahan) summation:
def mean(a):
s = e = a.dtype.type(0)
for i in range(0, len(a)):
temp = s
y = a[i] + e
s = temp + y
e = (temp - s) + y
return s / len(a)
Some numerical experiments in Maple using 5-digit precision show that your mean is maybe a bit better in some cases, but can also be much worse, than sum(a)/len(a), but both are quite poor in comparision to the Kahan summation.
Upon which Tim Hochberg provides an implementation:
Here's version of mean based on the Kahan sum, including the compensation term that seems to be consistently much better than builtin mean It seems to be typically 4 orders or magnitude closer to the "correct" answer than the builtin mean and 2 orders of magnitude better than just naively using the Kahan sum. I'm using the sum performed with dtype=int64 as the "correct" value.
def rawKahanSum(values):
if not input:
return 0.0
total = values[0]
c = type(total)(0.0)
for x in values[1:]:
y = x - c
t = total + y
c = (t - total) - y
total = t
return total, c
def kahanSum(values):
total, c = rawKahanSum(values)
return total
def mean(values):
total, c = rawKahanSum(values)
n = float(len(values))
return total / n - c / n
for i in range(100):
data = random.random([10000]).astype(float32)
baseline = data.mean(dtype=float64)
delta_builtin_mean = baseline - data.mean()
delta_compensated_mean = baseline - mean(data)
delta_kahan_mean = baseline - (kahanSum(data) / len(data))
if not abs(delta_builtin_mean) >= abs(delta_kahan_mean) >= abs(delta_compensated_mean):
print ">>>",
print delta_builtin_mean, delta_kahan_mean, delta_compensated_mean
Tim Hochberg mentions:
Again, my tests show otherwise for float32. I'll condense my ipython log into a
module for everyone's perusal. It's possible that the Kahan summation of the
squared residuals will work better than the current two-pass algorithm and the
implementations I give above.
David Cooke concludes:
- If you're going to calculate everything in single precision, use Kahan summation. Using it in double-precision also helps. - If you can use a double-precision accumulator, it's much better than any of the techniques in single-precision only.
- for speed+precision in the variance, either use Kahan summation in single precision with the two-pass method, or use double precision with simple summation with the two-pass method. Knuth buys you nothing, except slower code :-)