Many functions require the calculation of the sum of squared
deviations about a mean. To do this accurately, Microsoft Office Excel 2003 and later versions of Excel use a two-pass
procedure that finds the mean on the first pass, and then calculates squared
deviations about the mean on the second pass.
In precise arithmetic, the same
result occurs in earlier versions of Excel that use the "calculator formula."
This formula is so named because it was in widespread use when statisticians used calculators
instead of computers. With the calculator formula, earlier versions of Excel sum the
squares of the observations, and then subtract from this total the following
quantity:
((sum of observations)^2) / number of observations
This calculation occurs in a single pass through the data.
In
finite precision arithmetic, the calculator formula is subject to roundoff
errors in extreme cases. Excel 2002 and earlier versions of Excel use the calculator formula for
most functions that require a sum of squared deviations about a mean (such as
VAR, STDEV, SLOPE, and PEARSON). However, these versions of Excel also use the
more numerically robust two-pass procedure for the CORREL, COVAR, and DEVSQ
functions.
Experts in statistical computing recommend that you do not use the
calculator formula. The calculator formula is presented as "how not to do it"
in texts about statistical computing. Unfortunately, all three of the Analysis
ToolPak (ATP) ANOVA tools make widespread use of the calculator formula or an
equivalent single-pass approach in Excel 2002 and in earlier versions of Excel.
Excel 2003 and later
versions of Excel use the two-pass procedure for all three ATP ANOVA models. This article
discusses the following computational improvements in ATP's three ANOVA models:
- Single Factor
- Two-Factor with Replication
- Two-Factor without Replication
This article discusses these models later.
Because
Excel has always used the two-pass procedure with DEVSQ, this article makes
frequent use of it to describe the improved procedures. These revised
procedures either effectively call DEVSQ or use code whose functionality is
exactly the same as DEVSQ's functionality.
For each ANOVA tool, ATP
output contains a Summary table with values of Count, Sum, Average, and
Variance, and an ANOVA table that has various sums of squares and values of SS,
df, MS, F and P-value. Results in the summary table are calculated by calling
Excel functions COUNT, SUM, AVERAGE, and VAR. Of these four functions, only VAR
is subject to roundoff errors.
Excel 2002 and earlier versions of Excel implement VAR
by using the calculator formula. The following article about VAR describes the improvements
that occurred in Excel 2003 and in later versions of Excel. This article also lets you experiment with numeric data to see
when roundoff errors are likely to occur in earlier
versions of Excel.
For more information about VAR, click the following article number to view the article in the Microsoft Knowledge Base:
826112�
Excel statistical functions: VAR
As this article discusses the three ANOVA models,
it focuses on the ANOVA output tables. In each case, the Summary
tables are well-behaved in Excel 2003 and in later versions of Excel. In Excel 2002 and in earlier versions of Excel, problems
occur in the Variance column when data have extreme values.
However, this
article includes the Summary tables in the model sections because these tables are
useful for comparison when you review the modified examples in the
Appendix.
Model 1: Single Factor
A simple example with data is as follows.
Collapse this tableExpand this table
ANOVA 1 BASIC
MODEL: | | | | | | |
1 | 2 | 3 | | | | |
2 | 4 | 4 | | | | |
3 | 6 | 5 | | | | |
4 | 8 | 6 | | | | |
5 | | 7 | | | | |
6 | | 8 | | | | |
Anova: Single
Factor | | | | | | |
| | | | | | |
SUMMARY | | | | | | |
Groups | Count | Sum | Average | Variance | | |
Column
1 | 6 | 21 | 3.5 | 3.5 | | |
Column
2 | 4 | 20 | 5 | 6.666667 | | |
Column
3 | 6 | 33 | 5.5 | 3.5 | | |
| | | | | | |
| | | | | | |
ANOVA | | | | | | |
Source of
Variation | SS | df | MS | F | P-value | F
crit |
Between
Groups | 12.75 | 2 | 6.375 | 1.506818 | 0.257897 | 3.805567 |
Within
Groups | 55 | 13 | 4.230769 | | | |
| | | | | | |
Total | 67.75 | 15 | | | | |
Excel 2002 and earlier versions of Excel use the following pseudocode to
calculate the sums of squares:
GrandSum = 0;
GrandSumOfSqs = 0;
GrandSampleMeanSqrd = 0;
GrandMeanSqrd = 0;
GrandSampleSize = 0;
For s = 1 to Number_of_Samples do
GrandSum = GrandSum + sum of observations in s-th sample;
GrandSumOfSqs = GrandSumOfSqs + sum of squared observations in s-th sample;
GrandSampleMeanSqrd = GrandSampleMeanSqrd +
(sum of observations in s-th sample^2)/size of s-th sample;
GrandSampleSize = GrandSampleSize + size of s-th sample
Endfor;
GrandMeanSqrd = (GrandSum^2) / GrandSampleSize;
TotalSS = GrandSumOfSqs � GrandMeanSqrd;
BetweenGroupsSS = GrandSampleMeanSqrd � GrandMeanSqrd;
WithinGroupsSS = GrandSumOfSqs � GrandSampleMeanSqrd;
This approach is essentially the calculator formula. This approach computes the sums of
squares of observations, and then subtracts a quantity from them, just as VAR
computes the sum of squares of the observations, and then subtracts
sum of observations^2/
sample
size. Similar pseudocode for the model 2 and model 3 has been
omitted.
Again, for model 2 and model 3, sums of squares are calculated and a
quantity is subtracted from the sum of squares as in the calculator formula.
Unfortunately, basic statistics texts frequently suggest approaches for ANOVA
such as the one that is shown earlier in this article.
Excel 2003 and later versions of Excel use a
different approach to calculate the various entries in the SS column of the
ANOVA table. For illustration, this article assumes that the numeric data in
the earlier example appear in cells A2:C7 with missing data in cells B6 and B7.
- Total SS is just DEVSQ applied to all the data, such as
DEVSQ(A2:C7). DEVSQ works correctly even though data is missing.
- Between Groups SS is Total SS minus the sum of DEVSQ
applied to each column, such as DEVSQ(A2:A7) + DEVSQ(B2:B7) +
DEVSQ(C2:C7).
- Within Groups SS is Total SS minus Between Groups
SS.
If entries in the SS column of the ANOVA table are calculated
correctly, the accuracy of the other entries in the table follow.
Model 2: Two-Factor with Replication
A simple example with data is as follows.
Collapse this tableExpand this table
ANOVA 2 BASIC MODEL | group 1 | group
2 | group 3 | | | |
trial
1 | 1 | 2 | 3 | | | |
| 2 | 4 | 4 | | | |
| 3 | 6 | 5 | | | |
trial
2 | 4 | 8 | 6 | | | |
| 5 | 10 | 7 | | | |
| 6 | 12 | 8 | | | |
Anova: Two-Factor With
Replication | | | | | | |
| | | | | | |
SUMMARY | group 1 | group 2 | group
3 | Total | | |
trial
1 | | | | | | |
Count | 3 | 3 | 3 | 9 | | |
Sum | 6 | 12 | 12 | 30 | | |
Average | 2 | 4 | 4 | 3.333333 | | |
Variance | 1 | 4 | 1 | 2.5 | | |
| | | | | | |
trial
2 | | | | | | |
Count | 3 | 3 | 3 | 9 | | |
Sum | 15 | 30 | 21 | 66 | | |
Average | 5 | 10 | 7 | 7.333333 | | |
Variance | 1 | 4 | 1 | 6.25 | | |
| | | | | | |
Total | | | | | | |
Count | 6 | 6 | 6 | | | |
Sum | 21 | 42 | 33 | | | |
Average | 3.5 | 7 | 5.5 | | | |
Variance | 3.5 | 14 | 3.5 | | | |
| | | | | | |
| | | | | | |
ANOVA | | | | | | |
Source of
Variation | SS | df | MS | F | P-value | F
crit |
Sample | 72 | 1 | 72 | 36 | 6.22E-05 | 4.747221 |
Columns | 37 | 2 | 18.5 | 9.25 | 0.003709 | 3.88529 |
Interaction | 9 | 2 | 4.5 | 2.25 | 0.147973 | 3.88529 |
Within | 24 | 12 | 2 | | | |
| | | | | | |
Total | 142 | 17 | | | | |
Again, if entries in the SS column are calculated correctly,
the accuracy of all the other entries in the ANOVA part of the output
follows.
Here is the computational procedure for Excel 2003 and for later versions of Excel. This procedure uses DEVSQ
to calculate the various entries in the SS column of the ANOVA table. For
illustration, this example assumes that the numeric data appear in cells B2:D7.
- Total SS is just DEVSQ applied to all the data, such as
DEVSQ(B2:D7).
- Sample SS is Total SS minus the sum of DEVSQ applied to
each sample, such as DEVSQ(B2:D4) + DEVSQ(B5:D7).
- Columns SS is Total SS minus the sum of DEVSQ applied to
each column, such as DEVSQ(B2:B7) + DEVSQ(C2:C7) + DEVSQ(D2:D7).
- Within SS is the sum of DEVSQ applied to each trial or
group pair, such as DEVSQ(B2:B4) + DEVSQ(C2:C4) + DEVSQ(D2:D4) + DEVSQ(B5:B7) +
DEVSQ(C5:C7) + DEVSQ(D5:D7).
- Interaction SS equals Total SS minus Sample SS minus
Columns SS minus Within SS.
Model 3: Two-Factor without Replication
A simple example with data is as follows.
Collapse this tableExpand this table
ANOVA 3 BASIC MODEL: | LOW | MED
| HI | | | |
POOR | 1 | 2 | 3 | | | |
| 2 | 4 | 4 | | | |
| 3 | 6 | 5 | | | |
MID
CLASS | 4 | 8 | 6 | | | |
| 5 | 10 | 7 | | | |
| 6 | 12 | 8 | | | |
RICH | 7 | 14 | 10 | | | |
| 8 | 12 | 6 | | | |
| 9 | 10 | 2 | | | |
| | | | | | |
Anova: Two-Factor Without
Replication | | | | | | |
| | | | | | |
SUMMARY | Count | Sum | Average | Variance | | |
POOR | 3 | 6 | 2 | 1 | | |
| 3 | 10 | 3.333333 | 1.333333 | | |
| 3 | 14 | 4.666667 | 2.333333 | | |
MID
CLASS | 3 | 18 | 6 | 4 | | |
| 3 | 22 | 7.333333 | 6.333333 | | |
| 3 | 26 | 8.666667 | 9.333333 | | |
RICH | 3 | 31 | 10.33333 | 12.33333 | | |
| 3 | 26 | 8.666667 | 9.333333 | | |
| 3 | 21 | 7 | 19 | | |
| | | | | | |
LOW
| 9 | 45 | 5 | 7.5 | | |
MED
| 9 | 78 | 8.666667 | 16 | | |
HI | 9 | 51 | 5.666667 | 6.25 | | |
| | | | | | |
| | | | | | |
ANOVA | | | | | | |
Source of
Variation | SS | df | MS | F | P-value | F
crit |
Rows | 176.6667 | 8 | 22.08333 | 5.76087 | 0.001476 | 2.591094 |
Columns | 68.66667 | 2 | 34.33333 | 8.956522 | 0.002455 | 3.633716 |
Error | 61.33333 | 16 | 3.833333 | | | |
| | | | | | |
Total | 306.6667 | 26 | | | | |
If the values in the SS column are calculated correctly, the
accuracy of all the other values in the ANOVA table follows.
Excel 2003 and later versions of Excel use the following computational procedure. The procedure uses DEVSQ to
calculate the values in the SS column of the ANOVA table. For illustration,
this example assumes that the range of cells shown in the earlier example is
cells A1:D10. Therefore, the numeric data appear in cells B2:D10.
- Total SS is just DEVSQ applied to all the data, such as
DEVSQ(B2:D10).
- Rows SS is Total SS minus the sum of DEVSQ applied to each
row, such as DEVSQ(B2:D2) + DEVSQ(B3:D3) + DEVSQ(B4:D4) + DEVSQ(B5:D5) +
DEVSQ(B6:D6) + DEVSQ(B7:D7) + DEVSQ(B8:D8) + DEVSQ(B9:D9) +
DEVSQ(B10:D10).
- Columns SS is Total SS minus the sum of DEVSQ applied to
each column, such as DEVSQ(B2:B10) + DEVSQ(C2:C10) + DEVSQ(D2:D10).
- Error SS is Total SS minus Rows SS minus Columns
SS.
Results in Excel 2002 and in earlier versions of Excel
In extreme cases where there are many significant digits in the
data but also a small variance, the calculator formula leads to inaccurate
results. The Appendix that appears later in this article gives examples of roundoff problems
in such extreme situations.
Results in Excel 2003 and in later versions of Excel
Excel 2003 and later versions of Excel use a procedure that makes two passes through the
data. On the first pass, Excel 2003 and later versions of Excel calculate the sum and count of the data
values. From these, Excel can calculate the sample mean (average).
On the second
pass, Excel calculates the squared difference between each data point and the
sample mean, and then sums these squared differences. As a result, the results
in Excel 2003 and in later versions of Excel are more stable numerically.
Conclusions
A two-pass approach improves the numeric performance in all three
ATP ANOVA tools in Excel 2003 and in later versions of Excel as compared to earlier versions of Excel. The results that you obtain by using Excel 2003 and later versions of Excel are never less accurate than the results that you obtain by using earlier
versions of Excel.
In most practical cases, however, there is no difference
between these results. This is
because data do not typically exhibit the kind of unusual behavior that the
following Appendix illustrates. Numeric instability is most likely to occur in
earlier versions of Excel when data contains a high number of significant
digits with relatively little variation between data values.
If you
use an earlier version of Excel, and if you want to see whether Excel 2003 or a later version of Excel gives you
different ANOVA results, compare the results that you obtain when you use the ANOVA tools in your
earlier version of Excel with the results that you obtain when you use the procedures that use DEVSQ.
Note The procedures that use DEVSQ were
described earlier in this article for the ANOVA table that is associated with each of
the tools.
To verify that Variances are correct in the Summary table for each
range, use
DEVSQ(
range)/(COUNT(
range)
� 1).
Appendix: Numeric examples of the performance of Excel 2002 and earlier versions of Excel
For each basic example from models 1, 2, and 3, this article
previously presented the ATP tool's output. This included the Summary and ANOVA tables. Data was modified in each example to create a "stressed" example. This
is done by adding 10^8 to each data value. Adding a constant such as 10^8 to
each data value does not affect Variance in the Summary table (but will affect
Average and Sum in obvious ways). It should also not affect any entry in the
ANOVA table.
If you compare Variances in the Summary tables and SS in
the ANOVA tables, you will notice that all of these are incorrectly calculated
in all three of the following stressed models except for one entry in model 3
that is pointed to with "
<---".
In all the stressed cases, the ANOVA results that you obtain by using Excel 2003 and later versions of Excel agree with
the earlier results in the basic cases (as they should).
ANOVA 1 stressed model with large data values
Collapse this tableExpand this table
100000001 | 100000002 | 100000003 | | | | |
100000002 | 100000004 | 100000004 | | | | |
100000003 | 100000006 | 100000005 | | | | |
100000004 | 100000008 | 100000006 | | | | |
100000005 | | 100000007 | | | | |
100000006 | | 100000008 | | | | |
| | | | | | |
Anova: Single
Factor | | | | | | |
| | | | | | |
SUMMARY | | | | | | |
Groups | Count | Sum | Average | Variance | | |
Column
1 | 6 | 600000021 | 1E+08 | 4.8 | | |
Column
2 | 4 | 400000020 | 1E+08 | 8 | | |
Column
3 | 6 | 600000033 | 1E+08 | 1.6 | | |
| | | | | | |
| | | | | | |
ANOVA | | | | | | |
Source of
Variation | SS | df | MS | F | P-value | F
crit |
Between
Groups | 0 | 2 | 0 | 0 | 1 | 3.805567 |
Within
Groups | 64 | 13 | 4.923077 | | | |
| | | | | | |
Total | 64 | 15 | | | | |
ANOVA 2 stressed model with large data values
Collapse this tableExpand this table
| group 1 | group 2 | group
3 | | | |
trial
1 | 100000001 | 100000002 | 100000003 | | | |
| 100000002 | 100000004 | 100000004 | | | |
| 100000003 | 100000006 | 100000005 | | | |
trial
2 | 100000004 | 100000008 | 100000006 | | | |
| 100000005 | 100000010 | 100000007 | | | |
| 100000006 | 100000012 | 100000008 | | | |
Anova: Two-Factor With
Replication | | | | | | |
| | | | | | |
SUMMARY | group 1 | group 2 | group
3 | Total | | |
trial
1 | | | | | | |
Count | 3 | 3 | 3 | 9 | | |
Sum | 300000006 | 300000012 | 300000012 | 9E+08 | | |
Average | 100000002 | 100000004 | 100000004 | 1E+08 | | |
Variance | 0 | 4 | 0 | 4 | | |
| | | | | | |
trial
2 | | | | | | |
Count | 3 | 3 | 3 | 9 | | |
Sum | 300000015 | 300000030 | 300000021 | 9E+08 | | |
Average | 100000005 | 100000010 | 100000007 | 1E+08 | | |
Variance | 0 | 4 | 0 | 6 | | |
| | | | | | |
Total | | | | | | |
Count | 6 | 6 | 6 | | | |
Sum | 600000021 | 600000042 | 600000033 | | | |
Average | 100000004 | 100000007 | 100000005.5 | | | |
Variance | 4.8 | 14.4 | 1.6 | | | |
| | | | | | |
| | | | | | |
ANOVA | | | | | | |
Source of
Variation | SS | df | MS | F | P-value | F
crit |
Sample | 64 | 1 | 64 | 24 | 0.000367 | 4.747221 |
Columns | 32 | 2 | 16 | 6 | 0.015625 | 3.88529 |
Interaction | 32 | 2 | 16 | 6 | 0.015625 | 3.88529 |
Within | 32 | 12 | 2.666666667 | | | |
| | | | | | |
Total | 128 | 17 | | | | |
ANOVA 3 stressed model with large data values
Collapse this tableExpand this table
| LOW | MED
| HI | | | |
POOR | 100000001 | 100000002 | 100000003 | | | |
| 100000002 | 100000004 | 100000004 | | | |
| 100000003 | 100000006 | 100000005 | | | |
MID
CLASS | 100000004 | 100000008 | 100000006 | | | |
| 100000005 | 100000010 | 100000007 | | | |
| 100000006 | 100000012 | 100000008 | | | |
RICH | 100000007 | 100000014 | 100000010 | | | |
| 100000008 | 100000012 | 100000006 | | | |
| 100000009 | 100000010 | 100000002 | | | |
| | | | | | |
Anova: Two-Factor Without
Replication | | | | | | |
| | | | | | |
SUMMARY | Count | Sum | Average | Variance | | |
Row
1 | 3 | 300000006 | 100000002 | 0 | | |
Row
2 | 3 | 300000010 | 100000003 | 2 | | |
Row
3 | 3 | 300000014 | 100000005 | 2 | | |
Row
4 | 3 | 300000018 | 100000006 | 4 | <--- | |
Row
5 | 3 | 300000022 | 100000007 | 6 | | |
Row
6 | 3 | 300000026 | 100000009 | 10 | | |
Row
7 | 3 | 300000031 | 100000010 | 12 | | |
Row
8 | 3 | 300000026 | 100000009 | 10 | | |
Row
9 | 3 | 300000021 | 100000007 | 18 | | |
| | | | | | |
Column
1 | 9 | 900000045 | 100000005 | 8 | | |
Column
2 | 9 | 900000078 | 100000009 | 14 | | |
Column
3 | 9 | 900000051 | 100000006 | 4 | | |
| | | | | | |
| | | | | | |
ANOVA | | | | | | |
Source of
Variation | SS | df | MS | F | P-value | F
crit |
Rows | 128 | 8 | 16 | 2 | 0.113281 | 2.591094 |
Columns | 32 | 2 | 16 | 2 | 0.167772 | 3.633716 |
Error | 128 | 16 | 8 | | | |
| | | | | | |
Total | 288 | 26 | | | | |