Code for the ATP has not been edited directly except to
introduce improvements in the three ATP ANOVA tools.
For several ATP
tools, numeric performance has been improved for Excel 2003 and for later versions of Excel because the tool
calls an Excel statistical function that has been improved for Excel 2003 and for later versions of Excel. In
cases where results are different for earlier versions of Excel and later versions of Excel,
the values for Excel 2003 and for later versions of Excel are more accurate.
Most users will not notice a
difference in results between different versions of Excel. This is because differences
are generally caused by round-off errors that are significant only in extreme
cases. However, this article must first point out one case where differences
occur because of a wrong formula in Microsoft Excel 2002 and in earlier versions of Excel. Avoid the
tool in those versions.
A second example involves an incorrect formula in Excel
2002 and in earlier versions of Excel that persists in Excel 2003 and in later versions of Excel. Avoid the ATP tool in this
scenario for all versions of Excel.
First, avoid the Regression tool
when you have to click to check the
Constant is Zero check
box. This has been corrected in Excel 2003 and in later versions of Excel. You do not have to avoid the
Regression tool when the
Constant is Zero check box is clear
(the more typical case in practice).
Second, users of all versions of
Excel should avoid the ATP t-Test: Paired Two Sample for Means tool unless you
can guarantee that there are no missing data observations. The tool gives
inappropriate answers (or no answers at all) if there are one or more missing
observations.
For more information about the ATP Matched Pairs Two Sample t-Test
tool, click the following article number to view the article in the Microsoft Knowledge Base:
829252
You may obtain incorrect results and misleading labels when you use the Analysis ToolPak t-Test in Excel
If you want to use this tool, and if there is
missing data (or if there is even the chance of missing data), the TTEST function
in Excel will handle it correctly.
Separate sections are provided
later in this article for individual ATP tools. Tools that are not listed have
not been affected by improvements in Excel 2003 and in later versions of Excel.
ANOVA: Single Factor, Two-Factor with Replication, and Two-Factor without Replication
Each of these three ANOVA tools has been rewritten to upgrade the
computational procedure to a two-pass algorithm that is more numerically
robust. These improvements are similar to improvements in statistical functions
that compute sums of squared deviations about a mean (for example: VAR, STDEV,
SLOPE, PEARSON).
For more information about ATP
ANOVA, click the following article number to view the article in the Microsoft Knowledge Base:
829215
Description of numeric improvement in Analysis ToolPak ANOVA tools in Excel
Correlation
This tool has not been changed. However, there is a small
difference between the Correlation tool and the Covariance tool that persists
in all versions of Excel. The Correlation tool returns a lower triangular
correlation table with 1's on the diagonal and correlations off the diagonal.
The tool uses CORREL to compute off-diagonal entries and fills those entries
with the value that is returned by CORREL. (Therefore, if any data entry
changes, no entry in the table changes. Contrast this behavior with the
behavior of Covariance.)
Covariance
This tool returns a lower triangular covariance table with
variances on the diagonal and covariances off the diagonal. Cells on the
diagonal contain a formula "=VARP(...)" so that if a data entry changes, the
result in the table also changes. VARP has been improved for Office Excel 2003 and for later versions of Excel.
For more information about VARP, click the following article number to view the article in the Microsoft Knowledge Base:
826393
Excel statistical functions: VARP
The Covariance tool uses COVAR to compute
off-diagonal entries and fills those entries with the value that is returned by
COVAR. Therefore if a data entry changes, the off-diagonal entries do not
change.
Descriptive Statistics
This tool calls Excel statistical functions for everything that it
computes. Because VAR and STDEV are improved for Excel 2003 and for later versions of Excel, different values
are possible because of round-off errors in extreme cases.
For more information about
VAR, click the following article number to view the article in the Microsoft Knowledge Base:
826112
Excel statistical functions: VAR
F-Test Two-Sample for Variances
Like the Descriptive Statistics tool, this tool calls VAR. Again,
different values are possible because of round-off errors in extreme cases.
Random Number Generation
This tool fills a range with random observations. Values of these
observations are put directly in the cells so that these cell values are not
recomputed and replaced with fresh observations when the sheet is recalculated.
The built-in RAND function in Excel, on the other hand, replaces existing
random numbers with new ones every time the sheet is recalculated. You can use
RAND to preserve values. To do so, copy the results in a range, and then use the
Paste Special command to paste
values into the same range.
The Random Number Generation (RNG) tool
also produces random observations from various probability distributions
whereas RAND corresponds to the single option in the tool: Uniform with range
between 0 and 1. This article describes how to combine RAND with the
statistical functions in Excel to generate such
observations.
Therefore, in terms of functionality, you can emulate
the ATP random number tool by using RAND and some ingenuity. Sometimes this is
useful, particularly when many random numbers are wanted.
For Excel
2002 and earlier, both the ATP random number generator and RAND were known to
perform poorly on standard tests of randomness. Performance was poor because
the length of a cycle before the sequence of pseudo-random numbers starts
repeating was too short. This is an issue only when many random numbers are
required.
RAND has been improved for Excel 2003 and for later versions of Excel so that RAND now passes
all such standard tests. RAND's sequence of random numbers will start repeating
itself after more than 1 trillion numbers are generated.
For more information about
RAND, click the following article number to view the article in the Microsoft Knowledge Base:
828795
Description of the RAND function in Excel
However, the ATP's separate random number
generator was not upgraded. As in the version of RAND that is in Excel 2002 and in earlier versions of Excel, the
ATP's separate random number generator is known to perform poorly on standard
tests of randomness and to have a short repetition cycle. This has negative
implications only if you require a very lengthy sequence of random numbers (for
example, 1 million).
The RNG tool provides random observations from
various probability distributions and the Uniform[0,1], the distribution used
for outputting random numbers through RAND. The ATP tool first draws a
Uniform[0,1] random number (or more than one such number) and then converts the
answer into an observation from one of the following specific distributions.
For the benefit of those who prefer RAND because they are going to generate
many observations, this article suggests formulas that use RAND in the table
below. Following the table, there are some cautionary comments about the
Analysis ToolPak's normal distribution case.
Distribution | Excel formula using RAND() |
Bernoulli(p) | =IF(RAND() <= p, 1, 0) |
Binomial(n,p) | =CRITBINOM(n, p, RAND()) |
Discrete | See below |
Normal(mu, sigma) | =NORMINV(RAND(), mu, sigma) |
Patterned | Not really random |
Poisson(mean) | See below |
Uniform(low, high) | = low + (high � low) * RAND() |
There are two reasons that you might prefer RAND and
the formula in this table instead of the ATP random number tool in the
Normal(mu, sigma) case. First, RAND is a better Uniform[0,1] random number
generator than ATP's. Second, the ATP's tool does not call Excel's NORMINV
function, but instead has its own built-in version of the inverse normal
distribution. This is not as accurate as the version of NORMINV that is in Excel 2003 and in later versions of Excel. It
is inferior both in the accuracy of the normal distribution approximation that
it uses (Excel uses the much improved NORMSDIST function) and in the refinement
of the binary search (Excel carries it much further to guarantee a value that
is closer to NORMINV's probability argument). In a nutshell, using ATP in this
case does not take advantage of improvements for Excel 2003 and for later versions of Excel in the NORMINV,
NORMSDIST, and RAND functions.
For observations from a discrete
distribution, assume that the values are in column B and that their
probabilities are in column C. One would then want to fill each row of column A
with the probability of observing a value strictly less than the value in
column B in that row. Assuming that there were 10 values, assume that this data
is in cells A1:C10. Then, because A1 contains the probability of observing a
value strictly less than the first value, it must be set to 0. You can use
VLOOKUP(RAND(), A1:C10, 2); the fourth argument to VLOOKUP is optional and must
be omitted or set to TRUE. The "2" means that you want to return the value in
the second column (column B in this example).
The ATP uses an
adaptation of the method of generating Poisson observations in Press, W.H.,
S.A. Teukolsky, W. T. Vetterling, and B.P. Flannery,
Numerical Recipes in C, The Art of Scientific Computing, 2nd ed., Cambridge University Press, 1992, pp. 293-295. There
are two methods to take easy advantage of existing Excel
functions.
The first uses the observation that a POISSON random
variable with mean m has a distribution that is well approximated by a
BINOMIAL(n, m/n) for large n. You can then call CRITBINOM(n, m/n, RAND()).
Choice of n depends on m; n greater than 1,000 times m should be large
enough.
The second relates the POISSON distribution to the
Exponential. If events occur according to a POISSON process at rate m per unit
time, the time between events has an Exponential distribution with mean 1/m.
For a POISSON observation you can take a sequence of observations from this
Exponential distribution and count how many of them occur before their sum
exceeds 1. To obtain an observation from this Exponential distribution, use
GAMMAINV(RAND(), 1, 1/m). This method would be suitable when m is relatively
close to 0.
Regression
The Regression tool calls Excel's LINEST. The article on LINEST
describes extensive improvements for Excel 2003 and for later versions of Excel.
For more information about
LINEST, click the following article number to view the article in the Microsoft Knowledge Base:
828533
Description of the LINEST function in Excel
If you use Excel 2002 or an earlier version of Excel, note the
same two shortcomings of the ATP Regression tool as of LINEST:
- Regression Sum of Squares, r squared, and f statistic
values are always incorrect for the case where the regression is forced through
the origin.
For LINEST, this means "third argument set to FALSE
instead set to TRUE or omitted." For the ATP tool, it means "Constant is
Zero check box is checked." - LINEST and the ATP tool are insensitive to collinearity
issues. The article on LINEST discusses the computational approach to LINEST in
Excel 2003 and in later versions of Excel that is designed to find collinearity or near-collinearity when it
exists and to act appropriately.
Both of these LINEST shortcomings have been overcome in Excel 2003 and in later versions of Excel. Performance of the ATP Regression tool will be similarly improved. There
were no changes made to the tool's code; it has been improved by calling an
improved Excel function. This author considers the improvement in LINEST to be
the most important of the statistical function improvements.
The
following table shows the Regression tool's output for earlier versions of
Excel and for later versions of Excel with the
Constant is Zero check box
selected. It illustrates the first shortcoming that is mentioned earlier. In
earlier versions of Excel, the Regression sum of squares is negative as is the
R Square value.
X's | Y's | | | | |
1 | 11 | | | | |
2 | 12 | | | | |
3 | 13 | | | | |
| | | | | |
Excel 2002 and earlier
versions | | | | | |
SUMMARY
OUTPUT | | | | | |
| | | | | |
Regression
Statistics | | | | | |
Multiple
R | 65535 | | | | |
R
Square | -20.4285714 | | | | |
Adjusted R
Square | -20.9285714 | | | | |
Standard
Error | 4.629100499 | | | | |
Observations | 3 | | | | |
| | | | | |
ANOVA | | | | | |
| df | SS | MS | F | Significance
F |
Regression | 1 | -40.85714286 | -40.85714286 | -1.90666667 | #NUM! |
Residual | 2 | 42.85714286 | 21.42857143 | | |
Total | 3 | 2 | | | |
| | | | | |
Excel 2003 and later versions of Excel
version | | | | | |
SUMMARY
OUTPUT | | | | | |
| | | | | |
Regression
Statistics | | | | | |
Multiple
R | 0.949342311 | | | | |
R
Square | 0.901250823 | | | | |
Adjusted R
Square | 0.401250823 | | | | |
Standard
Error | 4.629100499 | | | | |
Observations | 3 | | | | |
| | | | | |
ANOVA | | | | | |
| df | SS | MS | F | Significance
F |
Regression | 1 | 391.1428571 | 391.1428571 | 18.25333333 | 0.14637279 |
Residual | 2 | 42.85714286 | 21.42857143 | | |
Total | 3 | 434 | | | |
t-Test: Paired Two Sample for Means
As mentioned earlier, avoid this tool if there is any chance of
one or more missing data values. The prototypical application of this test is
an experiment with measurements on subjects Before and After a treatment (such
as weights Before and After a 60-day diet plan). If there are no missing
observations, the tool will behave fine. If there are different numbers of
missing Before and After observations, you receive an error message and the
tool will not compute anything. If there are missing observations and the
numbers of missing Before and After observations are equal, the tool will
return answers that contain several errors.
Standard procedure is to
remove a subject from the data if either the Before or After measurement is
missing and to analyze the data that contains only those subjects that have
both Before and After measurements. Excel's TTEST function handles missing data
according to this standard procedure.
The other two t-Test tools,
Two-Sample Assuming Equal Variances and Two-Sample Assuming Unequal Variances,
do not share this defect.
z-Test: Two Sample for Means
This article noted that the normal distribution case of the random
number generation tool does not call the NORMSINV function (or more precisely,
NORMINV that calls NORMSINV) but has its own inferior procedure for finding
normal inverse values.
The z-Test tool does call the NORMSINV function
and takes advantage of improvements for Excel 2003 and for later versions of Excel.
Results in earlier versions of Excel
There are ATP tools whose performance has been improved for Excel 2003 and for later versions of Excel because they call Excel statistical functions that have been improved for
Excel 2003 and for later versions of Excel. One of these improvements to LINEST, when its third argument is set
to FALSE, implies that the ATP regression tool returns incorrect results in
Excel 2002 and in earlier versions of Excel when the
Constant is Zero
check box is selected. In other cases where Excel functions have been
improved, users of earlier versions are not likely to notice differences (most
of these differences involve round-off errors in extreme
situations).
The three ATP ANOVA tools have been improved by editing
ATP code to substitute a more numerically robust algorithm (in the same spirit
as the improvement in Excel's VAR). Users of these tools in earlier versions of
Excel are likely to notice differences only in extreme
situations.
Warning for users of all versions: avoid the t-Test:
Paired Two Sample for Means if there is the slightest chance of missing data.
Results in Excel 2003 and in later versions of Excel
Significant improvements have been made in Excel's statistical
functions. This translates into improvements in many ATP tools that call these functions.
There is one ATP tool, the random number generator, that does not take advantage
of an improved RAND function (because it is implemented in a self-contained
manner and does not call RAND). This is unfortunate, but even more unfortunate
is the special case of normally distributed random observations. The inverse
normal distribution is also implemented in a self-contained manner and does not
call the much improved NORMSINV function.
The following table lists
ATP tools and those Excel functions that they call that have been improved for
Excel 2003 and for later versions of Excel. Readers are referred to separate articles on each Excel function
that is called.
ATP tool | Excel functions that are called |
| |
ANOVA: Single Factor | VAR, FINV |
ANOVA: Two-Factor With Replication | VAR, FINV |
ANOVA: Two-Factor Without Replication | VAR, FINV |
Correlation | |
Covariance | |
Descriptive Statistics | STDEV, TINV, VAR |
Exponential Smoothing | |
F-Test Two-Sample for Variances | VAR, FINV |
Fourier Analysis | |
Histogram | |
Moving Average | |
Random Number Generation | |
Rank and Percentile | |
Regression | LINEST |
Sampling | RAND |
t-Test: Paired Two Sample for Means | VAR, PEARSON,
TINV |
t-Test: Two-Sample Assuming Equal Variances | VAR,
TINV |
t-Test: Two-Sample Assuming Unequal Variances | VAR,
TINV |
z-Test: Two Sample for Means | NORMSDIST, NORMSINV |
For all functions that appear in this table other
than LINEST and RAND, you are likely to see differences between earlier versions of Excel and later versions of Excel only because of round-off errors in extreme situations.
LINEST has been much improved, as discussed earlier. RAND has been improved
also. Curiously, the Sampling tool calls RAND, but the RNG tool relies on a
self-contained generator that will give inferior performance when a lengthy
sequence of random observations is required.
Conclusions
Except for changes to ATP code for each of the three ANOVA tools,
ATP code has not been rewritten. However, it benefits from calling improved
Excel functions as shown in the ATP Tool table. Flaws in the t-Test: Paired Two
Sample for Means test have not been fixed for Excel 2003 or for later versions of Excel. The most noticeable
improvements are likely to be in the Regression tool where LINEST no longer
returns incorrect results when the
Constant is Zero check box
is selected and where LINEST is designed to handle collinearity
appropriately.