For Excel 2003, improvements have been made in the following
six areas:
- LINEST and related functions.
- NORMSDIST and related functions.
- Functions that involve sums of squares (for example, VAR
and STDEV).
- Continuous distribution inverse functions (for example,
CHIINV, FINV, NORMSINV, and TINV).
- Functions for discrete distributions (for example,
BINOMDIST and POISSON).
- Random number generator (RAND).
Separate sections cover each of these six areas. In areas three,
four, and five, there is a common theme to improvements to each of several
functions. In areas one through four, improvements to one function had useful
effects on the performance of other functions that essentially call
it.
Because improvements were made for Excel 2003, you may inevitably
wonder about the risk of (and magnitude of) inaccuracies in statistical
functions in the earlier versions of Excel. These issues will be discussed
briefly in each of the six sections. The six areas are ranked in this author's
perception of their importance. While most users do not have to be the least
bit alarmed by performance in Microsoft Excel 2002 and earlier, problems with
LINEST (area 1) are much more likely to surface than problems with BINOMDIST
(area five) or RAND (area six).
There is one shortcoming in Excel 2002
and earlier that was identified and not fixed for Excel 2003. The prototypical
application of the Analysis ToolPak's tool for t-Test: Paired Two Sample for
Means is one where data includes measurements on the same set of subjects
before and after a treatment (for example, weights before and after a diet
program). This tool will not compute if there are unequal numbers of missing
before and after observations; it will compute incorrect answers if there are
missing observations and the number of missing before and after observations
are equal. Unless you can guarantee that there are no missing observations, the
TTEST function in Excel is preferred (because it handles missing observations
correctly).
For more information, click the following article number to view the article in the Microsoft Knowledge Base:
829252
You may obtain incorrect results and misleading labels when you use the Analysis ToolPak t-Test in Excel 2003
Two primary references that point out
numeric inaccuracies in Microsoft Excel 97 are papers by Knusel (see note 1)
for areas two, four, and five and McCullough and Wilson (see note two) for
areas one, three, and six. Improvements to the statistical functions in Excel
in Microsoft Excel 2000 and Microsoft Excel 2002 were relatively minor.
LINEST and related functions
Two major improvements were made to LINEST code. The first
replaced an inappropriate formula for total sum of squares in the case where
the third argument of LINEST was set to FALSE indicating that you do not want
LINEST to model an intercept (or constant). In practice, this option is
selected in a small proportion of cases. When calling LINEST(known_y's,
known_x's, FALSE, TRUE) to return an array with five rows that contain detailed
information (f-statistic, r squared, regression and residual or error sums of
squares), the inappropriate formula in earlier versions of Excel led to
incorrect results in the last three rows of the output array. Regression
coefficients and their standard errors were not affected. This incorrect
formula can lead to negative r squared and negative regression sum of squares.
This has been corrected. The article on LINEST suggests a workaround for Excel
2002 and earlier to generate correct values of the outputs in the last three
rows of the output array. This was a serious error as results in these last
three rows were always incorrect when LINEST was called with its third argument
set to FALSE.
For more information, click the following article number to view the article in the Microsoft Knowledge Base:
828533
Description of the LINEST function in Excel 2003 and in Excel 2004 for Mac
Notes
- Knusel, L. On the Accuracy of Statistical Distributions in Microsoft
Excel 97, Computational Statistics and Data Analysis, 26, 375-377,
1998.
- McCullough, B.D. & B. Wilson, On the accuracy of statistical procedures
in Microsoft Excel 97, Computational Statistics and Data Analysis, 31, 27-37,
1999.
McCullough and Wilson (McC and W) pointed out (correctly) that
Excel 97 was insensitive to issues of collinear (or almost collinear) predictor
columns. A set of predictor columns is collinear if one of them can be
discarded without impacting the goodness of fit of the regression model. In
other words, the information that is contained in a discarded column adds no
value because the discarded column can be reconstructed from the remaining
columns. Test examples in McC and W included collinearity and near
collinearity. Using a solution approach that ignored issues of collinearity led
to inaccurate results when collinearity was present. Such results included a
reduced number of significant digits of accuracy in regression coefficients and
an inappropriate number of degrees of freedom. For Excel 2003, the solution
approach was changed to use QR Decomposition. Results are much improved levels
of accuracy in notoriously difficult test cases that are used by McC and W.
Collinearity is discussed in the article on LINEST; the article also provides a
small example for walking through the QR Decomposition algorithm. This
deficiency in all versions of Excel that are earlier than Excel 2003 will have
a serious effect only when there are collinear predictors. One common case
where collinearity is guaranteed not to occur is if there is a single predictor
column and not all values in that column are the same.
Improving
LINEST has a good effect on functions that call it: LOGEST, TREND, and GROWTH
in addition to the Linear Regression tool in the Analysis ToolPak. The charting
tool in Excel allows you to fit a straight line to a plot of data points by
using a linear regression tool. Unfortunately, linear regression capabilities
in the charting tool have not been upgraded in Excel 2003.
In summary,
if you use a version of Excel that is earlier than Excel 2003, you must reject
detailed LINEST results in the last three rows of the five row output table in
LINEST if LINEST was called with its third argument set to FALSE. Also be aware
of the chance of collinearity; for most data sets, collinearity should not be a
problem.
NORMSDIST and related functions
NORMSDIST(z) must be evaluated by using an approximation
procedure. Earlier versions of Excel used the same procedure for all values of
z. For Excel 2003, two different approximations are used: one for |z| less than
or equal to five, and a second for |z| greater than five. The two new
procedures are each more accurate than the previous procedure over the range
that they are applied. In earlier versions of Excel, accuracy deteriorates in
the tails of the distribution yielding three significant digits for z = 4 as
reported in Knusel's paper. Also, in the neighborhood of z = 1.2, NORMSDIST
yields only six significant digits. However, in practice, this is likely to be
sufficient for most users.
Excel 2003 code yields at least fifteen
decimal point accuracy for any z. This means fifteen significant digits for z =
1.2 and (because of leading 0's) ten significant digits for z = 4. Improving
the accuracy of NORMSDIST has the useful side effect of also improving the
following functions that depend on it: CONFIDENCE, LOGINV, LOGNORMDIST,
NORMDIST, NORMINV, NORMSINV and ZTEST. Code for CONFIDENCE, LOGNORMDIST,
NORMDIST, and ZTEST has not been revised; accuracy of each of them is improved
because each essentially calls NORMSDIST one or more times and uses the results
of these calls in its computations. Improved accuracy of NORMSDIST also
benefited LOGINV, NORMINV, and NORMSINV. Because these are continuous
distribution inverse functions (see area four below), they also benefited from
search refinements that were associated with such functions. The search
refinements in area four discussed for other inverse functions were implemented
for these three inverse functions in Excel 2002.
In summary, if you
use Excel 2002 and earlier, you should be satisfied with NORMSDIST. However, if
you must have highly accurate NORMSDIST(z) values for z far from 0 (such as |z|
greater than or equal to four), Excel 2003 might be required. NORMSDIST(-4) =
0.0000316712; earlier versions would be accurate only as far as 0.0000317. You
can expect that "about 0.00003" or "about 3 chances in 100,000" is likely to be
an accurate enough answer for many users and earlier versions carry this to two
more decimal places.
Functions that involve sums of squares
McCullough and Wilson point out that Excel appeared to use a
"calculator formula" to compute VAR. Their observation was correct and can be
extended to many (but curiously, not all) functions that calculate the sum of
squared deviations about a mean. The calculator formula can be executed in a
single pass through the data. The resulting computational speed was probably
the main reason for using it in earlier versions of Excel. The alternative
formula, implemented for Excel 2003, requires two passes. For VAR the
calculator formula counts the number of observations, sums the squares of all
the observations, and sums the observations. From this, it can calculate:
Sum of squares of observations � ((sum of observations)^2) / number of observations
The alternative procedure counts the number of observations and
the sum of the observations on the first pass. From this it can calculate the
mean as sum of observations divided by number of observations. On the second
pass, the procedure calculates:
Sum of squared deviations of individual observations from the mean
With either computational approach, VAR is computed by dividing
the result by number of observations � 1; VARP is computed by dividing the
result by number of observations.
With infinitely precise arithmetic,
both procedures yield the same results. However, because of the finite
precision of Excel, the calculator formula is more prone to round off errors.
In texts on statistical computing, the calculator formula is generally
presented as an example of how not to compute variance. With the worksheet that
accompanies the article about VAR, you can experiment to judge the extent that
round off errors in earlier versions of Excel are likely to pose problems. You
will see that problems occur when there are many significant digits in the data
but very little difference between values. You can expect that for most users,
such round off errors are not likely to be troubling in practice. However,
revising various functions to substitute the two-pass procedure for the
calculator formula was worthwhile because it replaced an outdated and
discredited procedure by an appropriate one. It also affected many
functions.
DEVSQ also computes the sum of squared deviations about a
mean. DEVSQ has always been implemented by using the two-pass procedure.
Therefore, another alternative for computing VAR(data) is to compute
DEVSQ(data) / (COUNT(data) � 1). This would give the same results in all Excel
versions. If you are using Excel 2002 or earlier, you can compute both VAR and
DEVSQ(data) / (COUNT(data) � 1) to see how much their results differ. The
difference is essentially the round off error in VAR. Other functions that
require a sum of squared deviations about a mean and that have always used the
two-pass procedure are CORREL and COVAR. PEARSON and CORREL both compute the
Pearson Product-Moment Correlation Coefficient. Both yield the same results in Excel 2003. In earlier versions of Excel, PEARSON is implemented with the one pass algorithm.
Many functions involve sums of squared deviations
about a mean. The two-pass procedure has been implemented for each of the
following: VAR, VARA, VARP, VARPA, STDEV, STDEVA, STDEVP, STDEVPA, PEARSON,
SLOPE, and STEYX. Other functions that have been improved because they
essentially call one of the functions in the previous list include: FORECAST,
INTERCEPT, RSQ, TTEST, and ZTEST (when user omits standard deviation from
arguments). In the Analysis ToolPak, each of the three ANOVA tools has been
improved by replacing the calculator formula by the two-pass procedure. There
are also versions of VAR, VARP, STDEV, and STDEVP that are used in pivot tables
and data consolidation. There is a single article on the functions DVAR, DVARP,
DSTDEV, and DSTDEVP.
For more information, click the following article number to view the article in the Microsoft Knowledge Base:
828125
Excel statistical functions: DVAR, DVARP, DSTDEV, and DSTDEVP
All of these have also been
upgraded.
In summary, if you use an earlier version of Excel, you must
be concerned about round-off errors in cases where data contains many
significant digits but the variance is small. The article about VAR presents
contrived examples of this; you can expect that these round-off problems do not
naturally occur frequently in real data.
For more information, click the following article number to view the article in the Microsoft Knowledge Base:
826112
Excel statistical functions: VAR
Continuous distribution inverse functions
These are all the functions whose name ends in "INV". They all use
binary search to converge on a value to return to the user. Therefore accuracy
of, CHIINV, NORMSINV, and other functions depends on two factors: accuracy of
the underlying distributions (for example, CHIDIST and NORMSDIST) and
refinement of the binary search process.
For Excel 2002, the binary
search process was improved in one case, NORMSINV. This improvement also
affects NORMINV. NORMSINV and NORMINV are by far the most frequently used
inverse functions. The nature of the improvement was to refine the search so
that it continued until the result was accurate to the limit of Excel's
precision, instead of stopping earlier if the probability associated with the
value found was within 3 * 10^(-7) of the correct probability. While this
search process refinement improved NORMSINV, Excel 2002's version still
required a more accurate NORMSDIST.
For Excel 2003, the improved
binary search used for NORMSINV for Excel 2002, was implemented for each of the
other "INV" functions. Because NORMSDIST was also improved (area two), the
version of NORMSINV in Excel 2003 is better than the version in Excel 2002 and
better yet than that in Excel 2000 and earlier versions of Excel.
In
summary, these functions will behave well enough in earlier versions of Excel
assuming that you call them with a probability value (for example, p in
NORMSINV(p)) that is not too close to 0 or 1 (such as within about 10^(-6)).
Functions for discrete distributions
Knusel pointed out that BINOMDIST, HYPGEOMDIST, and POISSON do not
compute numeric results (and return #NUM!) in certain cases. CRITBINOM and
NEGBINOMDIST exhibit similar behavior. These functions produce accurate results
whenever they do not yield an error message like #NUM!.
These problems
have been fixed. You can view them as relatively unimportant because they occur
only for unlikely values of input parameters. For example, BINOMDIST will
return accurate answers in Excel 2002 and earlier as long as the number of
trials is fewer than 1,030.
All five of these functions were fixed in
the same way: continue to use existing code in cases where inputs to the
functions guaranteed no computational problems (such as fewer than 1,030 trials
for BINOMDIST); switch to an alternative plan in problematical cases. For each
of the five functions, this method uses a process that permits evaluation of
the probability that you want without having to evaluate a combinatorial
coefficient like COMBIN(1030, 515) or a factorial of a large number like
FACT(200). In each case, the alternative plan has the same general
approach.
For more information, click the following article numbers to view the articles in the Microsoft Knowledge Base:
827459
Excel statistical functions: BINOMDIST
828117 Excel statistical functions: CRITBINOM
828515 Excel statistical functions: HYPGEOMDIST
828361 Excel statistical functions: NEGBINOMDIST
828130 Excel statistical functions: POISSON
In summary, users of earlier versions of
Excel should not be troubled in practice by numeric deficiencies because input
values that cause #NUM! results are extreme. Additionally, you can be assured
that if you receive a result other than an error message, the result is
accurate. Round-off error is not a problem here.
Random number generator
There are standard tests of randomness for a stream of random
numbers. RAND�s sequence of random numbers repeated itself with too small a
frequency to pass these tests. An improved algorithm has been implemented that
passes all standard tests of randomness.
The chance of a serious
practical affect on your random data by the random number generator in Excel
2002 and earlier is minimal. For example, you must have a lengthy sequence of
random numbers (such as 1 million) before the repetitive behavior would have a
serious affect on your results.
Still, because the replacement
algorithm was easily implemented, it was worthwhile to try to make
improvements.
In summary, if you use an earlier version of Excel, do
not be concerned with the quality of pseudo-random numbers generated by RAND
unless you use many random numbers.
Conclusions
This overview article discusses six categories of functions that
were established based on inadequacies in Excel statistical functions that were
reported in papers by Knusel and McCullough and Wilson. Improvements were made
to functions in each category. Readers are encouraged to see articles on
individual functions for more detail.
This article has warned users of
Excel 2002 and earlier to avoid using results in the last three rows of the
output table for LINEST(known_y's, known_x's, FALSE, TRUE). Users of all
versions should not use the Analysis ToolPak's tool for t-Test: Paired Two
Sample for Means if there are missing data. These two cases return incorrect
results regardless of data values.
In other cases, the extent that
numeric inaccuracies in Excel 2002 and earlier will affect users is difficult
to measure. For functions where you supply numeric arguments, degree of
inaccuracy typically depends on the value of these arguments (for example,
BINOMDIST with 1,500 trials or NORMSDIST(-7)). In these cases such arguments
typically have to be "extreme" before there is a risk of serious numeric
problems. For functions where you supply one or more data ranges, such as any
function involving sums of squares (VAR, STDEV, SLOPE), data values also have
to be "extreme" in a sense with many significant digits and small variation
before round-off problems have an affect. For LINEST, you must be aware of the
chance of collinearity.
Perhaps you will never notice a difference
when a worksheet that is created in Excel 2002 or earlier is recalculated in
Excel 2003. However, it was important to improve functions whose numerical
accuracy was found not to be sufficient in tests that were designed to
investigate the ability of the functions to handle inputs known by experts in
the computational statistics community to stretch such functions to their
limits. When the functions were originally added to Excel, nobody could
anticipate future uses. For example, Six Sigma techniques were not in
widespread use. Now, you would not want to return inaccurate values of
NORMSDIST(-6) or NORMSDIST(6). Hopefully, numeric improvements for Excel 2003
will make these statistical functions appropriate for unanticipated future
use.