Notice: This website is an unofficial Microsoft Knowledge Base (hereinafter KB) archive and is intended to provide a reliable access to deleted content from Microsoft KB. All KB articles are owned by Microsoft Corporation. Read full disclaimer for more details.

Description of improvements in the statistical functions in Excel 2003 and in Excel 2004 for Mac


View products that this article applies to.

Summary

This article describes six categories of improvements to the statistical functions in Microsoft Office Excel 2003, points you (the reader) to more specific articles, and gives at least some assessment of the effect of using earlier versions of Microsoft Excel when functions have been improved for Excel 2003.

Microsoft Excel 2004 for Mac information

The statistical functions in Excel 2004 for Mac were updated by using the same algorithms that were used to update the statistical functions in Microsoft Office Excel 2003. Any information in this article that describes how a function works or how a function was modified for Excel 2003 also applies to Excel 2004 for Mac.

↑ Back to the top


More information

For Excel 2003, improvements have been made in the following six areas:
  1. LINEST and related functions.
  2. NORMSDIST and related functions.
  3. Functions that involve sums of squares (for example, VAR and STDEV).
  4. Continuous distribution inverse functions (for example, CHIINV, FINV, NORMSINV, and TINV).
  5. Functions for discrete distributions (for example, BINOMDIST and POISSON).
  6. Random number generator (RAND).
Separate sections cover each of these six areas. In areas three, four, and five, there is a common theme to improvements to each of several functions. In areas one through four, improvements to one function had useful effects on the performance of other functions that essentially call it.

Because improvements were made for Excel 2003, you may inevitably wonder about the risk of (and magnitude of) inaccuracies in statistical functions in the earlier versions of Excel. These issues will be discussed briefly in each of the six sections. The six areas are ranked in this author's perception of their importance. While most users do not have to be the least bit alarmed by performance in Microsoft Excel 2002 and earlier, problems with LINEST (area 1) are much more likely to surface than problems with BINOMDIST (area five) or RAND (area six).

There is one shortcoming in Excel 2002 and earlier that was identified and not fixed for Excel 2003. The prototypical application of the Analysis ToolPak's tool for t-Test: Paired Two Sample for Means is one where data includes measurements on the same set of subjects before and after a treatment (for example, weights before and after a diet program). This tool will not compute if there are unequal numbers of missing before and after observations; it will compute incorrect answers if there are missing observations and the number of missing before and after observations are equal. Unless you can guarantee that there are no missing observations, the TTEST function in Excel is preferred (because it handles missing observations correctly).

For more information, click the following article number to view the article in the Microsoft Knowledge Base:
829252 You may obtain incorrect results and misleading labels when you use the Analysis ToolPak t-Test in Excel 2003
Two primary references that point out numeric inaccuracies in Microsoft Excel 97 are papers by Knusel (see note 1) for areas two, four, and five and McCullough and Wilson (see note two) for areas one, three, and six. Improvements to the statistical functions in Excel in Microsoft Excel 2000 and Microsoft Excel 2002 were relatively minor.

LINEST and related functions

Two major improvements were made to LINEST code. The first replaced an inappropriate formula for total sum of squares in the case where the third argument of LINEST was set to FALSE indicating that you do not want LINEST to model an intercept (or constant). In practice, this option is selected in a small proportion of cases. When calling LINEST(known_y's, known_x's, FALSE, TRUE) to return an array with five rows that contain detailed information (f-statistic, r squared, regression and residual or error sums of squares), the inappropriate formula in earlier versions of Excel led to incorrect results in the last three rows of the output array. Regression coefficients and their standard errors were not affected. This incorrect formula can lead to negative r squared and negative regression sum of squares. This has been corrected. The article on LINEST suggests a workaround for Excel 2002 and earlier to generate correct values of the outputs in the last three rows of the output array. This was a serious error as results in these last three rows were always incorrect when LINEST was called with its third argument set to FALSE.

For more information, click the following article number to view the article in the Microsoft Knowledge Base:
828533 Description of the LINEST function in Excel 2003 and in Excel 2004 for Mac

Notes

  • Knusel, L. On the Accuracy of Statistical Distributions in Microsoft Excel 97, Computational Statistics and Data Analysis, 26, 375-377, 1998.
  • McCullough, B.D. & B. Wilson, On the accuracy of statistical procedures in Microsoft Excel 97, Computational Statistics and Data Analysis, 31, 27-37, 1999.
McCullough and Wilson (McC and W) pointed out (correctly) that Excel 97 was insensitive to issues of collinear (or almost collinear) predictor columns. A set of predictor columns is collinear if one of them can be discarded without impacting the goodness of fit of the regression model. In other words, the information that is contained in a discarded column adds no value because the discarded column can be reconstructed from the remaining columns. Test examples in McC and W included collinearity and near collinearity. Using a solution approach that ignored issues of collinearity led to inaccurate results when collinearity was present. Such results included a reduced number of significant digits of accuracy in regression coefficients and an inappropriate number of degrees of freedom. For Excel 2003, the solution approach was changed to use QR Decomposition. Results are much improved levels of accuracy in notoriously difficult test cases that are used by McC and W. Collinearity is discussed in the article on LINEST; the article also provides a small example for walking through the QR Decomposition algorithm. This deficiency in all versions of Excel that are earlier than Excel 2003 will have a serious effect only when there are collinear predictors. One common case where collinearity is guaranteed not to occur is if there is a single predictor column and not all values in that column are the same.

Improving LINEST has a good effect on functions that call it: LOGEST, TREND, and GROWTH in addition to the Linear Regression tool in the Analysis ToolPak. The charting tool in Excel allows you to fit a straight line to a plot of data points by using a linear regression tool. Unfortunately, linear regression capabilities in the charting tool have not been upgraded in Excel 2003.

In summary, if you use a version of Excel that is earlier than Excel 2003, you must reject detailed LINEST results in the last three rows of the five row output table in LINEST if LINEST was called with its third argument set to FALSE. Also be aware of the chance of collinearity; for most data sets, collinearity should not be a problem.

NORMSDIST and related functions

NORMSDIST(z) must be evaluated by using an approximation procedure. Earlier versions of Excel used the same procedure for all values of z. For Excel 2003, two different approximations are used: one for |z| less than or equal to five, and a second for |z| greater than five. The two new procedures are each more accurate than the previous procedure over the range that they are applied. In earlier versions of Excel, accuracy deteriorates in the tails of the distribution yielding three significant digits for z = 4 as reported in Knusel's paper. Also, in the neighborhood of z = 1.2, NORMSDIST yields only six significant digits. However, in practice, this is likely to be sufficient for most users.

Excel 2003 code yields at least fifteen decimal point accuracy for any z. This means fifteen significant digits for z = 1.2 and (because of leading 0's) ten significant digits for z = 4. Improving the accuracy of NORMSDIST has the useful side effect of also improving the following functions that depend on it: CONFIDENCE, LOGINV, LOGNORMDIST, NORMDIST, NORMINV, NORMSINV and ZTEST. Code for CONFIDENCE, LOGNORMDIST, NORMDIST, and ZTEST has not been revised; accuracy of each of them is improved because each essentially calls NORMSDIST one or more times and uses the results of these calls in its computations. Improved accuracy of NORMSDIST also benefited LOGINV, NORMINV, and NORMSINV. Because these are continuous distribution inverse functions (see area four below), they also benefited from search refinements that were associated with such functions. The search refinements in area four discussed for other inverse functions were implemented for these three inverse functions in Excel 2002.

In summary, if you use Excel 2002 and earlier, you should be satisfied with NORMSDIST. However, if you must have highly accurate NORMSDIST(z) values for z far from 0 (such as |z| greater than or equal to four), Excel 2003 might be required. NORMSDIST(-4) = 0.0000316712; earlier versions would be accurate only as far as 0.0000317. You can expect that "about 0.00003" or "about 3 chances in 100,000" is likely to be an accurate enough answer for many users and earlier versions carry this to two more decimal places.

Functions that involve sums of squares

McCullough and Wilson point out that Excel appeared to use a "calculator formula" to compute VAR. Their observation was correct and can be extended to many (but curiously, not all) functions that calculate the sum of squared deviations about a mean. The calculator formula can be executed in a single pass through the data. The resulting computational speed was probably the main reason for using it in earlier versions of Excel. The alternative formula, implemented for Excel 2003, requires two passes. For VAR the calculator formula counts the number of observations, sums the squares of all the observations, and sums the observations. From this, it can calculate:
Sum of squares of observations � ((sum of observations)^2) / number of observations
The alternative procedure counts the number of observations and the sum of the observations on the first pass. From this it can calculate the mean as sum of observations divided by number of observations. On the second pass, the procedure calculates:
Sum of squared deviations of individual observations from the mean
With either computational approach, VAR is computed by dividing the result by number of observations � 1; VARP is computed by dividing the result by number of observations.

With infinitely precise arithmetic, both procedures yield the same results. However, because of the finite precision of Excel, the calculator formula is more prone to round off errors. In texts on statistical computing, the calculator formula is generally presented as an example of how not to compute variance. With the worksheet that accompanies the article about VAR, you can experiment to judge the extent that round off errors in earlier versions of Excel are likely to pose problems. You will see that problems occur when there are many significant digits in the data but very little difference between values. You can expect that for most users, such round off errors are not likely to be troubling in practice. However, revising various functions to substitute the two-pass procedure for the calculator formula was worthwhile because it replaced an outdated and discredited procedure by an appropriate one. It also affected many functions.

DEVSQ also computes the sum of squared deviations about a mean. DEVSQ has always been implemented by using the two-pass procedure. Therefore, another alternative for computing VAR(data) is to compute DEVSQ(data) / (COUNT(data) � 1). This would give the same results in all Excel versions. If you are using Excel 2002 or earlier, you can compute both VAR and DEVSQ(data) / (COUNT(data) � 1) to see how much their results differ. The difference is essentially the round off error in VAR. Other functions that require a sum of squared deviations about a mean and that have always used the two-pass procedure are CORREL and COVAR. PEARSON and CORREL both compute the Pearson Product-Moment Correlation Coefficient. Both yield the same results in Excel 2003. In earlier versions of Excel, PEARSON is implemented with the one pass algorithm.

Many functions involve sums of squared deviations about a mean. The two-pass procedure has been implemented for each of the following: VAR, VARA, VARP, VARPA, STDEV, STDEVA, STDEVP, STDEVPA, PEARSON, SLOPE, and STEYX. Other functions that have been improved because they essentially call one of the functions in the previous list include: FORECAST, INTERCEPT, RSQ, TTEST, and ZTEST (when user omits standard deviation from arguments). In the Analysis ToolPak, each of the three ANOVA tools has been improved by replacing the calculator formula by the two-pass procedure. There are also versions of VAR, VARP, STDEV, and STDEVP that are used in pivot tables and data consolidation. There is a single article on the functions DVAR, DVARP, DSTDEV, and DSTDEVP.

For more information, click the following article number to view the article in the Microsoft Knowledge Base:
828125 Excel statistical functions: DVAR, DVARP, DSTDEV, and DSTDEVP


All of these have also been upgraded.

In summary, if you use an earlier version of Excel, you must be concerned about round-off errors in cases where data contains many significant digits but the variance is small. The article about VAR presents contrived examples of this; you can expect that these round-off problems do not naturally occur frequently in real data.

For more information, click the following article number to view the article in the Microsoft Knowledge Base:
826112 Excel statistical functions: VAR

Continuous distribution inverse functions

These are all the functions whose name ends in "INV". They all use binary search to converge on a value to return to the user. Therefore accuracy of, CHIINV, NORMSINV, and other functions depends on two factors: accuracy of the underlying distributions (for example, CHIDIST and NORMSDIST) and refinement of the binary search process.

For Excel 2002, the binary search process was improved in one case, NORMSINV. This improvement also affects NORMINV. NORMSINV and NORMINV are by far the most frequently used inverse functions. The nature of the improvement was to refine the search so that it continued until the result was accurate to the limit of Excel's precision, instead of stopping earlier if the probability associated with the value found was within 3 * 10^(-7) of the correct probability. While this search process refinement improved NORMSINV, Excel 2002's version still required a more accurate NORMSDIST.

For Excel 2003, the improved binary search used for NORMSINV for Excel 2002, was implemented for each of the other "INV" functions. Because NORMSDIST was also improved (area two), the version of NORMSINV in Excel 2003 is better than the version in Excel 2002 and better yet than that in Excel 2000 and earlier versions of Excel.

In summary, these functions will behave well enough in earlier versions of Excel assuming that you call them with a probability value (for example, p in NORMSINV(p)) that is not too close to 0 or 1 (such as within about 10^(-6)).

Functions for discrete distributions

Knusel pointed out that BINOMDIST, HYPGEOMDIST, and POISSON do not compute numeric results (and return #NUM!) in certain cases. CRITBINOM and NEGBINOMDIST exhibit similar behavior. These functions produce accurate results whenever they do not yield an error message like #NUM!.

These problems have been fixed. You can view them as relatively unimportant because they occur only for unlikely values of input parameters. For example, BINOMDIST will return accurate answers in Excel 2002 and earlier as long as the number of trials is fewer than 1,030.

All five of these functions were fixed in the same way: continue to use existing code in cases where inputs to the functions guaranteed no computational problems (such as fewer than 1,030 trials for BINOMDIST); switch to an alternative plan in problematical cases. For each of the five functions, this method uses a process that permits evaluation of the probability that you want without having to evaluate a combinatorial coefficient like COMBIN(1030, 515) or a factorial of a large number like FACT(200). In each case, the alternative plan has the same general approach.

For more information, click the following article numbers to view the articles in the Microsoft Knowledge Base:
827459 Excel statistical functions: BINOMDIST
828117 Excel statistical functions: CRITBINOM
828515 Excel statistical functions: HYPGEOMDIST
828361 Excel statistical functions: NEGBINOMDIST
828130 Excel statistical functions: POISSON


In summary, users of earlier versions of Excel should not be troubled in practice by numeric deficiencies because input values that cause #NUM! results are extreme. Additionally, you can be assured that if you receive a result other than an error message, the result is accurate. Round-off error is not a problem here.

Random number generator

There are standard tests of randomness for a stream of random numbers. RAND�s sequence of random numbers repeated itself with too small a frequency to pass these tests. An improved algorithm has been implemented that passes all standard tests of randomness.

The chance of a serious practical affect on your random data by the random number generator in Excel 2002 and earlier is minimal. For example, you must have a lengthy sequence of random numbers (such as 1 million) before the repetitive behavior would have a serious affect on your results.

Still, because the replacement algorithm was easily implemented, it was worthwhile to try to make improvements.

In summary, if you use an earlier version of Excel, do not be concerned with the quality of pseudo-random numbers generated by RAND unless you use many random numbers.

Conclusions

This overview article discusses six categories of functions that were established based on inadequacies in Excel statistical functions that were reported in papers by Knusel and McCullough and Wilson. Improvements were made to functions in each category. Readers are encouraged to see articles on individual functions for more detail.

This article has warned users of Excel 2002 and earlier to avoid using results in the last three rows of the output table for LINEST(known_y's, known_x's, FALSE, TRUE). Users of all versions should not use the Analysis ToolPak's tool for t-Test: Paired Two Sample for Means if there are missing data. These two cases return incorrect results regardless of data values.

In other cases, the extent that numeric inaccuracies in Excel 2002 and earlier will affect users is difficult to measure. For functions where you supply numeric arguments, degree of inaccuracy typically depends on the value of these arguments (for example, BINOMDIST with 1,500 trials or NORMSDIST(-7)). In these cases such arguments typically have to be "extreme" before there is a risk of serious numeric problems. For functions where you supply one or more data ranges, such as any function involving sums of squares (VAR, STDEV, SLOPE), data values also have to be "extreme" in a sense with many significant digits and small variation before round-off problems have an affect. For LINEST, you must be aware of the chance of collinearity.

Perhaps you will never notice a difference when a worksheet that is created in Excel 2002 or earlier is recalculated in Excel 2003. However, it was important to improve functions whose numerical accuracy was found not to be sufficient in tests that were designed to investigate the ability of the functions to handle inputs known by experts in the computational statistics community to stretch such functions to their limits. When the functions were originally added to Excel, nobody could anticipate future uses. For example, Six Sigma techniques were not in widespread use. Now, you would not want to return inaccurate values of NORMSDIST(-6) or NORMSDIST(6). Hopefully, numeric improvements for Excel 2003 will make these statistical functions appropriate for unanticipated future use.

↑ Back to the top


Keywords: KB828888, kbinfo, kbfuncstat, kbfunctions

↑ Back to the top

Article Info
Article ID : 828888
Revision : 7
Created on : 11/15/2006
Published on : 11/15/2006
Exists online : False
Views : 448