Description of the effects of the improved statistical functions for the Analysis ToolPak in Excel

View products that this article applies to.

Summary

This article describes the effect of numeric improvements in the statistical functions of Microsoft Office Excel 2003 and of later versions of Excel on Analysis ToolPak (ATP) tools. Most ATP tools call Excel statistical functions in the process of calculating results. In many cases, this article serves as a pointer to articles on the individual Excel statistical functions. Additionally, discussion of future improvements that are helpful is included for some ATP tools.

Microsoft Excel 2004 for Mac information

The statistical functions in Excel 2004 for Mac were updated by using the same algorithms that were used to update the statistical functions in Microsoft Office Excel 2003 and in later versions of Excel. Any information in this article that describes how a function works or how a function was modified for Excel 2003 and for later versions of Excel also applies to Excel 2004 for Mac.

↑ Back to the top

More information

Code for the ATP has not been edited directly except to introduce improvements in the three ATP ANOVA tools.

For several ATP tools, numeric performance has been improved for Excel 2003 and for later versions of Excel because the tool calls an Excel statistical function that has been improved for Excel 2003 and for later versions of Excel. In cases where results are different for earlier versions of Excel and later versions of Excel, the values for Excel 2003 and for later versions of Excel are more accurate.

Most users will not notice a difference in results between different versions of Excel. This is because differences are generally caused by round-off errors that are significant only in extreme cases. However, this article must first point out one case where differences occur because of a wrong formula in Microsoft Excel 2002 and in earlier versions of Excel. Avoid the tool in those versions.

A second example involves an incorrect formula in Excel 2002 and in earlier versions of Excel that persists in Excel 2003 and in later versions of Excel. Avoid the ATP tool in this scenario for all versions of Excel.

First, avoid the Regression tool when you have to click to check the Constant is Zero check box. This has been corrected in Excel 2003 and in later versions of Excel. You do not have to avoid the Regression tool when the Constant is Zero check box is clear (the more typical case in practice).

Second, users of all versions of Excel should avoid the ATP t-Test: Paired Two Sample for Means tool unless you can guarantee that there are no missing data observations. The tool gives inappropriate answers (or no answers at all) if there are one or more missing observations.

For more information about the ATP Matched Pairs Two Sample t-Test tool, click the following article number to view the article in the Microsoft Knowledge Base:

829252 You may obtain incorrect results and misleading labels when you use the Analysis ToolPak t-Test in Excel

If you want to use this tool, and if there is missing data (or if there is even the chance of missing data), the TTEST function in Excel will handle it correctly.

Separate sections are provided later in this article for individual ATP tools. Tools that are not listed have not been affected by improvements in Excel 2003 and in later versions of Excel.

ANOVA: Single Factor, Two-Factor with Replication, and Two-Factor without Replication

Each of these three ANOVA tools has been rewritten to upgrade the computational procedure to a two-pass algorithm that is more numerically robust. These improvements are similar to improvements in statistical functions that compute sums of squared deviations about a mean (for example: VAR, STDEV, SLOPE, PEARSON).

For more information about ATP ANOVA, click the following article number to view the article in the Microsoft Knowledge Base:

829215 Description of numeric improvement in Analysis ToolPak ANOVA tools in Excel

Correlation

This tool has not been changed. However, there is a small difference between the Correlation tool and the Covariance tool that persists in all versions of Excel. The Correlation tool returns a lower triangular correlation table with 1's on the diagonal and correlations off the diagonal. The tool uses CORREL to compute off-diagonal entries and fills those entries with the value that is returned by CORREL. (Therefore, if any data entry changes, no entry in the table changes. Contrast this behavior with the behavior of Covariance.)

Covariance

This tool returns a lower triangular covariance table with variances on the diagonal and covariances off the diagonal. Cells on the diagonal contain a formula "=VARP(...)" so that if a data entry changes, the result in the table also changes. VARP has been improved for Office Excel 2003 and for later versions of Excel.

For more information about VARP, click the following article number to view the article in the Microsoft Knowledge Base:

826393 Excel statistical functions: VARP

The Covariance tool uses COVAR to compute off-diagonal entries and fills those entries with the value that is returned by COVAR. Therefore if a data entry changes, the off-diagonal entries do not change.

Descriptive Statistics

This tool calls Excel statistical functions for everything that it computes. Because VAR and STDEV are improved for Excel 2003 and for later versions of Excel, different values are possible because of round-off errors in extreme cases.

For more information about VAR, click the following article number to view the article in the Microsoft Knowledge Base:

826112 Excel statistical functions: VAR

F-Test Two-Sample for Variances

Like the Descriptive Statistics tool, this tool calls VAR. Again, different values are possible because of round-off errors in extreme cases.

Random Number Generation

This tool fills a range with random observations. Values of these observations are put directly in the cells so that these cell values are not recomputed and replaced with fresh observations when the sheet is recalculated. The built-in RAND function in Excel, on the other hand, replaces existing random numbers with new ones every time the sheet is recalculated. You can use RAND to preserve values. To do so, copy the results in a range, and then use the Paste Special command to paste values into the same range.

The Random Number Generation (RNG) tool also produces random observations from various probability distributions whereas RAND corresponds to the single option in the tool: Uniform with range between 0 and 1. This article describes how to combine RAND with the statistical functions in Excel to generate such observations.

Therefore, in terms of functionality, you can emulate the ATP random number tool by using RAND and some ingenuity. Sometimes this is useful, particularly when many random numbers are wanted.

For Excel 2002 and earlier, both the ATP random number generator and RAND were known to perform poorly on standard tests of randomness. Performance was poor because the length of a cycle before the sequence of pseudo-random numbers starts repeating was too short. This is an issue only when many random numbers are required.

RAND has been improved for Excel 2003 and for later versions of Excel so that RAND now passes all such standard tests. RAND's sequence of random numbers will start repeating itself after more than 1 trillion numbers are generated.

For more information about RAND, click the following article number to view the article in the Microsoft Knowledge Base:

828795 Description of the RAND function in Excel

However, the ATP's separate random number generator was not upgraded. As in the version of RAND that is in Excel 2002 and in earlier versions of Excel, the ATP's separate random number generator is known to perform poorly on standard tests of randomness and to have a short repetition cycle. This has negative implications only if you require a very lengthy sequence of random numbers (for example, 1 million).

The RNG tool provides random observations from various probability distributions and the Uniform[0,1], the distribution used for outputting random numbers through RAND. The ATP tool first draws a Uniform[0,1] random number (or more than one such number) and then converts the answer into an observation from one of the following specific distributions. For the benefit of those who prefer RAND because they are going to generate many observations, this article suggests formulas that use RAND in the table below. Following the table, there are some cautionary comments about the Analysis ToolPak's normal distribution case.

Distribution	Excel formula using RAND()
Bernoulli(p)	=IF(RAND() <= p, 1, 0)
Binomial(n,p)	=CRITBINOM(n, p, RAND())
Discrete	See below
Normal(mu, sigma)	=NORMINV(RAND(), mu, sigma)
Patterned	Not really random
Poisson(mean)	See below
Uniform(low, high)	= low + (high � low) * RAND()

There are two reasons that you might prefer RAND and the formula in this table instead of the ATP random number tool in the Normal(mu, sigma) case. First, RAND is a better Uniform[0,1] random number generator than ATP's. Second, the ATP's tool does not call Excel's NORMINV function, but instead has its own built-in version of the inverse normal distribution. This is not as accurate as the version of NORMINV that is in Excel 2003 and in later versions of Excel. It is inferior both in the accuracy of the normal distribution approximation that it uses (Excel uses the much improved NORMSDIST function) and in the refinement of the binary search (Excel carries it much further to guarantee a value that is closer to NORMINV's probability argument). In a nutshell, using ATP in this case does not take advantage of improvements for Excel 2003 and for later versions of Excel in the NORMINV, NORMSDIST, and RAND functions.

For observations from a discrete distribution, assume that the values are in column B and that their probabilities are in column C. One would then want to fill each row of column A with the probability of observing a value strictly less than the value in column B in that row. Assuming that there were 10 values, assume that this data is in cells A1:C10. Then, because A1 contains the probability of observing a value strictly less than the first value, it must be set to 0. You can use VLOOKUP(RAND(), A1:C10, 2); the fourth argument to VLOOKUP is optional and must be omitted or set to TRUE. The "2" means that you want to return the value in the second column (column B in this example).

The ATP uses an adaptation of the method of generating Poisson observations in Press, W.H., S.A. Teukolsky, W. T. Vetterling, and B.P. Flannery, Numerical Recipes in C, The Art of Scientific Computing, 2nd ed., Cambridge University Press, 1992, pp. 293-295. There are two methods to take easy advantage of existing Excel functions.

The first uses the observation that a POISSON random variable with mean m has a distribution that is well approximated by a BINOMIAL(n, m/n) for large n. You can then call CRITBINOM(n, m/n, RAND()). Choice of n depends on m; n greater than 1,000 times m should be large enough.

The second relates the POISSON distribution to the Exponential. If events occur according to a POISSON process at rate m per unit time, the time between events has an Exponential distribution with mean 1/m. For a POISSON observation you can take a sequence of observations from this Exponential distribution and count how many of them occur before their sum exceeds 1. To obtain an observation from this Exponential distribution, use GAMMAINV(RAND(), 1, 1/m). This method would be suitable when m is relatively close to 0.

Regression

The Regression tool calls Excel's LINEST. The article on LINEST describes extensive improvements for Excel 2003 and for later versions of Excel.

For more information about LINEST, click the following article number to view the article in the Microsoft Knowledge Base:

828533 Description of the LINEST function in Excel

If you use Excel 2002 or an earlier version of Excel, note the same two shortcomings of the ATP Regression tool as of LINEST:

Regression Sum of Squares, r squared, and f statistic values are always incorrect for the case where the regression is forced through the origin.

For LINEST, this means "third argument set to FALSE instead set to TRUE or omitted." For the ATP tool, it means "Constant is Zero check box is checked."
LINEST and the ATP tool are insensitive to collinearity issues. The article on LINEST discusses the computational approach to LINEST in Excel 2003 and in later versions of Excel that is designed to find collinearity or near-collinearity when it exists and to act appropriately.

Both of these LINEST shortcomings have been overcome in Excel 2003 and in later versions of Excel. Performance of the ATP Regression tool will be similarly improved. There were no changes made to the tool's code; it has been improved by calling an improved Excel function. This author considers the improvement in LINEST to be the most important of the statistical function improvements.

The following table shows the Regression tool's output for earlier versions of Excel and for later versions of Excel with the Constant is Zero check box selected. It illustrates the first shortcoming that is mentioned earlier. In earlier versions of Excel, the Regression sum of squares is negative as is the R Square value.

X's	Y's
1	11
2	12
3	13

Excel 2002 and earlier versions
SUMMARY OUTPUT

Regression Statistics
Multiple R	65535
R Square	-20.4285714
Adjusted R Square	-20.9285714
Standard Error	4.629100499
Observations	3

ANOVA
	df	SS	MS	F	Significance F
Regression	1	-40.85714286	-40.85714286	-1.90666667	#NUM!
Residual	2	42.85714286	21.42857143
Total	3	2

Excel 2003 and later versions of Excel version
SUMMARY OUTPUT

Regression Statistics
Multiple R	0.949342311
R Square	0.901250823
Adjusted R Square	0.401250823
Standard Error	4.629100499
Observations	3

ANOVA
	df	SS	MS	F	Significance F
Regression	1	391.1428571	391.1428571	18.25333333	0.14637279
Residual	2	42.85714286	21.42857143
Total	3	434

t-Test: Paired Two Sample for Means

As mentioned earlier, avoid this tool if there is any chance of one or more missing data values. The prototypical application of this test is an experiment with measurements on subjects Before and After a treatment (such as weights Before and After a 60-day diet plan). If there are no missing observations, the tool will behave fine. If there are different numbers of missing Before and After observations, you receive an error message and the tool will not compute anything. If there are missing observations and the numbers of missing Before and After observations are equal, the tool will return answers that contain several errors.

Standard procedure is to remove a subject from the data if either the Before or After measurement is missing and to analyze the data that contains only those subjects that have both Before and After measurements. Excel's TTEST function handles missing data according to this standard procedure.

The other two t-Test tools, Two-Sample Assuming Equal Variances and Two-Sample Assuming Unequal Variances, do not share this defect.

z-Test: Two Sample for Means

This article noted that the normal distribution case of the random number generation tool does not call the NORMSINV function (or more precisely, NORMINV that calls NORMSINV) but has its own inferior procedure for finding normal inverse values.

The z-Test tool does call the NORMSINV function and takes advantage of improvements for Excel 2003 and for later versions of Excel.

Results in earlier versions of Excel

There are ATP tools whose performance has been improved for Excel 2003 and for later versions of Excel because they call Excel statistical functions that have been improved for Excel 2003 and for later versions of Excel. One of these improvements to LINEST, when its third argument is set to FALSE, implies that the ATP regression tool returns incorrect results in Excel 2002 and in earlier versions of Excel when the Constant is Zero check box is selected. In other cases where Excel functions have been improved, users of earlier versions are not likely to notice differences (most of these differences involve round-off errors in extreme situations).

The three ATP ANOVA tools have been improved by editing ATP code to substitute a more numerically robust algorithm (in the same spirit as the improvement in Excel's VAR). Users of these tools in earlier versions of Excel are likely to notice differences only in extreme situations.

Warning for users of all versions: avoid the t-Test: Paired Two Sample for Means if there is the slightest chance of missing data.

Results in Excel 2003 and in later versions of Excel

Significant improvements have been made in Excel's statistical functions. This translates into improvements in many ATP tools that call these functions. There is one ATP tool, the random number generator, that does not take advantage of an improved RAND function (because it is implemented in a self-contained manner and does not call RAND). This is unfortunate, but even more unfortunate is the special case of normally distributed random observations. The inverse normal distribution is also implemented in a self-contained manner and does not call the much improved NORMSINV function.

The following table lists ATP tools and those Excel functions that they call that have been improved for Excel 2003 and for later versions of Excel. Readers are referred to separate articles on each Excel function that is called.

ATP tool	Excel functions that are called

ANOVA: Single Factor	VAR, FINV
ANOVA: Two-Factor With Replication	VAR, FINV
ANOVA: Two-Factor Without Replication	VAR, FINV
Correlation
Covariance
Descriptive Statistics	STDEV, TINV, VAR
Exponential Smoothing
F-Test Two-Sample for Variances	VAR, FINV
Fourier Analysis
Histogram
Moving Average
Random Number Generation
Rank and Percentile
Regression	LINEST
Sampling	RAND
t-Test: Paired Two Sample for Means	VAR, PEARSON, TINV
t-Test: Two-Sample Assuming Equal Variances	VAR, TINV
t-Test: Two-Sample Assuming Unequal Variances	VAR, TINV
z-Test: Two Sample for Means	NORMSDIST, NORMSINV

For all functions that appear in this table other than LINEST and RAND, you are likely to see differences between earlier versions of Excel and later versions of Excel only because of round-off errors in extreme situations. LINEST has been much improved, as discussed earlier. RAND has been improved also. Curiously, the Sampling tool calls RAND, but the RNG tool relies on a self-contained generator that will give inferior performance when a lengthy sequence of random observations is required.

Conclusions

Except for changes to ATP code for each of the three ANOVA tools, ATP code has not been rewritten. However, it benefits from calling improved Excel functions as shown in the ATP Tool table. Flaws in the t-Test: Paired Two Sample for Means test have not been fixed for Excel 2003 or for later versions of Excel. The most noticeable improvements are likely to be in the Regression tool where LINEST no longer returns incorrect results when the Constant is Zero check box is selected and where LINEST is designed to handle collinearity appropriately.

↑ Back to the top

Applies to:

↑ Back to the top

Keywords: KB829208, kbinfo, kbfuncstat, kbfunctions, kbexpertisebeginner, kbformula

↑ Back to the top

Article Info

Article ID	:	829208
Revision	:	6
Created on	:	1/22/2007
Published on	:	1/22/2007
Exists online	:	False
Views	:	692

Microsoft KB Archive Search

Description of the effects of the improved statistical functions for the Analysis ToolPak in Excel

Summary

Microsoft Excel 2004 for Mac information

More information

ANOVA: Single Factor, Two-Factor with Replication, and Two-Factor without Replication

Correlation

Covariance

Descriptive Statistics

F-Test Two-Sample for Variances

Random Number Generation

Regression

t-Test: Paired Two Sample for Means

z-Test: Two Sample for Means

Results in earlier versions of Excel

Results in Excel 2003 and in later versions of Excel

Conclusions

Applies to: