In statistics, when computing the variance of a sample, why do we divide by n

Question:

In statistics, when computing the variance of a sample, why do we divide by n - 1 instead of by n?

anonymous

2007-09-13 15:21:05 UTC

The variance of a sample {x1, ..., xn} with average A(x) is computed by

s^2 = (1/(n - 1)) [(x1 - A(x))^2 + (x2 - A(x))^2 + ... + (xn - A(x))^2].

This is supposed to estimate the "true variance" of the population {y1, ..., ym} with average A(y) from which the sample is drawn. The true variance is

σ^2 = (1/m) [(y1 - A(y))^2 + (y2 - A(y))^2 + .... + (ym - A(y))^2].

I am told that the "n - 1" is used in s^2 because this makes s^2 an unbiased estimator for σ^2 -- that is, the expected value of s^2 is equal to σ^2. I understand what this means.

What I do not understand is: *why* do we need to change the denominator in s^2 to make it an unbiased estimator? It seems strange to use a different formula for the sample than the one we are trying to estimate in the population.

Can anyone explain why s^2 (as given above) is an unbiased estimator for σ^2, and using a denominator "n" for the sample variance is not?

Five answers:

Sugar Shane

2007-09-13 15:37:04 UTC

in small sample sizes, the individual measurements have large impacts on the mean. The change in denominator prevents us from under estimating the variance due to the mean being miss estimated. obviously as n gets very large that change in denominator vanishes as does the chance of miss estimating the mean.

piscesgirl

2007-09-13 15:53:45 UTC

The reason that the first one is biased is that the sample mean is generally somewhat closer to the observations in the sample than the population mean is to these observations.

This is so because the sample mean is by definition in the middle of the sample, while the population mean may even lie outside the sample. So the deviations to the sample mean will often be smaller than the deviations to the population mean, and so, if the same formula is applied to both, then this variance estimate will on average be somewhat smaller in the sample than in the population.

As n gets very large, we expect our sample to be a better and better estimator of the variance for the variance of the population and so the difference between the sample variance and the estimated population variation (ration of n/(n-1) becomes very small.

There are several proofs for why this is the case. Wikipedia has a couple here: http://en.wikipedia.org/wiki/Sample_variance#Population_variance_and_sample_variance

They all rely on showing that E(s^2) = population variance

sommerfeld

2016-11-15 09:01:01 UTC

that's because of the fact in case you divide by ability of (n-a million), the pattern variance could be an independent estimator of the inhabitants variance. for this reason, ninety 5% of the attempt books use n-a million quite of n.

Merlyn

2007-09-13 21:12:56 UTC

by dividing by n-1 you have an un-bias estimator for the variance.

if you divide by n you have a biased estimator for the variance. this called the maximum likelihood estimator for the variance.

best explanation I can give you for "why" is to teach out about MLE's and the effects of bias in analysis. That's a little bit much to cover in the forum.

Megegie

2007-09-13 15:29:23 UTC

That allows us to have some degree of freedom.

If did, we would be figuring out the variance of a population and not a sample.

ⓘ

This content was originally posted on Y! Answers, a Q&A website that shut down in 2021.

about - legalese