Deviation Deep Dive: Part III

Posted on Feb 9, 2025
library(tidyverse)
library(kaimeleon)
ACCENT <- "#f7d060"

Introduction

In previous blog posts [1] [2], I’ve alluded to the fact that when we calculate the standard deviation, we sometimes divide by n1 (that is, one less than the number of samples we have) rather than n. This is known as ‘Bessel’s correction’, and it’s typically sneaked in after learning how to calculate the standard deviation, hopefully quickly enough so you won’t notice. However, I think it’s fairly common to ask why we do this, and the standard response is usually a mumbling answer along the lines of it being a correction factor, and that you should only do it if you’re estimating the standard deviation from a sample rather than using the whole population, turning this corrective factor not only into a mysterious oddity, but now also part of a shadowy ritual. My thoughts to this answer were typically ‘correction for what?’ and ‘wow that seems pretty haphazard’, and usually resulted in me always dividing by n1 and hoping nothing bad would happen.

During my quest for enlightenment, I came across several different explanations, with varying levels of detail and coherence, which I’ll explain back to you in order of increasing detail. I very much regret this endeavor.

Conventions

Unfortunately, this blog post is going to contain some math. Instead of breaking up proofs by saying ‘and we can do this step because of XYZ’ which can make following along difficult, I’m doing to include symbols above equals signs that will direct you to a section in the appendix that should show you why this is possible. For instance:

E[A+B]=1E[A]+E[B]

Where 1 corresponds to the number in the appendix.

Also, I tried my best to use colors in equations to draw your attention to the changing parts. Apologies if it’s distracting. I find my eyes glaze over pretty quickly if they don’t have ’landmarks’.

Finally, I’ve chosen to collapse code that I didn’t find pertinent to the lesson at hand, but feel free to expand it if you’re interested!

Explanation I: It just works, OK?

The first and easiest way to look at this is to simply not look at it, a tactic similar to the ‘it’s working don’t touch it’ strategy I take with particularly arcane sections of code. So maybe we should kick the tires and make sure this thing is even working in the first place. It certainly would be convenient if we didn’t actually need this thing in the first place.

Let’s simulate some samples:

set.seed(13)
replicates <- 100000
n <- 10
sd <- 10
samples <- replicate(replicates, rnorm(n, sd = sd), simplify = FALSE)

And let’s plot some samples to wrap our head around them:

Plotting code
some <- 20
plotting_data <- data.frame(
  value = unlist(samples[seq_len(some)]),
  replicate = factor(rep(seq_len(some), each = n))
)
ggplot(plotting_data, aes(value, replicate, color = 1)) +
  geom_vline(xintercept = 0, color = "white", linewidth = 0.2) +
  geom_point(shape = 1) +
  theme_kai("dark") +
  theme(
    legend.position = "none",
    axis.text.y = element_blank(),
    panel.grid.minor = element_blank()
  ) +
  labs(x = NULL, y = "Replicate")

Each replicate has 10 values in it, sampled from a normal distribution with a mean of 0 and a standard deviation of 10.

However, just because it was sampled from a distribution with those parameters doesn’t mean that the samples themselves will have those parameters on average (that is to say, the sampling distribution isn’t guaranteed to have a mean equal to the population statistic, aka bias), so let’s check it out:

# We're creating our own functions because R's automatically applies Bessel's
# correction
calc_var <- function(values, center = mean(values)) {
  n <- length(values)
  sq_diffs <- (values - center)^2
  sum(sq_diffs/n)
}
calc_sd <- function(values, center = mean(values)) {
  sqrt(calc_var(values, center))
}

sds <- sapply(samples, calc_sd)
Plotting code
plot <- ggplot(data.frame(x = sds), aes(x)) +
  theme_kai() +
  geom_histogram(bins = 1000) +
  geom_vline(xintercept = sd, color = ACCENT) +
  geom_vline(xintercept = mean(sds), color = ACCENT, linetype = 2) +
  labs(x = "Sample SD", y = NULL)
ggsave("sds.png", plot, dpi = 300, width = 4, height = 4, units = "in")

In this figure I’m showing the distribution of the calculated sample standard deviations (without correction!!) for all 100,000 samples. The dotted line represents the mean of the sampling distribution, while the solid line represents the true population standard deviation. As you can tell, it’s being a bit underestimated.

What about if we use Bessel’s correction?

# Using R's built-in sd function, which applies Bessel's correction
sds <- sapply(samples, sd)
Plotting code
plot <- ggplot(data.frame(x = sds), aes(x)) +
  theme_kai() +
  geom_histogram(bins = 1000) +
  geom_vline(xintercept = sd, color = ACCENT) +
  geom_vline(xintercept = mean(sds), color = ACCENT, linetype = 2) +
  labs(x = "Sample SD", y = NULL)
ggsave("sds_w_bessel.png", plot, dpi = 300, width = 4, height = 4, units = "in")

This helps quite a bit (though it isn’t perfect).

For leading and didactic reasons, let’s also look at the distributions of the sample variances without and with Bessel’s correction:

vars <- sapply(samples, calc_var)
Plotting code
plot <- ggplot(data.frame(x = vars), aes(x)) +
  theme_kai() +
  geom_histogram(bins = 1000) +
  geom_vline(xintercept = sd^2, color = ACCENT) +
  geom_vline(xintercept = mean(vars), color = ACCENT, linetype = 2) +
  labs(x = "Sample Variance", y = NULL)
ggsave("vars.png", plot, dpi = 300, width = 4, height = 4, units = "in")
vars <- sapply(samples, var)
Plotting code
plot <- ggplot(data.frame(x = vars), aes(x)) +
  theme_kai() +
  geom_histogram(bins = 1000) +
  geom_vline(xintercept = sd^2, color = ACCENT) +
  geom_vline(xintercept = mean(vars), color = ACCENT, linetype = 2) +
  labs(x = "Sample Variance", y = NULL)
ggsave("vars_w_bessel.png", plot, dpi = 300, width = 4, height = 4, units = "in")

You’ll notice that while the sample variance consistently underestimated the population variance before Bessel’s correction, it lined up nearly exactly after we used Bessel’s correction. We are forced to agree that Bessel’s correction is doing something good, and is worth investigating further.

Explanation II: It accounts for reusing data

Besides the explanation of ‘it just does work, do it’ (which is, admittedly, not an explanation but rather a motivation), we can explain Bessel’s correction in terms of combating ‘data-reuse’.

When calculating the standard deviation, we find the differences between the mean and a given value. No worries if we know the mean of the population, but in my forays with statistics this is an incredibly rare scenario. Much more likely is that we are forced to use the mean of the sample, then estimate the population standard deviation using the differences of the values of the sample and the mean of the sample.

As we explored a bit in the first blog post of this series, variance is defined in large part by the distances of the values from the mean, usually either the sample mean or the population mean. As it turns out, the variance is minimized when that value is the mean of the sample.

Wait, I don’t believe you

That’s fine. I didn’t believe it either. But it turns out that some fairly harmless algebra can be used to show this is true.

(NB: This section is optional. If you choose to believe me (foolish) that the point at which to take distances from for variance is minimized by using the sample mean, you can move to the next section. Otherwise, read on.)

The definition of variance is:

Var(X):=E[(XX¯)2]

That is, variance is the average of the squared differences between the random variable X and its mean (aka expectation, E[X], denoted by X¯).

One way to determine if X¯ will result in the smallest Var(X) is to assume that maybe some other constant (let’s call it c) would be better. That is, what is this:

E[(Xc)2]

One clever trick we can do to get this in terms of the variance definition we know and love is to add and subtract X¯ to this equation (effectively adding 0, maintaining equality):

E[(Xc)2]=E[(XX¯+X¯c)2]=E[((XX¯)+(X¯c))2]=E[(XX¯)2+2(XX¯)(X¯c)+(X¯c)2]=1E[(XX¯)2]+E[2(XX¯)(X¯c)]+E[(X¯c)2]

Since X¯ and c are constants, E[X¯], E[c], and E[X¯c] will be just be X¯, c, and X¯c

E[(Xc)2]=E[(XX¯)2]+E[2(XX¯)(X¯c)]+E[(X¯c)2]=E[(XX¯)2]+2(X¯c)E[XX¯]+(X¯c)2=1E[(XX¯)2]+2(X¯c)(E[X]X¯)+(X¯c)2

Also, since E[X]:=X¯,

E[(Xc)2]=E[(XX¯)2]+2(X¯c)(X¯X¯)+(X¯c)2=E[(XX¯)2]+0+(X¯c)2=E[(XX¯)2]+(X¯c)2

We can see that the first term is our original definition of variance. The second term is squared and therefore can only be positive. At best, it can be 0, which only happens when c=X¯. Therefore, variance is minimized when the thing we’re measuring distances from is the mean of X.

Okay, I believe you

Since you either blindly believe me or have been convinced that measuring all the distances from the sample mean results in the smallest variance, we can now consider the implications.

For any given sample, the sample mean will almost never be the population mean. If the sample mean will always give the lowest variance, the population mean (the thing we would use to estimate population variance if we knew it because it gives us an unbiased result) will always give a higher variance (except in the case in which it is exactly equal, which is unlikely).

Intuitively, this has been described as painting a target around a bunch of bullet holes, rather than shooting at a target. You’re peeking at the future and then wondering why you’re correct all the time.

This is usually where the concept of ‘degrees of freedom’ gets invoked. If someone tells you that you can pick any 5 numbers, so long as they have a mean of, say, 20, you’re free to do whatever you want to do with 4 of them - but that last one has to pick up the bill to get the mean to be 20. That is, there are only 4 values that are ‘free to vary’ - only 4 degrees of freedom.

This is true for an arbitrary mean and an arbitrary number of values - you will always have n1 degrees of freedom if you are calculating a new statistic reusing old information (in the case of sample variance, the sample mean isn’t truly free to vary: it is bound by the sample). So it’s kind of like instead of having n values, you have n1 values. Dividing by this inflates the variance a bit to combat the artificial deflation, and all is well in the world. Right?

Explanation III: For those suspicious of simplicity

n1 feels pretty dang neat. Too neat. To rigorously prove that dividing by n1 rather that n provides the correction we need, we can pay our dues with math.

Gregory Gundersen (who has a very nice blog) has a post proving Bessel’s correction that I’m largely going to follow, expounding on points I found confusing (not due to Gregory but rather my own knowledge gaps. I also tend to write the individual steps out more to hopefully reduce your cognitive overhead).

The sample variance s2 is defined by

s2:=1ni=1n(XiX¯)2

Note that compared to σ2 (the population variance), s2 measures the difference between values and the sample mean (X¯) rather than the population mean (μ).

Our goal, ultimately, is to find E[s2] in terms of σ2 and then see by how much it differs, and if that difference corresponds with Bessel’s correction.

E[s2]=E[1ni=1n(XiX¯)2]=11nE[i=1n(XiX¯)2]=1nE[i=1nXi2+i=1n(2XiX¯)+i=1nX¯2]=1nE[i=1nXi22X¯i=1nXi+nX¯2]=1nE[i=1nXi22X¯nX¯+nX¯2]=1nE[i=1nXi22nX¯2+nX¯2]=1nE[i=1nXi2nX¯2]=11n(E[i=1nXi2]E[nX¯2])=11n(E[i=1nXi2]nE[X¯2])=1nE[i=1nXi2]E[X¯2]=11ni=1nE[Xi2]E[X¯2]=E[Xi2]E[X¯2]

Since

Var(X)=2.1E[X2]E[X]2E[X2]=Var(X)+E[X]2

Substituting X for Xi gets us

E[Xi2]=Var(Xi)+E[Xi]2

Var(Xi)=σ2 and E[Xi]=μ (note that Xi is not a value but rather the process of drawing a single value, so this equality still holds), so

E[Xi2]=Var(Xi)+E[Xi]2=σ2+μ2

Using the same equation above but substituting X¯ for X instead of Xi for X, we get:

E[X¯2]=Var(X¯)+E[X¯]2

Var(X¯) is a bit different than Var(Xi), but can be simplified like so:

Var(X¯)=Var(1ni=1nXi)=2.21n2Var(i=1nXi)=2.31n2i=1nVar(Xi)=1n2i=1nσ2=nσ2n2=σ2n

Plugging this back into the equation, we get:

E[X¯2]=Var(X¯)+E[X¯]2=σ2n+E[X¯]2=σ2n+μ2

So, substituting these for E[Xi2] and E[X¯2], we get

E[s2]=E[Xi2]E[X¯2]=(σ2+μ2)(σ2n+μ2)=σ2σ2n=σ2(11n)=σ2n1n

Rearranging this equation, we get:

E[s2]=σ2n1nσ2=nn1E[s2]

That is to say, rather than divide by n (undo dividing by n by multiplying it) divide instead by n1. This is Bessel’s correction. Was it worth it??? All the suffering????

Do we need Bessel’s correction at all?

‘It depends’. Of course it does. Bessel’s correction removes bias (that is, the difference between the true statistic and the mean of the distribution of estimated statistics) for variance, so in the case where theoretical perfection is required, this might be preferred. Indeed, the theory behind it is pretty interesting! However, it doesn’t completely remove bias in the case of standard deviation (as we saw in some of our first plots), and it’s frankly confusing as hell. Jeffery Rosenthal wrote a beautiful article summarizing the challenges of teaching Bessel’s correction, and mentioned that the smallest mean-squared-error results not from dividing by n1, or even n, but n+1. As such, it might be best to leave Bessel’s correction as a theoretical curiosity for more advanced studies, stick with dividing by n for didactic purposes, and n+1 for applications that require the smallest MSE. However, I hope this post at least helps explain the why and how n1 came to be in the first place, since it is still very much present.

Appendix

Random Variable

A random variable isn’t really random nor is it really a variable. There are a lot of people smarter than me that would quibble with this definition, but a random variable is probably better described as a function that takes some result of an event (a roll of a die, a flip of a coin, a heart rates measurement) and turns it into a number (in the case of a die roll and heart measurement, the conversion is pretty obvious. For heads and tails, one of them is probably going to be 1, one is probably going to be 0, and the random variable does the conversion).

Just like a function doesn’t have a value until called with an argument, a random variable doesn’t have a value until it is realized.

(1) Linearity of Expectation

Expectation (which for our purposes is identical to the mean) of a random variable has the following definition:

E[X]=1ni=1nXi

It has some algebraically useful properties, namely:

E[aX]=aE[X]

where a is some constant, and

E[X+Y]=E[X]+E[Y]

An extension of this is that

E[ΣX]=ΣE[X]

Proof

We can show this using the definition of the mean:

E[aX]=1ni=1naXi=1n(aX1+aX2++aXn)=a1n(X1+X2++Xn)=aE[X]

E[X+Y]=1ni=1n(Xi+Yi)=1n(X1+Y1++Xn+Yn)=1n[(X1++Xn)+(Y1++Yn)]=1n(X1++Xn)+1n(Y1++Yn)=E[X]+E[Y]

(2) Variance

(2.1) Rearrangement

Turns out, Var(X)=E[(XX¯)2]=E[X2](E[X])2

Proof

Var(X)=E[(XX¯)2]=E[X22XX¯+X¯2]=E[X2]E[2XX¯]+E[X¯2]=E[X2]2X¯E[X]+X¯2=E[X2]2X¯2+X¯2=E[X2]X¯2=E[X2](E[X])2

(2.2) Var(aX) = a2 Var(X)

Turns out, Var(aX)=a2Var(X)

where a is a constant.

Proof

Using the rearranged formula from the previous section and substituting aX for everywhere we see X:

Var(aX)=E[(aX)2](E[aX])2=E[a2X2](E[aX])2=a2E[X2](aE[X])2=a2E[X2]a2(E[X])2=a2(E[X2](E[X])2)=a2Var(X)

(2.3) Bienaymé’s formula

Another algebraically useful property is that if the values of a random variable are independent, then the sum of the variance is equal to the variance of the sum. Or rather:

Var(i=1nXi)=k=1n(Var(Xk))

Proof

The following proof is taken from here.

Using the proof of the previous section that

Var(X)=E[X2](E[X])2

Apparently, the sum of random variables is also a random variable, so this holds for:

Var(i=1nXi)=E[(i=1nXi)2](E[i=1nXi])2

In the linked answer, they helpfully note that

(i=1nXi)2=i=1nj=1nXiXj

A beautiful explanation they make is, quote:

…which is clear if you think about what you’re doing when you calculate (X1+…+Xn)⋅(X1+…+Xn) by hand.

So,

Var(i=1n(Xi))=E[(i=1nXi)2](E[i=1nXi])2=E[i=1nj=1nXiXj](E[i=1nXi])2=1i=1nj=1nE[XiXj](E[i=1nXi])2=1i=1nj=1nE[XiXj](i=1nE[Xi])2=i=1nj=1nE[XiXj]i=1nj=1nE[Xi]E[Xj]=i=1nj=1n(E[XiXj]E[Xi]E[Xj]):=i=1nj=1nCov(Xi,Xj)

In the case of independent sampling, we expect covariance to be 0 between any non-identical random variables (in other words, if we thought about it like a matrix we would expect 0s everywhere not on the diagonal), so:

i=1nj=1nCov(Xi,Xj)=i=1nCov(Xi,Xi)

If we look at the rearranged definition of covariance

Cov(X,Y)=E[XY]E[X]E[Y]

and consider the case in which X=Y:

Cov(X,X)=E[XX]E[X]E[X]

We realize that this is just the definition of variance:

Cov(X,X)=E[XX]E[X]E[X]=E[X2]E[X]2=Var(X)

So, substituting this back in, we see:

Var(i=1nXi)=i=1nCov(Xi,Xi)=i=1nVar(Xi)

(3) Covariance

Covariance can be defined as follows:

Cov(X,Y)=E[(XE[X])(YE[Y])]

Which can be rewritten, using the linearity of expectation, as:

Cov(X,Y)=E[XY]E[X]E[Y]

Sources

As always, Wikipedia served as a great entry citation for many downstream citations:

Explanation of why sample mean results in the lowest variance adapted from:

Andy Field’s “Discovering Statistics” book has a great explanation about why n1 using the degrees-of-freedom argument

Nice proof of Bessel’s Correction:

Why do constants get squared when in variance:

Should we care about Bessel’s correction?

Nice proof of Bienaymé’s formula:

Acknowledgments

Thanks to rigor for helping me think about, then understand the difference between calculating/estimating statisics/samples