Deviation Deep Dive: Part III
library(tidyverse)
library(kaimeleon)
ACCENT <- "#f7d060"
Introduction
In previous blog posts [1] [2], I’ve alluded to the fact that when we calculate the standard deviation, we sometimes divide by
During my quest for enlightenment, I came across several different explanations, with varying levels of detail and coherence, which I’ll explain back to you in order of increasing detail. I very much regret this endeavor.
Conventions
Unfortunately, this blog post is going to contain some math. Instead of breaking up proofs by saying ‘and we can do this step because of XYZ’ which can make following along difficult, I’m doing to include symbols above equals signs that will direct you to a section in the appendix that should show you why this is possible. For instance:
Where 1 corresponds to the number in the appendix.
Also, I tried my best to use colors in equations to draw your attention to the changing parts. Apologies if it’s distracting. I find my eyes glaze over pretty quickly if they don’t have ’landmarks’.
Finally, I’ve chosen to collapse code that I didn’t find pertinent to the lesson at hand, but feel free to expand it if you’re interested!
Explanation I: It just works, OK?
The first and easiest way to look at this is to simply not look at it, a tactic similar to the ‘it’s working don’t touch it’ strategy I take with particularly arcane sections of code. So maybe we should kick the tires and make sure this thing is even working in the first place. It certainly would be convenient if we didn’t actually need this thing in the first place.
Let’s simulate some samples:
set.seed(13)
replicates <- 100000
n <- 10
sd <- 10
samples <- replicate(replicates, rnorm(n, sd = sd), simplify = FALSE)
And let’s plot some samples to wrap our head around them:
Plotting code
some <- 20
plotting_data <- data.frame(
value = unlist(samples[seq_len(some)]),
replicate = factor(rep(seq_len(some), each = n))
)
ggplot(plotting_data, aes(value, replicate, color = 1)) +
geom_vline(xintercept = 0, color = "white", linewidth = 0.2) +
geom_point(shape = 1) +
theme_kai("dark") +
theme(
legend.position = "none",
axis.text.y = element_blank(),
panel.grid.minor = element_blank()
) +
labs(x = NULL, y = "Replicate")

Each replicate has 10 values in it, sampled from a normal distribution with a mean of 0 and a standard deviation of 10.
However, just because it was sampled from a distribution with those parameters doesn’t mean that the samples themselves will have those parameters on average (that is to say, the sampling distribution isn’t guaranteed to have a mean equal to the population statistic, aka bias), so let’s check it out:
# We're creating our own functions because R's automatically applies Bessel's
# correction
calc_var <- function(values, center = mean(values)) {
n <- length(values)
sq_diffs <- (values - center)^2
sum(sq_diffs/n)
}
calc_sd <- function(values, center = mean(values)) {
sqrt(calc_var(values, center))
}
sds <- sapply(samples, calc_sd)
Plotting code
plot <- ggplot(data.frame(x = sds), aes(x)) +
theme_kai() +
geom_histogram(bins = 1000) +
geom_vline(xintercept = sd, color = ACCENT) +
geom_vline(xintercept = mean(sds), color = ACCENT, linetype = 2) +
labs(x = "Sample SD", y = NULL)
ggsave("sds.png", plot, dpi = 300, width = 4, height = 4, units = "in")

In this figure I’m showing the distribution of the calculated sample standard deviations (without correction!!) for all 100,000 samples. The dotted line represents the mean of the sampling distribution, while the solid line represents the true population standard deviation. As you can tell, it’s being a bit underestimated.
What about if we use Bessel’s correction?
# Using R's built-in sd function, which applies Bessel's correction
sds <- sapply(samples, sd)
Plotting code
plot <- ggplot(data.frame(x = sds), aes(x)) +
theme_kai() +
geom_histogram(bins = 1000) +
geom_vline(xintercept = sd, color = ACCENT) +
geom_vline(xintercept = mean(sds), color = ACCENT, linetype = 2) +
labs(x = "Sample SD", y = NULL)
ggsave("sds_w_bessel.png", plot, dpi = 300, width = 4, height = 4, units = "in")

This helps quite a bit (though it isn’t perfect).
For leading and didactic reasons, let’s also look at the distributions of the sample variances without and with Bessel’s correction:
vars <- sapply(samples, calc_var)
Plotting code
plot <- ggplot(data.frame(x = vars), aes(x)) +
theme_kai() +
geom_histogram(bins = 1000) +
geom_vline(xintercept = sd^2, color = ACCENT) +
geom_vline(xintercept = mean(vars), color = ACCENT, linetype = 2) +
labs(x = "Sample Variance", y = NULL)
ggsave("vars.png", plot, dpi = 300, width = 4, height = 4, units = "in")

vars <- sapply(samples, var)
Plotting code
plot <- ggplot(data.frame(x = vars), aes(x)) +
theme_kai() +
geom_histogram(bins = 1000) +
geom_vline(xintercept = sd^2, color = ACCENT) +
geom_vline(xintercept = mean(vars), color = ACCENT, linetype = 2) +
labs(x = "Sample Variance", y = NULL)
ggsave("vars_w_bessel.png", plot, dpi = 300, width = 4, height = 4, units = "in")

You’ll notice that while the sample variance consistently underestimated the population variance before Bessel’s correction, it lined up nearly exactly after we used Bessel’s correction. We are forced to agree that Bessel’s correction is doing something good, and is worth investigating further.
Explanation II: It accounts for reusing data
Besides the explanation of ‘it just does work, do it’ (which is, admittedly, not an explanation but rather a motivation), we can explain Bessel’s correction in terms of combating ‘data-reuse’.
When calculating the standard deviation, we find the differences between the mean and a given value. No worries if we know the mean of the population, but in my forays with statistics this is an incredibly rare scenario. Much more likely is that we are forced to use the mean of the sample, then estimate the population standard deviation using the differences of the values of the sample and the mean of the sample.
As we explored a bit in the first blog post of this series, variance is defined in large part by the distances of the values from the mean, usually either the sample mean or the population mean. As it turns out, the variance is minimized when that value is the mean of the sample.
Wait, I don’t believe you
That’s fine. I didn’t believe it either. But it turns out that some fairly harmless algebra can be used to show this is true.
(NB: This section is optional. If you choose to believe me (foolish) that the point at which to take distances from for variance is minimized by using the sample mean, you can move to the next section. Otherwise, read on.)
The definition of variance is:
That is, variance is the average of the squared differences between the random variable
One way to determine if
One clever trick we can do to get this in terms of the variance definition we know and love is to add and subtract
Since
Also, since
We can see that the first term is our original definition of variance. The second term is squared and therefore can only be positive. At best, it can be 0, which only happens when
Okay, I believe you
Since you either blindly believe me or have been convinced that measuring all the distances from the sample mean results in the smallest variance, we can now consider the implications.
For any given sample, the sample mean will almost never be the population mean. If the sample mean will always give the lowest variance, the population mean (the thing we would use to estimate population variance if we knew it because it gives us an unbiased result) will always give a higher variance (except in the case in which it is exactly equal, which is unlikely).
Intuitively, this has been described as painting a target around a bunch of bullet holes, rather than shooting at a target. You’re peeking at the future and then wondering why you’re correct all the time.
This is usually where the concept of ‘degrees of freedom’ gets invoked. If someone tells you that you can pick any 5 numbers, so long as they have a mean of, say, 20, you’re free to do whatever you want to do with 4 of them - but that last one has to pick up the bill to get the mean to be 20. That is, there are only 4 values that are ‘free to vary’ - only 4 degrees of freedom.
This is true for an arbitrary mean and an arbitrary number of values - you will always have
Explanation III: For those suspicious of simplicity
Gregory Gundersen (who has a very nice blog) has a post proving Bessel’s correction that I’m largely going to follow, expounding on points I found confusing (not due to Gregory but rather my own knowledge gaps. I also tend to write the individual steps out more to hopefully reduce your cognitive overhead).
The sample variance
Note that compared to
Our goal, ultimately, is to find
Since
Substituting
Using the same equation above but substituting
Plugging this back into the equation, we get:
So, substituting these for
Rearranging this equation, we get:
That is to say, rather than divide by
Do we need Bessel’s correction at all?
‘It depends’. Of course it does. Bessel’s correction removes bias (that is, the difference between the true statistic and the mean of the distribution of estimated statistics) for variance, so in the case where theoretical perfection is required, this might be preferred. Indeed, the theory behind it is pretty interesting! However, it doesn’t completely remove bias in the case of standard deviation (as we saw in some of our first plots), and it’s frankly confusing as hell. Jeffery Rosenthal wrote a beautiful article summarizing the challenges of teaching Bessel’s correction, and mentioned that the smallest mean-squared-error results not from dividing by
Appendix
Random Variable
A random variable isn’t really random nor is it really a variable. There are a lot of people smarter than me that would quibble with this definition, but a random variable is probably better described as a function that takes some result of an event (a roll of a die, a flip of a coin, a heart rates measurement) and turns it into a number (in the case of a die roll and heart measurement, the conversion is pretty obvious. For heads and tails, one of them is probably going to be 1, one is probably going to be 0, and the random variable does the conversion).
Just like a function doesn’t have a value until called with an argument, a random variable doesn’t have a value until it is realized.
(1) Linearity of Expectation
Expectation (which for our purposes is identical to the mean) of a random variable has the following definition:
It has some algebraically useful properties, namely:
where
An extension of this is that
Proof
We can show this using the definition of the mean:
(2) Variance
(2.1) Rearrangement
Turns out,
Proof
(2.2) Var(aX) = a2 Var(X)
Turns out,
where
Proof
Using the rearranged formula from the previous section and substituting
(2.3) Bienaymé’s formula
Another algebraically useful property is that if the values of a random variable are independent, then the sum of the variance is equal to the variance of the sum. Or rather:
Proof
The following proof is taken from here.
Using the proof of the previous section that
Apparently, the sum of random variables is also a random variable, so this holds for:
In the linked answer, they helpfully note that
A beautiful explanation they make is, quote:
…which is clear if you think about what you’re doing when you calculate (X1+…+Xn)⋅(X1+…+Xn) by hand.
So,
In the case of independent sampling, we expect covariance to be 0 between any non-identical random variables (in other words, if we thought about it like a matrix we would expect 0s everywhere not on the diagonal), so:
If we look at the rearranged definition of covariance
and consider the case in which
We realize that this is just the definition of variance:
So, substituting this back in, we see:
(3) Covariance
Covariance can be defined as follows:
Which can be rewritten, using the linearity of expectation, as:
Sources
As always, Wikipedia served as a great entry citation for many downstream citations:
Explanation of why sample mean results in the lowest variance adapted from:
Andy Field’s “Discovering Statistics” book has a great explanation about why
Nice proof of Bessel’s Correction:
Why do constants get squared when in variance:
Should we care about Bessel’s correction?
Nice proof of Bienaymé’s formula:
Acknowledgments
Thanks to rigor for helping me think about, then understand the difference between calculating/estimating statisics/samples