Deviation Deep Dive: Part II
library(tidyverse)
Introduction
In the last post we talked about calculating the standard deviation, as well as some rationale for why it’s calculated the way it is. One point we glossed over, though, is why we would go through the trouble of squaring the results only to take the square root later - why not a simple absolute function?
When I first went into this, I had assumed that it was because - due to the absolute value function having a sharp point right at 0 - it wasn’t differentiable everywhere, so it caused some issues mathematically. Turns out that while this is partially true, it’s not the whole truth.

I dug back through some crusty papers, all the way back to something by Fisher himself (Fisher, 1920). He and another guy were beefing (as always, it seems, for Fisher) about - essentially - if the standard deviation method or the mean absolute deviation were better. I’m going to do my best to explain what I’ve found here.
Oh, and before I get going, I’m not going to talk about Bessel’s correction here (that is, dividing by
Let’s begin.
What is the mean absolute deviation?
First, let’s talk a little bit about how to calculate the mean absolute deviation. I’m going to abbreviate it as MAD, but to be clear in other contexts this could stand for median absolute deviation. Some people call it the average absolute deviation to get around this, but I already used MAD so much that I don’t want to change it. Plus it’s fun! He’s angry!
The mean absolute deviation is much easier to describe and conceptualize as compared to the standard deviation:
- Find the mean
- Find the difference between the mean and every value
- Take the absolute value of those differences
- Sum them
- Divide them by the number of values
This can be encapsulated by the following equation:
Where
A case for the MAD
Some folks (including Nassim Taleb) have made a case for teaching MAD vs SD, for the reasons that it is easier to calculate as well as conceptualize. Personally, when I think about deviation, I find it easier to think about it as ’the average distance from the mean’ (which is what MAD is) versus the more convoluted ’the square root of the average squared distance from the mean’ (which is what SD is). This also gives more practical advantage to the value of the given deviation - if a population has a mean of 5 and an MAD of 2, you know that the average of values you pull from this distribution will be around 2 units away from the mean1.
MAD isn’t just a different SD
This isn’t, however, why the standard deviation won out. In the paper written by Fisher I mentioned previously, Fisher argued that the standard deviation was actually better than mean absolute deviation. Stephan Gorard (2004) explained this topic beautifully by first showing the MAD and SD were different from one another, with a figure I will adapt below.
# Sample 10 numbers, from 0 - 100. Do that 10,000 times.
# Gorard only did it 255 times but more is always better. Right?
samples <- replicate(
10000,
sample(0:100, 10, replace = TRUE),
simplify = FALSE
)
# Definitions of mean and standard deviation
mean_deviation <- function(x) {
sum(abs(x - mean(x))) / length(x)
}
# No Bessel's correction yet - we'll get to that
standard_deviation <- function(x) {
sqrt(sum((x - mean(x))^2) / length(x))
}
# Calculate MAD and SD for every single sample
mads <- sapply(samples, mean_deviation)
sds <- sapply(samples, standard_deviation)
plotting_data <- data.frame(
mad = mads,
sd = sds
)
ggplot(plotting_data, aes(mad, sd)) +
geom_abline(slope = 1) +
geom_point(shape = 16, alpha = 0.1) +
theme_minimal() +
labs(x = "Mean Absolute Deviation", y = "Standard Deviation")

Gorard stated that either:
- SD and MAD must be considered similar enough where you could simply pick your favorite (in which case why not just pick MAD due to being easier to understand and calculate) OR
- You must admit they are different enough where considering the differences is worthwhile.
Looking at the plot, we note that the relationship isn’t 1:1 - in which case it would just be some kind of line or curve - but rather it seems to be a kind of distribution, in which a given MAD might map to multiple SDs and vice-versa. We also note that the MAD is consistently smaller than the SD (due to all the points being above the x = y line).
So, it seems wise to admit that MAD
Why we like SD
Fisher, in 1920, published a paper in response to a certain Arthur Eddington. Eddington had stated that MAD was both practically and theoretically a better measure of deviation, and Fisher was going to set the record straight. I’m going to be real with you: I don’t understand all the math that he did, and what some may call ’elegant’, I call ‘inscrutable’, but we’re going to work through it together.
I’m going to spoil the result of the paper so we can use that as the light at the end of the tunnel that we work towards ourselves (or at least one of the results of the paper - I didn’t care much to go beyond the first three sections). The result stated that in order to estimate the standard deviation of a population using the MAD, you would need a sample size 14% greater to get the same ’tightness’ of standard error. That’s the briefest way to say it, but we’ll work through it together at length. Starting now.
Just to warn you - like I mentioned previously, I didn’t understand all the math, so while I can’t prove it like he did, we can verify his claims using simulations. Yes, it’s crude, and as such, tremendously on brand for me.
Simulating samples
First, let’s make a bunch of samples from a normal distribution:
# 30 values sampled from a normal distribution with mean 0 and standard
# deviation 5.
# Do that 100,000 times.
sd <- 5
n <- 30
n_samples <- 100000
samples <- replicate(n_samples, rnorm(n = n, sd = sd), simplify = FALSE)
# Let's look at some values of a sample:
samples[[1]][1:10]
[1] -3.8199181 -8.9866895 2.1995732 4.3801874 -3.7793879 0.4391249
[7] 5.2472810 0.8450794 -1.2777123 -2.6700292
# And a quick plot for good measure:
single_sample <- data.frame(x = samples[[1]])
ggplot(single_sample, aes(x)) +
geom_histogram() +
theme_minimal()

SD
Mean
Ok, we have some samples (a lot of them - I just showed you one, we have 99,999 more).
Let’s start as Fisher did, by describing the mean, variance, and standard error of the standard deviation. He says a lot of stuff, but in the end he states that for a normal distribution, the probability distribution of sample standard deviation (which we denote as
Note that
We can calculate the mean of
Which turns out to be:
For reasons I couldn’t quite follow, when
We can validate this via our simulated data:
# Again, NO Bessel's correction here
standard_deviation <- function(x) {
sqrt(sum((x - mean(x))^2) / length(x))
}
# Calculate the standard deviation for each sample
sds <- sapply(samples, standard_deviation)
# Plot their distribution
ggplot(data.frame(x = sds), aes(x)) +
geom_histogram() +
theme_minimal() +
geom_vline(
xintercept = mean(sds),
color = "red"
) +
geom_vline(
xintercept = (1 - 3/(4*n)) * sd,
color = "blue",
linetype = 2
) +
geom_vline(xintercept = sd) +
labs(x = "Standard Deviation")

In the above figure, I’m showing the distribution of
the mean of the standard deviation of the sample (
Variance
The definition of variance is
With some rearranging (described nicely in the Wikipedia article for variance), you can show this:
We just found
This can be found using the same equation we used to find
Seems hard! Fortunately, Fisher did the work for us, so the answer is:
We can now use this to find the variance of
However, it looks like Fisher dropped the
We can verify this using our simulated samples:
variance <- function(sigmas) {
mean(sigmas^2) - mean(sigmas)^2
}
theoretical_variance_sd <- (sd^2 * 8 * n - 9)/(16 * n^2)
theoretical_variance_sd_no_nine <- sd^2/(n * 2)
actual_variance_sd <- variance(sds)
vars <- c(actual_variance_sd, theoretical_variance_sd, theoretical_variance_sd_no_nine)
data.frame(
Name = c("Actual Variance", "Theoretical Variance", "Theoretical Variance (no -9)"),
Value = round(vars, 4),
Pct_Actual = round(vars * 100/actual_variance_sd, 2)
)
Name Value Pct_Actual
1 Actual Variance 0.4161 100.00
2 Theoretical Variance 0.4160 99.98
3 Theoretical Variance (no -9) 0.4167 100.13
The numbers are around 1% of each-other.
Standard Error
Not much else to say, but since we’ll need this equation later: taking the square root gives us Fisher’s final answer for the standard error of
MAD
Mean
Before we get started, we need to talk about translating MAD to SD. We were able to measure the accuracy of our estimate of standard deviation (
Turns out, for normally distributed populations, the MAD is
Anyway, Fisher does a lot of confusing math that I really don’t understand, but he ultimately lands on the idea that the mean of
Which we can verify using our simulated samples:
mad <- function(x) {
(sum(abs(x - mean(x))) * sqrt(pi / 2)) / length(x)
}
# Calculate the standard deviation for each sample
mads <- sapply(samples, mad)
# Plot their distribution
plot <- ggplot(data.frame(x = mads), aes(x)) +
geom_histogram() +
theme_minimal() +
geom_vline(
xintercept = mean(mads),
color = "red"
) +
geom_vline(
xintercept = sqrt((n-1)/n) * sd,
color = "blue",
linetype = 2
) +
geom_vline(xintercept = sd)
plot

Looks like Fisher was telling the truth on this one, too. Just for fun, let’s also add the mean of
plot +
geom_vline(xintercept = mean(sds), color = "green")

You’ll note that
Variance
As before, to calculate variance we need to know what
Fisher says that
When
Using the same equation for variance we used previously, we find that:
As in the previous variance, Fisher dropped a constant (
Again, we can verify this with our simulated data:
theoretical_variance_mad <- ((n-1)/n^2) * ((pi - 2)/2) * sd^2
theoretical_variance_mad_no_one <- (1/n) * ((pi - 2)/2) * sd^2
actual_variance_mad <- variance(mads)
vars <- c(actual_variance_mad, theoretical_variance_mad, theoretical_variance_mad_no_one)
data.frame(
Name = c("Actual Variance", "Theoretical Variance", "Theoretical Variance (no -1)"),
Value = round(vars, 4),
Pct_Actual = round(vars * 100/actual_variance_mad, 2)
)
Name Value Pct_Actual
1 Actual Variance 0.4781 100.00
2 Theoretical Variance 0.4598 96.17
3 Theoretical Variance (no -1) 0.4757 99.48
Standard Error
Finally, the standard error of
Comparing SD and MAD
Fisher essentially states that SD (or
By taking their ratio, we see that
Smaller standard errors are generally better, and we can see that the classic SD method gives us more bang for our buck in this regard. If we wanted to determine how many more samples we would need to get the same standard error using the MAD method, we can do something like this:
Therefore, to get the same standard error as with the standard deviation method, we would need roughly 14% more samples if we used the MAD method.
When we don’t like SD
Despite being theoretically superior to MAD, SD may not always be the preferred metric. One reason I mentioned previously is didactic: it’s marginally easier to teach MAD compared to SD (Gary Kader demonstrates how the MAD might be ‘discovered’ in a classroom setting).
Another reason is the comparative robustness of the MAD. In “Robust Statistics”, Peter Huber (Huber, 1981) demonstrates that even a minute amount of ‘contamination’ in the data can MAD to win out over SD in terms of its estimation power. In the following example, I’ll create samples just as before, except this time one out of the thirty samples will come from a distribution with a standard deviation 3x larger:
contaminated_samples <- replicate(
n_samples,
c(rnorm(n = n - 1, sd = sd), rnorm(1, sd = sd * 3)),
simplify = FALSE
)
# Calculate the standard deviation for each sample
sds <- sapply(contaminated_samples, standard_deviation)
mads <- sapply(contaminated_samples, mad)
# Plot their distribution
ggplot(data.frame(x = mads), aes(x)) +
geom_histogram() +
theme_minimal() +
geom_vline(xintercept = mean(mads), color = "red") +
geom_vline(xintercept = mean(sds), color = "blue") +
geom_vline(xintercept = sd)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can see that while both estimates of standard deviation are dragged upwards due to the contaminants,
Conclusion
While SD is differentiable and MAD is not, in my research this appears not to be a huge talking point. Rather, the reason for using standard deviation seems to stem from its theoretical efficiency, combined with many years of tradition. When interacting with real-world data, however, MAD tends to win out due to being robust to outliers, and it tends to be a little more tractable for teaching in the classroom as well as practically wrapping your head around the spread of the data, due to the interpretability of MAD.
In the next post in this series, I’ll talk about Bessel’s correction and bias.
Citations
My primary entry point for all downstream citations came from this Wikipedia article:
A Mathematical Examination of the Methods of determining the Accuracy of an Observation by the Mean Error, and by the Mean Square Error. RA Fisher. 1920.
Revisiting a 90-year-old-debate: the advantages of the mean deviation. S Gorard. 2004.
The Ratio of the Mean Deviation to the Standard Deviation as a Test of Normality. RC Geary. 1935.
Robust Statistics. PJ Hubert. 1981.
-
This does NOT mean that half the values will fall within 3 - 7 and half will fall outside that. This is a property of the median absolute deviation, which I will not cover here ↩︎
-
Yes, yes. I see you back there waving your hand. You’d like to talk about Bessel’s correction now. Well, that’s too bad. I’m not going to talk about it in this blog post. I’ll talk about it in the next one. Hold your horses. ↩︎