The Slowest Line Fit: Part I
Introduction
Artificial intelligence (AI), machine learning (ML), and statistics are all phrases that get thrown around with increasing irreverence, making it difficult to nail down exactly what is what. In theory, AI is a category that encompasses ML, but nowadays, colloquially, when people refer to AI they are referring to large language models (LLMs) or some other kind of generative model, muddying the waters that much more. Further, data scientists and statisticians are doing plenty of work that they would probably only consider ML for résumé reasons, but in reality ML and statistics seems to me to be more of an-ever shifting, perspective-dependent Venn diagram rather than any kind of spectrum.
The point of that is to provide some insurance that if I say something funky, it’s only partially because I’m an idiot. The other 20% is because it’s confusing.
I want to learn more about machine learning. I know a little, but in a haphazard, ‘I guess its working’ kind of way. And I know a bit about statistics. I’m curious if I could learn a bit about machine learning by explaining it to myself (and hopefully you?) through the lens of one of the simpler things to do in statistics: linear regression.
What is linear regression?
Line fitting. It’s line fitting. I mean, yes, it’s so much more, too (you could argue it’s the backbone for most statistical tests), but I don’t want to over-complicate this right from the get go. You have some data, you put a line through it that fits as good as possible. We’ll work out the details as we go along.
Some complications, though:
- What do you mean ‘as good as possible’?
- I have a lot of kinds of data and my paper is flat. How does this work?
- Some of these data are NOT numbers.
- These data don’t look like a line at all
Even though I was just thinking about these complications right off the dome, they look like they would form a good blog post series, so let’s do that, starting from the top and moving our way down.
If you would prefer to get this from a vetted and coherent source rather than from the ravings of a madman, you cannot beat ‘An Introduction to Statistical Learning’, which you can get online, for free (legally, even!) here.
Motivating potato example
Setup
import pandas as pd
import numpy as np
import altair as alt
from sklearn.linear_model import LinearRegression
from kai_theme import kai # For themeing
The stress of computers has become too much and we have decided to become potato farmers. The humble potato knows relatively little of ‘agile’ and ‘git’ (although the dull memory of technology still haunts it, which is why you can power clocks with potatoes). Alas, old habits die hard and you find your data-hungry mind logging the amount of potatoes you harvested and the inches of rain you received that season1
Here are your logs for the rain and yield for the first 10 of 50 harvests:
rain | yield |
---|---|
4.07 | 124 |
2.23 | 37 |
2.68 | 49 |
4.36 | 77 |
3.27 | 90 |
8.01 | 137 |
2.51 | 90 |
6.79 | 139 |
2.71 | 66 |
2.98 | 61 |
Potato Data Generating Code - Contains Spoilers!
np.random.seed(1)
def make_potato_data(n):
e = np.random.normal(0, scale=20, size=n)
rain = (np.random.beta(a=2, b=4, size=n) * 10).round(2)
potatoes = (23.2 * rain - 3 + e).clip(0).round()
return pd.DataFrame({"rain": rain, "yield": potatoes})
tatos = make_potato_data(50)
We might look at these and note that the seasons with the highest amount of rain appeared to have the most potatoes, and the seasons with the lowest amount of rain tended to have the lowest number of potatoes.
To make this a little clearer, we can plot it:
Plotting Code
tatos_plt = alt.Chart(tatos).encode(x="rain", y="yield").mark_circle()
tatos_plt.save('tatos.svg')
There definitely seems to be an association between rain an the number of potatoes. These data aren’t just beautiful to behold (like all data are), we can also use these data to answer questions like:
- For each additional inch of rain, how many potatoes will I get?
- The forecast says it’ll rain 7 inches this season. How many potatoes should I expect?
- I have a potato debt with the potato mafia and I must give them no fewer than 100 potatoes this season or I must change my name and flee. At least how much rain should I hope for?
- How much can we predict about yield via just rain? How much variability can’t be predicted via just rain?
All of this can be answered with starting with the humble line. Where is that guy anyway?
The Humble Line
Oh my god. He’s here but he’s shown up terribly drunk. This is embarrassing for both of us.
Plotting Code
bad_fit = pd.DataFrame({'rain': [0,10], 'yield': [150, 100]})
bad_fit_plt = alt.Chart(bad_fit).encode(x="rain", y="yield").mark_line()
tatos_bad_fit_plt = tatos_plt + bad_fit_plt
tatos_bad_fit_plt.save('tatos_bad_fit.svg')
We both know that’s a really crummy fit, just as we know that this one is probably a little better:
Plotting Code
ok_fit = pd.DataFrame({'rain': [0,10], 'yield': [50, 150]})
ok_fit_plt = alt.Chart(ok_fit).encode(x="rain", y="yield").mark_line()
tatos_ok_fit_plt = tatos_plt + ok_fit_plt
tatos_ok_fit_plt.save('tatos_ok_fit.svg')
Mind you, it’s still not a great fit, but it’s worth noting that our peepers can tell that something is a good fit. But how to we explain to computers what a good fit is?
Before we dig into this, let’s start with some data that plots a little better - the differences in scales between rain and potatoes causes some issues in visualizing, but all the mathematics remain the same. We’ll return to our potatoes soon enough.
You might be able to tell that there’s a trend with these data, and that it seems to be roughly one-to-one2.
x | y |
---|---|
1 | 1.24 |
2 | 2.20 |
3 | 3.66 |
4 | 5.79 |
5 | 4.88 |
6 | 4.77 |
7 | 5.82 |
8 | 7.33 |
9 | 7.33 |
10 | 10.83 |
Example Data Generating Code - Contains Spoilers!
def create_example_data(n):
x = range(1, n+1)
y = x + np.random.normal(0, 1, n)
y = y.round(2)
return pd.DataFrame({"x": x,"y": y})
ex = create_example_data(10)
Let’s plot it:
Plotting Code
ex_plt = alt.Chart(ex).encode(
x=alt.X("x", scale = alt.Scale(domain=(0,12)), title = "x"),
y=alt.Y("y", scale = alt.Scale(domain=(0,12)), title = "y")
).mark_circle()
ex_plt.save('ex.svg')
Now let’s put an arbitrary line atop:
Plotting Code
ex_line = pd.DataFrame({"x":[0,12], "y":[4,8]})
ex_line_plt = alt.Chart(ex_line).encode(x="x", y="y").mark_line()
ex_rand_fit_plt = ex_plt + ex_line_plt
ex_rand_fit_plt.save('ex_rand_fit.svg')
We can see that this is a good fit, though it could be better. In order to create some kind of metric for determining fit, we might consider adding up the distance between all the points and the line.
That is, adding up all of these lines:
Plotting Code
def calc_orth_segment_to_line(points, line):
# https://en.wikipedia.org/wiki/Distance_from_a_point_to_a_line#Another_formula
# Given y = mx + k and x0, y0
# k = y - mx
# x = (x0 + my0 - mk)/(m^2 + 1)
# put into eq to get y
m = (line.y[1] - line.y[0])/(line.x[1] - line.x[0])
k = line.y[0] - m * line.x[0]
x = (points.x + m * points.y - m * k) / (m**2 + 1)
y = m * x + k
return pd.DataFrame({"x": points.x, "x2": x, "y": points.y, "y2": y,})
ortho_lines = calc_orth_segment_to_line(ex, ex_line)
ortho_lines_plt = alt.Chart(ortho_lines).encode(
x="x", x2 = "x2", y="y", y2="y2"
).mark_rule()
ex_ortho_plt = ex_plt + ex_line_plt + ortho_lines_plt
ex_ortho_plt.save('ex_ortho.svg')
In theory we could count up the distances of all those lines and then try to find a line that minimizes that distance. Were it that simple! While this is a way to fit a line, it isn’t common. I’ll teach you the most common way first, and then I’ll briefly discuss other methods for fitting.
The most common way is known as ‘ordinary least squares’ (OLS). Ordinary, as it is in contrast to the myriad other ways you can fit by least squares, and ’least squares’ being an incredibly terse way of saying ‘minimizing the sum of the squared residuals’, which I’ll explain later.
The first thing that differs between the figure above and OLS is that we don’t measure the distance perpendicular to the fit, we measure it by ‘dropping down’ to the line:
Plotting Code
def calc_drop_segment_to_line(points, line):
m: float = (line.y[1] - line.y[0])/(line.x[1] - line.x[0])
k: float = line.y[0] - m * line.x[0]
y = m * points.x + k
return pd.DataFrame({"x": points.x, "x2": points.x, "y": points.y, "y2": y})
drop_lines = calc_drop_segment_to_line(ex, ex_line)
drop_lines_plt = alt.Chart(drop_lines).encode(
x="x", x2 = "x2", y="y", y2="y2"
).mark_rule()
ex_line_drop_plt = ex_plt + ex_line_plt + drop_lines_plt
ex_line_drop_plt.save('ex_line-drop.svg')
We ‘drop down’ to the line (or up, depending on where the point is in relation to the line) because we presume that the independent variable - here x - is accurately measured. That is, the difference between our prediction (the line) and our observed values (points) is due only to the fault of the y-axis. Note that most of the time, this is actually not true, but it’s a little fiction we allow ourselves and it actually works out fairly well. There’s nothing special about the x-axis, besides the fact that the independent variable goes there most of the time. If we had switched the axes, we would ‘drop to the left’ (or right) instead.
While we’ll get to higher dimensions in a different post in the future, just know that if you had multiple independent variables, you would draw the line such that the only thing changing as you moved from the line to the point would be your dependent variable. Since the y-axis is our dependent variable, that means ‘dropping down’.
If we were to sum up all the (absolute) distances from points to lines, this kind of fit would be known as Least Absolute Deviations. It has its place, but we won’t cover it here. It has similar benefit to being comparatively robust to outliers, much like the mean absolute deviation is robust to outliers compared to the standard deviation (I wrote a blog post about it IF YOU EVEN CARE).
Briefly, we should talk about some terminology. These distances from the points to the line are called residuals. This is not to be confused with errors. A residual is the difference between an observed value and an estimated value. In the case of our linear modeling, the observed value is the y-value of the point, and the estimated value is the y-value of the line at the same x-value (that is, the point that is ‘dropped down’ (or up) to). On the other hand, an error is the difference between an observed value and the ’true’ value. This is almost never known in practical situations! To make matters more confusing, sometimes the word ’error’ is used when ‘residual’ would be more apt, such as in the phrase ‘mean squared error’3.
With that out of the way, instead of talking about ‘drop lines’, I can talk about ‘residuals’. And instead of taking the absolute sum of these residuals, we’re going to sum their squares.
Mathematically speaking, we have some line. If you recall, the formula for a line looks something like:
Where y
, but in the potato example, yield
), x
, but in our potato example, rain
).
As a worked example, suppose our line fit (which is not necessarily the correct fit) has a slope of 0.5 and an intercept of 2 - that is,
Plotting Code
ex_line_2 = pd.DataFrame({"x":[0,12], "y":[2,8]})
ex_line_2_plt = alt.Chart(ex_line_2).encode(x="x", y="y").mark_line()
ex_fit_2_plt = ex_plt + ex_line_2_plt
ex_fit_2_plt.save('ex_fit-2.svg')
In this example, there’s one point at (4, 5.79):
Plotting Code
ft_pt = pd.DataFrame({"x": [4], "y": [5.79]})
ft_pt_plt = alt.Chart(ft_pt).encode(x="x", y="y").mark_circle(
fill = "dodgerblue", stroke = "dodgerblue", size=100
)
ex_ft_pt_plt = ex_fit_2_plt + ft_pt_plt
ex_ft_pt_plt.save('ex_ft-pt.svg')
We calculate its residual by first figuring out what the value of y would be given our model (our line):
Plotting Code
ft_pred = pd.DataFrame({"x": [4], "y": [4]})
ft_pred_plt = alt.Chart(ft_pred).encode(x="x", y="y").mark_circle(
fill = "orangered", stroke = "orangered", size=100
)
ex_pred_plt = ex_ft_pt_plt + ft_pred_plt
ex_pred_plt.save('ex_pred.svg')
Conventionally, to distinguish between the actual, observed values of y (here 5.79) and the value of y cranked out by our model (which might be a terrible model, so best not conflate the two), we usually call it “y hat” and give it a hat (ŷ), and say it is a prediction4 of y. So, while
The residual is the difference between the two:
Plotting Code
ft_res = pd.DataFrame({"x": [4], "x2": [4], "y": [4], "y2": [5.79]})
ft_res_plt = alt.Chart(ft_res).encode(x="x", x2="x2", y="y", y2 = "y2").mark_rule()
ex_res_plt = ft_res_plt + ex_pred_plt
ex_res_plt.save('ex_res.svg')
If we calculate the residuals for every point…
Plotting Code
reses = calc_drop_segment_to_line(ex, ex_line_2)
res_plt = alt.Chart(reses).encode(
x="x", x2 = "x2", y="y", y2="y2"
).mark_rule()
ex_ress_plt = ex_fit_2_plt + res_plt
ex_ress_plt.save('ex_ress.svg')
Then square them and take the sum, we get some value:
def calc_ss(points, line):
m = (line.y[1] - line.y[0])/(line.x[1] - line.x[0])
k = line.y[0] - m * line.x[0]
y = m * points.x + k
res = points.y - y
return sum(res**2)
o = calc_ss(ex, ex_line_2)
print(o)
22.8837
So now we have a value. By itself, this means relatively little. However, this value becomes more useful if we use it to compare some other fits.
For instance, let’s consider this pretty terrible fit:
Plotting Code
bad_fit = pd.DataFrame({"x": [0, 12], "y": [4, 1]})
bad_fit_plt = alt.Chart(bad_fit).encode(x="x", y="y").mark_line()
bad_ress = calc_drop_segment_to_line(ex, bad_fit)
drop_lines_plt = alt.Chart(bad_ress).encode(
x="x", x2 = "x2", y="y", y2="y2"
).mark_rule()
bad_ress_plt = ex_plt + drop_lines_plt + bad_fit_plt
bad_ress_plt.save('bad_ress.svg')
This fit is visually worse, and, as we might expect, its sum of squared residuals is much bigger too:
o2 = calc_ss(ex, bad_fit)
print(o2)
184.9712
Since we’re programming, we can calculate the sum of squares for a variety of slopes and intercepts (also, quick note - I’m using the log of the sum of squares (lss) for color display purposes only):
Plotting Code
def search_space(seq_x, seq_y, data):
space = pd.DataFrame({
data.columns[0]: [x for x in seq_x for y in seq_y],
data.columns[1]: [y for x in seq_x for y in seq_y]
})
space["ss"] = [calc_ss(x, data) for x in zip(space.iloc[:, 0], space.iloc[:, 1])]
space["lss"] = np.log(space["ss"])
return space
def calc_ss(params, data):
y = params[1]*data.iloc[:, 0] + params[0]
res = data.iloc[:, 1] - y
return sum(res**2)
search_range = np.linspace(-2, 4, 20).round(2)
space = search_space(search_range, search_range, ex)
space_plt = alt.Chart(space).mark_circle(stroke=None, size=70).encode(
alt.X("x").title("b"), alt.Y("y").title("m"),
color=alt.Color("lss").scale(scheme="darkgold", reverse=True)
)
space_plt.save("search-space.svg")
Some things to notice:
- The sum of squares seems to be smallest around b = 0, m = 1
- The plot forms a kind of ‘funnel’ surface
This second point is critical to our application of machine learning, and powers a strategy known as ‘gradient descent’. As a brief overview before we dig into it earnestly: gradient descent describes the process in which we find the minimum of the surface. This is important, because it allows us to find the best fit ‘automatically’. It also works in much more complex contexts than this. Anywhere we start, we can look at the slope in each dimension and move in the steepest direction downward. Repeated iteratively, we can move closer to the minimum. This, of course, has some caveats, which we’ll discuss as we learn about gradient descent more deeply.
A Gradient Descent into Madness
Let’s orient ourselves. On the following plot, the red dot represents our bad fit:
Plotting Code
bad_point_plt = alt.Chart(space).mark_circle(
stroke = None, color="#f00", size=70
).encode(
x=alt.datum(4), y=alt.datum(-0.25)
)
ss_w_bad_plt = space_plt + bad_point_plt
ss_w_bad_plt.save("ss_w-bad.svg")
If we fix
dy = calc_ss((4, -0.24), ex) - calc_ss((4, -0.26), ex)
dx = -0.24 - -0.26
print(round(dy/dx))
-485
Ok, so the slope is negative when we go from our current slope (-0.03) to something more positive. We like negative - negative means that our sum of squares is going down, and that means a better fit. We like negative so much, we’re going to put an arrow towards the more negative direction to remind ourselves which way it is:
Plotting Code
m_plt = alt.Chart(space).mark_text(
size=50, angle=180, stroke = "orangered", fill = "orangered"
).encode(
x=alt.datum(4), y=alt.datum(-0.25), text=alt.datum("🠃")
)
ss_w_m_plt = space_plt + m_plt
ss_w_m_plt.save("m.svg")
We can also calculate the slope if we hold
dy = calc_ss((4.01, -0.25), ex) - calc_ss((3.99, -0.25), ex)
dx = 4.01 - 3.99
print(round(dy/dx))
-55
As before, moving in the positive direction also is a negative slope, so let’s put another arrow that way:
Plotting Code
b_plt = alt.Chart(space).mark_text(
size=50, angle=270, fontWeight=800, stroke = "dodgerblue", fill = "dodgerblue"
).encode(
x=alt.datum(4),
y=alt.datum(-0.25),
text=alt.datum("🠃")
)
ss_w_b_plt = space_plt + m_plt + b_plt
ss_w_b_plt.save("b.svg")
Actually, we can combine the two arrows into a single angled arrow if we do some light trigonometry. The slopes of each tell us the magnitude, and if (after quietly chanting “SOH CAH TOA”) we remember that the angle of the arrow is equal to the inverse tangent of the opposite side length divided by the adjacent side length…
round(math.degrees(math.atan(-485/-55)))
84
So let’s just slap a single arrow on there:
Plotting Code
mb_plt = alt.Chart(space).mark_text(
size=50, angle=270 - 84, stroke = "fuchsia", fill = "fuchsia"
).encode(
x=alt.datum(4),
y=alt.datum(-0.25),
text=alt.datum("🠃")
)
ss_w_mb_plt = space_plt + mb_plt
ss_w_mb_plt.save("mb.svg")
This tiny arrow is a little man pointing you in the direction of better fits. He doesn’t know a whole lot - he’s never left town, and doesn’t know that he’s not pointing you towards the absolute minimum - but he’s looked around the area a bit and surmises that the way he points is a good start.
So let’s follow his directions, why not? The question becomes, for how long should we follow his directions before we get our bearings/ask for directions again? Too soon, and the journey will be incredibly slow - we’ll be asking for directions all the time. Too long, and we’ll overshoot our goal. There are, of course, many clever tricks for setting this rate just right, as well as changing this rate as you ‘follow’ - but for the sake of learning, we’re going to stick with a single rate the entire time, and just make it kind small. We’ll be ok with waiting a bit (although computers are quite fast, so for this simple problem we probably won’t have to wait long)
Here will be our general strategy:
- Start somewhere, figure out the sum of squares at that point.
- Ask the man for where to go next (mathematicians and less whimsical folks call this ‘finding the gradient’)
- Follow that for a distance equal to our learning rate
- Repeat steps 1-3
- Stop when we get ‘close enough’
We should talk a bit about that last step. Close enough is kind of up to us, but we do need to set a limit, otherwise we’ll spiral closer, and closer, and closer to perfection (but alas, never reaching it). Generally, setting ‘close enough’ should be when our sum of squares values stop changing a whole lot, usually limited by our tastes or time.
I wrote some code so we can follow our journey from any point, down the gradient, to your stopping point. As an example, this is what gradient descent looks like if we start at our ‘bad fit’. For each iteration, the program makes a dot. Towards the end, the dots become very close together, forming almost a line:
Plotting Code
def gradient_descent_trace(b, m, data, learning_rate, e, search_space_plt, xlims):
trace = make_trace(b, m, data, learning_rate, e)
trace_plt = alt.Chart(trace).mark_circle().encode(
alt.X("b").scale(domain=xlims),
y ="m",
stroke=alt.Stroke("iteration", legend=None).scale(scheme="redblue"),
fill=alt.Stroke("iteration", legend=None).scale(scheme="redblue")
)
return search_space_plt + trace_plt
def make_trace(b, m, data, learning_rate, e):
coords = [(b, m)]
while True:
ss = calc_ss((b, m), data)
grad = get_gradient(b, m, data)
b, m = follow(b, m, grad, learning_rate)
new_ss = calc_ss((b, m), data)
if abs(ss - new_ss) < e:
break
coords.append((b, m))
trace = pd.DataFrame(coords)
trace = trace.rename(columns={0:'b', 1:'m'})
trace["iteration"] = trace.index
return trace
def get_gradient(b, m, data):
d = 0.01
dx = d * 2
dyb = calc_ss((b+d, m), data) - calc_ss((b-d, m), data)
dym = calc_ss((b, m+d), data) - calc_ss((b, m-d), data)
return((dyb/dx, dym/dx))
def follow(b, m, grad, learning_rate):
new_b = learning_rate * -grad[0] + b
new_m = learning_rate * -grad[1] + m
return(new_b, new_m)
with_trace = gradient_descent_trace(4, -0.25, ex, 0.001, 0.000001, space_plt, (-2, 4))
with_trace.save("with_trace.svg")
You can see the points moving down into the ‘valley’ and then sliding along towards the center, towards the local minimum. So it looks like gradient descent is finding the local minimum, or at least getting close to it.
Exactly how close, though? We can use vetted linear modeling software to give as a much more precise fit:
fit = LinearRegression().fit(ex.x.values.reshape(-1, 1), ex.y)
b = fit.intercept_.round(2)
m = fit.coef_.round(2)[0]
print(b, m)
0.7 0.85
So it looks like
trace = make_trace(4, -0.25, ex, 0.001, 0.000001)
print(trace.tail(1))
b m iteration
1454 0.707457 0.850747 1454
So it looks like by our 1454th (and final) iteration,
So…potatoes
Do you remember when we were still talking about potatoes, and possibly happiness? We got a little sidetracked, but as a quick recap, here’s what we learned:
- How we measure what a ‘good fit’ is (the sum squared the residuals)
- One way to find the best fit (gradient descent)5
We can apply the same strategy to our potato data - first, find out what the actual slope and intercept is:
tatos = tatos.rename(columns={'rain':'x', 'yield':'y'})
fit_tatos = LinearRegression().fit(tatos.x.values.reshape(-1, 1), tatos.y)
b = fit_tatos.intercept_.round(2)
m = fit_tatos.coef_.round(2)[0]
print(b, m)
5.02 20.8
As a reminder, this means that the intercept is
And now I’ll make a sum of squares plot roughly centered around the solution. It’s not structly necessary, but it sure is fun:
Plotting Code
search_range_rain = np.linspace(-30, 30, 20).round(2)
search_range_yield = np.linspace(-10, 50, 20).round(2)
space = search_space(search_range_rain, search_range_yield, tatos)
space_plt = alt.Chart(space).mark_circle(stroke=None, size=70).encode(
alt.X("x").title("rain"), alt.Y("y").title("yield"),
color=alt.Color("lss").scale(scheme="darkgold", reverse=True)
)
space_plt.save("search-space_tatos.svg")
Now, we use gradient descent to find intercept and slope. We’ll start our journey at
Plotting Code
tatos_trace_plt = gradient_descent_trace(0, 0, tatos, 0.001, 0.000001, space_plt, (-30, 30))
tatos_trace_plt.save("tatos_trace.svg")
What’s funny about this one is it appears to kind of oscillate across the ‘canyon’ before eventually settling down into the local minimum. We can also check to see how many iterations it took to get there, and what it eventually settled on:
tato_trace = make_trace(0, 0, tatos, 0.001, 0.000001)
print(tato_trace.tail(1))
b m iteration
264 5.020995 20.798555 264
And it looks like we got pretty dang close!
Before we close out, let’s consider the practical implications of our line fitting. This tells us that - if this model is correct - with 0 inches of rain, we would still get a yield of 5 potatoes (that is, the y-intercept
Wrap up
It was a long road to get there, but we’ve set up some of the fundamentals for machine learning. We were able to determine (using gradient descent) what the equation for the best line fit was for our potato data (at least, best in terms of the metrics we decided - ordinary least squares). Using this fit, we revealed a bit about the nature of the relationship between rain and potato crop yield. Practically, this might allow us to forecast how many potatoes a year with, say, 2 inches of rain will yield.
In the next post, which will hopefully be shorter, we’ll extend this concept to more dimensions.
Sources
Ordinary least squares:
Predict vs estimate:
Programming machine learning by Paolo Perrotta
- Excellent, easy, python based introduction to machine learning
Introduction to Statistical Learning:
-
Now is a good time to tell you that I know nothing about growing potatoes, rain (I am from the desert), and the interaction of potato growth and rain. I am regretting choosing this example but do you know how hard it is to hit a backspace key ↩︎
-
I’m using new example data because when I show orthogonal lines, I want to emphasize that they touch the line at a right angle, which only works if the units on the x-axis and y-axis are displayed at the same physical scale. Since the rain and yield data are on largely different scales (ie, rain is roughly 10x less than yield), I’d either have to display it verrrrrrrry tall, or the orthogonal lines wouldn’t look orthogonal. Hence the more well-behaved example data. ↩︎
-
For additional information, Wikipedia explains the difference between residuals and errors quite beautifully in the article on errors and residuals. ↩︎
-
A prediction is different from an estimation. An estimation, as I understand, is the attempt to measure some intrinsic property of a variable - its mean, for instance - while a prediction is interested in spinning up a new hypothetical observation, given some parameters, with the goal of being as close to reality as possible. In my mind, estimation feels more like a summary process, while prediction feels more ‘creative’ (that is, it is creating new values) ↩︎
-
It feels important to note that linear modeling usually never uses gradient descent - it’s much less efficient than things like using matrix algebra in linear least squares ↩︎