Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentAlgebra + Data Science
-
segmentChapter 1 - Exploring Variation in Data
-
segmentChapter 2 - Modeling Data with Functions
-
segmentChapter 3 - Assessing How Well Models Fit the Data
-
3.1 Better Models Make Better Predictions
-
-
segmentResources
list High School / Algebra + Data Science (G)
Chapter 3 - Assessing How Well Models Fit the Data
3.1 Better Models Make Better Predictions
Thus far you have made some models that seem promising. In the next sections of the textbook, we will conduct an all out search for the absolute best model of this data. But before embarking on this quest for the best model, we need to agree on what makes a model better or worse. It’s likely we can all agree that better models are characterized by more accurate predictions.
One strategy for comparing models might be to make some predictions
using different models and then check which of those predictions are
closest to the actual data. We can take advantage of our
penguins
dataset because we can use our models to make a
prediction for each of the penguins we have just using their flipper
lengths. Then we can check, how far were our predictions from the
penguins’ actual body mass?
Let’s limit our dataset to just the two variables we are thinking about. This will make it easier to check them against the predictions we generate later.
select_penguins <- select(penguins, flipper_length_m, body_mass_kg)
head(select_penguins)
flipper_length_m body_mass_kg
1 0.194 ███
2 0.217 ███
3 0.185 ███
4 0.218 ███
5 0.210 ███
6 0.192 ███
Instead of looking at their actual values on
body_mass_kg
(which we redacted for now with black boxes),
we’re going to generate a predicted body mass for each penguin based on
a model. Actually, let’s compare two different models, each of which
would make its own predictions.
As you can see in the figure below, the model on the left (which
we’ve called better_function
and represented with a blue
line) is clearly better than the model on the right
(worse_function
, represented with a red line).
better_function <- function(X){-6 + 51*X}
|
worse_function <- function(X){-7 + 49*X}}
|
---|---|
|
|
We can use mutate()
to create a new column called
prediction
, which will be the predicted body mass for each
penguin in the select_penguins
data frame based on the
better_function
model. (We’ve also piped,
%>%
, on the head()
function, which will
only print out the first 6 rows of the mutated data frame.)
mutate(select_penguins, prediction = better_function(flipper_length_m)) %>%
head()
flipper_length_m body_mass_kg prediction
1 0.194 4.200 3.894
2 0.217 4.375 5.067
3 0.185 3.950 3.435
4 0.218 5.700 5.118
5 0.210 4.000 4.710
6 0.192 3.000 3.792
In the figure below we have plotted the actual body masses of the
first 6 penguins in the data frame in black, and the predictions made by
the better_function
for the same 6 penguins in blue. Notice
that the black and blue dots come in vertically aligned pairs. Each pair
of dots represents a single penguin: its actual body mass in black, and
its predicted body mass in blue.
If the actual body mass is higher than predicted, the black dot
appears above the prediction but if the actual body mass is lower, the
black dot appears below the prediction. The body masses in general are
sometimes higher and sometimes lower than the
better_function
predictions.
The scatter plot also shows us that sometimes the actual body mass is closer to the prediction (as in the case of penguin #1) and sometimes farther (see penguin #6 below).
These vertical differences between the actual body masses and the predicted body masses can be thought of as error from the model. Notice that each penguin’s prediction has an associated error, which is technically called a residual. In the next sections, we’ll learn how to calculate the residuals and use them to compare which models are better and which are worse.