Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentAlgebra + Data Science
-
segmentChapter 1 - Exploring Variation in Data
-
segmentChapter 2 - Modeling Data with Functions
-
segmentChapter 3 - Assessing How Well Models Fit the Data
-
3.1 Better Models Make Better Predictions
-
-
segmentResources
list High School / Algebra + Data Science (G)
Chapter 3 - Assessing How Well Models Fit the Data
3.1 Better Models Make Better Predictions
Thus far you have made some models that seem promising. In the next sections of the textbook, we will conduct an all out search for the absolute best model of this data. But before embarking on this quest for the best model, we need to agree on what makes a model better or worse. It’s likely we can all agree that better models are characterized by more accurate predictions.
One strategy for comparing models might be to make some predictions using different models and then check which of those predictions are closest to the actual data. We can take advantage of our penguins
data set because we can use our models to make a prediction for each of the penguins we have just using their flipper lengths. Then we can check, how far were our predictions from the penguins’ actual body mass?
Let’s limit our data set to just the two variables we are thinking about. This will make it easier to check them against the predictions we generate later.
select_penguins <- select(penguins, flipper_length_m, body_mass_kg)
head(select_penguins)
flipper_length_m body_mass_kg
1 0.194 ███
2 0.217 ███
3 0.185 ███
4 0.218 ███
5 0.210 ███
6 0.192 ███
Instead of looking at their actual values on body_mass_kg
(which we redacted for now with black boxes), we’re going to generate a predicted body mass for each penguin based on a model. Actually, let’s compare two different models, each of which would make its own predictions.
As you can see in the figure below, the model on the left (which we’ve called better_function
and represented with a blue line) is clearly better than the model on the right (worse_function
, represented with a red line).
better_function <- function(X){-6 + 51*X}
|
worse_function <- function(X){-7 + 49*X}}
|
---|---|
|
|
We can use mutate()
to create a new column called prediction
, which will be the predicted body mass for each penguin in the select_penguins
data frame based on the better_function
model. (We’ve also piped, %>%
, on the head()
function, which will only print out the first 6 rows of the mutated data frame.)
mutate(select_penguins, prediction = better_function(flipper_length_m)) %>%
head()
flipper_length_m body_mass_kg prediction
1 0.194 4.200 3.894
2 0.217 4.375 5.067
3 0.185 3.950 3.435
4 0.218 5.700 5.118
5 0.210 4.000 4.710
6 0.192 3.000 3.792
In the figure below we have plotted the actual body masses of the first 6 penguins in the data frame in black, and the predictions made by the better_function
for the same 6 penguins in blue. Notice that the black and blue dots come in vertically aligned pairs. Each pair of dots represents a single penguin: its actual body mass in black, and its predicted body mass in blue.
If the actual body mass is higher than predicted, the black dot appears above the prediction but if the actual body mass is lower, the black dot appears below the prediction. The body masses in general are sometimes higher and sometimes lower than the better_function
predictions.
The scatter plot also shows us that sometimes the actual body mass is closer to the prediction (as in the case of penguin #1) and sometimes farther (see penguin #6 below).
These vertical differences between the actual body masses and the predicted body masses can be thought of as error from the model. Notice that each penguin’s prediction has an associated error, which is technically called a residual. In the next sections, we’ll learn how to calculate the residuals and use them to compare which models are better and which are worse.