list High School / Algebra + Data Science (G)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

Chapter 3 - Assessing How Well Models Fit the Data

3.1 Better Models Make Better Predictions

Thus far you have made some models that seem promising. In the next sections of the textbook, we will conduct an all out search for the absolute best model of this data. But before embarking on this quest for the best model, we need to agree on what makes a model better or worse. It’s likely we can all agree that better models are characterized by more accurate predictions.

One strategy for comparing models might be to make some predictions using different models and then check which of those predictions are closest to the actual data. We can take advantage of our penguins data set because we can use our models to make a prediction for each of the penguins we have just using their flipper lengths. Then we can check, how far were our predictions from the penguins’ actual body mass?

Let’s limit our data set to just the two variables we are thinking about. This will make it easier to check them against the predictions we generate later.

select_penguins <- select(penguins, flipper_length_m, body_mass_kg)
head(select_penguins)
 flipper_length_m body_mass_kg
1            0.194        ███
2            0.217        ███
3            0.185        ███
4            0.218        ███
5            0.210        ███
6            0.192        ███

Instead of looking at their actual values on body_mass_kg (which we redacted for now with black boxes), we’re going to generate a predicted body mass for each penguin based on a model. Actually, let’s compare two different models, each of which would make its own predictions.

As you can see in the figure below, the model on the left (which we’ve called better_function and represented with a blue line) is clearly better than the model on the right (worse_function, represented with a red line).

better_function <- function(X){-6 + 51*X} worse_function <- function(X){-7 + 49*X}}

On the left, a scatter plot of body_mass_kg predicted by flipper_length_m. A blue line of best fit is plotted on the graph and runs through the center of the data points.

On the right, a scatter plot of body_mass_kg predicted by flipper_length_m. A red line is plotted on the graph but the line has a slope and intercept that does not run through the data points, and just runs through the empty space below the data points on the graph.

We can use mutate() to create a new column called prediction, which will be the predicted body mass for each penguin in the select_penguins data frame based on the better_function model. (We’ve also piped, %>%, on the head() function, which will only print out the first 6 rows of the mutated data frame.)

mutate(select_penguins, prediction = better_function(flipper_length_m)) %>%
  head()
  flipper_length_m body_mass_kg prediction
1            0.194        4.200      3.894
2            0.217        4.375      5.067
3            0.185        3.950      3.435
4            0.218        5.700      5.118
5            0.210        4.000      4.710
6            0.192        3.000      3.792

In the figure below we have plotted the actual body masses of the first 6 penguins in the data frame in black, and the predictions made by the better_function for the same 6 penguins in blue. Notice that the black and blue dots come in vertically aligned pairs. Each pair of dots represents a single penguin: its actual body mass in black, and its predicted body mass in blue.

A scatter plot of body_mass_kg predicted by flipper_length_m. A blue line of best fit is plotted on the graph and runs through the center of the data points. One data point is highlighted and labeled as Penguin one’s body mass equals 4.2. Along the blue line the point for that Penguin one’s prediction is highlighted as predicted body mass of 3.894.

If the actual body mass is higher than predicted, the black dot appears above the prediction but if the actual body mass is lower, the black dot appears below the prediction. The body masses in general are sometimes higher and sometimes lower than the better_function predictions.

The scatter plot also shows us that sometimes the actual body mass is closer to the prediction (as in the case of penguin #1) and sometimes farther (see penguin #6 below).

A scatter plot of body_mass_kg predicted by flipper_length_m. A blue line of best fit is plotted on the graph and runs through the center of the data points. A couple of data points are highlighted and have vertical lines connecting the points to the blue line to indicate the distance between the actual value and the predicted value. Some of the data points are closer to the prediction line, and some are farther from the prediction line.

These vertical differences between the actual body masses and the predicted body masses can be thought of as error from the model. Notice that each penguin’s prediction has an associated error, which is technically called a residual. In the next sections, we’ll learn how to calculate the residuals and use them to compare which models are better and which are worse.

Responses