list High School / Algebra + Data Science (G)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

3.2 Better Models Have Less Error

DATA = MODEL + ERROR

We started out representing hypotheses as word equations:

Body Mass = Flipper Length + Other Stuff

We then learned to quantify our hypothesis, using an algebraic function, as a statistical model, using flipper length to predict body mass:

\[\text{body_mass_kg}=b_0+b_1\text{flipper_length_m}\]

This function generates a straight line, and can predict where a penguin’s body mass would fall on that line for any value of flipper length. However, it doesn’t account for the fact that most penguins’ actual body masses are not exactly where our function predicts them to be, i.e., they don’t fall right on the line.

Now that we’ve learned about residuals (or error), we can introduce a new and powerful concept:

\[\underbrace{\text{DATA}}_\text{Actual Body Mass}=\underbrace{\text{MODEL}}_\text{Predicted Body Mass} + \underbrace{\text{ ERROR}}_\text{Residual}\]

If we continue on with the example of flipper length predicting body mass, DATA refers to the penguin’s actual body mass, MODEL to their predicted body mass based on our model, and ERROR to the residual, or gap between the two.

To represent actual data (as opposed to just the model predictions) we will need to complicate our General Linear Model (GLM) notation just a little to allow for variation across different penguins. Whereas before we represented our model predictions as the straight line \(Y=b_0+b_1X\), we now are going to add subscripts to represent each individual penguin, like this:

\[\underbrace{Y_i}_\text{DATA}=\underbrace{b_0+b_1X_i}_\text{MODEL} + \underbrace{e_i}_\text{ERROR}\]

The model part of the equation is just as it was before; it still is a function that produces a straight line of model predictions. But now we’ve added the subscript \(i\) to indicate each individual penguin. \(Y_i\), for example, represents the actual body mass of a particular penguin – the one on row \(i\) of the data frame. \(Y_1\), for example, would be the value of body_mass_kg of the penguin in the first row of the data frame. (Think of \(i\) as cycling through each row, from row 1 all the way to the last row of penguins in the data frame, 333.)

\(b_0+b_1X_i\) represents the predicted body mass for the penguin in row \(i\) based on their particular flipper length, represented as \(X_i\). And \(e_i\) represents the difference between the actual body mass (\(Y_i\)) and the predicted body mass.

Let’s look again at the first 6 rows of the select_penguins data frame. \(Y_3\) represents the value of the penguin’s body mass in row three, which is 3.950.

  flipper_length_m body_mass_kg prediction
1            0.194        4.200      3.894
2            0.217        4.375      5.067
3            0.185        3.950      3.435
4            0.218        5.700      5.118
5            0.210        4.000      4.710
6            0.192        3.000      3.792

Calculating Residuals

To calculate the residual for each data point, we simply rearrange the equation a bit, like this:

\[\underbrace{\text{ERROR}}_\text{Residual}=\underbrace{\text{DATA}}_\text{Actual Body Mass} - \underbrace{\text{ MODEL}}_\text{Predicted Body Mass}\]

In other words, the residual is calculated by subtracting the model prediction (our_function(flipper_length_m) or \(b_0+b_1X_i\)) from the penguin’s actual body mass (body_mass_kg or \(Y_i\)).

Let’s use R to add a column of residuals for each penguin right next to the column for predictions in the select_penguins data frame. We can start with the code we used to create the column called prediction, then just add to the mutate() code like this:

mutate(select_penguins, 
       prediction = better_function(flipper_length_m),
       residual = body_mass_kg - prediction) %>%
head()

Note that we also piped (%>%) on the head() command so that we only print out the first 6 rows of the mutated data frame.

In the code block below, add the code that would add a variable called residual to the select_penguins data frame.

require(coursekata) # selects a few variables select_penguins <- select(penguins, flipper_length_m, body_mass_kg) # creates the better function better_function <- function(X){-6 + 51*X} # add a column called residual mutate(select_penguins, prediction = better_function(flipper_length_m), residual = ) %>% head() # selects a few variables select_penguins <- select(penguins, flipper_length_m, body_mass_kg) # creates the better function better_function <- function(X){-6 + 51*X} # add a column called residual mutate(select_penguins, prediction = better_function(flipper_length_m), residual = body_mass_kg - prediction) %>% head() mutate_msg <- "Check how you mutated select_penguins" ex() %>% check_output_expr("mutate(select_penguins, prediction = better_function(flipper_length_m), residual = body_mass_kg - prediction) %>% head()", missing_msg = mutate_msg)
  flipper_length_m body_mass_kg prediction residual
1            0.194        4.200      3.894    0.306
2            0.217        4.375      5.067   -0.692
3            0.185        3.950      3.435    0.515
4            0.218        5.700      5.118    0.582
5            0.210        4.000      4.710   -0.710
6            0.192        3.000      3.792   -0.792

Note that you can see in this new table that DATA does indeed equal MODEL + ERROR. Take row 3 as an example. Penguin #3’s actual body mass (3.950) can be exactly obtained by adding together their value in the prediction column (3.435) and their value in the residual column (0.515).

In the figure below, we have added these residuals to the vertical lines depicting how far off the data and the prediction are.

A scatter plot of body_mass_kg predicted by flipper_length_m. A blue line of best fit is plotted on the graph and runs through the center of the data points. A couple of data points are highlighted and have vertical lines connecting the points to the blue line to indicate the distance between the actual value and the predicted value. The calculated difference between the data points and the predictions are labeled near each residual. The points above the line have a positive value, and the points below the line have a negative value.

Returning to the Worse Model

Now that we know about predictions and residuals, let’s see how these concepts apply to a worse model, represented by the red line in the figure below.

worse_function <- function(X){-7 + 49*X}

A scatter plot of body_mass_kg predicted by flipper_length_m. A red line is plotted on the graph but the line has a slope and intercept that does not run through the data points, and just runs through the empty space below the data points on the graph.

Go ahead and calculate the predictions and residuals from the worse_function model in the code window below.

require(coursekata) # selects a few variables select_penguins <- select(penguins, flipper_length_m, body_mass_kg) # creates the worse function worse_function <- function(X){-7 + 49*X} # calculate the prediction and residual variables # (assume select_penguins has already been created) mutate(select_penguins, prediction = , residual = ) %>% head() # creates the worse function worse_function <- function(X){-7 + 49*X} # calculate the prediction and residual variables # (assume SelectPenguins has already been created) mutate(select_penguins, prediction = worse_function(flipper_length_m), residual = body_mass_kg - prediction) %>% head() mutate_msg <- "Check how you mutated select_penguins" ex() %>% check_output_expr("mutate(select_penguins, prediction = worse_function(flipper_length_m), residual = body_mass_kg - prediction) %>% head()", missing_msg = mutate_msg)
 flipper_length_m body_mass_kg prediction residual
1            0.194        4.200      2.506    1.694
2            0.217        4.375      3.633    0.742
3            0.185        3.950      2.065    1.885
4            0.218        5.700      3.682    2.018
5            0.210        4.000      3.290    0.710
6            0.192        3.000      2.408    0.592

In the figures below we compare the residuals for the better_function model with those for the worse_function model for the first six rows of the data frame just so you can see them side by side.

better_function <- function(X){-6 + 51*X} worse_function <- function(X){-7 + 49*X}}

On the left, a scatter plot of body_mass_kg predicted by flipper_length_m. A blue line of best fit is plotted on the graph and runs through the center of the data points. A couple of data points are highlighted and have vertical lines connecting the points to the blue line to indicate the distance between the actual value and the predicted value. The calculated difference between the data points and the predictions are labeled near each residual.

On the right, a scatter plot of body_mass_kg predicted by flipper_length_m. A red line is plotted on the graph but the line has a slope and intercept that does not run through the data points, and just runs through the empty space below the data points on the graph. The same highlighted data points as the graph on the right have their residuals drawn as a vertical line from the points to the red line. All of the residuals are positive values above the line, and the length of the residuals are longer than for the graph on the left.

The residuals for the worse_function model (on the right) are, indeed, all positive. But also the length of the residuals (i.e., the distance between the residual and the predicted body mass) are in general longer for the worse_function model than the better_function model.

Better models will have residuals that are both positive and negative, and in general, closer to 0. We will see later that analyzing residuals is an important tool for assessing how well a model fits the data. But before we go further, we are going to take a quick detour to the world of vectors.

Responses