list High School / Algebra + Data Science (G)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

3.4 Summing Residuals From a Model

Having learned a bit about vectors, let’s go back to our comparison of the better_function model with the worse_function model.

better_function <- function(X){-6 + 51*X} worse_function <- function(X){-7 + 49*X}}

On the left, a scatter plot of body_mass_kg predicted by flipper_length_m. A blue line of best fit is plotted on the graph and runs through the center of the data points. A couple of data points are highlighted and have vertical lines connecting the points to the blue line to indicate the distance between the actual value and the predicted value. The calculated difference between the data points and the predictions are labeled near each residual. The points above the line have a positive value, and the points below the line have a negative value.

On the right, a scatter plot of body_mass_kg predicted by flipper_length_m. A red line is plotted on the graph but the line has a slope and intercept that does not run through the data points, and just runs through the empty space below the data points on the graph. The same highlighted data points as the graph on the right have their residuals drawn as a vertical line from the points to the red line. All of the residuals are positive values above the line, and the length of the residuals are longer than for the graph on the left.

We observed before that the better model appears to have shorter residuals than the worse model, and that the residuals seem to hover more closely around (above and below) the line of predictions. Of course, we were only looking closely at the 6 residuals highlighted in the figure above.

Let’s see if we can test this idea by summing up all of the 333 residuals (one for each penguin) from each model.

Using Vectors to Calculate the Residuals

Let’s start by putting our outcome (Y) and predictor (X) variables into vectors:

Y <-  penguins$body_mass_kg
X <-  penguins$flipper_length_m

Using the vectors X and Y, we can now calculate the predictions and residuals from the better_function model like this:

prediction <-  better_function(X) 
residual <-  Y - prediction

Note that we have now saved two new vectors: prediction and residual. But we didn’t have to save a vector called prediction in order to calculate the residuals. We could have skipped this step and just replaced the idea of prediction with the function that calculated the prediction: better_function(X). Try it in the code block below.

require(coursekata) # defines our Y and X Y <- penguins$body_mass_kg X <- penguins$flipper_length_m # defines better_function better_function <- function(X){-6 + 51*X} # edit the code below to replace prediction residual <- Y - prediction # prints out residual vector residual # defines our Y and X Y <- penguins$body_mass_kg X <- penguins$flipper_length_m # defines better_function better_function <- function(X){-6 + 51*X} # edit the code below to replace prediction residual <- Y - better_function(X) # prints out residual vector residual ex() %>% check_function("better_function") %>% { check_result(.,) %>% check_equal() }

Using Vectors to Sum the Residuals

Now that we’ve found an easy way to calculate all the residuals from the better_function model, let’s try summing them up to get an idea of what the total error might be around this model. We will then compare the total error from the better_function model with that from the worse_function model to get a sense of which model has less total error.

To do this we will use the sum() function to sum up the residuals from each of the two models. Try getting both of these sums in the code block below.

require(coursekata) # defines our Y and X Y <- penguins$body_mass_kg X <- penguins$flipper_length_m # defines the functions (models) better_function <- function(X){-6 + 51*X} worse_function <- function(X){-7 + 49*X} # assume Y, X, and the functions have been defined # this code calculates the better and worse residuals better_residual <- Y - better_function(X) worse_residual <- Y - worse_function(X) # write code to sum up each set of residuals # assume Y, X, and the functions have been defined # this calculates the better and worse residuals better_residual <- Y - better_function(X) worse_residual <- Y - worse_function(X) # write code to sum up each set of residuals sum(better_residual) sum(worse_residual) ex() %>% check_function("sum", index = 1) %>% check_arg("x") %>% check_equal() ex() %>% check_function("sum", index = 2) %>% check_arg("x") %>% check_equal()
-14.072
452.772

In the figure below we show you the two models overlaid on the scatter plot of body mass by flipper length along with each model’s total residuals. We’ve also added a third model (on the right) and its total residuals.

Model: worse_function
Sum of residuals = 452.772
Model: better_function
Sum of residuals = -14.072
Model: some other function
Sum of residuals = -213.228

On the left, a scatter plot of body_mass_kg predicted by flipper_length_m. A red line is plotted on the graph but the line has a slope and intercept that does not run through the data points, and just runs through the empty space below the data points on the graph.

In the middle, a scatter plot of body_mass_kg predicted by flipper_length_m. A blue line of best fit is plotted on the graph and runs through the center of the data points.

On the right, a scatter plot of body_mass_kg predicted by flipper_length_m. An orange line is plotted on the graph and runs through some of the data points near the top of the cluster of the data points, but does not run through the center of the data points.


We previously surmised that the residuals from our better models tend to be both positive and negative, and the better the model, the closer the residuals are to 0. We have now confirmed this idea by summing up all the residuals in the data frame. Residuals from our better_function model add up much closer to 0 than did residuals from the worse_function model.

When the body masses are sometimes higher and sometimes lower than the predictions, the sum of the residuals should be closer to 0 than when the predictions are always too high or too low. This suggests that in our quest for the best model, we should be looking for a model that perfectly balances the residuals.

Responses