Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentAlgebra + Data Science
-
segmentChapter 1 - Exploring Variation in Data
-
segmentChapter 2 - Modeling Data with Functions
-
segmentChapter 3 - Assessing How Well Models Fit the Data
-
3.4 Summing Residuals From a Model
-
segmentResources
list High School / Algebra + Data Science (G)
3.4 Summing Residuals From a Model
Having learned a bit about vectors, let’s go back to our comparison of the better_function
model with the worse_function
model.
better_function <- function(X){-6 + 51*X}
|
worse_function <- function(X){-7 + 49*X}}
|
---|---|
|
|
We observed before that the better model appears to have shorter residuals than the worse model, and that the residuals seem to hover more closely around (above and below) the line of predictions. Of course, we were only looking closely at the 6 residuals highlighted in the figure above.
Let’s see if we can test this idea by summing up all of the 333 residuals (one for each penguin) from each model.
Using Vectors to Calculate the Residuals
Let’s start by putting our outcome (Y
) and predictor (X
) variables into vectors:
Y <- penguins$body_mass_kg
X <- penguins$flipper_length_m
Using the vectors X
and Y
, we can now calculate the predictions and residuals from the better_function
model like this:
prediction <- better_function(X)
residual <- Y - prediction
Note that we have now saved two new vectors: prediction
and residual
. But we didn’t have to save a vector called prediction
in order to calculate the residuals. We could have skipped this step and just replaced the idea of prediction
with the function that calculated the prediction: better_function(X)
. Try it in the code block below.
require(coursekata)
# defines our Y and X
Y <- penguins$body_mass_kg
X <- penguins$flipper_length_m
# defines better_function
better_function <- function(X){-6 + 51*X}
# edit the code below to replace prediction
residual <- Y - prediction
# prints out residual vector
residual
# defines our Y and X
Y <- penguins$body_mass_kg
X <- penguins$flipper_length_m
# defines better_function
better_function <- function(X){-6 + 51*X}
# edit the code below to replace prediction
residual <- Y - better_function(X)
# prints out residual vector
residual
ex() %>% check_function("better_function") %>% {
check_result(.,) %>% check_equal()
}
Using Vectors to Sum the Residuals
Now that we’ve found an easy way to calculate all the residuals from the better_function
model, let’s try summing them up to get an idea of what the total error might be around this model. We will then compare the total error from the better_function
model with that from the worse_function
model to get a sense of which model has less total error.
To do this we will use the sum()
function to sum up the residuals from each of the two models. Try getting both of these sums in the code block below.
require(coursekata)
# defines our Y and X
Y <- penguins$body_mass_kg
X <- penguins$flipper_length_m
# defines the functions (models)
better_function <- function(X){-6 + 51*X}
worse_function <- function(X){-7 + 49*X}
# assume Y, X, and the functions have been defined
# this code calculates the better and worse residuals
better_residual <- Y - better_function(X)
worse_residual <- Y - worse_function(X)
# write code to sum up each set of residuals
# assume Y, X, and the functions have been defined
# this calculates the better and worse residuals
better_residual <- Y - better_function(X)
worse_residual <- Y - worse_function(X)
# write code to sum up each set of residuals
sum(better_residual)
sum(worse_residual)
ex() %>% check_function("sum", index = 1) %>% check_arg("x") %>% check_equal()
ex() %>% check_function("sum", index = 2) %>% check_arg("x") %>% check_equal()
-14.072
452.772
In the figure below we show you the two models overlaid on the scatter plot of body mass by flipper length along with each model’s total residuals. We’ve also added a third model (on the right) and its total residuals.
Model: worse_function Sum of residuals = 452.772 |
Model: better_function Sum of residuals = -14.072 |
Model: some other function Sum of residuals = -213.228 |
---|---|---|
|
|
|
We previously surmised that the residuals from our better models tend to be both positive and negative, and the better the model, the closer the residuals are to 0. We have now confirmed this idea by summing up all the residuals in the data frame. Residuals from our better_function
model add up much closer to 0 than did residuals from the worse_function
model.
When the body masses are sometimes higher and sometimes lower than the predictions, the sum of the residuals should be closer to 0 than when the predictions are always too high or too low. This suggests that in our quest for the best model, we should be looking for a model that perfectly balances the residuals.