Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentHigh School / Statistics and Data Science II (XCD)
-
segmentPART I: EXPLORING AND MODELING VARIATION
-
segmentChapter 1 - Exploring Data with R
-
segmentChapter 2 - From Exploring to Modeling Variation
-
2.9 Quantifying Total Error Around a Model
-
segmentChapter 3 - Modeling Relationships in Data
-
segmentPART II: COMPARING MODELS TO MAKE INFERENCES
-
segmentChapter 4 - The Logic of Inference
-
segmentChapter 5 - Model Comparison with F
-
segmentChapter 6 - Parameter Estimation and Confidence Intervals
-
segmentPART III: MULTIVARIATE MODELS
-
segmentChapter 7 - Introduction to Multivariate Models
-
segmentChapter 8 - Multivariate Model Comparisons
-
segmentChapter 9 - Models with Interactions
-
segmentChapter 10 - More Models with Interactions
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Statistics and Data Science II (XCD)
2.9 Quantifying Total Error Around a Model
In this last part of this chapter, we will dig deeper into the ERROR part of our DATA = MODEL + ERROR framework.
The goal of the statistical enterprise is to explain variation. Once we have created a statistical model of our data, we can then define what it means to explain variation in a more specific way, as reducing error around the model. When we add an explanatory variable to the model it will reduce error. But to know how much error it has reduced, we need to know how much error we had to start with.
We have already learned how to calculate a residual, which is the error for an individual data point. Now we will consider how to aggregate all these individual errors together to find out how much total error there is around the empty model.
To make this concrete, let’s consider the empty model for home prices
in Ames
. Recall that we saved our model in the object
empty_model
:
empty_model
Call:
lm(formula = PriceK ~ NULL, data = Ames)
Coefficients:
(Intercept)
181.4
Saving Predictions and Residuals
We will start by calculating the individual errors (residuals)
between our model predictions and the actual home prices for each house
in the dataset. To get the predictions from our empty model, we can use
the predict()
function, putting empty_model
as
the input inside the parentheses. Give it a try in the code window
below.
require(coursekata)
empty_model <- lm(PriceK ~ NULL, data = Ames)
# generate predictions from this model
empty_model <- lm(PriceK ~ NULL, data = Ames)
# generate predictions from this model
predict(empty_model)
ex() %>% check_output_expr("predict(empty_model)")
Whoa – that’s a lot of 181.4s! What you see here is the prediction the empty model made for each of the 185 homes in our dataset. Our simple model gave the same prediction for every home: the mean price of about $181,428.
Usually, these predictions are “off” from the actual sale price of
the home. How “off” are they? To calculate all the residuals (error)
from these predictions, we can use the resid()
function:
resid(empty_model)
Try it in the code block below.
require(coursekata)
empty_model <- lm(PriceK ~ NULL, data = Ames)
# generate residuals from this model’s predictions
empty_model <- lm(PriceK ~ NULL, data = Ames)
# generate residuals from this model’s predictions
resid(empty_model)
ex() %>% check_output_expr("resid(empty_model)")
It’s kind of hard to look at the residuals like this. To see the
DATA = MODEL + ERROR (or PriceK = Mean +
Residual) relationship more clearly, run the code, which will
save the predictions and the residuals back into the Ames
data frame as new variables. Modify the select()
function
so you can see the actual home prices, predicted prices from the empty
model, and the residuals from the model for the first six rows of the
dataset.
require(coursekata)
empty_model <- lm(PriceK ~ NULL, data = Ames)
# saves the predictions and residuals from the empty model
Ames$empty_predict <- predict(empty_model)
Ames$empty_resid <- resid(empty_model)
# this will show us the first 6 rows of PriceK
# modify this code also show prediction and residual variables
head(select(Ames, PriceK))
# saves the predictions and residuals from the empty model
Ames$empty_predict <- predict(empty_model)
Ames$empty_resid <- resid(empty_model)
# this will show us the first 6 rows of PriceK
# modify this code also show prediction and residual variables
head(select(Ames, PriceK, empty_predict, empty_resid))
ex() %>% check_function("head") %>%
check_result() %>% check_equal()
PriceK empty_predict empty_resid
1 260 181.4281 78.57191
2 210 181.4281 28.57191
3 155 181.4281 -26.42809
4 125 181.4281 -56.42809
5 110 181.4281 -71.42809
6 100 181.4281 -81.42809
Notice that on each row, DATA = MODEL + ERROR: The
home price (PriceK
) is the sum of the model prediction and
the home’s residual (or error) from that prediction. For example, if you
look at the first house, the sale price (260) is equal to the prediction
+ residual (181.43 + 78.57).
Notice, also, that the residuals for the first two homes (e.g., 78.57, 28.57) are positive. This is because the actual price of these two houses were more than the model prediction (181).
The residuals for the first six homes in the dataset can also be depicted as vertical lines from the empty model prediction.
Total Error: Sum of Squared Residuals (SS)
We have now saved a residual for each home in the Ames
data frame. How might we put these residuals together to get a measure
of total error around the empty model?
One approach might be to just add all the residuals together. The problem with this approach, which we explained earlier, is that if you add together the residuals around the mean, the total will be 0 because the residuals are perfectly balanced around the mean.
One of the most common measures of total error around a model in statistics, and the one we will use in this book, is the sum of the squared residuals, or simply, Sum of Squares.
To calculate the sum of squares in R, let’s start by adding a new
column to Ames
that has the squared residuals for each
home. In R, we represent exponents with the caret symbol
(^
, usually above the 6 on a standard keyboard). So we can
use this code to create this new column:
Ames$residual_sqrd <- Ames$residual^2
Run the code in the window below, adding some code to get the sum of the squared residuals (Sum of Squares).
require(coursekata)
# don't delete this part
empty_model <- lm(PriceK ~ NULL, data = Ames)
Ames$empty_resid <- resid(empty_model)
# this creates the squared residuals
Ames$empty_resid_sqrd <- Ames$empty_resid^2
# write code to sum these squared residuals
# this create the squared residuals
Ames$empty_resid_sqrd <- Ames$empty_resid^2
# write code to sum these squared these residuals
sum(Ames$empty_resid_sqrd)
ex() %>% check_function("sum") %>%
check_result() %>% check_equal()
You should have gotten a number like this:
633717.215434616
This is the total sum of squares for the empty model of
PriceK
. We call it sum of squares because we
literally turned all those residual lines in the figure above into
squares. Here we show the same 6 data points as above, this time with
their residuals squared.
The sum of squares gives us a quantitative indicator of how much
total error there is in the outcome variable, PriceK
. When
we add explanatory variables to our model, in the next chapter, we will
reduce the total sum of squares. The amount by which we reduce it will
tell us how good our new model is.
Is the SS better than other measures of total error? We’ll explore why statisticians use sum of squares in the next section.