Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentAlgebra + Data Science
-
segmentChapter 1 - Exploring Variation in Data
-
segmentChapter 2 - Modeling Data with Functions
-
segmentChapter 3 - Assessing How Well Models Fit the Data
-
3.7 The Beauty of Sum of Squares (SSE)
-
segmentResources
list High School / Algebra + Data Science (G)
3.7 The Beauty of Sum of Squared Error (SSE)
We are pretty sure that our best model will run through the middle of the data, and that it will balance the residuals perfectly. The sum of the residuals will add up to 0, which means the line will pass through the point of means. But how do we know which of all the models that pass through the point of means will be the best model?
Summing Up Error: Sum of Squares
Statisticians have explored various measures of total error, such as calculating the sum of the absolute value of residuals or the average absolute residual, but all these approaches had the same problem as the sum of residuals – they aren’t uniquely minimized by a single function or model. Finally, they had a breakthrough: a measure of error that would identify a single best model that reduces error to the lowest possible level. This breakthrough was the sum of the squared error (SSE), or simply, Sum of Squares.
Sum of squares, like other measures of error, starts with the residuals. But instead of adding up the residuals, we square each residual first, then add them up. Instead of telling R to sum(residual)
, we simply tell it to sum(residual^2)
. (Note that in R, we create an exponent using the caret symbol (^
), usually found above the 6 on a standard keyboard. ^2
means squared.)
Let’s take a look at what we are doing here. Below we show the same 6 data points from penguins
that we have been looking at, this time with their residuals squared. With sum of squares, the total error around the model can be thought of as the sum of all the areas of all the squares formed by all the residuals from the model.
In the code window below we’ve put some code to create the residuals from the model we created earlier called our_balanced_function
. Add some code to calculate the sum of squares from this model.
require(coursekata)
Y <- penguins$body_mass_kg
X <- penguins$flipper_length_m
# assume Y and X have been created for you
# this creates our_balanced_function
our_balanced_function <- function(X){-6.0925 + 51.25*X}
# this creates the residuals
residual <- Y - our_balanced_function(X)
# write code to sum the squared residuals
residual
# assume Y and X have been created for you
# this creates our_balanced_function
our_balanced_function <- function(X){-6.0925 + 51.25*X}
# this creates the residuals
residual <- Y - our_balanced_function(X)
# write code to sum the squared residuals
sum(residual^2)
ex() %>% {
check_operator(., "^") %>% check_result() %>% check_equal()
check_function(., "sum") %>% check_arg("x") %>% check_equal()
}
51.290409375
This is the SSE for the model saved as our_balanced_function
. Models that have a smaller SSE are better than models with larger SSE.
Below we show two other models, both of which pass through the point of means. Even though both models have residuals that sum to 0, the one on the left looks like a better model. Sure enough, it also has a smaller SSE (51.63 versus 165.97).
Sum of residuals = 0 Sum of squared error = 51.63 |
Sum of residuals = 0 Sum of squared error = 165.97 |
---|---|
|
|