list High School / Algebra + Data Science (G)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

3.7 The Beauty of Sum of Squared Error (SSE)

We are pretty sure that our best model will run through the middle of the data, and that it will balance the residuals perfectly. The sum of the residuals will add up to 0, which means the line will pass through the point of means. But how do we know which of all the models that pass through the point of means will be the best model?

Summing Up Error: Sum of Squares

Statisticians have explored various measures of total error, such as calculating the sum of the absolute value of residuals or the average absolute residual, but all these approaches had the same problem as the sum of residuals – they aren’t uniquely minimized by a single function or model. Finally, they had a breakthrough: a measure of error that would identify a single best model that reduces error to the lowest possible level. This breakthrough was the sum of the squared error (SSE), or simply, Sum of Squares.

Sum of squares, like other measures of error, starts with the residuals. But instead of adding up the residuals, we square each residual first, then add them up. Instead of telling R to sum(residual), we simply tell it to sum(residual^2). (Note that in R, we create an exponent using the caret symbol (^), usually found above the 6 on a standard keyboard. ^2 means squared.)

Let’s take a look at what we are doing here. Below we show the same 6 data points from penguins that we have been looking at, this time with their residuals squared. With sum of squares, the total error around the model can be thought of as the sum of all the areas of all the squares formed by all the residuals from the model.

Residuals are drawn as blue vertical lines from the data point to the model, and those residuals have been scaled into squares.

In the code window below we’ve put some code to create the residuals from the model we created earlier called our_balanced_function. Add some code to calculate the sum of squares from this model.

require(coursekata) Y <- penguins$body_mass_kg X <- penguins$flipper_length_m # assume Y and X have been created for you # this creates our_balanced_function our_balanced_function <- function(X){-6.0925 + 51.25*X} # this creates the residuals residual <- Y - our_balanced_function(X) # write code to sum the squared residuals residual # assume Y and X have been created for you # this creates our_balanced_function our_balanced_function <- function(X){-6.0925 + 51.25*X} # this creates the residuals residual <- Y - our_balanced_function(X) # write code to sum the squared residuals sum(residual^2) ex() %>% { check_operator(., "^") %>% check_result() %>% check_equal() check_function(., "sum") %>% check_arg("x") %>% check_equal() }
51.290409375

This is the SSE for the model saved as our_balanced_function. Models that have a smaller SSE are better than models with larger SSE.

Below we show two other models, both of which pass through the point of means. Even though both models have residuals that sum to 0, the one on the left looks like a better model. Sure enough, it also has a smaller SSE (51.63 versus 165.97).

Sum of residuals = 0
Sum of squared error = 51.63
Sum of residuals = 0
Sum of squared error = 165.97

On the left, a scatter plot of body_mass_kg predicted by flipper_length_m. A teal line of best fit is plotted on the graph and runs through the center of the data points. A couple of data points are highlighted and their residuals as vertical lines from the teal line are made into squares.

On the right, a scatter plot of body_mass_kg predicted by flipper_length_m. A red line is plotted on the graph and runs through the mean of the data points, but not through the center of the data points. A couple of data points are highlighted and their residuals as vertical lines from the red line are made into squares. The squares are larger in size, on average, than the graph on the left.

Responses