list High School / Algebra + Data Science (G)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

3.6 Models with Perfectly Balanced Residuals

Lines that Pass Through Point of Means

To test out this idea that lines passing through the point of means will have residuals that sum to 0, we need to create a line (or function) that passes through this point. Here’s an example of a linear function (we will call it our_balanced_function) that passes through the point of means

\[Y=\underbrace{b_0}_{-6.0925} + \underbrace{b_1}_{51.25}X + error\]

We can verify that it passes through the point of means with an arithmetic calculation: If we plug the mean of \(X\) (0.20096697) into this equation, we should get the mean of \(Y\) (which is around 4.207). We can also verify that this line passes through the point of means adding the function onto our scatter plot.

Teacher Note: We recommend doing the calculation using R instead of a calculator because R won’t round the mean(X) as much as human students will. Rounding the mean of \(X\) more than what we have shown below will produce rounding errors – that is, the calculated \(Y\) will be a little different than the rounded \(Y\) (4.207).

R code: -6.0925 + 51.25*mean(X)

Calculator: \(-6.0925 + 51.25 * \underbrace{0.20096697}_\text{mean of X, rounded}\)

require(coursekata) # defines our Y and X Y <- penguins$body_mass_kg X <- penguins$flipper_length_m # fill in the appropriate numbers for b0 and b1 our_balanced_function <- function(X){b0 + b1*X} # this will put our_balanced_function onto our graph gf_point(body_mass_kg ~ flipper_length_m, data = penguins) %>% gf_point(mean(Y) ~ mean(X), color = "red") %>% gf_function(our_balanced_function, color = "blue3") # defines our Y and X Y <- penguins$body_mass_kg X <- penguins$flipper_length_m # fill in the appropriate numbers for b0 and b1 our_balanced_function <- function(X){-6.0925 + 51.25*X} # this will put our_balanced_function onto our graph gf_point(body_mass_kg ~ flipper_length_m, data = penguins) %>% gf_point(mean(Y) ~ mean(X), color = "red") %>% gf_function(our_balanced_function, color = "blue3") msg <- "Remember not to round the values of intercept and slope." ex() %>% check_fun_def("our_balanced_function") %>% { check_arguments(.) check_call(., 1) %>% check_result() %>% check_equal(incorrect_msg = msg) check_body(.) %>% { check_operator(., "+") check_operator(., "*") } }

A scatter plot of body_mass_kg predicted by flipper_length_m. A single data point highlighted in red is plotted where the mean of body mass and the mean of flipper length intersect, and a blue line is plotted on the graph and runs roughly through the center of the data points, and runs through the highlighted red dot.

Now write some code to verify that the residuals from this function indeed add up to a number that is very close to 0.

require(coursekata) Y <- penguins$body_mass_kg X <- penguins$flipper_length_m our_balanced_function <- function(X){-6.0925 + 51.25*X} # Y, X, and our_balanced_function have been defined for you # calculate the residuals residual <- # calculate the sum of the residuals # Y, X, and our_balanced_function have been defined for you # calculate the residuals residual <- Y - our_balanced_function(X) # calculate the sum of the residuals sum(residual) ex() %>% { check_object(., "residual") %>% check_equal() check_function(., "sum") %>% check_arg("x") %>% check_equal() }
7.68274333040608e-14

Well, that’s good news, right? We’ve found a function that perfectly balances the residuals! But before you get too excited, we have to tell you this: there are a lot of lines that pass through the point of means (an infinite number, in fact), and they all produce residuals that add up to 0.

We’ve plotted just 15 of these in the figure below. You can see that even though their residuals add up to 0, most of them are clearly terrible models for predicting body mass from flipper length. Balancing the residuals may be one criterion for how well a model fits, but it can’t be the only criterion. Which leads us to: Sum of Squares.

A scatter plot of body_mass_kg predicted by flipper_length_m. A single data point highlighted in red is plotted where the mean of body mass and the mean of flipper length intersect. Several lines are plotted on the graph running in multiple directions. All of the lines run through the red dot.

Teacher Note: If students are curious about how we found all these lines, you can tell them we used algebra and coding!

The key is that in the formula \(Y = b_0 + b_1X\), if we know 3 of these values, we can figure out the last one. Since these lines have to pass through the point of means, we know a \(Y\) and \(X\) value we could use (\(mean(Y)\) and \(mean(X)\), respectively). If we put in some number (any number!) for \(b_1\) we can solve for \(b_0\):

\[mean(Y) = b_0 + b_1mean(X)\]

\[b_0 = mean(Y) - b_1mean(X)\]

In code, we just created two variables, b0 and b1. We could set b0 to whatever number we want (e.g., 5 in the example below):

b1 <- 5
b0 <- mean(Y) - b_1*mean(X)

You could also use the same ideas to set the value of \(b_0\) and solve for \(b_1\) (i.e., \(b_1 = \frac{mean(Y) - b_0}{mean(X)}\)).

Responses