Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentAlgebra + Data Science
-
segmentChapter 1 - Exploring Variation in Data
-
segmentChapter 2 - Modeling Data with Functions
-
segmentChapter 3 - Assessing How Well Models Fit the Data
-
3.6 Models with Perfectly Balanced Residuals
-
segmentResources
list High School / Algebra + Data Science (G)
3.6 Models with Perfectly Balanced Residuals
Lines that Pass Through Point of Means
To test out this idea that lines passing through the point of means
will have residuals that sum to 0, we need to create a line (or
function) that passes through this point. Here’s an example of a linear
function (we will call it our_balanced_function
) that
passes through the point of means
\[Y=\underbrace{b_0}_{-6.0925} + \underbrace{b_1}_{51.25}X + error\]
We can verify that it passes through the point of means with an arithmetic calculation: If we plug the mean of \(X\) (0.20096697) into this equation, we should get the mean of \(Y\) (which is around 4.207). We can also verify that this line passes through the point of means adding the function onto our scatter plot.
Teacher Note: We recommend doing the calculation
using R instead of a calculator because R won’t round the
mean(X)
as much as human students will. Rounding the mean
of \(X\) more than what we have shown
below will produce rounding errors – that is, the calculated \(Y\) will be a little different than the
rounded \(Y\) (4.207).
R code: -6.0925 + 51.25*mean(X)
Calculator: \(-6.0925 + 51.25 * \underbrace{0.20096697}_\text{mean of X, rounded}\)
require(coursekata)
# defines our Y and X
Y <- penguins$body_mass_kg
X <- penguins$flipper_length_m
# fill in the appropriate numbers for b0 and b1
our_balanced_function <- function(X){b0 + b1*X}
# this will put our_balanced_function onto our graph
gf_point(body_mass_kg ~ flipper_length_m, data = penguins) %>%
gf_point(mean(Y) ~ mean(X), color = "red") %>%
gf_function(our_balanced_function, color = "blue3")
# defines our Y and X
Y <- penguins$body_mass_kg
X <- penguins$flipper_length_m
# fill in the appropriate numbers for b0 and b1
our_balanced_function <- function(X){-6.0925 + 51.25*X}
# this will put our_balanced_function onto our graph
gf_point(body_mass_kg ~ flipper_length_m, data = penguins) %>%
gf_point(mean(Y) ~ mean(X), color = "red") %>%
gf_function(our_balanced_function, color = "blue3")
msg <- "Remember not to round the values of intercept and slope."
ex() %>% check_fun_def("our_balanced_function") %>% {
check_arguments(.)
check_call(., 1) %>% check_result() %>% check_equal(incorrect_msg = msg)
check_body(.) %>% {
check_operator(., "+")
check_operator(., "*")
}
}
Now write some code to verify that the residuals from this function indeed add up to a number that is very close to 0.
require(coursekata)
Y <- penguins$body_mass_kg
X <- penguins$flipper_length_m
our_balanced_function <- function(X){-6.0925 + 51.25*X}
# Y, X, and our_balanced_function have been defined for you
# calculate the residuals
residual <-
# calculate the sum of the residuals
# Y, X, and our_balanced_function have been defined for you
# calculate the residuals
residual <- Y - our_balanced_function(X)
# calculate the sum of the residuals
sum(residual)
ex() %>% {
check_object(., "residual") %>% check_equal()
check_function(., "sum") %>% check_arg("x") %>% check_equal()
}
7.68274333040608e-14
Well, that’s good news, right? We’ve found a function that perfectly balances the residuals! But before you get too excited, we have to tell you this: there are a lot of lines that pass through the point of means (an infinite number, in fact), and they all produce residuals that add up to 0.
We’ve plotted just 15 of these in the figure below. You can see that even though their residuals add up to 0, most of them are clearly terrible models for predicting body mass from flipper length. Balancing the residuals may be one criterion for how well a model fits, but it can’t be the only criterion. Which leads us to: Sum of Squares.
Teacher Note: If students are curious about how we found all these lines, you can tell them we used algebra and coding!
The key is that in the formula \(Y = b_0 + b_1X\), if we know 3 of these values, we can figure out the last one. Since these lines have to pass through the point of means, we know a \(Y\) and \(X\) value we could use (\(mean(Y)\) and \(mean(X)\), respectively). If we put in some number (any number!) for \(b_1\) we can solve for \(b_0\):
\[mean(Y) = b_0 + b_1mean(X)\]
\[b_0 = mean(Y) - b_1mean(X)\]
In code, we just created two variables, b0
and
b1
. We could set b0
to whatever number we want
(e.g., 5 in the example below):
b1 <- 5
b0 <- mean(Y) - b_1*mean(X)
You could also use the same ideas to set the value of \(b_0\) and solve for \(b_1\) (i.e., \(b_1 = \frac{mean(Y) - b_0}{mean(X)}\)).