Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentAlgebra + Data Science
-
segmentChapter 1 - Exploring Variation in Data
-
segmentChapter 2 - Modeling Data with Functions
-
segmentChapter 3 - Assessing How Well Models Fit the Data
-
3.8 The Best-Fitting Model
-
segmentResources
list High School / Algebra + Data Science (G)
3.8 The Best-Fitting Model
Whereas many different lines have residuals that sum to 0, the sum of squares is uniquely minimized for one of these lines. This is the model that we will call the best-fitting model of our data.
We might reduce error further by adding another predictor to the model, or by using a different function (something you can learn more about later). But once we decide to use a straight line as a model, there will be only one line, defined by its coefficients \(b_0\) and \(b_1\), where the sum of squares is minimized.
R has a special function that helps us identify the coefficients where the SSE is minimized. The function is called lm()
, which stands for linear model. Here’s how to find the best-fitting linear model of body mass using flipper length as the predictor:
lm(body_mass_kg ~ flipper_length_m, data = penguins)
Try using lm()
in the code block below to find the slope and intercept (\(b_0\) and \(b_1\)) of the best-fitting line to model the relationship between flipper length and body mass.
require(coursekata)
# use lm() to find the best coefficients
# use lm() to find the best coefficients
lm(body_mass_kg ~ flipper_length_m, data = penguins)
ex() %>% check_function("lm") %>% check_arg('formula') %>% check_equal()
Call:
lm(formula = body_mass_kg ~ flipper_length_m, data = penguins)
Coefficients:
(Intercept) flipper_length_m
-5.872 50.153
The output of lm()
shows two numbers.
Now that we know the coefficients that go in the best model, we can use those values to make best_function
(the best-fitting model) in the code block below. We have also added code to graph the function and calculate the SSE from it.
require(coursekata)
Y <- penguins$body_mass_kg
X <- penguins$flipper_length_m
# replace the coefficients to create the best_function
best_function <- function(X){b0 + b1*X}
# this puts the best function on this scatter plot
gf_point(body_mass_kg ~ flipper_length_m, data = penguins) %>%
gf_function(best_function, color = "forestgreen") %>%
gf_point(mean(Y) ~ mean(X), color = "red")
# this calculates sum of squares
residual <- Y - best_function(X)
sum(residual^2)
# replace the coefficients to create the best_function
best_function <- function(X){-5.872 + 50.153*X}
# this puts the best function on this scatter plot
gf_point(body_mass_kg ~ flipper_length_m, data = penguins) %>%
gf_function(best_function, color = "forestgreen") %>%
gf_point(mean(Y) ~ mean(X), color = "red")
# this calculates sum of squares
residual <- Y - best_function(X)
sum(residual^2)
msg <- "Remember not to round the values of intercept and slope."
ex() %>% check_fun_def("best_function") %>% {
check_arguments(.)
check_call(., 1) %>% check_result() %>% check_equal(incorrect_msg = msg)
check_body(.) %>% {
check_operator(., "+")
check_operator(., "*")
}
}
51.21196324697
Is it Really the Best-Fitting Model?
The best-fitting model is the model that minimizes the sum of squared error (or SSE). If we calculate residuals from any other function (with different \(b_0\)s and \(b_1\)s than the ones we found from lm()
), the sum of squares would be larger than 51.21. Let’s check to see if this appears to be true.
Enter your own \(b_1\). Assuming that you want your line to go through the point of means, the R code will automatically calculate the \(b_0\) associated with your \(b_1\). The rest of the code will calculate the sum of squares. See if you can find any \(b_1\)s that has a lower SSE than 51.21.
require(coursekata)
Y <- penguins$body_mass_kg
X <- penguins$flipper_length_m
# replace with your own value of b1 here
b1 <- 50.153
# this calculates the b0 that goes with your b1
b0 <- mean(Y) - b1*mean(X)
# this creates a function from the b1 and b0
try_function <- function(X){b0 + b1*X}
# this calculates sum of squares
residual <- Y - try_function(X)
sum(residual^2)
# replace with your own value of b1 here
# any value of b1 is acceptable
b1 <- 50.153
# this calculates the b0 that goes with your b1
b0 <- mean(Y) - b1*mean(X)
# this creates a function from the b1 and b0
try_function <- function(X){b0 + b1*X}
# this calculates sum of squares
residual <- Y - try_function(X)
sum(residual^2)
ex() %>% {
check_function(., "try_function")
check_function(., "mean")
check_operator(., "-")
check_operator(., "*")
check_operator(., "^")
}
It’s impossible to make the SSE go lower! We’ve tried some values of \(b_1\) ourselves and plotted them on the graph below. We have also plotted the best-fitting model’s \(b_1\) as a green dot.
When we plot the sums of squares this way they form a pattern – a kind of parabola (if you remember the shape from your algebra days). In fact, it can be proven mathematically, using calculus, that the sums of squares can never be lower for any parameters other than the particular combination generated by the lm()
function. (A smoothed out version of the function is shown below).
The \(b_1\) that minimizes SS | The \(b_0\) that minimizes SS |
---|---|
|
|
Teacher Note: How does R know how to find the best coefficients? Here is the formula that R uses to find the best-fitting \(b_1\) and \(b_0\):
\[b_1 = \frac{\sum{(X - mean(X))(Y - mean(Y))}}{\sum{X - mean(X)}}\]
Once we know \(b_1\), we can use the point of means and figure out what the corresponding \(b_0\) is:
\[b_0 = mean(Y) - b_1*mean(X)\]