list High School / Algebra + Data Science (G)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

3.8 The Best-Fitting Model

Whereas many different lines have residuals that sum to 0, the sum of squares is uniquely minimized for one of these lines. This is the model that we will call the best-fitting model of our data.

We might reduce error further by adding another predictor to the model, or by using a different function (something you can learn more about later). But once we decide to use a straight line as a model, there will be only one line, defined by its coefficients \(b_0\) and \(b_1\), where the sum of squares is minimized.

R has a special function that helps us identify the coefficients where the SSE is minimized. The function is called lm(), which stands for linear model. Here’s how to find the best-fitting linear model of body mass using flipper length as the predictor:

lm(body_mass_kg ~ flipper_length_m, data = penguins)

Try using lm() in the code block below to find the slope and intercept (\(b_0\) and \(b_1\)) of the best-fitting line to model the relationship between flipper length and body mass.

require(coursekata) # use lm() to find the best coefficients # use lm() to find the best coefficients lm(body_mass_kg ~ flipper_length_m, data = penguins) ex() %>% check_function("lm") %>% check_arg('formula') %>% check_equal()
Call:
lm(formula = body_mass_kg ~ flipper_length_m, data = penguins)

Coefficients:
     (Intercept)  flipper_length_m  
          -5.872            50.153 

The output of lm() shows two numbers.

Now that we know the coefficients that go in the best model, we can use those values to make best_function (the best-fitting model) in the code block below. We have also added code to graph the function and calculate the SSE from it.

require(coursekata) Y <- penguins$body_mass_kg X <- penguins$flipper_length_m # replace the coefficients to create the best_function best_function <- function(X){b0 + b1*X} # this puts the best function on this scatter plot gf_point(body_mass_kg ~ flipper_length_m, data = penguins) %>% gf_function(best_function, color = "forestgreen") %>% gf_point(mean(Y) ~ mean(X), color = "red") # this calculates sum of squares residual <- Y - best_function(X) sum(residual^2) # replace the coefficients to create the best_function best_function <- function(X){-5.872 + 50.153*X} # this puts the best function on this scatter plot gf_point(body_mass_kg ~ flipper_length_m, data = penguins) %>% gf_function(best_function, color = "forestgreen") %>% gf_point(mean(Y) ~ mean(X), color = "red") # this calculates sum of squares residual <- Y - best_function(X) sum(residual^2) msg <- "Remember not to round the values of intercept and slope." ex() %>% check_fun_def("best_function") %>% { check_arguments(.) check_call(., 1) %>% check_result() %>% check_equal(incorrect_msg = msg) check_body(.) %>% { check_operator(., "+") check_operator(., "*") } }
51.21196324697

A scatter plot of body_mass_kg predicted by flipper_length_m. A single data point highlighted in red is plotted where the mean of body mass and the mean of flipper length intersect. A green line of best fit is plotted on the graph and runs through the center of the data points, and runs through the red dot.

Is it Really the Best-Fitting Model?

The best-fitting model is the model that minimizes the sum of squared error (or SSE). If we calculate residuals from any other function (with different \(b_0\)s and \(b_1\)s than the ones we found from lm()), the sum of squares would be larger than 51.21. Let’s check to see if this appears to be true.

Enter your own \(b_1\). Assuming that you want your line to go through the point of means, the R code will automatically calculate the \(b_0\) associated with your \(b_1\). The rest of the code will calculate the sum of squares. See if you can find any \(b_1\)s that has a lower SSE than 51.21.

require(coursekata) Y <- penguins$body_mass_kg X <- penguins$flipper_length_m # replace with your own value of b1 here b1 <- 50.153 # this calculates the b0 that goes with your b1 b0 <- mean(Y) - b1*mean(X) # this creates a function from the b1 and b0 try_function <- function(X){b0 + b1*X} # this calculates sum of squares residual <- Y - try_function(X) sum(residual^2) # replace with your own value of b1 here # any value of b1 is acceptable b1 <- 50.153 # this calculates the b0 that goes with your b1 b0 <- mean(Y) - b1*mean(X) # this creates a function from the b1 and b0 try_function <- function(X){b0 + b1*X} # this calculates sum of squares residual <- Y - try_function(X) sum(residual^2) ex() %>% { check_function(., "try_function") check_function(., "mean") check_operator(., "-") check_operator(., "*") check_operator(., "^") }

It’s impossible to make the SSE go lower! We’ve tried some values of \(b_1\) ourselves and plotted them on the graph below. We have also plotted the best-fitting model’s \(b_1\) as a green dot.

Scatter plot of SSE on the y-axis and model on the x-axis. The points are distributed in a curved shape, b1 values on the x-axis and ss on the y-axis. The point where the b1 is 50.153 (from the output of lm()) has the lowest ss of 51.21. Another example point is labeled: a b1 of 80 is associated with a higher ss of 109.31.

When we plot the sums of squares this way they form a pattern – a kind of parabola (if you remember the shape from your algebra days). In fact, it can be proven mathematically, using calculus, that the sums of squares can never be lower for any parameters other than the particular combination generated by the lm() function. (A smoothed out version of the function is shown below).

The \(b_1\) that minimizes SS The \(b_0\) that minimizes SS

On the left, a plot of SSE on the y-axis and b1 on the x-axis. A curved line is plotted on the graph and runs through a single point labeled as b1 equals 50.153 and SSE equals 51.21.

On the right, a plot of SSE on the y-axis and b0 on the x-axis. A curved line is plotted on the graph and runs through a single point labeled as b0 equals -5.872 and SSE equals 51.21.


Teacher Note: How does R know how to find the best coefficients? Here is the formula that R uses to find the best-fitting \(b_1\) and \(b_0\):

\[b_1 = \frac{\sum{(X - mean(X))(Y - mean(Y))}}{\sum{X - mean(X)}}\]

Once we know \(b_1\), we can use the point of means and figure out what the corresponding \(b_0\) is:

\[b_0 = mean(Y) - b_1*mean(X)\]

Responses