CourseKata - 3.8 The Best-Fitting Model

High School / Algebra + Data Science (G)

Book

3.8 The Best-Fitting Model

Whereas many different lines have residuals that sum to 0, the sum of squares is uniquely minimized for one of these lines. This is the model that we will call the best-fitting model of our data.

We might reduce error further by adding another predictor to the model, or by using a different function (something you can learn more about later). But once we decide to use a straight line as a model, there will be only one line, defined by its coefficients \(b_0\) and \(b_1\), where the sum of squares is minimized.

R has a special function that helps us identify the coefficients where the SSE is minimized. The function is called lm(), which stands for linear model. Here’s how to find the best-fitting linear model of body mass using flipper length as the predictor:

lm(body_mass_kg ~ flipper_length_m, data = penguins)

Try using lm() in the code block below to find the slope and intercept (\(b_0\) and \(b_1\)) of the best-fitting line to model the relationship between flipper length and body mass.

require(coursekata)

# use lm() to find the best coefficients

# use lm() to find the best coefficients
lm(body_mass_kg ~ flipper_length_m, data = penguins)

ex() %>% check_function("lm") %>% check_arg('formula') %>% check_equal()

Call:
lm(formula = body_mass_kg ~ flipper_length_m, data = penguins)

Coefficients:
     (Intercept)  flipper_length_m  
          -5.872            50.153

The output of lm() shows two numbers.

Now that we know the coefficients that go in the best model, we can use those values to make best_function (the best-fitting model) in the code block below. We have also added code to graph the function and calculate the SSE from it.

require(coursekata)
Y <-  penguins$body_mass_kg
X <-  penguins$flipper_length_m

# replace the coefficients to create the best_function
best_function <- function(X){b0 + b1*X}

# this puts the best function on this scatter plot
gf_point(body_mass_kg ~ flipper_length_m, data = penguins) %>%
  gf_function(best_function, color = "forestgreen") %>%
  gf_point(mean(Y) ~ mean(X), color = "red")

# this calculates sum of squares
residual <- Y - best_function(X)
sum(residual^2)

# replace the coefficients to create the best_function
best_function <- function(X){-5.872 + 50.153*X}

# this puts the best function on this scatter plot
gf_point(body_mass_kg ~ flipper_length_m, data = penguins) %>%
  gf_function(best_function, color = "forestgreen") %>%
  gf_point(mean(Y) ~ mean(X), color = "red")

# this calculates sum of squares
residual <- Y - best_function(X)
sum(residual^2)

msg <- "Remember not to round the values of intercept and slope."
ex() %>% check_fun_def("best_function") %>% {
  check_arguments(.)
  check_call(., 1) %>% check_result() %>% check_equal(incorrect_msg = msg)
  check_body(.) %>% {
    check_operator(., "+")
    check_operator(., "*")
  }
}

51.21196324697

A scatter plot of body_mass_kg predicted by flipper_length_m. A single data point highlighted in red is plotted where the mean of body mass and the mean of flipper length intersect. A green line of best fit is plotted on the graph and runs through the center of the data points, and runs through the red dot.

Is it Really the Best-Fitting Model?

The best-fitting model is the model that minimizes the sum of squared error (or SSE). If we calculate residuals from any other function (with different \(b_0\)s and \(b_1\)s than the ones we found from lm()), the sum of squares would be larger than 51.21. Let’s check to see if this appears to be true.

Enter your own \(b_1\). Assuming that you want your line to go through the point of means, the R code will automatically calculate the \(b_0\) associated with your \(b_1\). The rest of the code will calculate the sum of squares. See if you can find any \(b_1\)s that has a lower SSE than 51.21.

require(coursekata)
Y <-  penguins$body_mass_kg
X <-  penguins$flipper_length_m

# replace with your own value of b1 here
b1 <- 50.153

# this calculates the b0 that goes with your b1
b0 <- mean(Y) - b1*mean(X)

# this creates a function from the b1 and b0
try_function <- function(X){b0 + b1*X}

# this calculates sum of squares
residual <- Y - try_function(X)
sum(residual^2)

# replace with your own value of b1 here
# any value of b1 is acceptable
b1 <- 50.153

# this calculates the b0 that goes with your b1
b0 <- mean(Y) - b1*mean(X)

# this creates a function from the b1 and b0
try_function <- function(X){b0 + b1*X}

# this calculates sum of squares
residual <- Y - try_function(X)
sum(residual^2)

ex() %>% {
  check_function(., "try_function")
  check_function(., "mean")
  check_operator(., "-")
  check_operator(., "*")
  check_operator(., "^")
}

It’s impossible to make the SSE go lower! We’ve tried some values of \(b_1\) ourselves and plotted them on the graph below. We have also plotted the best-fitting model’s \(b_1\) as a green dot.

Scatter plot of SSE on the y-axis and model on the x-axis. The points are distributed in a curved shape, b1 values on the x-axis and ss on the y-axis. The point where the b1 is 50.153 (from the output of lm()) has the lowest ss of 51.21. Another example point is labeled: a b1 of 80 is associated with a higher ss of 109.31.

When we plot the sums of squares this way they form a pattern – a kind of parabola (if you remember the shape from your algebra days). In fact, it can be proven mathematically, using calculus, that the sums of squares can never be lower for any parameters other than the particular combination generated by the lm() function. (A smoothed out version of the function is shown below).

The \(b_1\) that minimizes SS	The \(b_0\) that minimizes SS

Teacher Note: How does R know how to find the best coefficients? Here is the formula that R uses to find the best-fitting \(b_1\) and \(b_0\):

\[b_1 = \frac{\sum{(X - mean(X))(Y - mean(Y))}}{\sum{X - mean(X)}}\]

Once we know \(b_1\), we can use the point of means and figure out what the corresponding \(b_0\) is:

\[b_0 = mean(Y) - b_1*mean(X)\]

3.7 The Beauty of Sum of Squares (SSE) 3.9 SSE, MSE, and RMSE

Course Outline

High School / Algebra + Data Science (G)

3.8 The Best-Fitting Model

Is it Really the Best-Fitting Model?

Responses

list High School / Algebra + Data Science (G)

3.8 The Best-Fitting Model

Is it Really the Best-Fitting Model?

High School / Algebra + Data Science (G)