list High School / Algebra + Data Science (G)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

2.2 Make Your Own Model

An advantage of the straight line as a model is that it can be described using a simple algebraic equation: \(y = mx+b\). The \(m\) represents the slope (or steepness) of the line, and \(b\) represents the y-intercept, which is the value of \(y\) when \(x\) is 0. If we know the slope and intercept of a line, and if we fill in any particular value of \(x\), we can calculate an exact prediction for \(y\). This makes the line a very useful statistical model!

A scatter plot of body_mass_kg predicted by flipper_length_m. A blue line of best fit is plotted on the graph and runs through the center of the data points.

Representing a Line in GLM Notation

In algebra you probably learned to represent a line as \(y=mx+b\). In this book we will use a different notation to represent the same thing: the notation of the General Linear Model (GLM). We will use this notation because it is more often used by professionals in the statistics and data science world, and because it will make it easier later when you learn about other models (other than a straight line).

In going from school algebra notation to GLM notation, the first thing we do is change around the order a little. Instead of saying \(y = mx + b\), we think of it as \(y = b+mx\). Changing the order of the \(b\) and \(mx\) doesn’t change what it represents (just like saying \(1+2\) is the same as saying \(2+1\)). Whereas the simple component typically comes second in school algebra (\(b\)), it comes first in the GLM notation.

The next thing we do is change the symbols we use to represent y-intercept and slope. We use \(b_0\) to represent y-intercept (pronounced “b sub zero”) and \(b_1\) to represent slope (pronounced “b sub 1”). You can see what GLM notation looks like in comparison with the algebraic notation typically used in school in the table below.

School algebraic notation \(y = b+mx\)
General Linear Model (GLM) notation \(Y = b_0 + b_1X\)


Make Your Own Model

In the code block below, we introduce a new R function, gf_abline(), where you can put in a slope and intercept and it will overlay a line onto the scatter plot. Try changing the numbers to find an intercept and slope that creates a line of predictions that you think best represents the data. It might take some trial and error but you can try as many times as you like! Submit your code when the line looks good to you.

Note: The best y-intercept might be a negative number. We’ll discuss this more later. For now, just try to make a line that visually looks like a good line of predictions.

require(coursekata) # modify the intercept and slope # to put a line of predictions through the scatter plot gf_point(body_mass_kg ~ flipper_length_m, data = penguins) %>% gf_abline(intercept = -2, slope = 40) # modify the intercept and slope # to put a line of predictions through the scatter plot gf_point(body_mass_kg ~ flipper_length_m, data = penguins) %>% gf_abline(intercept = -5.5, slope = 49) # solutions will vary but the line # should generally follow the pattern of data ex() %>% { check_function(., "gf_point") %>% { check_arg(., "object") %>% check_equal() check_arg(., "data") %>% check_equal() } check_function(., "gf_abline") %>% { #check_arg(., "color") %>% check_equal() } }

We’ve put these five student-generated models on the scatter plot below by using the pipe, %>%. Just for fun, we also used different colors.

gf_point(body_mass_kg ~ flipper_length_m, data = penguins) %>%
  gf_abline(intercept = -3.8, slope = 41, color = "tomato") %>%
  gf_abline(intercept = -5.5, slope = 49, color = "orange") %>%
  gf_abline(intercept = -6.2, slope = 52, color = "gold2") %>%
  gf_abline(intercept = -7, slope = 56, color = "green4") %>%
  gf_abline(intercept = -8.1, slope = 62, color = "steelblue") 

A scatter plot of body_mass_kg predicted by flipper_length_m. Five slightly different lines of best fit are plotted on the graph. All lines have slightly different intercepts and slopes, but all tend to run through the approximate center of the data points and have a positive trend.

Responses