Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentAlgebra + Data Science
-
segmentChapter 1 - Exploring Variation in Data
-
segmentChapter 2 - Modeling Data with Functions
-
segmentChapter 3 - Assessing How Well Models Fit the Data
-
segmentResources
list High School / Algebra + Data Science (G)
2.2 Make Your Own Model
An advantage of the straight line as a model is that it can be described using a simple algebraic equation: \(y = mx+b\). The \(m\) represents the slope (or steepness) of the line, and \(b\) represents the y-intercept, which is the value of \(y\) when \(x\) is 0. If we know the slope and intercept of a line, and if we fill in any particular value of \(x\), we can calculate an exact prediction for \(y\). This makes the line a very useful statistical model!
Representing a Line in GLM Notation
In algebra you probably learned to represent a line as \(y=mx+b\). In this book we will use a different notation to represent the same thing: the notation of the General Linear Model (GLM). We will use this notation because it is more often used by professionals in the statistics and data science world, and because it will make it easier later when you learn about other models (other than a straight line).
In going from school algebra notation to GLM notation, the first thing we do is change around the order a little. Instead of saying \(y = mx + b\), we think of it as \(y = b+mx\). Changing the order of the \(b\) and \(mx\) doesn’t change what it represents (just like saying \(1+2\) is the same as saying \(2+1\)). Whereas the simple component typically comes second in school algebra (\(b\)), it comes first in the GLM notation.
The next thing we do is change the symbols we use to represent y-intercept and slope. We use \(b_0\) to represent y-intercept (pronounced “b sub zero”) and \(b_1\) to represent slope (pronounced “b sub 1”). You can see what GLM notation looks like in comparison with the algebraic notation typically used in school in the table below.
School algebraic notation | \(y = b+mx\) |
---|---|
General Linear Model (GLM) notation | \(Y = b_0 + b_1X\) |
Make Your Own Model
In the code block below, we introduce a new R function, gf_abline()
, where you can put in a slope and intercept and it will overlay a line onto the scatter plot. Try changing the numbers to find an intercept and slope that creates a line of predictions that you think best represents the data. It might take some trial and error but you can try as many times as you like! Submit your code when the line looks good to you.
Note: The best y-intercept might be a negative number. We’ll discuss this more later. For now, just try to make a line that visually looks like a good line of predictions.
require(coursekata)
# modify the intercept and slope
# to put a line of predictions through the scatter plot
gf_point(body_mass_kg ~ flipper_length_m, data = penguins) %>%
gf_abline(intercept = -2, slope = 40)
# modify the intercept and slope
# to put a line of predictions through the scatter plot
gf_point(body_mass_kg ~ flipper_length_m, data = penguins) %>%
gf_abline(intercept = -5.5, slope = 49)
# solutions will vary but the line
# should generally follow the pattern of data
ex() %>% {
check_function(., "gf_point") %>% {
check_arg(., "object") %>% check_equal()
check_arg(., "data") %>% check_equal()
}
check_function(., "gf_abline") %>% {
#check_arg(., "color") %>% check_equal()
}
}
We’ve put these five student-generated models on the scatter plot below by using the pipe, %>%
. Just for fun, we also used different colors.
gf_point(body_mass_kg ~ flipper_length_m, data = penguins) %>%
gf_abline(intercept = -3.8, slope = 41, color = "tomato") %>%
gf_abline(intercept = -5.5, slope = 49, color = "orange") %>%
gf_abline(intercept = -6.2, slope = 52, color = "gold2") %>%
gf_abline(intercept = -7, slope = 56, color = "green4") %>%
gf_abline(intercept = -8.1, slope = 62, color = "steelblue")