Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentAlgebra + Data Science
-
segmentChapter 1 - Exploring Variation in Data
-
segmentChapter 2 - Modeling Data with Functions
-
segmentChapter 3 - Assessing How Well Models Fit the Data
-
segmentResources
list High School / Algebra + Data Science (G)
2.6 Exploring Slope
Now that we’ve dug into the \(b_0\) (y-intercept), let’s look more closely at the \(b_1\) (slope) of our function.
our_function <- function(X){-5.5 + 49*X}
If we write the function in algebraic notation, we would write:
\(Y = -5.5 + 49X\)
The y-intercept is -5.5, and the slope is 49. But what does it mean to say slope is 49? One definition of slope you might have heard is “rise over run”. In other words, slope can be thought of as a fraction.
One way to read the fraction \(\frac{49}{1}\) would be “a rise of 49 for a run of 1”. Think of it like this: there will be a 49 kg increase in body mass (rise) for every 1 meter increase in flipper length (run). In data science, we often interpret this as the adjustment we make to our prediction of body mass for each additional unit of flipper length.
Slope as Rate of Change
Another way to think about slope is as a rate of change. For a linear function (i.e., a straight line) this rate of change is constant. This is because the rate or slope (49 kilograms per 1 meter) is the same regardless of where you are on the graph. Whether you go from 0 to 1 meter of flipper length, or from 100 to 101 meters, the increase in body weight, and thus the slope, is the same.
The difference between the model’s predictions of body mass for flipper lengths of 0 versus 1 is 49 (try running the code below to check!). But so is the difference in predictions for flipper lengths of 99 versus 100. Still 49! Try any two flipper lengths that are just 1 unit apart (e.g., 9 versus 10, 1.23 versus 2.23, 3000 versus 3001).
require(coursekata)
# this creates our custom function
our_function <- function(X){-5.5 + 49*X}
# compare predictions
our_function(1) - our_function(0)
# this creates our custom function
our_function <- function(X){-5.5 + 49*X}
# compare predictions
our_function(1) - our_function(0)
# this is one correct answer;
# there are a variety of correct answers!
ex() %>% {
check_operator(., "-") %>% check_result() %>% check_equal()
}
Later we might see patterns in data where the slopes (rates of change) are not constant across all values of \(X\). For example, for smaller flipper lengths there may be a different slope than for very large flipper lengths. If this were the case, the model would not be a straight line.
Coefficients Versus Variables
Exploring the y-intercept and slope hopefully helps you appreciate that the \(b_0\) and \(b_1\) are not actually variables. They are just single numbers, which we chose based on how well the lines appeared to fit the data. In the tradition of the GLM, we call these coefficients (we don’t call them variables!).
In contrast, flipper_length_m
(\(X\)) and body_mass_kg
(\(Y\)) are variables because they vary across penguins. For a single linear function (or straight line), the values of \(X\) and \(Y\) depending where you are on the line. But the slope (\(b_1\)) and y-intercept (\(b_0\)) that define the line don’t change. These are the coefficients.
GLM notation is helpful because it helps us keep in mind the distinction between variables (the \(X\) and \(Y\)) and coefficients (the \(b\)s). In statistics, data science, research methods, and science in general, we collect and work with data. Those data are our \(X\)s and \(Y\)s. The coefficients (e.g., \(b_0\) or \(b_1\)) are used to make predictions of \(Y\) based on the \(X\) variable.