list High School / Algebra + Data Science (G)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

2.6 Exploring Slope

Now that we’ve dug into the \(b_0\) (y-intercept), let’s look more closely at the \(b_1\) (slope) of our function.

our_function <- function(X){-5.5 + 49*X}

If we write the function in algebraic notation, we would write:

\(Y = -5.5 + 49X\)

The y-intercept is -5.5, and the slope is 49. But what does it mean to say slope is 49? One definition of slope you might have heard is “rise over run”. In other words, slope can be thought of as a fraction.

One way to read the fraction \(\frac{49}{1}\) would be “a rise of 49 for a run of 1”. Think of it like this: there will be a 49 kg increase in body mass (rise) for every 1 meter increase in flipper length (run). In data science, we often interpret this as the adjustment we make to our prediction of body mass for each additional unit of flipper length.

A scatter plot of body_mass_kg predicted by flipper_length_m. A blue line of best fit is plotted on the graph and runs through the center of the data points. The graph is zoomed out so that the x-axis is extended to a flipper length of 0 to 1, and the y-axis is extended to a body mass of 0 to 40. A red dot is plotted where x equals zero and y equals negative 5.5. A dashed arrow runs vertically from the red dot up to a body mass of 49, and another arrow runs from the top of that arrow horizontally to a flipper length of 1. The arrows demonstrate that we adjust our prediction of body mass by 49 kg for each additional meter or flipper length.

Slope as Rate of Change

Another way to think about slope is as a rate of change. For a linear function (i.e., a straight line) this rate of change is constant. This is because the rate or slope (49 kilograms per 1 meter) is the same regardless of where you are on the graph. Whether you go from 0 to 1 meter of flipper length, or from 100 to 101 meters, the increase in body weight, and thus the slope, is the same.

The difference between the model’s predictions of body mass for flipper lengths of 0 versus 1 is 49 (try running the code below to check!). But so is the difference in predictions for flipper lengths of 99 versus 100. Still 49! Try any two flipper lengths that are just 1 unit apart (e.g., 9 versus 10, 1.23 versus 2.23, 3000 versus 3001).

require(coursekata) # this creates our custom function our_function <- function(X){-5.5 + 49*X} # compare predictions our_function(1) - our_function(0) # this creates our custom function our_function <- function(X){-5.5 + 49*X} # compare predictions our_function(1) - our_function(0) # this is one correct answer; # there are a variety of correct answers! ex() %>% { check_operator(., "-") %>% check_result() %>% check_equal() }

Later we might see patterns in data where the slopes (rates of change) are not constant across all values of \(X\). For example, for smaller flipper lengths there may be a different slope than for very large flipper lengths. If this were the case, the model would not be a straight line.

Coefficients Versus Variables

Exploring the y-intercept and slope hopefully helps you appreciate that the \(b_0\) and \(b_1\) are not actually variables. They are just single numbers, which we chose based on how well the lines appeared to fit the data. In the tradition of the GLM, we call these coefficients (we don’t call them variables!).

In contrast, flipper_length_m (\(X\)) and body_mass_kg (\(Y\)) are variables because they vary across penguins. For a single linear function (or straight line), the values of \(X\) and \(Y\) depending where you are on the line. But the slope (\(b_1\)) and y-intercept (\(b_0\)) that define the line don’t change. These are the coefficients.

GLM notation is helpful because it helps us keep in mind the distinction between variables (the \(X\) and \(Y\)) and coefficients (the \(b\)s). In statistics, data science, research methods, and science in general, we collect and work with data. Those data are our \(X\)s and \(Y\)s. The coefficients (e.g., \(b_0\) or \(b_1\)) are used to make predictions of \(Y\) based on the \(X\) variable.

Responses