Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentAlgebra + Data Science
-
segmentChapter 1 - Exploring Variation in Data
-
1.4 Relationships Between Two Variables
-
segmentChapter 2 - Modeling Data with Functions
-
segmentChapter 3 - Assessing How Well Models Fit the Data
-
segmentResources
list High School / Algebra + Data Science (G)
1.4 Relationships Between Two Variables
The power of algebra is that it can help us represent relationships between variables. For example, we might be trying to understand the relationship between body mass (measured in kilograms) and flipper length (measured in meters) in penguins. We might have a hypothesis that penguins with longer flippers would have greater body mass.
When we make a hypothesis about the relationship between two variables, it is useful to think of one variable as the outcome variable (let’s think of body mass as the outcome) and the other variable as a predictor variable. Our hypothesis is that if we know a penguin’s flipper length, we could make a more accurate prediction of their body mass than if we didn’t know their flipper length.
On this page, we will learn how to translate a hypothesis into a word equation and then ultimately into a scatter plot. (In later units, we’ll also learn how to use algebraic functions to represent relationships like the hypothesized relationship between body mass and flipper length.)
Word Equations
We can represent relationships between outcome variables and predictor variables with word equations. Here is a word equation that represents the relationship between body mass and flipper length:
body mass = flipper length + other stuff
The term other stuff at the end of the word equation represents an important idea: even if knowing a penguin’s flipper length can help us make a better prediction of their body mass, the prediction won’t be perfect. While some of the variation in body mass can be explained by variation in flipper length, there will still be some variation that is not explained. This remaining variation could, presumably, be explained by other stuff.
species gentoo body_mass_kg flipper_length_m bill_length_cm female island
1 Adelie 0 4.200 0.194 4.6 0 Torgersen
2 Gentoo 1 4.375 0.217 4.6 1 Biscoe
3 Adelie 0 3.950 0.185 3.8 0 Biscoe
4 Gentoo 1 5.700 0.218 5.0 0 Biscoe
5 Adelie 0 4.000 0.210 4.4 0 Torgersen
6 Adelie 0 3.000 0.192 3.7 1 Dream
Here’s how to read a word equation: “Variation in body_mass_kg
can be explained by variation in flipper_length_m
plus other stuff.” By convention, the outcome variable, body_mass_kg
, is written to the left of the equal sign and the predictor variable, flipper_length_m
, is written to the right.
Word equations are not the same as mathematical equations. It isn’t the case, for example, that body mass and flipper length are the same thing or are “equal.” A word equation is just an informal way of representing the idea that some of the variation in body mass is explained by variation in flipper length (the rest is other stuff).
More generally, we could say that some of the variation in the outcome variable is explained by variation in the predictor variable:
outcome = predictor + other stuff
We will start to refer to these word equations as informal models. A model airplane isn’t the real thing, but it can give you a good idea of what a real plane looks like. Models give us a simplified representation of what the relationship between variables might look like. We will quantify these relationships as formal models later, but it’s helpful to start thinking of word equations as models.
Visualizing the Relationship
Now that we have a word equation, let’s make a data visualization to explore the relationship between the outcome variable and the predictor variable.
Body Mass = Flipper Length + other stuff
Here’s some code to make a scatter plot:
gf_point(body_mass_kg ~ flipper_length_m, data = penguins)
Let’s break down this line of code into parts. First, take a look at the part inside the parentheses ( )
:
gf_point(body_mass_kg ~ flipper_length_m, data = penguins)
Now look at the part that says data =
:
gf_point(body_mass_kg ~ flipper_length_m, data = penguins)
Finally, take a look at the beginning part of the code:
gf_point(body_mass_kg ~ flipper_length_m, data = penguins)
gf_point()
is the name of the R function that will make a scatter plot of the relationship between flipper length and body mass.
Alright, enough explanation. Let’s write some code to explore the hypothesis expressed in the word equation (body_mass_kg = flipper_length_m + other stuff) in a scatter plot.
require(coursekata)
# fill in the variables and data frame
gf_point( ~ , data = )
# fill in the variables and data frame
gf_point(body_mass_kg ~ flipper_length_m, data = penguins)
ex() %>% check_function(., "gf_point") %>% {
check_arg(., "data") %>% check_equal()
check_arg(., "object") %>% check_equal()
}
Interpreting the Graph
Here’s the graph produced by the gf_point()
function.
By convention, we put the outcome variable (in this case body_mass_kg
) on the y-axis and the predictor variable (flipper_length_m
) on the x-axis. Each point in the scatter plot represents a single penguin (a row) in the data frame.
As we look at this graph, let’s keep in mind the relationship we hypothesized between flipper length and body mass as expressed in this word equation:
body_mass_kg = flipper_length_m + other stuff
In general, penguins with longer flippers (i.e., farther to the right in the graph) also tend to have greater body mass (i.e., they tend to be closer to the top of the graph).
This pattern illustrates what we mean when we say that some of the variation in body mass is explained by variation in flipper length. We can informally define explain like this: If we know a penguin’s flipper length, we can make a better prediction of its body mass than we could if we didn’t know its flipper length.
Informal Definition of Explain Variation: If we know a case’s value on the predictor variable, we can make a better prediction of its value on the outcome variable.
Even though we can make a better prediction of body mass if we know flipper length, we can’t make a perfect prediction. That’s where the “other stuff” comes in. We can see in the scatter plot that there are a few penguins with flipper lengths of approximately 0.23 m (in the upper right of the plot). We can predict that penguins with flippers this long would tend to have fairly high body mass. But even though they all have the same flipper length, there is still variation in their body mass - due, presumably, to other stuff.
How R Knows Which Variable to Put on Which Axis
As pointed out earlier, it is customary to put the outcome variable on the y-axis and the predictor variable on the x-axis. But how does R know which variable should be on the y-axis?
Here, again, is the code used to create the scatter plot above:
gf_point(body_mass_kg ~ flipper_length_m, data = penguins)
Try modifying the code below to put flipper_length_m
on the y-axis and body_mass_kg
on the x-axis. (We also added some code to show you how to change the colors of the data points, color = "purple"
.)
require(coursekata)
# modify this code
gf_point(body_mass_kg ~ flipper_length_m, data = penguins, color = "purple")
# modify this code
gf_point(flipper_length_m ~ body_mass_kg, data = penguins, color = "purple")
ex() %>% check_function(., "gf_point") %>% {
check_arg(., "data") %>% check_equal()
check_arg(., "object") %>% check_equal()
}