Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentAlgebra + Data Science
-
segmentChapter 1 - Exploring Variation in Data
-
1.6 Quantitative and Categorical Variables
-
segmentChapter 2 - Modeling Data with Functions
-
segmentChapter 3 - Assessing How Well Models Fit the Data
-
segmentResources
list High School / Algebra + Data Science (G)
1.6 Quantitative and Categorical Variables
Broadly speaking, variables can be divided into two types: quantitative and categorical.
Quantitative Variables. The variables we have been looking at so far (body_mass_kg
and flipper_length_m
) are examples of quantitative variables. Quantitative variables are assigned numerical values (e.g., 4.2 or 0.194) and these values represent quantities. This means that observations with higher numbers are assumed to have more of the quantity than those with lower numbers. For example, we can assume that a penguin with a flipper_length_m
of 0.194 is actually longer than one with a value of 0.12.

Categorical Variables. The values assigned to categorical variables do not represent quantities. Instead, they represent categories. One categorical variable in the penguins
data frame is female
, which we used earlier to indicate the penguin’s sex.
Another example in the penguins
data frame is the variable gentoo
. Gentoo is a penguin species (depicted in the photo above).
species gentoo body_mass_kg flipper_length_m bill_length_cm female island
1 Adelie 0 4.200 0.194 4.6 0 Torgersen
2 Gentoo 1 4.375 0.217 4.6 1 Biscoe
3 Adelie 0 3.950 0.185 3.8 0 Biscoe
4 Gentoo 1 5.700 0.218 5.0 0 Biscoe
5 Adelie 0 4.000 0.210 4.4 0 Torgersen
6 Adelie 0 3.000 0.192 3.7 1 Dream
Even though the variable gentoo
is assigned values such as 1
or 0
, these numbers don’t mean that the penguin with 1
has more of something than penguins coded as 0
. Instead, these numbers just stand for different categories (either being a gentoo penguin or not).
Hypotheses with Categorical Variables
We’ve been translating hypotheses into word equations and then into scatter plots. We explored the hypothesis that we can make a better prediction of body mass if we know a penguin’s flipper length. We translated this hypothesis into the word equation: body_mass_kg = flipper_length_m + other stuff, and then made a scatter plot to explore the relationship.
Now let’s try swapping out the quantitative predictor variable of flipper_length_m
with a categorical predictor variable.
In the code block below, make a scatter plot to explore this hypothesis.
require(coursekata)
# make a scatter plot
# make a scatter plot
gf_point(body_mass_kg ~ gentoo, data = penguins)
ex() %>% check_function(., "gf_point") %>% {
check_arg(., "data") %>% check_equal()
check_arg(., "object") %>% check_equal()
}
Although this graph might look a little weird at first, it is also a scatter plot! It’s just that gentoo
is a very different kind of variable than flipper_length_m
. There are many values of flipper_length_m
: 0.172, 0.174, 0.186, 0.194, 0.217, 0.231 just to name a few (it would be a long list to name them all!). The variable gentoo
in contrast has only two possible values: 0 or 1. So the data points are all lined up in one of two places on the horizontal x-axis: above the 0 or above the 1.
Notice that the variable body_mass_kg
still varies quantitatively so the data points are spread out along the vertical y-axis even though they are clustered on the x-axis.
We can look at the graph again in light of the hypothesized relationship we were investigating: body_mass_kg = gentoo + other stuff. Does the variable gentoo
explain some of the variation in body_mass_kg
?
If gentoo
explains some of the variation in body_mass_kg
then knowing whether a penguin is a gentoo or not would help us make a better prediction of its body mass than if we didn’t know what species it is. If we know that a penguin is a gentoo, we would predict it to have a larger body mass (maybe 5 kg, the middle of the gentoo penguin distribution) than if it is not a gentoo, in which case we might predict something closer to 4 kg. These wouldn’t be very accurate predictions, but they’d be a little bit better than if we didn’t know anything about the species of the penguin.