list High School / Algebra + Data Science (G)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

1.6 Quantitative and Categorical Variables

Broadly speaking, variables can be divided into two types: quantitative and categorical.

Quantitative Variables. The variables we have been looking at so far (body_mass_kg and flipper_length_m) are examples of quantitative variables. Quantitative variables are assigned numerical values (e.g., 4.2 or 0.194) and these values represent quantities. This means that observations with higher numbers are assumed to have more of the quantity than those with lower numbers. For example, we can assume that a penguin with a flipper_length_m of 0.194 is actually longer than one with a value of 0.12.

Photo of gentoo penguins from critterfacts.com
Photo of gentoo penguins from critterfacts.com

Categorical Variables. The values assigned to categorical variables do not represent quantities. Instead, they represent categories. One categorical variable in the penguins data frame is female, which we used earlier to indicate the penguin’s sex.

Another example in the penguins data frame is the variable gentoo. Gentoo is a penguin species (depicted in the photo above).

  species gentoo body_mass_kg flipper_length_m bill_length_cm female    island
1  Adelie      0        4.200            0.194            4.6      0 Torgersen
2  Gentoo      1        4.375            0.217            4.6      1    Biscoe
3  Adelie      0        3.950            0.185            3.8      0    Biscoe
4  Gentoo      1        5.700            0.218            5.0      0    Biscoe
5  Adelie      0        4.000            0.210            4.4      0 Torgersen
6  Adelie      0        3.000            0.192            3.7      1     Dream

Even though the variable gentoo is assigned values such as 1 or 0, these numbers don’t mean that the penguin with 1 has more of something than penguins coded as 0. Instead, these numbers just stand for different categories (either being a gentoo penguin or not).

Hypotheses with Categorical Variables

We’ve been translating hypotheses into word equations and then into scatter plots. We explored the hypothesis that we can make a better prediction of body mass if we know a penguin’s flipper length. We translated this hypothesis into the word equation: body_mass_kg = flipper_length_m + other stuff, and then made a scatter plot to explore the relationship.

Now let’s try swapping out the quantitative predictor variable of flipper_length_m with a categorical predictor variable.

In the code block below, make a scatter plot to explore this hypothesis.

require(coursekata) # make a scatter plot # make a scatter plot gf_point(body_mass_kg ~ gentoo, data = penguins) ex() %>% check_function(., "gf_point") %>% { check_arg(., "data") %>% check_equal() check_arg(., "object") %>% check_equal() }

A scatter plot of body_mass_kg predicted by gentoo. The points are stacked up vertically along the y-axis at two points along the x-axis: at a value of zero and a value of one.

Although this graph might look a little weird at first, it is also a scatter plot! It’s just that gentoo is a very different kind of variable than flipper_length_m. There are many values of flipper_length_m: 0.172, 0.174, 0.186, 0.194, 0.217, 0.231 just to name a few (it would be a long list to name them all!). The variable gentoo in contrast has only two possible values: 0 or 1. So the data points are all lined up in one of two places on the horizontal x-axis: above the 0 or above the 1.

Notice that the variable body_mass_kg still varies quantitatively so the data points are spread out along the vertical y-axis even though they are clustered on the x-axis.

A scatter plot of body_mass_kg predicted by gentoo. The points are stacked up vertically along the y-axis at two points along the x-axis: at a value of zero and a value of one. The points stacked over the value of one are generally higher on the y-axis than the points at a value of zero.

We can look at the graph again in light of the hypothesized relationship we were investigating: body_mass_kg = gentoo + other stuff. Does the variable gentoo explain some of the variation in body_mass_kg?

If gentoo explains some of the variation in body_mass_kg then knowing whether a penguin is a gentoo or not would help us make a better prediction of its body mass than if we didn’t know what species it is. If we know that a penguin is a gentoo, we would predict it to have a larger body mass (maybe 5 kg, the middle of the gentoo penguin distribution) than if it is not a gentoo, in which case we might predict something closer to 4 kg. These wouldn’t be very accurate predictions, but they’d be a little bit better than if we didn’t know anything about the species of the penguin.

Responses