list High School / Algebra + Data Science (G)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

1.7 Exploring Multivariate Hypotheses with Visualizations

Now let’s get crazy. Maybe we can make a better prediction about body mass if we knew both the flipper length and whether the penguin was a gentoo penguin!

This is called a multivariate hypothesis because it doesn’t just have one predictor variable, it has 2! (A multivariate model has more than 1 predictor variable.)

We can explore multivariate hypotheses with data visualizations in a few ways. One way is to start with a basic scatter plot (such as the one below) and add in color to represent the other predictor variable (by adding the argument color = ~gentoo). (We did this earlier when we added the variable female to the plot.)

Try adding a color argument in the code block below to color gentoo penguins differently from non-gentoo in the scatter plot of body mass by flipper length.

require(coursekata) # add color according to the gentoo variable gf_point(body_mass_kg ~ flipper_length_m, data = penguins) # add color according to the gentoo variable gf_point(body_mass_kg ~ flipper_length_m, data = penguins, color = ~gentoo) ex() %>% check_function(., "gf_point") %>% { check_arg(., "data") %>% check_equal() check_arg(., "object") %>% check_equal() check_arg(., "color") %>% check_equal() }

A scatter plot of body_mass_kg predicted by flipper_length_m. The data points for gentoo penguins are colored purple and the non-gentoo penguins are in teal. The purple dots are more clustered in the upper right portion of the plot.

Is It Possible to Have More Than Two Predictor Variables?

Exploring variation with graphs is like a detective game. Patterns you notice when graphing data often will lead to new hypotheses and new word equations. And yes, you can have many predictor variables. Let’s look at an example.

When we looked at the plot above where we put gentoo penguins in a different color than the others, it reminded us of a puzzle we encountered earlier when we used the color argument to represent female. Here are the two graphs side by side.

Colored by female Colored by gentoo

On the left, a scatter plot of body_mass_kg predicted by flipper_length_m. The data points for female penguins are colored purple and the male penguins are in teal.

On the right, a scatter plot of body_mass_kg predicted by flipper_length_m. The data points for gentoo penguins are colored purple and the non-gentoo penguins are in teal.

Earlier we were puzzled by the fact that the female vs. male difference appeared to be repeated in two clumps of dots. Now we can see that the two clumps were defined by species, gentoo vs. others. It now looks like that in addition to flipper length, both female and gentoo explain variation in body mass.

Size, Shape, and Facets

Note that in addition to arguments like color, you might also want to try exploring arguments like size and shape with gf_point(). You can do almost anything you want to do when graphing in R; the sky’s the limit.

In the following line of code we added size = 3 to make the dots larger. You can try experimenting with different sizes.

gf_point(body_mass_kg ~ flipper_length_m, data = penguins, 
  color = ~female, size = 3)

A scatter plot of body_mass_kg predicted by flipper_length_m. The data points for female penguins are colored purple and the male penguins are in teal. The size of the dots for each data point is slightly larger than previous plots.

If you want to represent gentoo as well as female and flipper_length_m in the same plot, you could add the argument shape = ~gentoo into the line of code above.

gf_point(body_mass_kg ~ flipper_length_m, data = penguins, 
  color = ~female, size = 3, shape = ~gentoo)

A scatter plot of body_mass_kg predicted by flipper_length_m. The data points for female penguins are colored purple and the male penguins are in teal. Additionally, the data points for gentoo penguins are shaped as triangles, and the data points for non-gentoo penguins are circles.

In addition to the males being teal and females being purple, the gentoo penguins are represented by triangles and the non-gentoo by circles. The possibilities, really, are endless.

Just for fun, we will teach you one more way to look at a multivariate hypothesis. We can make separate facets (or panels) of scatter plots – one for each category of a categorical variable (such as gentoo) – by piping on (%>%) a new function, gf_facet_wrap().

gf_point(body_mass_kg ~ flipper_length_m, data = penguins,
  color = ~female, shape = ~gentoo) %>%
  gf_facet_wrap(~ gentoo)

A scatter plot of body_mass_kg predicted by flipper_length_m, faceted by gentoo, with non-gentoo penguins plotted in the plot on the left, and gentoo penguins plotted in the plot on the right. The data points for female penguins are colored purple and the male penguins are in teal. Additionally, the data points for gentoo penguins are shaped as triangles, and the data points for non-gentoo penguins are circles.

Try playing around with gf_facet_wrap() in the code block below. <Run> it with the categorical variable island. Then <Run> it with the quantitative variable bill_length_cm. Use the <Submit> button when you have the faceted visualization that you think is most helpful.

require(coursekata) # try faceting by island # then try faceting bill_length_cm gf_point(body_mass_kg ~ flipper_length_m, data = penguins) %>% gf_facet_wrap(~ gentoo) # try faceting by island # then try faceting bill_length_cm gf_point(body_mass_kg ~ flipper_length_m, data = penguins) %>% gf_facet_wrap(~ island) ex() %>% { check_function(., "gf_point") %>% { check_arg(., "object") %>% check_equal() check_arg(., "data") %>% check_equal() } check_function(., "gf_facet_wrap") %>% { #check_arg(., 1) %>% check_equal() check_arg(., 2) %>% check_equal() } }

Responses