Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentHigh School / Statistics and Data Science II (XCD)
-
segmentPART I: EXPLORING AND MODELING VARIATION
-
segmentChapter 1 - Exploring Data with R
-
segmentChapter 2 - From Exploring to Modeling Variation
-
2.3 More Two-Variable Visualizations
-
segmentChapter 3 - Modeling Relationships in Data
-
segmentPART II: COMPARING MODELS TO MAKE INFERENCES
-
segmentChapter 4 - The Logic of Inference
-
segmentChapter 5 - Model Comparison with F
-
segmentChapter 6 - Parameter Estimation and Confidence Intervals
-
segmentPART III: MULTIVARIATE MODELS
-
segmentChapter 7 - Introduction to Multivariate Models
-
segmentChapter 8 - Multivariate Model Comparisons
-
segmentChapter 9 - Models with Interactions
-
segmentChapter 10 - More Models with Interactions
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Statistics and Data Science II (XCD)
2.3 More Two-Variable Visualizations
Scatter Plots and Jitter Plots
Another common way to show the relationship between two variables is a scatter plot. A scatter plot shows each data point as a dot in two-dimensional space, with one variable (usually the explanatory variable) on the x-axis, and the other (the outcome variable) on the y-axis.
To make a scatter plot in ggformula
we use the
gf_point()
function. Let’s try using
gf_point()
to examine home prices by neighborhood.
gf_point(PriceK ~ Neighborhood, data = Ames)
When one of your variables is categorical
(e.g. Neighborhood
), the points can get bunched up on top
of each other, making it hard to see them, especially if you have a lot
of data points. Thankfully, there’s a solution to this problem:
gf_jitter()
.
gf_jitter()
adds some random noise to the plot, either
horizontally, vertically, or both. This spreads the points out so you
can see them better. We’ll usegf_jitter()
to create a
jitter plot of home prices by neighborhood.
gf_jitter(PriceK ~ Neighborhood, data = Ames, width = 0.1)
Note that we set the width
argument to control how much
the points get jittered on the horizontal axis. width
can
take a number between 0 and 1. You can try experimenting with different
values to get the plot that’s most useful to you.
If a point is in the Old Town column, it’s a home in Old Town. But being more to the left or right within the Old Town column doesn’t mean anything. The jitter is there just so the points do not overlap too much and obscure how many Old Town homes there are at a certain price.
Sometimes we may want to jitter one direction but not the other. For
example, we could include the argument height=0
to tell R
that we don’t want the points jittered vertically.
Box Plots
gf_point()
and gf_jitter()
are useful. They
give us a way to see each individual data point, yet at the same time
notice clusters and patterns across all the points. There are times,
however, when we want to transcend the individual data points and just
see the pattern. Box plots are helpful in this regard.
Here’s how we would create a box plot of home prices broken down by neighborhood.
gf_boxplot(PriceK ~ Neighborhood, data = Ames)
Recall from the previous chapter that box plots are a way of visually
representing the five number summary: min, Q1, median, Q3, and max. By
adding in an explanatory variable (in this case
Neighborhood
), we tell R to create two box plots side by
side, one for each neighborhood.
If we want to get the five-number summary broken down by neighborhood
we can simply add an explanatory variable into the
favstats()
function, like this:
favstats(PriceK ~ Neighborhood, data = Ames)
Neighborhood min Q1 median Q3 max mean sd n missing
1 CollegeCreek 110.0 179.675 203.5 230.0 424.87 204.5960 50.38751 134 0
2 OldTown 64.5 106.450 115.0 141.5 178.00 120.5555 26.51013 51 0
Here’s the code that generated the box plots of price broken down by
neighborhood. Use the pipe operator (%>%
) to overlay a
jitter plot on top of the box plot.
require(coursekata)
gf_boxplot(PriceK ~ Neighborhood, data = Ames)
gf_boxplot(PriceK ~ Neighborhood, data = Ames) %>%
gf_jitter()
ex() %>% {
check_function(., "gf_jitter")
check_function(., "gf_boxplot")
}
You didn’t have to but we adjusted the width
argument of
the jitter plot (to .3) so that the points would stay mostly within the
columns defined by the boxes. Notice that in each neighborhood, about
50% of the dots are in the boxes, with about 25% of the dots above, and
25% below, the boxes.
Visualizing the Relationship Between Two Quantitative Variables
Thus far we’ve focused on visualizing the relationship between
PriceK
and Neighborhood
*. But there are other
variables that might explain variation in PriceK
, for
example, HomeSizeK
, which is the size of the home in
thousands of sq feet.
Scatter plots are usually the best way to explore the relationship
between two quantitative variables (e.g., PriceK
and
HomeSizeK
). Try making one in the code window below.
require(coursekata)
# make a scatter plot to explore the relationship between PriceK and HomeSizeK
# make a scatter plot to explore the relationship between PriceK and HomeSizeK
gf_point(PriceK ~ HomeSizeK, data = Ames)
ex() %>% check_function("gf_point")