Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentCollege / Advanced Statistics with R (ABCD)
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Digging Deeper into Group Models
-
segmentChapter 9 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 10 - The Logic of Inference
-
segmentChapter 11 - Model Comparison with F
-
segmentChapter 12 - Parameter Estimation and Confidence Intervals
-
segmentPART IV: MULTIVARIATE MODELS
-
segmentChapter 13 - Introduction to Multivariate Models
-
13.2 Visualizing Price = Home Size + Neighborhood
-
segmentChapter 14 - Multivariate Model Comparisons
-
segmentChapter 15 - Models with Interactions
-
segmentChapter 16 - More Models with Interactions
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list College / Advanced Statistics with R (ABCD)
13.2 Visualizing Price = Home Size + Neighborhood
Let’s explore this idea with some visualizations. We will start with
a graph of the home size model, plotting PriceK
by
HomeSizeK
, with this code:
gf_point(PriceK ~ HomeSizeK, data = Smallville)
. We will
then explore some ways we could visualize the effect of
Neighborhood
above and beyond that of
HomeSizeK
.
Using Facet Grids
Here’s a scatter plot of PriceK
by
HomeSizeK
for the 32 homes in Smallville.
One way to integrate Neighborhood
into the same
visualization is to make a grid of scatter plots, each one representing
a different neighborhood. We can do this by chaining on
gf_facet_grid(Neighborhood ~ .)
on top of the scatter
plot.
Because we put Neighborhood
before the tilde
(Neighborhood ~ .
) the two graphs will be stacked
vertically (i.e., along the y-axis). To put the graphs side-by-side
(i.e., in a grid along the x-axis), we would put the variable after the
tilde: . ~ Neighborhood
. Notice that in R, as in GLM
notation, we usually follow the form Y ~ X
.
In the code block below, try putting the two scatter plots, one for
each Neighborhood
, side by side in a horizontal grid.
require(coursekata)
# Make a horizontal grid of scatter plots using Neighborhood
gf_point(PriceK~ HomeSizeK, data = Smallville)
# Make a horizontal grid of scatter plots using Neighborhood
gf_point(PriceK~ HomeSizeK, data = Smallville) %>%
gf_facet_grid(. ~ Neighborhood)
ex() %>% check_function("gf_facet_grid") %>% {
check_arg(., "object") %>% check_equal()
check_arg(., 2) %>% check_equal()
}
Based on these plots, you can see that knowing both neighborhood
and home size would improve your predictions. One way to see
this is to look, within each neighborhood, at the prices of homes that
are between 1000 and 1500 square feet (i.e., HomeSizeK
between 1.0 and 1.5). We have colored them differently in the faceted
plot below. You can see that even for homes the same size, there still
are higher prices in Downtown than in Eastside.
Using Color
Another approach to adding neighborhood to the scatter plot of
PriceK
by HomeSizeK
is to assign different
colors to points representing homes from the different neighborhoods.
You can do this by adding color = ~Neighborhood
to the
scatter plot. (The ~
tilde tells R that
Neighborhood
is a variable.) Try it in the code block
below.
require(coursekata)
# Add in the color argument
gf_point(PriceK ~ HomeSizeK, data = Smallville)
# Add in the color argument
gf_point(PriceK ~ HomeSizeK, data = Smallville, color = ~ Neighborhood)
ex() %>%
check_function("gf_point") %>%
check_arg("color") %>%
check_equal()
We used this code (also overlaying the HomeSizeK
regression line on the scatter plot) to get the graph below.
HomeSizeK_model <- lm(PriceK ~ HomeSizeK, data = Smallville)
gf_point(PriceK ~ HomeSizeK, data = Smallville, color= ~ Neighborhood) %>%
gf_model(HomeSizeK_model, color = "black")
Adding the regression line makes it easier to see the error (or
residuals) leftover from the HomeSizeK
model. Notice that
the teal dots (homes from Downtown) are mostly above the regression line
(i.e., with positive residuals from the HomeSizeK
model)
while the purple dots (from Eastside) are mostly below the line
(negative residuals).
This indicates that Downtown homes are generally more expensive than
what the home size model would predict, while Eastside homes are less
expensive. This pattern is a clue that tells us that adding
Neighborhood
into the HomeSizeK
model will
explain additional variation in PriceK
above and beyond
that explained by just the home size model alone.