Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science II
-
segmentPART I: EXPLORING AND MODELING VARIATION
-
segmentChapter 1 - Exploring Data with R
-
segmentChapter 2 - From Exploring to Modeling Variation
-
segmentChapter 3 - Modeling Relationships in Data
-
3.6 Using a Quantitative Explanatory Variable in a Model
-
segmentPART II: COMPARING MODELS TO MAKE INFERENCES
-
segmentChapter 4 - The Logic of Inference
-
segmentChapter 5 - Model Comparison with F
-
segmentChapter 6 - Parameter Estimation and Confidence Intervals
-
segmentPART III: MULTIVARIATE MODELS
-
segmentChapter 7 - Introduction to Multivariate Models
-
segmentChapter 8 - Multivariate Model Comparisons
-
segmentChapter 9 - Models with Interactions
-
segmentChapter 10 - More Models with Interactions
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Statistics and Data Science II (XCD)
3.6 Using a Quantitative Explanatory Variable in a Model
Neighborhood
is a categorical variable. The Neighborhood
model is what we might call a group model because it uses the group mean as the best predictor of home prices within each group (in this case, neighborhood).
Not all models are group models, however. If we want to use a quantitative variable as an explanatory variable we will need to adjust our model a bit. Models that use quantitative predictors are often referred to as regression models.
The HomeSizeK
Model of PriceK
One quantitative variable in the Ames
data frame that might explain some of the variation in PriceK
is HomeSizeK
: the total indoor square footage of the home in thousands of square feet. (Note: a value of 1.5 on HomeSizeK
would mean that the home is 1500 square feet.)
In the previous chapter we created a scatter plot to visualize the relationship between PriceK
and HomeSizeK
. We’ve reprinted that scatter plot below.
gf_point(PriceK ~ HomeSizeK, data = Ames)
As we noted previously, it does appear that if we know the square footage of a home we can make a better guess as to its price than if we didn’t have this information. Larger homes tend to have higher prices, and smaller homes, lower prices. We call a pattern like this a positive relationship because as one variable goes up, so does the other.
If we want to make specific predictions, and quantitatively compare the HomeSizeK
model to other models, we need to turn it into a statistical model, much like we did when we developed the Neighborhood
model. This time, however, we can’t use group means as the model because there are no groups!
Just as group means are the simplest way to use categorical explanatory variables in models with quantitative outcome variables (such as PriceK
), a line, called the regression line, is the simplest way to model the relationship between two quantitative variables. We overlaid the regression line (or model) on the scatter plot below.
We will learn how to fit a regression model (i.e., find the best-fitting line) using R in a moment, but first it’s worth pointing out that the regression line is not just any line, just like the mean is not just any number.
Just as the mean is the point at which the sum of squared residuals are minimized, the regression line is the exact line, defined by its slope and y-intercept, from which the sum of squared residuals is minimized. Let’s dig into what that really means.
Predictions from the HomeSizeK
Model
We will use the lm()
function to fit the HomeSizeK
model in the same way we did with the group model. You don’t have to tell R that this is a regression model; R will guess, just based on the fact that your explanatory variable is quantitative, not categorical.
Use the code window below to fit the HomeSizeK
model using lm()
, and then save it into an object called HomeSizeK_model
. Then write some code to generate the model predictions, and save them as a new column in the Ames
data frame. (HINT: Go back a few pages and look at how you generated predictions from the Neighborhood
model if you get stuck.)
library(coursekata)
# edit the Neighborhood_model code to create HomeSizeK_model
Neighborhood_model <- lm(PriceK ~ Neighborhood, data = Ames)
# save the predictions of the HomeSizeK_model as a new variable in Ames
Ames$HomeSizeK_predict <-
# this code prints out the first 6 observations
head(select(Ames, PriceK, HomeSizeK, HomeSizeK_predict))
# edit the Neighborhood_model code to create HomeSizeK_model
HomeSizeK_model <- lm(PriceK ~ HomeSizeK, data = Ames)
# save the predictions of the HomeSizeK_model as a new variable in Ames
Ames$HomeSizeK_predict <- predict(HomeSizeK_model)
# this code prints out the first 6 observations
head(select(Ames, PriceK, HomeSizeK, HomeSizeK_predict))
ex() %>% {
check_object(., "HomeSizeK_model") %>%
check_equal()
check_object(., "Ames") %>%
check_column("HomeSizeK_predict") %>%
check_equal()
}
PriceK HomeSizeK HomeSizeK_predict
1 260 1.734 209.5258
2 210 1.436 177.7587
3 155 1.826 219.3331
4 125 0.825 112.6255
5 110 0.924 123.1790
6 100 0.968 127.8695
We ran the code below to overlay the predicted prices of the HomeSizeK
model onto the original scatter plot depicting the actual home prices. (The predictions are represented by red circles, accomplished by adding arguments for shape and color to the gf_point() function .)
Ames$prediction <- predict(HomeSizeK_model)
gf_point(PriceK ~ HomeSizeK, data = Ames) %>%
gf_point(prediction ~ HomeSizeK, shape = 1, size = 3, color = "firebrick")
See how all the predictions seem to fall in a straight line? This is no accident! It’s because the predictions were generated by the regression line that R fit to the data.
If we chain on gf_model()
to our scatter plot, the best-fitting model lies right on top of the model predictions.
gf_point(PriceK ~ HomeSizeK, data = Ames) %>%
gf_point(prediction ~ HomeSizeK, shape = 1, size = 3, color = "firebrick") %>%
gf_model(HomeSizeK_model, color="red")