Course Outline

list High School / Statistics and Data Science II (XCD)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

3.6 Using a Quantitative Explanatory Variable in a Model

Neighborhood is a categorical variable. The Neighborhood model is what we might call a group model because it uses the group mean as the best predictor of home prices within each group (in this case, neighborhood).

Not all models are group models, however. If we want to use a quantitative variable as an explanatory variable we will need to adjust our model a bit. Models that use quantitative predictors are often referred to as regression models.

The HomeSizeK Model of PriceK

One quantitative variable in the Ames data frame that might explain some of the variation in PriceK is HomeSizeK: the total indoor square footage of the home in thousands of square feet. (Note: a value of 1.5 on HomeSizeK would mean that the home is 1500 square feet.)

In the previous chapter we created a scatter plot to visualize the relationship between PriceK and HomeSizeK. We’ve reprinted that scatter plot below.

gf_point(PriceK ~ HomeSizeK, data = Ames)

A scatter plot of PriceK by HomeSizeK in the Ames data frame.

As we noted previously, it does appear that if we know the square footage of a home we can make a better guess as to its price than if we didn’t have this information. Larger homes tend to have higher prices, and smaller homes, lower prices. We call a pattern like this a positive relationship because as one variable goes up, so does the other.

If we want to make specific predictions, and quantitatively compare the HomeSizeK model to other models, we need to turn it into a statistical model, much like we did when we developed the Neighborhood model. This time, however, we can’t use group means as the model because there are no groups!

Just as group means are the simplest way to use categorical explanatory variables in models with quantitative outcome variables (such as PriceK), a line, called the regression line, is the simplest way to model the relationship between two quantitative variables. We overlaid the regression line (or model) on the scatter plot below.

A scatter plot of PriceK by HomeSizeK in the Ames data frame overlaid with the regression line in red.

We will learn how to fit a regression model (i.e., find the best-fitting line) using R in a moment, but first it’s worth pointing out that the regression line is not just any line, just like the mean is not just any number.

Just as the mean is the point at which the sum of squared residuals are minimized, the regression line is the exact line, defined by its slope and y-intercept, from which the sum of squared residuals is minimized. Let’s dig into what that really means.

Predictions from the HomeSizeK Model

We will use the lm() function to fit the HomeSizeK model in the same way we did with the group model. You don’t have to tell R that this is a regression model; R will guess, just based on the fact that your explanatory variable is quantitative, not categorical.

Use the code window below to fit the HomeSizeK model using lm(), and then save it into an object called HomeSizeK_model. Then write some code to generate the model predictions, and save them as a new column in the Ames data frame. (HINT: Go back a few pages and look at how you generated predictions from the Neighborhood model if you get stuck.)

library(coursekata) # edit the Neighborhood_model code to create HomeSizeK_model Neighborhood_model <- lm(PriceK ~ Neighborhood, data = Ames) # save the predictions of the HomeSizeK_model as a new variable in Ames Ames$HomeSizeK_predict <- # this code prints out the first 6 observations head(select(Ames, PriceK, HomeSizeK, HomeSizeK_predict)) # edit the Neighborhood_model code to create HomeSizeK_model HomeSizeK_model <- lm(PriceK ~ HomeSizeK, data = Ames) # save the predictions of the HomeSizeK_model as a new variable in Ames Ames$HomeSizeK_predict <- predict(HomeSizeK_model) # this code prints out the first 6 observations head(select(Ames, PriceK, HomeSizeK, HomeSizeK_predict)) ex() %>% { check_object(., "HomeSizeK_model") %>% check_equal() check_object(., "Ames") %>% check_column("HomeSizeK_predict") %>% check_equal() }
  PriceK HomeSizeK HomeSizeK_predict
1    260     1.734          209.5258
2    210     1.436          177.7587
3    155     1.826          219.3331
4    125     0.825          112.6255
5    110     0.924          123.1790
6    100     0.968          127.8695

We ran the code below to overlay the predicted prices of the HomeSizeK model onto the original scatter plot depicting the actual home prices. (The predictions are represented by red circles, accomplished by adding arguments for shape and color to the gf_point() function .)

Ames$prediction <- predict(HomeSizeK_model)

gf_point(PriceK ~ HomeSizeK, data = Ames) %>%
  gf_point(prediction ~ HomeSizeK, shape = 1, size = 3, color = "firebrick")

A scatter plot of PriceK by HomeSizeK in the Ames data frame. It is overlaid with all of the point predictions from the HomeSizeK_model in red. The predictions are aligned along the same path as the regression line.

See how all the predictions seem to fall in a straight line? This is no accident! It’s because the predictions were generated by the regression line that R fit to the data.

If we chain on gf_model() to our scatter plot, the best-fitting model lies right on top of the model predictions.

gf_point(PriceK ~ HomeSizeK, data = Ames) %>%
  gf_point(prediction ~ HomeSizeK, shape = 1, size = 3, color = "firebrick") %>%
  gf_model(HomeSizeK_model, color="red")

A scatter plot of PriceK by HomesizeK in the Ames data frame. It is overlaid with all of the point predictions from the HomeSizeK_model in red, as well as the regression line. The point predictions are perfectly aligned and overlap with the regression line.

Responses