Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentHigh School / Statistics and Data Science II (XCD)
-
segmentPART I: EXPLORING AND MODELING VARIATION
-
segmentChapter 1 - Exploring Data with R
-
segmentChapter 2 - From Exploring to Modeling Variation
-
segmentChapter 3 - Modeling Relationships in Data
-
segmentPART II: COMPARING MODELS TO MAKE INFERENCES
-
segmentChapter 4 - The Logic of Inference
-
segmentChapter 5 - Model Comparison with F
-
segmentChapter 6 - Parameter Estimation and Confidence Intervals
-
segmentPART III: MULTIVARIATE MODELS
-
segmentChapter 7 - Introduction to Multivariate Models
-
segmentChapter 8 - Multivariate Model Comparisons
-
8.10 Models with Multiple Quantitative Predictors
-
segmentChapter 9 - Models with Interactions
-
segmentChapter 10 - More Models with Interactions
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Statistics and Data Science II (XCD)
8.10 Models with Multiple Quantitative Predictors
We have used the General Linear Model approach in models with one categorical predictor and one quantitative predictor, and with models that have two categorical predictors. Let’s add one more type of model to the mix, one with two quantitative predictor variables.
Example: Lung Capacity in Young People
Here is a snippet of a data frame called fevdata
collected by medical researchers interested in children’s lung function.
They used a measure of lung capacity called forced expiratory volume
(FEV), which is the amount of air an individual can exhale in the first
second of a forceful breath (in liters). They collected this data from a
sample of 654 young people ages 3-19. FEV is often used as a measure of
lung capacity and pulmonary health.
head(select(fevdata, FEV, HEIGHT, AGE))
FEV HEIGHT AGE
1 1.708 57.0 9
2 1.724 67.5 8
3 1.720 54.5 7
4 1.558 53.0 9
5 1.895 57.0 9
6 2.336 61.0 8
In addition to FEV
, we’ve shown two other variables
here: HEIGHT
(measured in inches) and AGE
(measured in years). Normal people know that as children get older (in
this age range) they tend to get taller, and researchers have found that
both being older and being taller are associated with increased FEV.
We can write this relationship between FEV, height and age with a word equation like this: FEV = HEIGHT + AGE + error.
Visualizing
the Relationships Between FEV
, HEIGHT
and
AGE
We can visualize relationships with categorical predictors in a number of ways. When we only have quantitative predictors, however, we are mostly limited to scatter plots and jitter plots.
Let’s start with a basic scatter plot of FEV predicted by height, and
then add age to the plot using the argument color = ~AGE
,
which colors the dots based on age.
require(coursekata)
# add the color argument to this scatter plot
gf_point(FEV ~ HEIGHT, data = fevdata)
# add the color argument to this scatter plot
gf_point(FEV ~ HEIGHT, data = fevdata, color = ~AGE)
ex() %>%
check_function("gf_point") %>%
check_arg("color") %>%
check_equal()
In this plot, each dot represents a young person. Taller people are represented by the dots on the right; older people are represented by lighter-colored dots (more yellow).
There is redundancy between the two explanatory variables
HEIGHT
and AGE
such that older kids are
taller. Using the concepts we developed earlier, we can use a
multivariate model to figure out how much of the variation in
FEV
is uniquely related to HEIGHT
and how much
is uniquely related to AGE
.
A Multivariate Model of
FEV
Use the code window below to fit a model that uses both
HEIGHT
and AGE
to predict FEV
and
output the best-fitting parameter estimates.
require(coursekata)
# find the best-fitting parameter estimates
# find the best-fitting parameter estimates
lm(FEV ~ HEIGHT + AGE, data = fevdata)
ex() %>%
check_function("lm") %>%
check_result() %>%
check_equal()
Call:
lm(formula = FEV ~ HEIGHT + AGE, data = fevdata)
Coefficients:
(Intercept) HEIGHT AGE
-4.61047 0.10971 0.05428
The \(b_0\) estimate (-4.61) is the predicted lung function for someone who has a height of 0 and an age of 0. Of course, no one is 0 inches tall and no one is 0 years old, and a negative lung capacity is also not possible. A y-intercept is mathematically necessary for defining a straight line, but we should always be careful not to believe model predictions that go far beyond the reaches of our data!
The \(b_1\) estimate, for
HEIGHT
, is the predicted increase in FEV for each one-unit
(i.e., one inch) increase in height, controlling for age.
Visualizing the Model Predictions
In an earlier model, PriceK ~ Neighborhood + HomeSizeK
,
we were able to represent the model predictions as two parallel lines,
one for each neighborhood. This was possible because
Neighborhood
was a categorical variable with two
levels.
In the current model, however, both HEIGHT
and
AGE
are continuous variables. If we plot the model
predictions for FEV ~ HEIGHT
we would get a single straight
line. But to add AGE
into the model, we would need to plot
many parallel lines, one for each possible age. This is a lot of lines,
and it’s not easy to visualize.
One way we can visualize the FEV ~ HEIGHT + AGE
model is
to pick several ages, then plot the lines that use HEIGHT
to predict FEV
for each of these ages. The
gf_model()
function will, by default, plot three lines for
age: one for the average age (around 10 years old in
fevdata
), another for the age 1 standard deviation below
average, and the third for the age 1 standard deviation above
average.
In the code window below, try chaining on gf_model()
to
the plot we made before.
require(coursekata)
# this saves the multivariate model
multi_model <- lm(FEV ~ HEIGHT + AGE, data = fevdata)
# add the multi_model to this plot
gf_point(FEV ~ HEIGHT, data = fevdata, color = ~AGE, alpha = .2)
# this saves the multivariate model
multi_model <- lm(FEV ~ HEIGHT + AGE, data = fevdata)
# add the multi_model to this plot
gf_point(FEV ~ HEIGHT, data = fevdata, color = ~AGE, alpha = .2) %>%
gf_model(multi_model)
ex() %>%
check_function("gf_model") %>% {
check_arg(., "object") %>% check_equal()
check_arg(., "model") %>% check_equal()
}
The fact that the lines for the different ages we have plotted are
parallel is a feature of the model we have fit. In defining the model as
additive (HEIGHT + AGE
), we
constrained the model to assume that the effect of HEIGHT
is the same for every value of AGE
(and also that the
effect of AGE
is the same for every HEIGHT
).
In the next chapter we will build interaction models in which
this constraint is removed.
In this model, the fact that the lines are parallel indicates that, regardless of age, we add the same amount per inch of height to our prediction of FEV, which thus makes the slopes of the lines we have plotted the same. Similarly, regardless of height, the amount we add per additional year of age is the same.
We can even use the same multi_model
but just switch
around the axes and colors on our graph.
multi_model <- lm(FEV ~ HEIGHT + AGE, data = fevdata)
gf_point(FEV ~ AGE, data = fevdata, color = ~HEIGHT, alpha = .2) %>%
gf_model(multi_model)
Visualizing the Model in Three Dimensions
Another way we can picture this data and model is by using 3 dimensions. In the gif below, we have plotted FEV, height, and age in a 3-dimensional scatter plot.
If we look at the model predictions (in the gif below), the many lines (each age’s line has the same color dots) actually make a flat surface – a plane!
We’ve used a special R package called plotly
to create
these 3D visualizations. In the code window below, we have put in code
for creating a 3D graph of the model predictions. If you run it, you’ll
be able to move around the 3D visualization for yourself.
require(coursekata)
# preload plotly to prevent the startup messages
require(plotly)
# this loads plotly
library(plotly)
# this saves the predictions of the multivariate model
fevdata$prediction <- predict(lm(FEV ~ HEIGHT + AGE, data = fevdata))
# this makes the 3d plot of model
plot_ly(fevdata, x = ~AGE, y = ~HEIGHT, z = ~prediction, type = "scatter3d", mode = "markers", color = ~AGE, size = 1)
# this loads plotly
library(plotly)
# this saves the predictions of the multivariate model
fevdata$prediction <- predict(lm(FEV ~ HEIGHT + AGE, data = fevdata))
# this makes the 3d plot of model
plot_ly(fevdata, x = ~AGE, y = ~HEIGHT, z = ~prediction, type = "scatter3d", mode = "markers", color = ~AGE, size = 1)
ex() %>% check_error()