Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentCollege / Accelerated Statistics with R (XCD)
-
segmentPART I: EXPLORING AND MODELING VARIATION
-
segmentChapter 1 - Exploring Data with R
-
segmentChapter 2 - From Exploring to Modeling Variation
-
segmentChapter 3 - Modeling Relationships in Data
-
segmentPART II: COMPARING MODELS TO MAKE INFERENCES
-
segmentChapter 4 - The Logic of Inference
-
segmentChapter 5 - Model Comparison with F
-
segmentChapter 6 - Parameter Estimation and Confidence Intervals
-
segmentPART III: MULTIVARIATE MODELS
-
segmentChapter 7 - Introduction to Multivariate Models
-
segmentChapter 8 - Multivariate Model Comparisons
-
8.6 Using `shuffle()` for Targeted Model Comparisons (Part 2)
-
segmentChapter 9 - Models with Interactions
-
segmentChapter 10 - More Models with Interactions
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list College / Accelerated Statistics with R (XCD)
8.6 Using shuffle()
for Targeted
Model Comparisons (Part 2)
Having saved the residuals from Neighborhood
model,
let’s now see how we can use them, along with the shuffle()
function, to create a sampling distribution for the unique effect of
HomeSizeK
.
Step Two: Create the Sampling Distribution of F for Home Size
A sampling distribution of Fs provides us a way to calculate how
likely it would be for the simple model of the DGP (i.e., the one with
no unique effect of HomeSizeK
) to generate an F for
HomeSizeK
as large or larger than the one found in the data
(11.626).
Before we create the sampling distribution of F for the
HomeSizeK
effect, we will show you how to get the sample F
for HomeSizeK
. Our previous method, using the
f()
function, won’t work; it only gives us the overall F
for the full model. To get the F for HomeSizeK
you can run
this code:
f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
The first part of this code creates a supernova table for the
multivariate model using PriceK_N_resids
as the outcome.
The highlighted part above then reads the sample F for
HomeSizeK
out of the table (without ever printing it out).
We’ve put this code in the window below, so you have this F
available.
In the window below, modify the code where indicated, using the
shuffle()
function, to produce a single F for
HomeSizeK
that assumes a DGP with 0 effect of home size.
Run the code a few times just to see what it does.
require(coursekata)
# code to fit neighborhood model and save residuals
Neighborhood_model <- lm(PriceK ~ Neighborhood, data = Smallville)
Smallville$PriceK_N_resids <- resid(Neighborhood_model)
# this code prints sample F for HomeSizeK
f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
# modify the code below to produce the F when residuals are shuffled
f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
# this code prints sample F for HomeSizeK
f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
# modify the code below to produce the F when residuals are shuffled
f(shuffle(PriceK_N_resids) ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
ex() %>% check_function("f", index = 2) %>% {
check_arg(., 1) %>% check_equal()
check_arg(., "data") %>% check_equal()
check_arg(., "predictor") %>% check_equal()
}
Now let’s add some code to create a sampling distribution of 1000 Fs
for HomeSizeK
assuming no effect of home size in the DGP.
Save these Fs into a data frame called HomeSizeK_sdof
.
require(coursekata)
Neighborhood_model <- lm(PriceK~ Neighborhood, data = Smallville)
Smallville$PriceK_N_resids <- resid(Neighborhood_model)
# This code generates one shuffled HomeSizeK F
# Modify it to make a sampling distribution of 1000 shuffled Fs
# Save them in a data frame called HomeSizeK_sdof
f(shuffle(PriceK_N_resids) ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
# This code will put these Fs into a histogram
gf_histogram(~ f, data = HomeSizeK_sdof) %>%
gf_labs(title = "shuffled HomeSizeK Fs")
# This code generates one shuffled HomeSizeK F
# Modify it to make a sampling distribution of 1000 shuffled Fs
# Save them in a data frame called HomeSizeK_sdof
HomeSizeK_sdof <- do(1000) * f(shuffle(PriceK_N_resids) ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
# This code will put these Fs into a histogram
gf_histogram(~ f, data = HomeSizeK_sdof) %>%
gf_labs(title = "shuffled HomeSizeK Fs")
ex() %>%
check_object("HomeSizeK_sdof") %>%
check_equal()
Below we have graphed out the sampling distribution of 1000 shuffled
Fs for the HomeSizeK
effect. We also have added to the
plot, as a black dot, the sample F for the HomeSizeK
row of
the ANOVA table (11.63). We’ll save this value as
HomeSizeK_f
. As you can see, the sample F is far out in the
tail of the sampling distribution.
To calculate the exact p-value for the HomeSizeK
F, we
can use tally.
Try copying and pasting the appropriate code into the code block
below. Also generate an ANOVA table – to check out whether the p-value
obtained from tally()
is similar to the p-value for
HomeSizeK
in the ANOVA table.
require(coursekata)
Neighborhood_model <- lm(PriceK~ Neighborhood, data = Smallville)
Smallville$PriceK_N_resids <- resid(Neighborhood_model)
# This saves the sample HomeSizeK F
HomeSizeK_f <- f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
# This code generates a sampling distribution of shuffled HomeSizeK Fs
HomeSizeK_sdof <- do(1000) * f(shuffle(PriceK_N_resids) ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
# Paste in the code for tallying the p-value for HomeSizeK
# Modify the code below to generate an ANOVA table from the multivariate model
lm(PriceK ~ Neighborhood + HomeSizeK, data = Smallville)
# This saves the sample HomeSizeK F
HomeSizeK_f <- f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
# This code generates a sampling distribution of shuffled HomeSizeK Fs
HomeSizeK_sdof <- do(1000) * f(shuffle(PriceK_N_resids) ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
# Paste in the code for tallying the p-value for HomeSizeK
tally(~ f > HomeSizeK_f, data=HomeSizeK_sdof, format="proportion")
# Modify the code below to generate an ANOVA table from the multivariate model
supernova(lm(PriceK ~ Neighborhood + HomeSizeK, data = Smallville))
ex() %>% {
check_function(., "tally") %>%
check_result() %>%
check_equal()
check_function(., "supernova") %>%
check_result() %>%
check_equal()
}
The p-value we got from tally()
is close to the p-value
reported on the HomeSizeK
row of the multivariate ANOVA
table: 0.0019.