Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentHigh School / Statistics and Data Science II (XCD)
-
segmentPART I: EXPLORING AND MODELING VARIATION
-
segmentChapter 1 - Exploring Data with R
-
1.6 Variable Types in R
-
segmentChapter 2 - From Exploring to Modeling Variation
-
segmentChapter 3 - Modeling Relationships in Data
-
segmentPART II: COMPARING MODELS TO MAKE INFERENCES
-
segmentChapter 4 - The Logic of Inference
-
segmentChapter 5 - Model Comparison with F
-
segmentChapter 6 - Parameter Estimation and Confidence Intervals
-
segmentPART III: MULTIVARIATE MODELS
-
segmentChapter 7 - Introduction to Multivariate Models
-
segmentChapter 8 - Multivariate Model Comparisons
-
segmentChapter 9 - Models with Interactions
-
segmentChapter 10 - More Models with Interactions
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Statistics and Data Science II (XCD)
1.6 Variable Types in R
Let’s look at a new dataset, called Ames
. The data
describe a sample of 185 homes sold in Ames, Iowa during a particular
time period. Ames is located about 30 miles north of Des Moines (the
state capitol) and is home to Iowa State University (the largest
university in the state).
Write some code below to look at the first six rows of the
Ames
data frame:
require(coursekata)
# Use the head() function to look at the first six rows of Ames
# Use the head() function to look at the first six rows of
head(Ames)
ex() %>% check_output_expr(
"head(Ames)",
missing_msg = "Did you call `head()` with `Ames`?"
)
There are a lot of variables in this data frame – you can scroll right and left to see them all.
Each row in the Ames
data frame represents a particular
home. Each variable describes a different feature of the homes in the
data frame, including the year each home was built
(YearBuilt
), how big the house is (HomeSize
),
and what neighborhood it is in (Neighborhood
).
Quantitative and Categorical Variables in R
Broadly speaking, variables can be divided into two types:
quantitative and categorical. Quantitative variables
take numerical values (e.g., 3 or 1.25). For quantitative variables, the
values they are assigned represent quantities such that observations
with higher numbers are assumed to have more of the quantity
than those with lower numbers. In Ames
, for example, we can
assume that a home with a BuildQuality
of 7 is actually of
higher quality than one with a value of 5.
The values assigned to categorical variables do not represent
quantities. Instead, they represent categories. For example, in
Ames
, the variable Foundation
is coded with
values such as PouredConcrete
or CinderBlock
.
The difference is not quantitative; these are just two different types
of foundations.
Most quantitative variables are categorized by R as numeric (or num). (They may on occasion be categorized as int for integer or dbl for double – which basically means that the numbers have decimals.) The nice thing about all these types of variables (num, int, and dbl) is that R knows it can add, subtract, multiply, divide, etc, their values. That’s good!
Categorical variables are a slightly different story. Take, for
example, the variable HasCentralAir
in the
Ames
dataset (the first six values are printed below).
HasCentralAir
1 1
2 1
3 1
4 1
5 0
6 1
Even though the variable is coded with numbers (1 represents “has central air”, 0, “does not have central air”), it really is a categorical variable. We know that. But R does not know that unless we tell it. R will usually try to guess what kind of variable it is, but it may guess wrong!
For that reason, R has a way to let you specify whether a variable is
categorical, using the factor()
function. If you tell R
that a variable is a factor, it will treat it as a categorical
variable. To tell R that HasCentralAir
is categorical, we
can write factor(Ames$HasCentralAir)
.
But in order for this change to stick, we have to save this new
version of the variable back into the Ames
data frame.
Ames$HasCentralAir <- factor(Ames$HasCentralAir)
If the 0s and 1s in the HasCentralAir
column represented
true quantities, we could add them up using the code
sum(Ames$HasCentralAir)
. But if we tell R that
HasCentralAir
is a factor, it will assume the 0s and 1s
refer to categories, and so it won’t be willing to add them up.
In the code block below, add the sum()
function to find
the sum of HasCentralAir
when it is coded as a numeric
variable (R thinks of the 0s and 1s as numbers).
require(coursekata)
# this turns HasCentralAir into a numeric variable
Ames$HasCentralAir <- as.numeric(Ames$HasCentralAir)
# add code to sum up the values of HasCentralAir
Ames$HasCentralAir
# this turns HasCentralAir into a numeric variable
Ames$HasCentralAir <- as.numeric(Ames$HasCentralAir)
# add code to sum up the values of HasCentralAir
sum(Ames$HasCentralAir)
ex() %>% check_function("sum") %>%
check_result() %>% check_equal()
Even though R summed up these values, we shouldn’t be totaling these values up because the 0s and 1s represent categories. The total is uninterpretable.