Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentHigh School / Statistics and Data Science II (XCD)
-
segmentPART I: EXPLORING AND MODELING VARIATION
-
segmentChapter 1 - Exploring Data with R
-
1.7 Selecting and Filtering Data
-
segmentChapter 2 - From Exploring to Modeling Variation
-
segmentChapter 3 - Modeling Relationships in Data
-
segmentPART II: COMPARING MODELS TO MAKE INFERENCES
-
segmentChapter 4 - The Logic of Inference
-
segmentChapter 5 - Model Comparison with F
-
segmentChapter 6 - Parameter Estimation and Confidence Intervals
-
segmentPART III: MULTIVARIATE MODELS
-
segmentChapter 7 - Introduction to Multivariate Models
-
segmentChapter 8 - Multivariate Model Comparisons
-
segmentChapter 9 - Models with Interactions
-
segmentChapter 10 - More Models with Interactions
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Statistics and Data Science II (XCD)
1.7 Selecting & Filtering Data
Sometimes you want to focus on a subset of the variables in a data
frame. For example, you might want to look at just the variables
PriceK
and PriceR
in the Ames
data frame. PriceK
represents the sale price of the home in
thousands of dollars. PriceR
represents the sale price in
dollars.
We can use the select()
function to look at just a
subset of variables. When using select()
, we first need to
tell R which data frame, then which variables to select from that data
frame.
select(Ames, PriceK, PriceR)
Modify the select()
code below to take a look at just
the following variables in Ames
: PriceK
,
PriceR
, and Neighborhood
.
require(coursekata)
# Modify this code
select(Ames, ...)
# Modify this code
select(Ames, PriceK, PriceR, Neighborhood)
ex() %>% check_output_expr("select(Ames, PriceK, PriceR, Neighborhood)")
Running the select()
function will print out the values
of the selected variables for every case. If you want to just look at
the first six rows you can combine the head()
and
select()
functions like this:
head(select(Ames, PriceK, PriceR, Neighborhood))
.
PriceK PriceR Neighborhood
1 260 260000 CollegeCreek
2 210 210000 CollegeCreek
3 155 155000 OldTown
4 125 125000 OldTown
5 110 110000 CollegeCreek
6 100 100000 OldTown
Whereas select()
gives you a subset of
variables (or columns of the data frame), the
filter()
function will give you a subset of
observations (or rows) of the data frame based on some
criteria. For example, here is some code that will return only the
observations where the sale price is greater than $300,000:
filter(Ames, PriceK > 300)
Edit the code below to filter for homes that cost more than 300K.
require(coursekata)
# Modify this code
filter()
# Modify this code
filter(Ames, PriceK > 300)
ex() %>% check_output_expr("filter(Ames, PriceK > 300)")
YearBuilt YearSold Neighborhood HomeSizeR HomeSizeK LotSizeR LotSizeK Floors
1 2007 2007 CollegeCreek 2696 2.696 9965 9.965 2
2 2004 2007 CollegeCreek 2000 2.000 10386 10.386 1
3 2000 2009 CollegeCreek 2153 2.153 11050 11.050 2
4 2006 2007 CollegeCreek 2828 2.828 9965 9.965 2
BuildQuality Foundation HasCentralAir Bathrooms Bedrooms TotalRooms
1 7 PouredConcrete 1 2 4 10
2 8 PouredConcrete 1 2 3 8
3 9 PouredConcrete 1 2 3 8
4 8 PouredConcrete 1 3 4 11
KitchenQuality HasFireplace GarageType GarageCars PriceR PriceK
1 Excellent 1 Attached 3 383970 383.97
2 Good 0 Attached 3 305900 305.90
3 Excellent 1 Attached 3 313000 313.00
4 Good 1 Attached 3 424870 424.87
The function filter()
, like select()
,
returns a data frame. In this case, the data frame only has four rows
because only four observations in Ames
had sale prices
greater than $300K.