CourseKata - 1.7 Selecting and Filtering Data

High School / Statistics and Data Science II (XCD)

Book

1.7 Selecting & Filtering Data

Sometimes you want to focus on a subset of the variables in a data frame. For example, you might want to look at just the variables PriceK and PriceR in the Ames data frame. PriceK represents the sale price of the home in thousands of dollars. PriceR represents the sale price in dollars.

We can use the select() function to look at just a subset of variables. When using select(), we first need to tell R which data frame, then which variables to select from that data frame.

select(Ames, PriceK, PriceR)

Modify the select() code below to take a look at just the following variables in Ames: PriceK, PriceR, and Neighborhood.

require(coursekata)

# Modify this code
select(Ames, ...)

# Modify this code
select(Ames, PriceK, PriceR, Neighborhood)

ex() %>% check_output_expr("select(Ames, PriceK, PriceR, Neighborhood)")

Running the select() function will print out the values of the selected variables for every case. If you want to just look at the first six rows you can combine the head() and select() functions like this: head(select(Ames, PriceK, PriceR, Neighborhood)).

 PriceK PriceR Neighborhood
1    260 260000 CollegeCreek
2    210 210000 CollegeCreek
3    155 155000      OldTown
4    125 125000      OldTown
5    110 110000 CollegeCreek
6    100 100000      OldTown

Whereas select() gives you a subset of variables (or columns of the data frame), the filter() function will give you a subset of observations (or rows) of the data frame based on some criteria. For example, here is some code that will return only the observations where the sale price is greater than $300,000:

filter(Ames, PriceK > 300)

Edit the code below to filter for homes that cost more than 300K.

require(coursekata)

# Modify this code
filter()

# Modify this code
filter(Ames, PriceK > 300)

ex() %>% check_output_expr("filter(Ames, PriceK > 300)")


 YearBuilt YearSold Neighborhood HomeSizeR HomeSizeK LotSizeR LotSizeK Floors
1      2007     2007 CollegeCreek      2696     2.696     9965    9.965      2
2      2004     2007 CollegeCreek      2000     2.000    10386   10.386      1
3      2000     2009 CollegeCreek      2153     2.153    11050   11.050      2
4      2006     2007 CollegeCreek      2828     2.828     9965    9.965      2
  BuildQuality     Foundation HasCentralAir Bathrooms Bedrooms TotalRooms
1            7 PouredConcrete             1         2        4         10
2            8 PouredConcrete             1         2        3          8
3            9 PouredConcrete             1         2        3          8
4            8 PouredConcrete             1         3        4         11
  KitchenQuality HasFireplace GarageType GarageCars PriceR PriceK
1      Excellent            1   Attached          3 383970 383.97
2           Good            0   Attached          3 305900 305.90
3      Excellent            1   Attached          3 313000 313.00
4           Good            1   Attached          3 424870 424.87

The function filter(), like select(), returns a data frame. In this case, the data frame only has four rows because only four observations in Ames had sale prices greater than $300K.

<p>Remember: even though <code>select()</code> and <code>filter()</code> both return data frames, those new data frames are just temporary unless you save them. If you want to save a data frame that includes only the variables PriceK and PriceR you would need to do something like this: <code>new_data_frame <- select(Ames, PriceK, PriceR)</code>.</p>

1.6 Variable Types in R 1.8 Missing Data

Course Outline

High School / Statistics and Data Science II (XCD)

1.7 Selecting & Filtering Data

Responses

list High School / Statistics and Data Science II (XCD)

1.7 Selecting & Filtering Data

High School / Statistics and Data Science II (XCD)