Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentAlgebra + Data Science
-
segmentChapter 1 - Exploring Variation in Data
-
1.8 Manipulating Data Frames: select() and filter()
-
segmentChapter 2 - Modeling Data with Functions
-
segmentChapter 3 - Assessing How Well Models Fit the Data
-
segmentResources
list High School / Algebra + Data Science (G)
1.8 Manipulating Data Frames: select()
& filter()
We’ve been exploring the penguins
data frame, but there are so many data sets in the world to discover. When you find them, you might want to manipulate them to make them easier to work with. Here we will explore a few functions that can help us work with data frames: select()
, filter()
, mutate()
, and arrange()
.
Example Data Frame: World
To explore these R functions, let’s take a look at a data set called World
(click here for more details about the variables in the World
data set). It has a number of variables about different countries in the world. Try writing the name of the data frame to just look at the whole thing in all its glory.
require(coursekata)
# take a look at World
# take a look at World
World
ex() %>% check_output_expr("World")
That’s a lot to take in. Each row represents a country but there are a lot of variables and a lot of rows! You can always use functions like head()
but it might be handy to learn a few more functions like select()
or filter()
.
select()
Variables
Sometimes we want to focus on a subset of the variables in a data frame. For example, we might want to look at just the variables Country
, Region
and LifeExpectancy
in the World
data frame. (Note: LifeExpectancy
represents the average number of years lived by citizens in years.)
The figure to the right summarizes what
select()
does. Each row represents a case, each column represents a variable. The dark gray boxes at the top represent the variable names. The column (or variable) colored yellow is the one we are interested in.
We can use the select()
function to take a dataset with a lot of variables and winnow it down to just a subset of variables (or even just one, as in the figure). When using select()
, we first need to tell R which data frame, then which variables to select from that data frame.
select(World, Country, Region, LifeExpectancy)
Modify the select()
code below to take a look at just the following variables in World
: Country
, LifeExpectancy
, GirlsH1900
, and GirlsH1980
(the latter two variables represent the average height of 18-year-old girls in the year 1900 and 1980, respectively, in cm).
require(coursekata)
# modify this code
select()
# modify this code
select(World, Country, LifeExpectancy, GirlsH1900, GirlsH1980)
ex() %>% check_output_expr("select(World, Country, LifeExpectancy, GirlsH1900, GirlsH1980)")
The select()
function will output a data frame with only the selected variables for every case. If we want to just look at the first six rows we can combine the head()
and select()
functions like this:
head(select(World, Country, LifeExpectancy, GirlsH1900, GirlsH1980))
Country LifeExpectancy GirlsH1900 GirlsH1980
1 Albania 76.2 151.0666 162.0169
2 Algeria 71.7 152.4126 159.5387
3 Angola 41.7 153.0222 158.5683
4 Argentina 74.8 151.1988 159.5268
5 Armenia 71.7 150.2388 158.4285
6 Australia 80.9 156.0623 165.0783
This head(select())
code is an example of a function-within-a-function. To read this, you want to start from the inside (so, the select()
part), then think about the output of the select()
function as being inside the head()
function.
If you want to save a data frame that includes only the four variables, you would need to use the assignment operator (the arrow, <-
or ->
). Here is an example where we save the new slimmed down data frame as SelectWorld
:
SelectWorld <- select(World, Country, LifeExpectancy, GirlsH1900, GirlsH1980)
filter()
for Cases
Whereas
select()
gives you a subset of variables (or columns) of the data frame, the filter()
function will give you a subset of observations (or rows) of a data frame based on some criteria.
For example, we might notice that some countries have a much longer average life expectancy than others. We could use the filter()
function to show us only the countries that have an average life expectancy greater than 80, like this:
filter(World, LifeExpectancy > 80)
Write code below to show only countries that, on average, have a life expectancy greater than 80. To make things a little easier to see, use the data frame we created with just a few variables, SelectWorld
.
require(coursekata)
SelectWorld <- select(World, Country, LifeExpectancy, GirlsH1900, GirlsH1980)
# Write filter code but use the data frame SelectWorld
# Write filter code but use the data frame SelectWorld
filter(SelectWorld, LifeExpectancy > 80)
ex() %>% check_output_expr("filter(SelectWorld, LifeExpectancy > 80)")
Country LifeExpectancy GirlsH1900 GirlsH1980
1 Australia 80.9 156.0623 165.0783
2 Canada 80.3 157.9067 163.3481
3 France 80.2 155.6877 164.4921
4 Iceland 81.5 159.2744 167.2474
5 Israel 80.3 152.3893 161.8456
6 Italy 80.3 153.7283 163.1413
7 Japan 82.3 143.3583 158.5073
8 Spain 80.5 151.4209 162.6662
9 Sweden 80.5 160.6127 166.4325
10 Switzerland 81.3 157.9165 164.1571
The function filter()
, like select()
, returns a data frame. The filtered data frame has fewer rows because only a few countries have an average life expectancy over 80 years of age.
As before, you won’t be able to keep this smaller data frame for further use unless you save it (using the assignment arrow <-
).