Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentAlgebra + Data Science
-
segmentChapter 1 - Exploring Variation in Data
-
1.8 Manipulating Data Frames: select() and filter()
-
segmentChapter 2 - Modeling Data with Functions
-
segmentChapter 3 - Assessing How Well Models Fit the Data
-
segmentResources
list High School / Algebra + Data Science (G)
1.8 Manipulating Data Frames: select()
&
filter()
We’ve been exploring the penguins
data frame, but there
are so many datasets in the world to discover. When you find them, you
might want to manipulate them to make them easier to work with. Here we
will explore a few functions that can help us work with data frames:
select()
, filter()
, mutate()
, and
arrange()
.
Example Data Frame:
World
To explore these R functions, let’s take a look at a dataset called
World
(click here for more details about the variables in
the World
dataset). It has a number of variables about
different countries in the world. Try writing the name of the data frame
to just look at the whole thing in all its glory.
require(coursekata)
# take a look at World
# take a look at World
World
ex() %>% check_output_expr("World")
That’s a lot to take in. Each row represents a country but there are
a lot of variables and a lot of rows! You can always use functions like
head()
but it might be handy to learn a few more functions
like select()
or filter()
.
select()
Variables
Sometimes we want to focus on a subset of the variables in a data
frame. For example, we might want to look at just the variables
Country
, Region
and
LifeExpectancy
in the World
data frame. (Note:
LifeExpectancy
represents the average number of years lived
by citizens in years.)
The
figure to the right summarizes what
select()
does. Each row
represents a case, each column represents a variable. The dark gray
boxes at the top represent the variable names. The column (or variable)
colored yellow is the one we are interested in.
We can use the select()
function to take a dataset with
a lot of variables and winnow it down to just a subset of variables (or
even just one, as in the figure). When using select()
, we
first need to tell R which data frame, then which variables to select
from that data frame.
select(World, Country, Region, LifeExpectancy)
Modify the select()
code below to take a look at just
the following variables in World
: Country
,
LifeExpectancy
, GirlsH1900
, and
GirlsH1980
(the latter two variables represent the average
height of 18-year-old girls in the year 1900 and 1980, respectively, in
cm).
require(coursekata)
# modify this code
select()
# modify this code
select(World, Country, LifeExpectancy, GirlsH1900, GirlsH1980)
ex() %>% check_output_expr("select(World, Country, LifeExpectancy, GirlsH1900, GirlsH1980)")
The select()
function will output a data frame with only
the selected variables for every case. If we want to just look at the
first six rows we can combine the head()
and
select()
functions like this:
head(select(World, Country, LifeExpectancy, GirlsH1900, GirlsH1980))
Country LifeExpectancy GirlsH1900 GirlsH1980
1 Albania 76.2 151.0666 162.0169
2 Algeria 71.7 152.4126 159.5387
3 Angola 41.7 153.0222 158.5683
4 Argentina 74.8 151.1988 159.5268
5 Armenia 71.7 150.2388 158.4285
6 Australia 80.9 156.0623 165.0783
This head(select())
code is an example of a
function-within-a-function. To read this, you want to start from the
inside (so, the select()
part), then think about the output
of the select()
function as being inside the
head()
function.
If you want to save a data frame that includes only the four
variables, you would need to use the assignment operator (the arrow,
<-
or ->
). Here is an example where we
save the new slimmed down data frame as SelectWorld
:
SelectWorld <- select(World, Country, LifeExpectancy, GirlsH1900, GirlsH1980)
filter()
for Cases
Whereas
select()
gives you a subset of variables (or
columns) of the data frame, the filter()
function will give
you a subset of observations (or rows) of a data frame based on
some criteria.
For example, we might notice that some countries have a much longer
average life expectancy than others. We could use the
filter()
function to show us only the countries that have
an average life expectancy greater than 80, like this:
filter(World, LifeExpectancy > 80)
Write code below to show only countries that, on average, have a life
expectancy greater than 80. To make things a little easier to see, use
the data frame we created with just a few variables,
SelectWorld
.
require(coursekata)
SelectWorld <- select(World, Country, LifeExpectancy, GirlsH1900, GirlsH1980)
# Write filter code but use the data frame SelectWorld
# Write filter code but use the data frame SelectWorld
filter(SelectWorld, LifeExpectancy > 80)
ex() %>% check_output_expr("filter(SelectWorld, LifeExpectancy > 80)")
Country LifeExpectancy GirlsH1900 GirlsH1980
1 Australia 80.9 156.0623 165.0783
2 Canada 80.3 157.9067 163.3481
3 France 80.2 155.6877 164.4921
4 Iceland 81.5 159.2744 167.2474
5 Israel 80.3 152.3893 161.8456
6 Italy 80.3 153.7283 163.1413
7 Japan 82.3 143.3583 158.5073
8 Spain 80.5 151.4209 162.6662
9 Sweden 80.5 160.6127 166.4325
10 Switzerland 81.3 157.9165 164.1571
The function filter()
, like select()
,
returns a data frame. The filtered data frame has fewer rows because
only a few countries have an average life expectancy over 80 years of
age.
As before, you won’t be able to keep this smaller data frame for
further use unless you save it (using the assignment arrow
<-
).