list High School / Algebra + Data Science (G)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

1.8 Manipulating Data Frames: select() & filter()

We’ve been exploring the penguins data frame, but there are so many data sets in the world to discover. When you find them, you might want to manipulate them to make them easier to work with. Here we will explore a few functions that can help us work with data frames: select(), filter(), mutate(), and arrange().

Example Data Frame: World

To explore these R functions, let’s take a look at a data set called World (click here for more details about the variables in the World data set). It has a number of variables about different countries in the world. Try writing the name of the data frame to just look at the whole thing in all its glory.

require(coursekata) # take a look at World # take a look at World World ex() %>% check_output_expr("World")

That’s a lot to take in. Each row represents a country but there are a lot of variables and a lot of rows! You can always use functions like head() but it might be handy to learn a few more functions like select() or filter().

select() Variables

Sometimes we want to focus on a subset of the variables in a data frame. For example, we might want to look at just the variables Country, Region and LifeExpectancy in the World data frame. (Note: LifeExpectancy represents the average number of years lived by citizens in years.)

A 3 by 5 grid of gray squares to symbolize a data frame, where the top row is dark gray to indicate the column headers, and the first column of values is shaded in yellow to indicate that the select function will select specific variables, which are whole columns in a data frame.The figure to the right summarizes what select() does. Each row represents a case, each column represents a variable. The dark gray boxes at the top represent the variable names. The column (or variable) colored yellow is the one we are interested in.

We can use the select() function to take a dataset with a lot of variables and winnow it down to just a subset of variables (or even just one, as in the figure). When using select(), we first need to tell R which data frame, then which variables to select from that data frame.

select(World, Country, Region, LifeExpectancy)

Modify the select() code below to take a look at just the following variables in World: Country, LifeExpectancy, GirlsH1900, and GirlsH1980 (the latter two variables represent the average height of 18-year-old girls in the year 1900 and 1980, respectively, in cm).

require(coursekata) # modify this code select() # modify this code select(World, Country, LifeExpectancy, GirlsH1900, GirlsH1980) ex() %>% check_output_expr("select(World, Country, LifeExpectancy, GirlsH1900, GirlsH1980)")

The select() function will output a data frame with only the selected variables for every case. If we want to just look at the first six rows we can combine the head() and select() functions like this:

head(select(World, Country, LifeExpectancy, GirlsH1900, GirlsH1980))
    Country LifeExpectancy GirlsH1900 GirlsH1980
1   Albania           76.2   151.0666   162.0169
2   Algeria           71.7   152.4126   159.5387
3    Angola           41.7   153.0222   158.5683
4 Argentina           74.8   151.1988   159.5268
5   Armenia           71.7   150.2388   158.4285
6 Australia           80.9   156.0623   165.0783

This head(select()) code is an example of a function-within-a-function. To read this, you want to start from the inside (so, the select() part), then think about the output of the select() function as being inside the head() function.

If you want to save a data frame that includes only the four variables, you would need to use the assignment operator (the arrow, <- or ->). Here is an example where we save the new slimmed down data frame as SelectWorld:

SelectWorld <- select(World, Country, LifeExpectancy, GirlsH1900, GirlsH1980)

filter() for Cases

A 3 by 5 grid of gray squares to symbolize a data frame, where the top row is dark gray to indicate the column headers. Two of the rows are shaded in yellow to indicate that the filter function will select specific rows based on specified criteria.Whereas select() gives you a subset of variables (or columns) of the data frame, the filter() function will give you a subset of observations (or rows) of a data frame based on some criteria.

For example, we might notice that some countries have a much longer average life expectancy than others. We could use the filter() function to show us only the countries that have an average life expectancy greater than 80, like this:

filter(World, LifeExpectancy > 80)

Write code below to show only countries that, on average, have a life expectancy greater than 80. To make things a little easier to see, use the data frame we created with just a few variables, SelectWorld.

require(coursekata) SelectWorld <- select(World, Country, LifeExpectancy, GirlsH1900, GirlsH1980) # Write filter code but use the data frame SelectWorld # Write filter code but use the data frame SelectWorld filter(SelectWorld, LifeExpectancy > 80) ex() %>% check_output_expr("filter(SelectWorld, LifeExpectancy > 80)")
       Country LifeExpectancy GirlsH1900 GirlsH1980
1    Australia           80.9   156.0623   165.0783
2       Canada           80.3   157.9067   163.3481
3       France           80.2   155.6877   164.4921
4      Iceland           81.5   159.2744   167.2474
5       Israel           80.3   152.3893   161.8456
6        Italy           80.3   153.7283   163.1413
7        Japan           82.3   143.3583   158.5073
8        Spain           80.5   151.4209   162.6662
9       Sweden           80.5   160.6127   166.4325
10 Switzerland           81.3   157.9165   164.1571

The function filter(), like select(), returns a data frame. The filtered data frame has fewer rows because only a few countries have an average life expectancy over 80 years of age.

As before, you won’t be able to keep this smaller data frame for further use unless you save it (using the assignment arrow <-).

Responses