list High School / Algebra + Data Science (G)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

1.9 Manipulating Data Frames: arrange() & mutate()

arrange() Cases in Order

A 3 by 5 grid of gray squares to symbolize a data frame, where the top row is dark gray to indicate the column headers. Each row below the header is shaded in a slightly different shade of yellow. To the right, is the same grid, but reorganized so that each row is now ordered in a gradient from top to bottom, with the lightest shade of yellow at the top and the darkest shade at the bottom in order to demonstrate how the arrange function will arrange each row in a data frame based on a specific variable.In many data sets, we might want to order the cases in some way. For example, in SelectWorld, we might want to know which country had, on average, the shortest girls in 1900. To find out, we can simply arrange the data frame in order using this code:

arrange(SelectWorld, GirlsH1900)

Write some code below to arrange the countries in order by GirlsH1900, and then save the resulting data frame as a new object called ArrangeWorld.

require(coursekata) SelectWorld <- select(World, Country, LifeExpectancy, GirlsH1900, GirlsH1980) # Save the arranged data frame arrange(SelectWorld, GirlsH1900) # Write code to print out first 6 rows of new data frame # Save the arranged data frame ArrangeWorld <- arrange(SelectWorld, GirlsH1900) # Write code to print out first 6 rows of new data frame head(ArrangeWorld) ex() %>% { check_object(., "ArrangeWorld") %>% check_equal() check_output_expr(., "head(ArrangeWorld)") }

The countries are arranged from shortest to tallest (according to how tall 18-year-old girls were in 1900).

      Country LifeExpectancy GirlsH1900 GirlsH1980
1   Guatemala           69.7   140.9926   149.0530
2 El Salvador           71.3   142.0544   153.6128
3  Bangladesh           63.1   142.1550   151.0859
4        Peru           70.7   142.2386   152.2495
5 South Korea           77.9   143.2104   160.9055
6       Japan           82.3   143.3583   158.5073

The function arrange() can also be used to arrange values in the opposite order (descending from tallest to shortest). Adding a negative sign (-) in front of the variable will arrange the data so that the tallest girl countries appear at the top.

arrange(SelectWorld, -GirlsH1900)

Out of curiosity, do you think that the countries with the shortest girls are the same countries in 1980? Write some code below to find out.

require(coursekata) SelectWorld <- select(World, Country, LifeExpectancy, GirlsH1900, GirlsH1980) # Modify this code to see the shortest girl countries of 1980 head(arrange(SelectWorld, GirlsH1900)) # Modify this code to see the shortest girl countries of 1980 head(arrange(SelectWorld, GirlsH1980)) ex() %>% check_output_expr("head(arrange(SelectWorld, GirlsH1980))")
      Country LifeExpectancy GirlsH1900 GirlsH1980
1   Guatemala           69.7   140.9926   149.0530
2 Philippines           71.0   148.1826   149.3036
3  Bangladesh           63.1   142.1550   151.0859
4       Nepal           62.6   144.6591   151.1820
5        Laos           63.2   145.1294   151.5993
6   Indonesia           69.7   145.0053   151.7019

mutate() to Create New Variables

A 3 by 5 grid of gray squares to symbolize a data frame, where the top row is dark gray to indicate the column headers. To the right, is the same grid, but an extra column is added to the end and shaded in yellow to indicate the new column that is created when using the mutate function.If you want to create a new variable, you can use mutate(). For example, in SelectWorld, we might want to create a variable to indicate how much taller girls in each country were, on average, in 1980 compared with 1900. For example, girls in Peru averaged 152 cm tall in 1980 and 142 cm in 1900. We’d like to make a variable, which we might call GirlsHeightChange, that would have a value of 10 for Peru, indicating that girls in Peru got taller by about 10 cm during those 80 years.

We can create a data frame with a new variable by using the mutate() function, like this:

mutate(SelectWorld, GirlsHeightChange = GirlsH1980 - GirlsH1900)

Try running the code below. Try to arrange it in order, from countries with girls’ heights that changed the most to those that changed the least.

require(coursekata) SelectWorld <- select(World, Country, LifeExpectancy, GirlsH1900, GirlsH1980) NewWorld <- mutate(SelectWorld, GirlsHeightChange = GirlsH1980 - GirlsH1900) # write code to arrange the data frame NewWorld <- mutate(SelectWorld, GirlsHeightChange = GirlsH1980 - GirlsH1900) # write code to arrange the data frame arrange(NewWorld, -GirlsHeightChange) ex() %>% check_output_expr("arrange(NewWorld, -GirlsHeightChange)")

Here is the head() of this newly arranged data frame:

                   Country LifeExpectancy GirlsH1900 GirlsH1980  GirlsHeightChange
1              South Korea           77.9   143.2104   160.9055           17.69510
2                    Japan           82.3   143.3583   158.5073           15.14893
3                  Croatia           75.3   151.1788   165.8835           14.70473
4 Czech Republic (Czechia)           75.9   153.6532   167.5305           13.87726
5              Netherlands           79.2   155.8199   168.9368           13.11695
6                   Greece           78.9   151.0368   163.8708           12.83408

Summary of Data Manipulation Functions

select() selects a few variables (i.e., a few columns)

A 3 by 5 grid of gray squares to symbolize a data frame, where the top row is dark gray to indicate the column headers, and the first column of values is shaded in yellow to indicate that the select function will select specific variables, which are whole columns in a data frame.

filter() filters for particular cases (i.e., particular rows)

A 3 by 5 grid of gray squares to symbolize a data frame, where the top row is dark gray to indicate the column headers. Two of the rows are shaded in yellow to indicate that the filter function will select specific rows based on specified criteria.

arrange() arrange the cases according to a particular variable (i.e, arranges rows in order)

A 3 by 5 grid of gray squares to symbolize a data frame, where the top row is dark gray to indicate the column headers. Each row below the header is shaded in a slightly different shade of yellow. To the right, is the same grid, but reorganized so that each row is now ordered in a gradient from top to bottom, with the lightest shade of yellow at the top and the darkest shade at the bottom in order to demonstrate how the arrange function will arrange each row in a data frame based on a specific variable.

mutate() creates new variables (i.e., creates new columns)

A 3 by 5 grid of gray squares to symbolize a data frame, where the top row is dark gray to indicate the column headers. To the right, is the same grid, but an extra column is added to the end and shaded in yellow to indicate the new column that is created when using the mutate function.

Responses