Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentHigh School / Statistics and Data Science II (XCD)
-
segmentPART I: EXPLORING AND MODELING VARIATION
-
segmentChapter 1 - Exploring Data with R
-
1.5 Working With Data Frames in R
-
segmentChapter 2 - From Exploring to Modeling Variation
-
segmentChapter 3 - Modeling Relationships in Data
-
segmentPART II: COMPARING MODELS TO MAKE INFERENCES
-
segmentChapter 4 - The Logic of Inference
-
segmentChapter 5 - Model Comparison with F
-
segmentChapter 6 - Parameter Estimation and Confidence Intervals
-
segmentPART III: MULTIVARIATE MODELS
-
segmentChapter 7 - Introduction to Multivariate Models
-
segmentChapter 8 - Multivariate Model Comparisons
-
segmentChapter 9 - Models with Interactions
-
segmentChapter 10 - More Models with Interactions
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Statistics and Data Science II (XCD)
1.5 Working With Data Frames in R
Now for the moment you all have been waiting for: It’s time to work with real data in R! For handling datasets, R has a special object type called a data frame. Data frames look like this:
TableID Tip Condition
1 1 39 Control
2 2 36 Control
3 3 34 Control
4 42 21 Smiley Face
5 43 21 Smiley Face
6 44 17 Smiley Face
This is the first six rows of a data frame called
TipExperiment
. The full data frame is from an experiment
that randomly assigned tables at a restaurant to receive checks that
either included smiley faces (Smiley Face
) or didn’t
include smiley faces (Control
). Each row represents a
different table from the experiment. The researchers recorded how much
each table tipped as a percentage of their total check (e.g., a table
may have tipped 17% of their total).
The rows in a data frame represent the cases sampled, with each row
being a single case. In the TipExperiment
data frame, the
cases (which are sometimes called observations) are tables.
Depending on the study, the rows could be people, states, couples, mice
– any cases you take a sample of in order to collect data.
The columns of the data frame (labeled TableID
,
Tip
, and Condition
) represent variables, or
the attributes of each case that could vary from row to row.
This dataset is organized in a “tidy” format (a term coined by statistician Hadley Wickham). It’s generally good practice to format our datasets in a tidy way (“keep things tidy”). The key aspects of a tidy dataset are:
- Each row is an observation (or case)
- Each column is a variable
- Each cell contains a value for the particular observation and variable
Later in the course, we’ll analyze this data to determine if there’s convincing evidence that smiley faces result in higher tips. You can read more about the data here: TipExperiment R documentation.
Peeking at a Data Frame
As with any object in R, you can just type the name of the data frame to see the whole thing.
In the code block below, type the name of the data frame
TipExperiment
and then Run.
require(coursekata)
# Try typing TipExperiment to see what is in the data frame.
# Try typing TipExperiment to see what is in the data frame.
TipExperiment
ex() %>% check_output_expr("TipExperiment")
Be sure to scroll up to see the whole output. Once you do, you might think to yourself, “Wow, that’s a lot to take in!” This is usually the case when working with real data frames, which often include many rows and many columns. Here, we don’t just have data from one table—we have a bunch of tables, each with their own values for different variables.
head()
and tail()
. It’s
often useful to take a quick peek at your data frame without printing
out the whole thing. One way to do this is with the head()
command.
Press the head(TipExperiment)
.
require(coursekata)
# Run this code to get the first 6 rows of TipExperiment
head(TipExperiment)
# Run this code to get the first 6 rows of TipExperiment
head(TipExperiment)
ex() %>% check_function("head") %>%
check_result() %>% check_equal()
TableID Tip Condition
1 1 39 Control
2 2 36 Control
3 3 34 Control
4 4 34 Control
5 5 33 Control
6 6 31 Control
The head()
function (or command) prints out just the
first six rows of the data frame. (You can also try the
tail()
function, which prints the last six rows.)
str()
and glimpse()
. These
functions show the overall structure of the data frame,
including the number of observations, number of variables, names of
variables and so on. We often use str()
or
glimpse()
when first exploring a new data frame, just to
see what’s in it.
Run glimpse(TipExperiment)
and look at the results.
require(coursekata)
# Use glimpse() to see what’s in TipExperiment
# Use glimpse() to see what’s in TipExperiment
glimpse(TipExperiment)
ex() %>% check_function("glimpse") %>%
check_result() %>% check_equal()
Rows: 44
Columns: 3
$ TableID <int> 22, 44, 21, 20, 18, 19, 42, 43, 17, 41, 16, 40, 38, 39, 15, …
$ Tip <dbl> 8, 17, 18, 20, 21, 21, 21, 21, 22, 22, 23, 23, 24, 24, 25, 2…
$ Condition <fct> Control, Smiley Face, Control, Control, Control, Control, Sm…
dataframe$variable. Notice in the output above there
is a $
in front of each variable name (in front of
TableID
, Tip
, and Condition
). In
R, $
is often used to indicate that what follows is a
variable name. If you want to specify the Tip
variable in
the TipExperiment
data frame, for example, you would write
TipExperiment$Tip
. (R has its own way of categorizing
variables, such as int, num, and factor. You will learn more about these
later.)
Try using the $
to tell R to look in the
TipExperiment
data frame to get the contents of the
variable Condition
.
require(coursekata)
# Use the $ sign to print out the contents of the Condition variable in the TipExperiment data frame
# Use the $ sign to print out the contents of the Condition variable in the TipExperiment data frame
TipExperiment$Condition
ex() %>% check_output_expr(
"TipExperiment$Condition",
missing_msg = "Have you used $ to select the Condition variable in TipExperiment?"
)
Control Control Control Control Control Control
Control Control Control Control Control Control
Control Control Control Control Control Control
Control Control Control Control Smiley Face Smiley Face
Smiley Face Smiley Face Smiley Face Smiley Face Smiley Face Smiley Face
Smiley Face Smiley Face Smiley Face Smiley Face Smiley Face Smiley Face
Smiley Face Smiley Face Smiley Face Smiley Face Smiley Face Smiley Face
Smiley Face Smiley Face
Using brackets to refer to specific rows. To refer
to a specific row of a data frame you can use the brackets after the
name of the data frame, similar to what we did before with vectors. For
example: TipExperiment[1, ]
will print out the first row of
the TipExperiment data frame. (Inside the brackets the order is
“row”,“column”. By leaving out the column value it prints all the
columns.)
Using the brackets, you can also find the rows that meet certain conditions. What do you think this code will do??
TipExperiment[TipExperiment$Condition == "Control", ]
It will print out all the rows in which the variable
Condition
is equal to Control. Try it out in the window
below.
require(coursekata)
# Print out all the rows in which the variable Condition is equal to "Control"
# Print out all the rows in which the variable Condition is equal to "Control"
TipExperiment[TipExperiment$Condition == "Control", ]
ex() %>% check_output_expr(
'TipExperiment[TipExperiment$Condition == "Control", ]',
missing_msg = "Check your code -- something didn't match with the solution. A common mistake here is to forget the comma at the end"
)
You can also add and (&) or or (|) inside the brackets. For example, if you wanted to find all the tables that tipped greater than 40 or less than 5 percent, we could write:
TipExperiment[TipExperiment$Tip > 40 | TipExperiment$Tip < 5, ]
Note: To find the | symbol on your keyboard, look above the return key or near the bracket ([ ], { }) keys.
See if you can figure out in the window below how to print out all the rows in which the tables are both in the “Smiley Face” condition and also tipped less than 20%.
require(coursekata)
# Print out all the rows in which the variable Condition was "Smiley Face" tables and that also tipped less than 20 percent
# Print out all the rows in which the variable Condition was "Smiley Face" tables and that also tipped less than 20 percent
TipExperiment[TipExperiment$Condition == "Smiley Face" & TipExperiment$Tip < 20, ]
ex() %>% check_output_expr(
'TipExperiment[TipExperiment$Condition == "Smiley Face" & TipExperiment$Tip < 20, ]',
missing_msg = "Check your code -- something didn't match with the solution. A common mistake here is to forget the comma at the end"
)
We see there is only one table that fits this description.
TableID Tip Condition
2 44 17 Smiley Face
tally()
. It might be useful to be able
to count up how many tables were in each condition (e.g., how many
tables were in the “Smiley Face” condition versus the “Control”
condition). We can use the tally()
function to create a
frequency table for a particular variable.
This line of R code will produce a frequency table for the
Condition
variable (in the TipExperiment
data
frame).
tally(TipExperiment$Condition)
Alternatively, we could also specify the variable and data frame separately like this:
tally(~ Condition, data = TipExperiment)
Both ways of writing tally()
will result in a frequency
table that tallies up how many tables were in each condition.
(Notice this time we had to put a tilde, ~
, in front of the
variable name. This is required when we include data=
as an
argument.)
Condition
Control Smiley Face
22 22
We can see from the output that the two experimental groups (smiley face and control) are balanced in size: 22 restaurant tables were assigned to each condition.
arrange()
. Let’s turn our attention to
the outcome the researchers were interested in: Tip
. What
was the lowest percentage tipped by any of the tables? One way we could
answer this question would be to sort the dataset by Tip
,
from low to high, using the arrange()
function.
arrange(TipExperiment, Tip)
Importantly, when you arrange a data frame based on the values of one
variable (e.g., Tip
), it sorts whole rows, not just the
column for that one variable. This ensures that the data for each table
( TableID
, Tip
, and Condition
)
stays together as the tables are re-arranged from lowest to highest tip
percentages.
If you want to save the data frame after you sort the rows into a new
order you can use the assignment operator (<-
). See if
you can edit the code below to save the version of
TipExperiment
that is arranged by Tip
back
into TipExperiment
. Then print out the first six lines of
TipExperiment
using head()
.
require(coursekata)
# save TipExperiment, arranged by Tip, back to TipExperiment
arrange(TipExperiment, Tip)
# write code to print out the first 6 rows of TipExperiment
# save TipExperiment, arranged by Tip, back to TipExperiment
TipExperiment <- arrange(TipExperiment, Tip)
# write code to print out the first 6 rows of TipExperiment
head(TipExperiment)
no_save <- "Make sure to both `arrange()` `TipExperiment` by `Tip` *and* save the arranged data frame back to `TipExperiment`."
ex() %>% {
check_object(., "TipExperiment") %>% check_equal(incorrect_msg = no_save)
check_function(., "head") %>% check_result() %>% check_equal()
}
Notice that now the tables are arranged from the lowest to higher tipping tables.
TableID Tip Condition
1 22 8 Control
2 44 17 Smiley Face
3 21 18 Control
4 20 20 Control
5 18 21 Control
6 19 21 Control
The function arrange()
can also be used to arrange
values in descending order by adding desc()
around the
variable name. If we added the function desc()
(as in the
code below), the highest tipping tables would be at the top.
arrange(TipExperiment, desc(Tip))