Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentAlgebra + Data Science
-
segmentChapter 1 - Exploring Variation in Data
-
segmentChapter 2 - Modeling Data with Functions
-
segmentChapter 3 - Assessing How Well Models Fit the Data
-
3.3 Variables as Vectors
-
segmentResources
list High School / Algebra + Data Science (G)
3.3 Variables as Vectors
We have so far been accessing data through data frames. Data frames are made up of rows (each row is a case or observation, such as a penguin) and variables (some attribute that we have measured for each row, such as the body mass of a penguin).
When R does calculations across the rows of a data frame – for
example, when we had it subtract the body mass minus the predicted body
mass for each row and save the result in a column called
residual
– it is actually treating the variables as
vectors and doing something called vector
math.
Vectors are a useful concept in programming, and so it’s worth spending a little time working with them.
Variables Inside Data Frames Are Vectors
We’ve been looking at variables such as body_mass_kg
and
flipper_length_m
. But if you wanted to print out all the
values of body_mass_kg
, you couldn’t just write
body_mass_kg
. R wouldn’t know what you are talking
about.
For R to see a variable such as body_mass_kg
,
it needs to know which data frame it is part of. If we want R to print
out just the variable, we need to use the $
(dollar
sign).
penguins$body_mass_kg
Notice the structure of this piece of code:
dataframe$variable
. First we tell R what data frame the
variable is inside of, and then we tell it which variable we are looking
for.
require(coursekata)
# run this to see the error
# then fix the code so that the values of this variable are printed out
body_mass_kg
# run this to see the error
# then fix the code so that the values of this variable are printed out
penguins$body_mass_kg
msg <- "Check how you used $"
ex() %>% check_output_expr("penguins$body_mass_kg", missing_msg = msg)
What gets printed out when we run this code isn’t a nice table of data with rows and columns. Instead, it’s a vector, which is just a series of elements (in this case, numbers) that are all the same type.
When
we run the code
penguins$body_mass_kg
we are extracting a
vector of values from a single variable in the data frame, leaving
behind such niceties as the name of the variable at the top of the
column.
If we want we can save this vector as a new object that exists
outside of the data frame using the <-
assignment
operator. For example, we could create an object called Y
and assign <-
to it a vector of all values from our
outcome variable body_mass_kg
. We could create a vector
called X
and assign <-
to it all values
from the predictor variable flipper_length_m
.
Y <- penguins$body_mass_kg
X <- penguins$flipper_length_m
Once we have created these two vectors, we can re-make our scatter plot of body mass and flipper length with slightly simplified code that doesn’t require a data frame:
gf_point(Y ~ X)
When we make a graph like this using vectors we are assuming that the
vectors contain the same number of elements, and in the same order, as
each other. Data frames make it easy to keep track of this because each
penguin is in its own row. But if we maintain the order of the two
vectors, the scatter plot of Y
vs. X
will be
the same one we get when graphing body_mass_kg
versus
flipper_length_m
from the penguins
data frame.
Try it in the code block below.
require(coursekata)
# save body mass values to Y
Y <-
# save flipper length values to X
X <-
# this will make a scatter plot of Y and X
gf_point(Y ~ X)
# save body mass values to Y
Y <- penguins$body_mass_kg
# save flipper length values to X
X <- penguins$flipper_length_m
# this will make a scatter plot of Y and X
gf_point(Y ~ X)
msg <- "Check your use of $"
ex() %>% {
check_object(., "Y") %>% check_equal(incorrect_msg = msg)
check_object(., "X") %>% check_equal(incorrect_msg = msg)
check_function(., "gf_point") %>% check_arg(., "object") %>% check_equal()
}
Now you have two ways of making scatter plots: using dataframes
(e.g., body_mass_kg ~ flipper_length_m, data = penguins
) or
using vectors (e.g., Y ~ X
).
Math with Vectors
These vectors we made in R also correspond to mathematical objects
(also called vectors!). The nice thing about vectors in both programming
and math is that we can do calculations with them that will affect every
single number in the vector. Previously when we used
mutate()
to create new variables in our data frames, we
were actually doing math with vectors. (If you are interested in
vectors, there is a college-level math course called Linear
Algebra that is all about vectors!)
For example, Y
represents body mass in kilograms. If we
wanted to convert these measurements into grams, we would have to
multiply each of the 333 body mass measurements by 1000. With vector
math, we could just multiply the whole vector Y
by 1000
like this:
Y * 1000
require(coursekata)
# this saves body mass values in kg to Y
Y <- penguins$body_mass_kg
# run this first to see what the body masses look like in kg
# then multiply the vector by 1000 to see the body masses in grams
Y
# this saves body mass values in kg to Y
Y <- penguins$body_mass_kg
# run this first to see what the body masses look like in kg
# then multiply the vector by 1000 to see the body masses in grams
Y * 1000
msg <- "Check your use of $"
ex() %>% {
check_object(., "Y") %>% check_equal(incorrect_msg = msg)
check_operator(., "*") %>% check_result() %>% check_equal()
}
Later in this chapter, doing math with vectors is going to simplify the R code for us!
Vectors as Inputs to Functions
Another thing we can do with vectors is put a whole vector of values
into a function and immediately generate a whole set of outputs. If we
run the code better_function(0.21)
it will return a single
predicted body mass for a penguin with a flipper length of 0.21 meters
based on the better_function
model. But if we instead
run:
better_function(X)
It will generate a model prediction for every value we have saved in
the vector X
.
If we want to overlay predictions of the better_function
model on the scatter plot of Y ~ X
we can write something
like this:
gf_point(Y ~ X) %>%
gf_point(better_function(X) ~ X, color = "blue3")
Notice how much less cumbersome this is than creating a new variable
in the penguins
data frame called prediction
,
then overlaying the new variable’s points on top of the scatter plot.
One method isn’t better than the other, but sometimes vectors are easier
to work with.