Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentAlgebra + Data Science
-
segmentChapter 1 - Exploring Variation in Data
-
segmentChapter 2 - Modeling Data with Functions
-
segmentChapter 3 - Assessing How Well Models Fit the Data
-
3.3 Variables as Vectors
-
segmentResources
list High School / Algebra + Data Science (G)
3.3 Variables as Vectors
We have so far been accessing data through data frames. Data frames are made up of rows (each row is a case or observation, such as a penguin) and variables (some attribute that we have measured for each row, such as the body mass of a penguin).
When R does calculations across the rows of a data frame – for example, when we had it subtract the body mass minus the predicted body mass for each row and save the result in a column called residual
– it is actually treating the variables as vectors and doing something called vector math.
Vectors are a useful concept in programming, and so it’s worth spending a little time working with them.
Variables Inside Data Frames Are Vectors
We’ve been looking at variables such as body_mass_kg
and flipper_length_m
. But if you wanted to print out all the values of body_mass_kg
, you couldn’t just write body_mass_kg
. R wouldn’t know what you are talking about.
For R to see a variable such as body_mass_kg
, it needs to know which data frame it is part of. If we want R to print out just the variable, we need to use the $
(dollar sign).
penguins$body_mass_kg
Notice the structure of this piece of code: dataframe$variable
. First we tell R what data frame the variable is inside of, and then we tell it which variable we are looking for.
require(coursekata)
# run this to see the error
# then fix the code so that the values of this variable are printed out
body_mass_kg
# run this to see the error
# then fix the code so that the values of this variable are printed out
penguins$body_mass_kg
msg <- "Check how you used $"
ex() %>% check_output_expr("penguins$body_mass_kg", missing_msg = msg)
What gets printed out when we run this code isn’t a nice table of data with rows and columns. Instead, it’s a vector, which is just a series of elements (in this case, numbers) that are all the same type.
When we run the code
penguins$body_mass_kg
we are extracting a vector of values from a single variable in the data frame, leaving behind such niceties as the name of the variable at the top of the column.
If we want we can save this vector as a new object that exists outside of the data frame using the <-
assignment operator. For example, we could create an object called Y
and assign <-
to it a vector of all values from our outcome variable body_mass_kg
. We could create a vector called X
and assign <-
to it all values from the predictor variable flipper_length_m
.
Y <- penguins$body_mass_kg
X <- penguins$flipper_length_m
Once we have created these two vectors, we can re-make our scatter plot of body mass and flipper length with slightly simplified code that doesn’t require a data frame:
gf_point(Y ~ X)
When we make a graph like this using vectors we are assuming that the vectors contain the same number of elements, and in the same order, as each other. Data frames make it easy to keep track of this because each penguin is in its own row. But if we maintain the order of the two vectors, the scatter plot of Y
vs. X
will be the same one we get when graphing body_mass_kg
versus flipper_length_m
from the penguins
data frame. Try it in the code block below.
require(coursekata)
# save body mass values to Y
Y <-
# save flipper length values to X
X <-
# this will make a scatter plot of Y and X
gf_point(Y ~ X)
# save body mass values to Y
Y <- penguins$body_mass_kg
# save flipper length values to X
X <- penguins$flipper_length_m
# this will make a scatter plot of Y and X
gf_point(Y ~ X)
msg <- "Check your use of $"
ex() %>% {
check_object(., "Y") %>% check_equal(incorrect_msg = msg)
check_object(., "X") %>% check_equal(incorrect_msg = msg)
check_function(., "gf_point") %>% check_arg(., "object") %>% check_equal()
}
Now you have two ways of making scatter plots: using dataframes (e.g., body_mass_kg ~ flipper_length_m, data = penguins
) or using vectors (e.g., Y ~ X
).
Math with Vectors
These vectors we made in R also correspond to mathematical objects (also called vectors!). The nice thing about vectors in both programming and math is that we can do calculations with them that will affect every single number in the vector. Previously when we used mutate()
to create new variables in our data frames, we were actually doing math with vectors. (If you are interested in vectors, there is a college-level math course called Linear Algebra that is all about vectors!)
For example, Y
represents body mass in kilograms. If we wanted to convert these measurements into grams, we would have to multiply each of the 333 body mass measurements by 1000. With vector math, we could just multiply the whole vector Y
by 1000 like this:
Y * 1000
require(coursekata)
# this saves body mass values in kg to Y
Y <- penguins$body_mass_kg
# run this first to see what the body masses look like in kg
# then multiply the vector by 1000 to see the body masses in grams
Y
# this saves body mass values in kg to Y
Y <- penguins$body_mass_kg
# run this first to see what the body masses look like in kg
# then multiply the vector by 1000 to see the body masses in grams
Y * 1000
msg <- "Check your use of $"
ex() %>% {
check_object(., "Y") %>% check_equal(incorrect_msg = msg)
check_operator(., "*") %>% check_result() %>% check_equal()
}
Later in this chapter, doing math with vectors is going to simplify the R code for us!
Vectors as Inputs to Functions
Another thing we can do with vectors is put a whole vector of values into a function and immediately generate a whole set of outputs. If we run the code better_function(0.21)
it will return a single predicted body mass for a penguin with a flipper length of 0.21 meters based on the better_function
model. But if we instead run:
better_function(X)
It will generate a model prediction for every value we have saved in the vector X
.
If we want to overlay predictions of the better_function
model on the scatter plot of Y ~ X
we can write something like this:
gf_point(Y ~ X) %>%
gf_point(better_function(X) ~ X, color = "blue3")
Notice how much less cumbersome this is than creating a new variable in the penguins
data frame called prediction
, then overlaying the new variable’s points on top of the scatter plot. One method isn’t better than the other, but sometimes vectors are easier to work with.