list High School / Algebra + Data Science (G)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

3.3 Variables as Vectors

We have so far been accessing data through data frames. Data frames are made up of rows (each row is a case or observation, such as a penguin) and variables (some attribute that we have measured for each row, such as the body mass of a penguin).

When R does calculations across the rows of a data frame – for example, when we had it subtract the body mass minus the predicted body mass for each row and save the result in a column called residual – it is actually treating the variables as vectors and doing something called vector math.

Vectors are a useful concept in programming, and so it’s worth spending a little time working with them.

Variables Inside Data Frames Are Vectors

We’ve been looking at variables such as body_mass_kg and flipper_length_m. But if you wanted to print out all the values of body_mass_kg, you couldn’t just write body_mass_kg. R wouldn’t know what you are talking about.

For R to see a variable such as body_mass_kg, it needs to know which data frame it is part of. If we want R to print out just the variable, we need to use the $ (dollar sign).

penguins$body_mass_kg

Notice the structure of this piece of code: dataframe$variable. First we tell R what data frame the variable is inside of, and then we tell it which variable we are looking for.

require(coursekata) # run this to see the error # then fix the code so that the values of this variable are printed out body_mass_kg # run this to see the error # then fix the code so that the values of this variable are printed out penguins$body_mass_kg msg <- "Check how you used $" ex() %>% check_output_expr("penguins$body_mass_kg", missing_msg = msg)

What gets printed out when we run this code isn’t a nice table of data with rows and columns. Instead, it’s a vector, which is just a series of elements (in this case, numbers) that are all the same type.

A 3 by 5 grid of gray squares to symbolize a data frame, where the top row is dark gray to indicate the column headers. The first column below the headers is shaded in yellow. To the right, the column of yellow squares is extracted without the header to demonstrate that the column is a vector of values without a header.When we run the code penguins$body_mass_kg we are extracting a vector of values from a single variable in the data frame, leaving behind such niceties as the name of the variable at the top of the column.

If we want we can save this vector as a new object that exists outside of the data frame using the <- assignment operator. For example, we could create an object called Y and assign <- to it a vector of all values from our outcome variable body_mass_kg. We could create a vector called X and assign <- to it all values from the predictor variable flipper_length_m.

Y <-  penguins$body_mass_kg
X <-  penguins$flipper_length_m

Once we have created these two vectors, we can re-make our scatter plot of body mass and flipper length with slightly simplified code that doesn’t require a data frame:

gf_point(Y ~ X)

When we make a graph like this using vectors we are assuming that the vectors contain the same number of elements, and in the same order, as each other. Data frames make it easy to keep track of this because each penguin is in its own row. But if we maintain the order of the two vectors, the scatter plot of Y vs. X will be the same one we get when graphing body_mass_kg versus flipper_length_m from the penguins data frame. Try it in the code block below.

require(coursekata) # save body mass values to Y Y <- # save flipper length values to X X <- # this will make a scatter plot of Y and X gf_point(Y ~ X) # save body mass values to Y Y <- penguins$body_mass_kg # save flipper length values to X X <- penguins$flipper_length_m # this will make a scatter plot of Y and X gf_point(Y ~ X) msg <- "Check your use of $" ex() %>% { check_object(., "Y") %>% check_equal(incorrect_msg = msg) check_object(., "X") %>% check_equal(incorrect_msg = msg) check_function(., "gf_point") %>% check_arg(., "object") %>% check_equal() }

A scatter plot of body_mass_kg predicted by flipper_length_m, but with the labels Y and X for each variable.

Now you have two ways of making scatter plots: using dataframes (e.g., body_mass_kg ~ flipper_length_m, data = penguins) or using vectors (e.g., Y ~ X).

Math with Vectors

These vectors we made in R also correspond to mathematical objects (also called vectors!). The nice thing about vectors in both programming and math is that we can do calculations with them that will affect every single number in the vector. Previously when we used mutate() to create new variables in our data frames, we were actually doing math with vectors. (If you are interested in vectors, there is a college-level math course called Linear Algebra that is all about vectors!)

For example, Y represents body mass in kilograms. If we wanted to convert these measurements into grams, we would have to multiply each of the 333 body mass measurements by 1000. With vector math, we could just multiply the whole vector Y by 1000 like this:

Y * 1000
require(coursekata) # this saves body mass values in kg to Y Y <- penguins$body_mass_kg # run this first to see what the body masses look like in kg # then multiply the vector by 1000 to see the body masses in grams Y # this saves body mass values in kg to Y Y <- penguins$body_mass_kg # run this first to see what the body masses look like in kg # then multiply the vector by 1000 to see the body masses in grams Y * 1000 msg <- "Check your use of $" ex() %>% { check_object(., "Y") %>% check_equal(incorrect_msg = msg) check_operator(., "*") %>% check_result() %>% check_equal() }

Later in this chapter, doing math with vectors is going to simplify the R code for us!

Vectors as Inputs to Functions

Another thing we can do with vectors is put a whole vector of values into a function and immediately generate a whole set of outputs. If we run the code better_function(0.21) it will return a single predicted body mass for a penguin with a flipper length of 0.21 meters based on the better_function model. But if we instead run:

better_function(X)

It will generate a model prediction for every value we have saved in the vector X.

If we want to overlay predictions of the better_function model on the scatter plot of Y ~ X we can write something like this:

gf_point(Y ~ X) %>%
  gf_point(better_function(X) ~ X, color = "blue3")

A scatter plot of body_mass_kg predicted by flipper_length_m, but with the labels Y and X for each variable. A series of blue dots are plotted on the graph as predictions and run through the center of the data points in a line.

Notice how much less cumbersome this is than creating a new variable in the penguins data frame called prediction, then overlaying the new variable’s points on top of the scatter plot. One method isn’t better than the other, but sometimes vectors are easier to work with.

Responses