list High School / Algebra + Data Science (G)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

2.3 Using Algebraic Functions as Statistical Models

Up to now we have used the straight line as a statistical model. We can represent the model as a line on a graph or as a linear equation (e.g., \(\text{body_mass_kg} = b_0 + b_1\text{flipper_length_m}\)). The line can also be thought of as a function, which is a method of turning an input (such as flipper length) into an output (a prediction of body mass).

The linear equation is an algebraic function. An algebraic function uses mathematical operations (e.g., addition, multiplication) to transform some input value (an \(X\)) into an output value (\(Y\)). Algebraic functions, by definition, only produce one output for any particular input, which makes them especially suitable for being used as statistical models.

Not all functions are algebraic functions, however. Nowadays, we also have computer algorithms that define a series of steps, which could be very involved and complex, that take an input value (an \(X\)) and produce an output (\(Y\)). These computer functions can also work as statistical models because they take a value on a predictor variable and produce a predicted value on an outcome variable.

In this book we will focus on using algebraic functions as statistical models, but we will use R to implement these functions and make predictions. As we will see, using R will make it much easier to work with algebraic functions as statistical models of data.

An Algebraic Function to Predict Body Mass

Let’s go back to the line we proposed as a model using flipper length to predict body mass. One line we tried had a y-intercept of -5.5 and a slope of 49. We plotted the line below using the gf_abline() function.

gf_point(body_mass_kg ~ flipper_length_m, data = penguins) %>%
  gf_abline(intercept = -5.5, slope = 49)

A scatter plot of body_mass_kg predicted by flipper_length_m. A blue line of best fit is plotted on the graph and runs through the center of the data points.

The \(b_0+b_1X\) part of the GLM equation for a line is an algebraic function. Our model, which we graphed above, could be written in algebraic notation like this:

\(\text{our_function}(X) = -5.5 + 49X\)

We decided to name the function \(\text{our_function}(X)\). In algebra, functions are often abbreviated as \(f(X)\), but we are free to name them whatever we want. The important thing is that \(\text{our_function}(X)\) (or \(f(X)\)) tells us how to take any value of \(X\) (which in this case is flipper_length_m) and produce a value for the outcome variable (body_mass_kg).

Programming a Custom Function in R

You can also turn your model into a new custom-made R function. To make a function in R to evaluate the our_function function, we can use this code:

our_function <- function(X){-5.5 + 49*X}

Note that we used the assignment operator (<-) to save the function (function(X){-5.5 + 49*X}) as an R object called our_function.

Once we have saved the function, we can use it to predict body mass as a function of flipper length for any value of flipper_length_m. For example, we we want the predicted body mass for a penguin with a flipper length of 0.22 meters, we can run this code:

our_function(0.22)

In the code block below, we have put in the code to define our_function as a function in R. Use it to generate predictions for flipper lengths of 0.22, 0.23, and 0.24.

require(coursekata) # this creates our_function our_function <- function(X){-5.5 + 49*X} # use it to generate predictions of body mass # for flipper lengths of 0.22, 0.23, and 0.24 # this creates our_function our_function <- function(X){-5.5 + 49*X} # use it to generate predictions of body mass # for flipper lengths of 0.22, 0.23, and 0.24 our_function(0.22) our_function(0.23) our_function(0.24) ex() %>% { check_function(., "our_function", index = 1) %>% { check_arg(., "X") %>% check_equal() check_result(.) %>% check_equal() } check_function(., "our_function", index = 2) %>% { check_arg(., "X") %>% check_equal() check_result(.) %>% check_equal() } check_function(., "our_function", index = 3) %>% { check_arg(., "X") %>% check_equal() check_result(.) %>% check_equal() } }

Using R to Evaluate Algebraic Functions

When we run the code our_function(0.22) we are evaluating a function. Evaluating a function simply means to substitute in a value for \(X\), and then generate a prediction for \(Y\). When we are using the function as a statistical model, we are putting in a value for the predictor variable (e.g., flipper_length_m = 0.22) and getting out a predicted value for the outcome variable (e.g., body_mass_kg).

The function (such as our_function, which is -5.5 + 49*X) is a recipe for calculating the outcome. We can also use this recipe by using R as a calculator. If we want to find the predicted body mass for a penguin with a flipper length of 0.22 meters, we can run this line of R code:

-5.5 + 49*0.22

Try using R as a calculator like this in the code block below to see what our model predicts as the body mass for a penguin with a flipper length of 0.22 meters. We should get the same answer as when we run our_function(0.22).

require(coursekata) # this creates our_function our_function <- function(X){-5.5 + 49*X} # this evaluates our custome function our_function(0.22) # write code to evaluate the function using R as a calculator # this creates our_function our_function <- function(X){-5.5 + 49*X} # this evaluates our custom function our_function(0.22) # write code to evaluate the function using R as a calculator -5.5 + 49*0.22 ex() %>% { check_function(., "our_function") %>% check_result(.) %>% check_equal() }

You should have gotten a predicted body mass of 5.28 kg. Note that R always follows the order of operations (multiplication and division before addition and subtraction), so it first multiplies 49 times 0.22, then adds -5.5. You could also make it clearer by adding parentheses to the code: -5.5 + (49*0.22)

Evaluating Versus Solving

When we evaluate a function, we provide it with a value for \(X\), then use the function to calculate the value of \(Y\). In algebra class, this often gets experienced as plugging a value for \(X\) into an equation. When we use algebraic functions to help us model data, we want you to think of “evaluating a function” as using a value of \(X\) to make a prediction of \(Y\), or, more specifically, using a value on a predictor variable to predict a value on an outcome variable.

We could also plug in values of \(Y\) and solve for \(X\). We don’t call that evaluating because it doesn’t involve making a prediction for \(Y\)! You can just call that solving for \(X\).

Responses