list High School / Algebra + Data Science (G)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

2.7 Keeping Predictions Reasonable: Domain and Range of Functions

Lines, and hence linear functions, go on forever. If we want, we can use our linear function to calculate the predicted body mass of penguins who have flipper lengths of 1000 meters, or of -300 meters. But we wouldn’t want to do this because there are no penguins with such flipper lengths, which means the predicted body masses would make no sense.

Even though lines go on forever, the models we make with a linear function will only make sense for certain values of \(X\) and of \(Y\). The values of \(X\) that we would reasonably put into our function are referred to as the domain of the function. The set of numerical predictions that are generated by the function (i.e., the possible values of \(Y\)) is called the range.

When we use R to make a scatter plot, it adjusts the x- and y-axes to roughly include only the domain and range of the function. It does this just by looking at the actual flipper lengths and body masses of the penguins in our data set. That’s why we weren’t able to see the y-intercept of our function (-5.5) on the scatter plot – because 0 isn’t part of the domain of flipper length.

A scatter plot of body_mass_kg predicted by flipper_length_m. A blue line of best fit is plotted on the graph and runs through the center of the data points. A vertical line runs along the length of the y-axis and is labeled as range. A horizontal line runs along the length of the x-axis and is labeled as domain.

Based on our data, a reasonable domain for our function might be 0.17 to 0.23. Using this domain, figure out the possible range of body mass predictions that our_function (defined in the code block below) can make.

require(coursekata) # this creates our custom function our_function <- function(X){-5.5 + 49*X} # what is the possible range of body mass predictions? # this creates our custom function our_function <- function(X){-5.5 + 49*X} # what is the possible range of body mass predictions? our_function(0.17) our_function(0.23) ex() %>% { check_or(., check_function(., "our_function", index = 1) %>% { check_arg(., "X") %>% check_equal() check_result(.) %>% check_equal() }, override_solution(., "our_function(0.23)") %>% { check_function(., "our_function") %>% { check_arg(., "X") %>% check_equal() check_result(.) %>% check_equal() } } ) }
2.83
5.77

Assuming a domain of 0.17 to 0.23 m, the lowest possible prediction of body mass that our_function could produce is 2.83 kg and the highest is 5.77 kg. This constitutes the range of our function.

Mathematical versus Contextual Limits on Domain

Mathematically, there is no limit to the domain or range of our function: \(f(X)=-5.5 + 49X\). Any real number could be entered into it, and any real number could be the result.

But that’s in the world of mathematics. In the world of data we take a different approach to defining the domain of a function (or model). We want the model to make sense in the real world, and we want it to help us understand patterns in the data.

For this reason, we must look at the context in order to decide the appropriate domain of a function. In the case of penguin flipper lengths, a good guide will be the data we have. But our data is just a sample; there are no doubt penguins in the world with smaller or larger flipper lengths than the ones in our data.

We would not set our domain as right at the minimum and maximum flipper lengths in the data because we’d want some wiggle room. But on the other hand, we aren’t going to use our function to predict the body mass of a penguin with a flipper length of 100 meters! If we find a penguin like that, it’s no ordinary penguin!

Range is a Property of the Function, Not the Data

Once we have assumed a domain for our function, we can use the lower and upper limits of the domain (e.g., 0.17 to 0.23) to calculate the lower and upper limits of the range. But whereas we might look at the data to help us decide the most appropriate domain of our function, the range is purely a property of the function, not a property of the data.

Using our_function (above) we entered our domain lower and upper limits, then calculated the lowest possible output (2.83) and highest possible output (5.77) of to our function to get its range. But that doesn’t mean there are not penguins in the data set with body masses that are outside this range.

You can check this out in the penguins data set by using both the head() and arrange() functions . Are there penguins with body masses below 2.83 kg? What about the largest penguins? Are there penguins with body masses higher than 5.77 kg?

require(coursekata) # run this code to find the 6 penguins with lowest body mass # then try sorting in reverse order with the negative sign # to find penguins with the highest body mass head(arrange(penguins, body_mass_kg)) # run this code to find the 6 penguins with lowest body mass # then try sorting in reverse order with the negative sign # to find penguins with the highest body mass head(arrange(penguins, -body_mass_kg)) ex() %>% check_output_expr("head(arrange(penguins, -body_mass_kg))")

In the data, there are penguins with body masses as low as 2.70 and as high as 6.3, both of which fall outside of the range of the function (2.83 to 5.77). It’s not the data that must fall within the range, but the predictions of the function. Whenever the data don’t match the predictions of the function, it’s a good reminder that no model is perfect.

Extrapolating Beyond the Domain of Our Data

Functions are useful because they let us make predictions based on values of \(X\) that don’t exist in our data. Sometimes these predictions fall within the domain of the function. For example, we might have a penguin with a flipper length of 0.183 m and another of 0.185, but not one with a flipper length of 0.184. We can use our function to make a prediction for this new value that falls in between the two data points. We call this interpolation.

We can also make predictions that fall outside the domain of our data; this is called extrapolation.

An algebraic function can be a powerful tool for making predictions. It’s what companies and governments use to predict the future (e.g., jobs next month, customer purchases next year, wars in the next decade). However, predicting the future is always risky and requires extrapolation. The same level of vigilance and wariness we apply to predictions of the future should also be extended to extrapolating functions beyond their domain.

Responses