Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentAlgebra + Data Science
-
segmentChapter 1 - Exploring Variation in Data
-
segmentChapter 2 - Modeling Data with Functions
-
segmentChapter 3 - Assessing How Well Models Fit the Data
-
segmentResources
list High School / Algebra + Data Science (G)
Chapter 2 - Modeling Data with Functions
2.1 From Exploring Data to Modeling Data
Making Predictions With a scatter plot
Visualizations are great for exploring data. But they also can help us find patterns in the data that we can use to predict future observations. Let’s try it here.
Going back to our penguins
data, we saw that there was a pattern where penguins with longer flippers tended to have greater body mass.
In the scatter plot above we’ve colored penguins with a flipper length of 0.22 m purple. Now consider a new penguin, not in the dataset. If it also happens to have a flipper length of 0.22 m, what would you predict for its body mass?
Try modifying the code below to overlay your prediction as a red dot on top of the scatter plot of data.
require(coursekata)
gf_point(body_mass_kg ~ flipper_length_m, data = penguins, color = ~flipper_length_m == 0.22) %>%
# modify this code to add your prediction in red
gf_point(0 ~ 0.22)
gf_point(body_mass_kg ~ flipper_length_m, data = penguins, color = ~flipper_length_m == 0.22) %>%
# solutions vary, here is one
gf_point(5.2 ~ 0.22, color = "red")
ex() %>% {
check_function(., "gf_point", index = 1) %>% {
check_arg(., "data") %>% check_equal()
check_arg(., "object") %>% check_equal()
check_arg(., "color") %>% check_equal()
}
check_function(., "gf_point", index = 2) %>% {
check_arg(., "color") %>% check_equal()
}
}
Now try making 6 more predictions of body mass for penguins with flipper lengths ranging from 0.23 down to 0.18. Overlay each of the 6 new predictions as red dots on the scatter plot below.
require(coursekata)
# add a few predicted penguins
gf_point(body_mass_kg ~ flipper_length_m, data = penguins) %>%
gf_point(0 ~ 0.23, color = "red") %>%
gf_point(0 ~ 0.22, color = "red") %>%
gf_point(0 ~ 0.21, color = "red") %>%
gf_point(0 ~ 0.20, color = "red") %>%
gf_point(0 ~ 0.19, color = "red") %>%
gf_point(0 ~ 0.18, color = "red")
# add a few predicted penguins
gf_point(body_mass_kg ~ flipper_length_m, data = penguins) %>%
gf_point(5.8 ~ 0.23, color = "red") %>%
gf_point(5.3 ~ 0.22, color = "red") %>%
gf_point(4.7 ~ 0.21, color = "red") %>%
gf_point(4.2 ~ 0.20, color = "red") %>%
gf_point(3.5 ~ 0.19, color = "red") %>%
gf_point(3 ~ 0.18, color = "red")
# solutions will vary but the red dots
# should generally follow the pattern of data
ex() %>% {
check_function(., "gf_point", index = 1) %>% {
check_arg(., "data") %>% check_equal()
check_arg(., "object") %>% check_equal()
}
check_function(., "gf_point", index = 2) %>% {
check_arg(., "color") %>% check_equal()
}
check_function(., "gf_point", index = 3) %>% {
check_arg(., "color") %>% check_equal()
}
check_function(., "gf_point", index = 4) %>% {
check_arg(., "color") %>% check_equal()
}
check_function(., "gf_point", index = 5) %>% {
check_arg(., "color") %>% check_equal()
}
check_function(., "gf_point", index = 6) %>% {
check_arg(., "color") %>% check_equal()
}
check_function(., "gf_point", index = 7) %>% {
check_arg(., "color") %>% check_equal()
}
}
Using a Line to Make Predictions: Our First Model
Everyone will put their predictions in a slightly different place (ours are pictured below). Our predictions follow the general pattern seen in the data, with lower body mass predictions for penguins with shorter flippers, and higher predictions of body mass for those with longer flippers.
Our predictions sort of form a straight line. A straight line seems like it could be a good tool for making predictions because it fits with the idea that increases in flipper length are associated with increases in body mass. We can position the line to cut through the “middle” of points at each value of flipper_length_m
.
The simple straight line pictured above can be used to generate predictions. If we find a new penguin with a flipper length of 0.20 m, we can look at the line right where it crosses the 0.20 value on the x-axis, and directly read off the predicted body mass on the y-axis (a little more than 4.0 kg, maybe 4.1, on the graph as shown below).
This line is our first statistical model. Statistical models generate a single prediction on an outcome variable for each value of a predictor variable. In this case, this statistical model generates a single prediction for body mass for each value of flipper length. Thus, the model produces one prediction of body mass for a penguin with a 0.20 m flipper length.
Of course, even though the model generates a single prediction, the actual body masses of penguins with flipper lengths of 0.20 meters varies a lot from penguin to penguin. Although a statistical model is a simple way of capturing the pattern we see, it is not usually very accurate. But the model prediction is more accurate if we use what we know about flipper length than if we don’t.