list High School / Algebra + Data Science (G)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

Chapter 2 - Modeling Data with Functions

2.1 From Exploring Data to Modeling Data

Making Predictions With a scatter plot

Visualizations are great for exploring data. But they also can help us find patterns in the data that we can use to predict future observations. Let’s try it here.

Going back to our penguins data, we saw that there was a pattern where penguins with longer flippers tended to have greater body mass.

A scatter plot of body_mass_kg predicted by flipper_length_m. The data points for penguins with a flipper length of 0.22 are shaded in purple.

In the scatter plot above we’ve colored penguins with a flipper length of 0.22 m purple. Now consider a new penguin, not in the dataset. If it also happens to have a flipper length of 0.22 m, what would you predict for its body mass?

Try modifying the code below to overlay your prediction as a red dot on top of the scatter plot of data.

require(coursekata) gf_point(body_mass_kg ~ flipper_length_m, data = penguins, color = ~flipper_length_m == 0.22) %>% # modify this code to add your prediction in red gf_point(0 ~ 0.22) gf_point(body_mass_kg ~ flipper_length_m, data = penguins, color = ~flipper_length_m == 0.22) %>% # solutions vary, here is one gf_point(5.2 ~ 0.22, color = "red") ex() %>% { check_function(., "gf_point", index = 1) %>% { check_arg(., "data") %>% check_equal() check_arg(., "object") %>% check_equal() check_arg(., "color") %>% check_equal() } check_function(., "gf_point", index = 2) %>% { check_arg(., "color") %>% check_equal() } }

Now try making 6 more predictions of body mass for penguins with flipper lengths ranging from 0.23 down to 0.18. Overlay each of the 6 new predictions as red dots on the scatter plot below.

require(coursekata) # add a few predicted penguins gf_point(body_mass_kg ~ flipper_length_m, data = penguins) %>% gf_point(0 ~ 0.23, color = "red") %>% gf_point(0 ~ 0.22, color = "red") %>% gf_point(0 ~ 0.21, color = "red") %>% gf_point(0 ~ 0.20, color = "red") %>% gf_point(0 ~ 0.19, color = "red") %>% gf_point(0 ~ 0.18, color = "red") # add a few predicted penguins gf_point(body_mass_kg ~ flipper_length_m, data = penguins) %>% gf_point(5.8 ~ 0.23, color = "red") %>% gf_point(5.3 ~ 0.22, color = "red") %>% gf_point(4.7 ~ 0.21, color = "red") %>% gf_point(4.2 ~ 0.20, color = "red") %>% gf_point(3.5 ~ 0.19, color = "red") %>% gf_point(3 ~ 0.18, color = "red") # solutions will vary but the red dots # should generally follow the pattern of data ex() %>% { check_function(., "gf_point", index = 1) %>% { check_arg(., "data") %>% check_equal() check_arg(., "object") %>% check_equal() } check_function(., "gf_point", index = 2) %>% { check_arg(., "color") %>% check_equal() } check_function(., "gf_point", index = 3) %>% { check_arg(., "color") %>% check_equal() } check_function(., "gf_point", index = 4) %>% { check_arg(., "color") %>% check_equal() } check_function(., "gf_point", index = 5) %>% { check_arg(., "color") %>% check_equal() } check_function(., "gf_point", index = 6) %>% { check_arg(., "color") %>% check_equal() } check_function(., "gf_point", index = 7) %>% { check_arg(., "color") %>% check_equal() } }

Using a Line to Make Predictions: Our First Model

Everyone will put their predictions in a slightly different place (ours are pictured below). Our predictions follow the general pattern seen in the data, with lower body mass predictions for penguins with shorter flippers, and higher predictions of body mass for those with longer flippers.

A scatter plot of body_mass_kg predicted by flipper_length_m. Six red dots are plotted as predictions through the approximate center of the data points.

Our predictions sort of form a straight line. A straight line seems like it could be a good tool for making predictions because it fits with the idea that increases in flipper length are associated with increases in body mass. We can position the line to cut through the “middle” of points at each value of flipper_length_m.

A scatter plot of body_mass_kg predicted by flipper_length_m. Six red dots are plotted as predictions through the approximate center of the data points. A blue line of best fit is also plotted on the graph and the line closely follows the trend of the red prediction dots.

The simple straight line pictured above can be used to generate predictions. If we find a new penguin with a flipper length of 0.20 m, we can look at the line right where it crosses the 0.20 value on the x-axis, and directly read off the predicted body mass on the y-axis (a little more than 4.0 kg, maybe 4.1, on the graph as shown below).

A scatter plot of body_mass_kg predicted by flipper_length_m. A blue line of best fit is plotted on the graph and runs through the center of the data points. Blue dashed lines run through the x- and y-axes at a body mass of 4.1 and a flipper length of 0.20 and intersect with the line of best fit.

This line is our first statistical model. Statistical models generate a single prediction on an outcome variable for each value of a predictor variable. In this case, this statistical model generates a single prediction for body mass for each value of flipper length. Thus, the model produces one prediction of body mass for a penguin with a 0.20 m flipper length.

Of course, even though the model generates a single prediction, the actual body masses of penguins with flipper lengths of 0.20 meters varies a lot from penguin to penguin. Although a statistical model is a simple way of capturing the pattern we see, it is not usually very accurate. But the model prediction is more accurate if we use what we know about flipper length than if we don’t.

Responses