install.packages("palmerpenguins")
Demo 02: Scatterplots and regression with categorical variables
Scatterplots with Penguins
The graphs below don’t have proper titles, axis labels, legends, etc. Please take care to do this on your own graphs. Throughout this demo we will use the palmerpenguins
dataset. To access the data, you will need to install the palmerpenguins
package:
Geometric layer
We use geom_point()
to generate scatterplots (requires x
and y
aes)
library(tidyverse)
library(palmerpenguins)
data(penguins)
|>
penguins ggplot(aes(x = bill_depth_mm, y = bill_length_mm)) +
geom_point()
Always adjust transparency!
Scatterplots display the joint distribution of the variables along the x
and y
axes. You should always adjust the transparency of points via alpha
to visualize overlap - providing a better understanding of joint frequency. This is especially important with larger datasets.
|>
penguins ggplot(aes(x = bill_depth_mm, y = bill_length_mm)) +
geom_point(alpha = 0.5)
Mapping Aesthetics
Colors
We can color by a third variable (e.g., different color for each category).
Note that you can put color =
inside ggplot
or geom_point
, both display the same visualization:
|>
penguins ggplot(aes(x = bill_depth_mm, y = bill_length_mm, color = species)) +
geom_point(alpha = 0.5)
|>
penguins ggplot(aes(x = bill_depth_mm, y = bill_length_mm)) +
geom_point(alpha = 0.5, aes(color = species))
You can also color by a quantitative variable using a color scale/gradient:
|>
penguins ggplot(aes(x = bill_depth_mm, y = bill_length_mm)) +
geom_point(alpha = 0.5, aes(color = body_mass_g))
The default color gradient is not the most appealing, while there are a number of possibilities - blue to orange is a good choice since these colors are opposites on the color spectrum:
|>
penguins ggplot(aes(x = bill_depth_mm, y = bill_length_mm)) +
geom_point(alpha = 0.5, aes(color = body_mass_g)) +
scale_color_gradient(low = "darkblue", high = "darkorange")
Point size (size
)
We can also map variables to other aesthetics, e.g. size
:
|>
penguins ggplot(aes(x = bill_depth_mm, y = bill_length_mm)) +
geom_point(alpha = 0.5, aes(size = body_mass_g))
Point type (shape
)
or the type (shape
) of points:
|>
penguins ggplot(aes(x = bill_depth_mm, y = bill_length_mm)) +
geom_point(alpha = 0.5, aes(shape = species))
Combining aesthetics
We can even do several of these at once:
|>
penguins ggplot(aes(x = bill_depth_mm, y = bill_length_mm,
color = species, shape = island,
size = body_mass_g)) +
geom_point(alpha = 0.5)
The above graph may be a bit difficult to read, but it contains a lot of information in the sense that it is a 5-dimensional graphic:
x
= bill depth (mm)y
= bill length (mm)color
= speciessize
= body mass (g)shape
= island
But be careful! The more complications you add, the more difficult your graph is to explain.
Regression with Penguins
We will focus on predicting penguin bill depth (bill_depth_mm
) based on the penguin’s bill length (bill_length_mm
) and species
. In particular, we will consider the following three linear regression models:
bill_depth_mm
~bill_length_mm
bill_depth_mm
~bill_length_mm
+species
bill_depth_mm
~bill_length_mm
+species
+bill_length_mm
\(\times\)species
Simple Linear Regression (based only on bill length)
First, we can run a simple linear regression (the first model) based only on bill length. We can display this line via geom_smooth()
:
|>
penguins ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE)
And display the regression model output using summary()
:
summary(lm(bill_depth_mm ~ bill_length_mm, data = penguins))
Call:
lm(formula = bill_depth_mm ~ bill_length_mm, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-4.1381 -1.4263 0.0164 1.3841 4.5255
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.88547 0.84388 24.749 < 2e-16 ***
bill_length_mm -0.08502 0.01907 -4.459 1.12e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.922 on 340 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.05525, Adjusted R-squared: 0.05247
F-statistic: 19.88 on 1 and 340 DF, p-value: 1.12e-05
We can write this regression model as:
\[\text{depth} \sim N(\beta_0 + \beta_L \cdot \text{length}, \sigma^2)\]
Note that \(\beta_0\) is the intercept and \(\beta_L\) is the slope.
Thus, our estimates are:
- \(\hat{\beta}_0 = 20.88547\)
- \(\hat{\beta}_L = 12.43\)
- \(\hat{\sigma}^2 = 1.922^2\)
Multiple Linear Regression (Additive)
We can also run the second model, which is based on length and species, but with only additive effects. First, we’ll check the counts of the species variable to ensure that the species with the highest number of observations if the reference level (i.e., the first level for a factor variable):
table(penguins$species)
Adelie Chinstrap Gentoo
152 68 124
Looks like we’re lucky and that the Adelie
species is the most popular and is already first due to alphabetical order. What function would we need to do to re-order the variable?
Next, we’ll fit the regression that accounts for species
without an interaction - so it’s just an additive effect:
<- lm(bill_depth_mm ~ bill_length_mm + species,
depth_lm_species_add data = penguins)
summary(depth_lm_species_add)
Call:
lm(formula = bill_depth_mm ~ bill_length_mm + species, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-2.4529 -0.6864 -0.0508 0.5519 3.5915
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.59218 0.68302 15.508 < 2e-16 ***
bill_length_mm 0.19989 0.01749 11.427 < 2e-16 ***
speciesChinstrap -1.93319 0.22416 -8.624 2.55e-16 ***
speciesGentoo -5.10602 0.19142 -26.674 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9533 on 338 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.769, Adjusted R-squared: 0.7669
F-statistic: 375.1 on 3 and 338 DF, p-value: < 2.2e-16
This tells us that Chinstrap
is different from Adelie
and Gentoo
is different from Adelie
, but it does NOT tell us Chinstrap
is different from Gentoo
. That would require another model with a reordered species
variable. Exercise: Reorder species
so that Gentoo
is the reference level and compare to the results above.
We can manually extract intercepts and coefficients to use for plotting (read the code comments!):
# Calculate species-specific intercepts in order:
<- # First for `Adelie` it's just the initial intercept
intercepts c(coef(depth_lm_species_add)["(Intercept)"],
# Next for `Chinstrap` it's the intercept plus the `Chinstrap` term:
coef(depth_lm_species_add)["(Intercept)"] +
coef(depth_lm_species_add)["speciesChinstrap"],
# And finally for `Gentoo` it's again the intercept plus the `Gentoo` term
coef(depth_lm_species_add)["(Intercept)"] +
coef(depth_lm_species_add)["speciesGentoo"])
# Create a small table to store the intercept, slopes, and species:
<- tibble("intercepts" = intercepts,
lines_tbl # Slopes are the same for each, thus use rep()
"slopes" = rep(coef(depth_lm_species_add)["bill_length_mm"],
3),
# And the levels of species:
"species" = levels(penguins$species))
We can now plot this model by specifying the regression lines with geom_abline()
using the newly constructed lines_tbl
as the data for this layer:
|>
penguins ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point(alpha = 0.5) +
geom_abline(data = lines_tbl,
aes(intercept = intercepts, slope = slopes,
color = species)) +
labs(x = "Bill length (mm)", y = "Bill depth (mm)",
title = "Bill depth versus weight by species")
This is a great example of Simpson’s Paradox! We originally observed a negative linear relationship between depth and length, but now observe a positive linear relationship within species!
Multiple Linear Regression (Interactive)
Next, we can run the third model, which is based on length and species, including interaction effects. This is the default type of model displayed when we map species
to the color aesthetic for the geom_smooth()
layer. In the plot below, we display across both layers, geom_point()
and geom_smooth()
by mapping species
to color in the initial ggplot
canvas construction:
|>
penguins ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE) +
labs(x = "Bill length (mm)", y = "Bill depth (mm)",
title = "Bill depth versus weight by species")
What about the summary of this model? Is the inclusion of interaction terms relevant? Note that by default, multiplying two variables in the lm()
formula below includes both the additive AND interaction terms.
<- lm(bill_depth_mm ~ bill_length_mm * species,
depth_lm_species_int data = penguins)
summary(depth_lm_species_int)
Call:
lm(formula = bill_depth_mm ~ bill_length_mm * species, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-2.6574 -0.6675 -0.0524 0.5383 3.5032
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.40912 1.13812 10.025 < 2e-16 ***
bill_length_mm 0.17883 0.02927 6.110 2.76e-09 ***
speciesChinstrap -3.83998 2.05398 -1.870 0.062419 .
speciesGentoo -6.15812 1.75451 -3.510 0.000509 ***
bill_length_mm:speciesChinstrap 0.04338 0.04558 0.952 0.341895
bill_length_mm:speciesGentoo 0.02601 0.04054 0.642 0.521590
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9548 on 336 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.7697, Adjusted R-squared: 0.7662
F-statistic: 224.5 on 5 and 336 DF, p-value: < 2.2e-16
The interaction terms do NOT appear to be necessary to include. This is justified by both the lack of significance and the slight drop in adjusted R-squared.
What about the intercept?
Remember the meaning of the intercept term… that is not reasonable in this setting because penguins will never have bills with length of 0mm! We should update the additive model (since we found the interaction terms to not be meaningful) to remove the intercept. This can be done by adding a 0
term to the lm()
formula:
<- lm(bill_depth_mm ~ 0 + bill_length_mm + species,
depth_lm_remove_b0 data = penguins)
summary(depth_lm_remove_b0)
Call:
lm(formula = bill_depth_mm ~ 0 + bill_length_mm + species, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-2.4529 -0.6864 -0.0508 0.5519 3.5915
Coefficients:
Estimate Std. Error t value Pr(>|t|)
bill_length_mm 0.19989 0.01749 11.427 < 2e-16 ***
speciesAdelie 10.59218 0.68302 15.508 < 2e-16 ***
speciesChinstrap 8.65899 0.86207 10.044 < 2e-16 ***
speciesGentoo 5.48616 0.83547 6.567 1.94e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9533 on 338 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.997, Adjusted R-squared: 0.997
F-statistic: 2.795e+04 on 4 and 338 DF, p-value: < 2.2e-16
What changed in the summary output? Why did that occur?
We can copy-and-paste our code from above to add these appropriate regression lines:
# Calculate species-specific intercepts in order:
<- # First for `Adelie`
new_intercepts c(coef(depth_lm_remove_b0)["speciesAdelie"],
# Next for `Chinstrap`
coef(depth_lm_remove_b0)["speciesChinstrap"],
# And finally for `Gentoo`
coef(depth_lm_remove_b0)["speciesGentoo"])
# Create a small table to store the intercept, slopes, and species:
<-
new_lines_tbl tibble("intercepts" = new_intercepts,
# Slopes are the same for each, thus use rep()
"slopes" = rep(coef(depth_lm_remove_b0)["bill_length_mm"],
3),
# And the levels of species:
"species" = levels(penguins$species))
Again, create the display:
|>
penguins ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
geom_point(alpha = 0.5) +
geom_abline(data = new_lines_tbl,
aes(intercept = intercepts, slope = slopes,
color = species)) +
labs(x = "Bill length (mm)", y = "Bill depth (mm)",
title = "Bill depth versus weight by species")
Why is this the same display as before? Here’s a great description of why we observe a higher R-squared with the intercept-term excluded from the model.