install.packages("palmerpenguins")
Demo 03: Simple visuals for high-dimensional data
More fun with penguins
The graphs below don’t have proper titles, axis labels, legends, etc. Please take care to do this on your own graphs. Throughout this demo we will use the palmerpenguins
dataset. To access the data, you will need to install the palmerpenguins
package:
We load the penguins
data in the same way as the previous demos:
library(tidyverse)
library(palmerpenguins)
data(penguins)
head(penguins)
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex <fct>, year <int>
Pairs plot with GGally
We will use the GGally
package to make pairs plots in R
with ggplot
figures. You need to install the package:
install.packages("GGally")
Next, we’ll load the package and create a pairs plot of just the continuous variables using ggpairs
. The main arguments you have to worry about for ggpairs
are data
, columns
, and mapping
:
data
: specifies the datasetcolumns
: Columns of data you want in the plot (can specify with vector of column names or numbers referring to the column indices)mapping
: aesthetics usingaes()
. Most important one isaes(color = <variable name>)
First, let’s create a pairs plot by specifying columns
as the four columns of continuous variables (columns 3 through 6):
library(GGally)
|> ggpairs(columns = 3:6) penguins
Obviously this suffers from over-plotting so we’ll want to adjust the alpha
. An annoying thing is that we specify the alpha
directionly with aes
when using ggpairs
:
|> ggpairs(columns = 3:6, mapping = aes(alpha = 0.5)) penguins
Plots along the diagonal show marginal distributions. Plots along the off-diagonal show joint (pairwise) distributions or statistical summaries (e.g., correlation) to avoid redundancy. The matrix of plots is symmetric; e.g., entry (1,2) shows the same distribution as entry (2,1). However, entry (1,2) and entry (2,1) display different bits of information (or alternative plots) about the same distribution.
We could also specify categorical variables in the plot. We also don’t need to specify column indices if we just select
which columns to use beforehand:
|>
penguins ::select(bill_length_mm, body_mass_g, species, island) |>
dplyrggpairs(mapping = aes(alpha = 0.5))
Alternatively, we can use the mapping argument to display these categorical variables in a different manner - and arguably more efficiently:
|>
penguins ggpairs(columns = c("bill_length_mm", "body_mass_g", "island"),
mapping = aes(alpha = 0.5, color = species))
The ggpairs
function in GGally
is very flexible and customizable with regards to which figures are displayed in the various panels. I encourage you to check out the vignettes and demos on the package website for more examples. For instance, in the pairs plot below I decide to display the regression lines and make other adjustments to the off-diagonal figures:
|>
penguins ggpairs(columns = c("bill_length_mm", "body_mass_g", "island"),
mapping = aes(alpha = 0.5, color = species),
lower = list(
continuous = "smooth_lm",
combo = "facetdensitystrip"
),upper = list(
continuous = "cor",
combo = "facethist"
) )
You can also proceed to customize the pairs plot in the same manner as ggplot
figures:
|>
penguins ::select(species, body_mass_g, ends_with("_mm")) |>
dplyrggpairs(mapping = aes(color = species, alpha = 0.5),
columns = c("flipper_length_mm", "body_mass_g",
"bill_length_mm", "bill_depth_mm")) +
scale_colour_manual(values = c("darkorange","purple","cyan4")) +
scale_fill_manual(values = c("darkorange","purple","cyan4")) +
theme_bw() +
theme(strip.text = element_text(size = 7))
Correlograms with ggcorrplot
We can visualize the correlation matrix for the variables in a dataset using the ggcorrplot
package. You need to install the package:
install.packages("ggcorrplot")
Next, we’ll load the package and create a correlogram using only the continuous variables. To do this, we first need to compute the correlation matrix for these variables:
<- penguins |>
penguins_cor_matrix ::select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) |>
dplyrcor(use = "complete.obs")
penguins_cor_matrix
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
bill_length_mm 1.0000000 -0.2350529 0.6561813 0.5951098
bill_depth_mm -0.2350529 1.0000000 -0.5838512 -0.4719156
flipper_length_mm 0.6561813 -0.5838512 1.0000000 0.8712018
body_mass_g 0.5951098 -0.4719156 0.8712018 1.0000000
NOTE: Since there are missing values in the penguins
data we need to indicate in the cor()
function how to handle missing values using the use
argument. By default, the correlations are returned as NA
, which is not what we want. Instead, we can change this to only use observations without NA
values for the considered columns (see help(cor)
for more options).
Now, we can create the correlogram using ggcorrplot()
using this correlation matrix:
library(ggcorrplot)
ggcorrplot(penguins_cor_matrix)
There are several ways we can improve this correlogram:
- we can avoid redundancy by only using one half of matrix by changing the
type
input: the default isfull
, we can make itlower
orupper
instead:
ggcorrplot(penguins_cor_matrix, type = "lower")
- we can rearrange the variables using hierarchical clustering so that variables displaying stronger levels of correlation are closer together along the diagonal by setting
hc.order = TRUE
:
ggcorrplot(penguins_cor_matrix, type = "lower", hc.order = TRUE)
- if we want to add the correlation values directly to the plot, we can include those labels setting
lab = TRUE
- but we should round the correlation values first using theround()
function:
ggcorrplot(round(penguins_cor_matrix, digits = 4),
type = "lower", hc.order = TRUE, lab = TRUE)
- if we want to place more stress on the correlation magnitude, we can change the
method
input tocircle
so that the size of the displayed circles is mapped to the absolute value of the correlation value:
ggcorrplot(penguins_cor_matrix, type = "lower", hc.order = TRUE,
method = "circle")
You can ignore the Warning
message that is displayed - just from the differences in ggplot
implementation.
Heatmaps to display dataset structure with color
Heatmaps provide a way to display structure of the dataset using the fill of tiles in a matrix. The fill of the tiles is mapped to a variable’s standardized value (i.e., (x - mean(x)) / sd(x)). There is a convenient function in R
called heatmap
to create this type of figure:
heatmap(as.matrix(dplyr::select(penguins,
bill_length_mm, bill_depth_mm,
flipper_length_mm, body_mass_g)),scale = "column",
Rowv = NA, Colv = NA)
In order to manually create this figure, we’ll need to pivot our dataset from wide to long using the pivot_longer()
function. This results in a dataset with one row per observation and variable combination. Then we use geom_tile
as the geometric object with the standardized value mapped to the fill:
|>
penguins mutate(penguin_index = as.factor(paste0("Penguin-", 1:n()))) |>
::select(penguin_index, bill_length_mm, bill_depth_mm,
dplyr|>
flipper_length_mm, body_mass_g) pivot_longer(bill_length_mm:body_mass_g,
names_to = "variable",
values_to = "raw_value") |>
group_by(variable) |>
mutate(std_value = (raw_value - mean(raw_value, na.rm = TRUE)) /
sd(raw_value, na.rm = TRUE)) |>
ungroup() |>
ggplot(aes(x = variable, y = penguin_index, fill = std_value)) +
geom_tile() +
theme_light() +
theme(legend.position = "bottom",
axis.text.y = element_text(size = 2))
In order to provide some notion of the correlation structure between variables, it’s useful to reorder the observations in the heatmap display by some variable:
|>
penguins mutate(penguin_index = as.factor(paste0("Penguin-", 1:n())),
penguin_index = fct_reorder(penguin_index, body_mass_g,
# Ignore the missings when reordering
.na_rm = TRUE)) |>
::select(penguin_index, bill_length_mm, bill_depth_mm,
dplyr|>
flipper_length_mm, body_mass_g) pivot_longer(bill_length_mm:body_mass_g,
names_to = "variable",
values_to = "raw_value") |>
group_by(variable) |>
mutate(std_value = (raw_value - mean(raw_value, na.rm = TRUE)) /
sd(raw_value, na.rm = TRUE)) |>
ungroup() |>
ggplot(aes(x = variable, y = penguin_index, fill = std_value)) +
geom_tile() +
scale_fill_gradient(low = "darkblue", high = "darkorange") +
theme_light() +
theme(legend.position = "bottom",
axis.text.y = element_text(size = 2))
Parallel coordinates plot with GGally
In a parallel coordinates plot, we create an axis for each varaible and align these axes side-by-side, drawing lines between observations from one axis to the next. This can be useful for visualizing structure among both the variables and observations in our dataset. These are useful when working with a moderate number of observations and variables - but can be overwhelming with too many.
We use the ggparcoord()
function from the GGally
package to make parallel coordinates plots:
|>
penguins ggparcoord(columns = 3:6)
There are several ways we can modify this parallel coordinates plot:
- we should always adjust the transparency of the lines using the
alphaLines
input to help handle overlap:
|>
penguins ggparcoord(columns = 3:6, alphaLines = .2)
- we can color each observation’s lines by a categorical variable, which can be useful for revealing group structure:
|>
penguins ggparcoord(columns = 3:6, alphaLines = .2, groupColumn = "species")
- we can change how the y-axis is constructed by modifying the
scale
input, which by default isstd
that is simply subtracting the mean and dividing by the standard deviation. We could instead useuniminmax
so that minimum of the variable is zero and the maximum is one:
|>
penguins ggparcoord(columns = 3:6, alphaLines = .2, groupColumn = "species",
scale = "uniminmax")
- we can also reorder the variables a number of different ways with the
order
input (seehelp(ggparcoord)
for details). There appears to be some weird errors however with the different options, but you can still manually provide the order of indices as follows:
|>
penguins ggparcoord(columns = 3:6, alphaLines = .2, groupColumn = "species",
order = c(6, 5, 3, 4))