Demo 03: Simple visuals for high-dimensional data

More fun with penguins

The graphs below don’t have proper titles, axis labels, legends, etc. Please take care to do this on your own graphs. Throughout this demo we will use the palmerpenguins dataset. To access the data, you will need to install the palmerpenguins package:

install.packages("palmerpenguins")

We load the penguins data in the same way as the previous demos:

library(tidyverse)
library(palmerpenguins)
data(penguins)
head(penguins)
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

Pairs plot with GGally

We will use the GGally package to make pairs plots in R with ggplot figures. You need to install the package:

install.packages("GGally")

Next, we’ll load the package and create a pairs plot of just the continuous variables using ggpairs. The main arguments you have to worry about for ggpairs are data, columns, and mapping:

  • data: specifies the dataset

  • columns: Columns of data you want in the plot (can specify with vector of column names or numbers referring to the column indices)

  • mapping: aesthetics using aes(). Most important one is aes(color = <variable name>)

First, let’s create a pairs plot by specifying columns as the four columns of continuous variables (columns 3 through 6):

library(GGally)
penguins |> ggpairs(columns = 3:6)

Obviously this suffers from over-plotting so we’ll want to adjust the alpha. An annoying thing is that we specify the alpha directionly with aes when using ggpairs:

penguins |> ggpairs(columns = 3:6, mapping = aes(alpha = 0.5))

Plots along the diagonal show marginal distributions. Plots along the off-diagonal show joint (pairwise) distributions or statistical summaries (e.g., correlation) to avoid redundancy. The matrix of plots is symmetric; e.g., entry (1,2) shows the same distribution as entry (2,1). However, entry (1,2) and entry (2,1) display different bits of information (or alternative plots) about the same distribution.

We could also specify categorical variables in the plot. We also don’t need to specify column indices if we just select which columns to use beforehand:

penguins |> 
  dplyr::select(bill_length_mm, body_mass_g, species, island) |>
  ggpairs(mapping = aes(alpha = 0.5))

Alternatively, we can use the mapping argument to display these categorical variables in a different manner - and arguably more efficiently:

penguins |> 
  ggpairs(columns = c("bill_length_mm", "body_mass_g", "island"),
          mapping = aes(alpha = 0.5, color = species))

The ggpairs function in GGally is very flexible and customizable with regards to which figures are displayed in the various panels. I encourage you to check out the vignettes and demos on the package website for more examples. For instance, in the pairs plot below I decide to display the regression lines and make other adjustments to the off-diagonal figures:

penguins |> 
  ggpairs(columns = c("bill_length_mm", "body_mass_g", "island"),
          mapping = aes(alpha = 0.5, color = species), 
          lower = list(
            continuous = "smooth_lm", 
            combo = "facetdensitystrip"
          ),
          upper = list(
            continuous = "cor",
            combo = "facethist"
          )
  )

You can also proceed to customize the pairs plot in the same manner as ggplot figures:

penguins |>
  dplyr::select(species, body_mass_g, ends_with("_mm")) |>
  ggpairs(mapping = aes(color = species, alpha = 0.5),
          columns = c("flipper_length_mm", "body_mass_g",
                      "bill_length_mm", "bill_depth_mm")) +
  scale_colour_manual(values = c("darkorange","purple","cyan4")) +
  scale_fill_manual(values = c("darkorange","purple","cyan4")) +
  theme_bw() +
  theme(strip.text = element_text(size = 7))

Correlograms with ggcorrplot

We can visualize the correlation matrix for the variables in a dataset using the ggcorrplot package. You need to install the package:

install.packages("ggcorrplot")

Next, we’ll load the package and create a correlogram using only the continuous variables. To do this, we first need to compute the correlation matrix for these variables:

penguins_cor_matrix <- penguins |>
  dplyr::select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) |>
  cor(use = "complete.obs")
penguins_cor_matrix
                  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
bill_length_mm         1.0000000    -0.2350529         0.6561813   0.5951098
bill_depth_mm         -0.2350529     1.0000000        -0.5838512  -0.4719156
flipper_length_mm      0.6561813    -0.5838512         1.0000000   0.8712018
body_mass_g            0.5951098    -0.4719156         0.8712018   1.0000000

NOTE: Since there are missing values in the penguins data we need to indicate in the cor() function how to handle missing values using the use argument. By default, the correlations are returned as NA, which is not what we want. Instead, we can change this to only use observations without NA values for the considered columns (see help(cor) for more options).

Now, we can create the correlogram using ggcorrplot() using this correlation matrix:

library(ggcorrplot)
ggcorrplot(penguins_cor_matrix)

There are several ways we can improve this correlogram:

  • we can avoid redundancy by only using one half of matrix by changing the type input: the default is full, we can make it lower or upper instead:
ggcorrplot(penguins_cor_matrix, type = "lower")

  • we can rearrange the variables using hierarchical clustering so that variables displaying stronger levels of correlation are closer together along the diagonal by setting hc.order = TRUE:
ggcorrplot(penguins_cor_matrix, type = "lower", hc.order = TRUE)

  • if we want to add the correlation values directly to the plot, we can include those labels setting lab = TRUE - but we should round the correlation values first using the round() function:
ggcorrplot(round(penguins_cor_matrix, digits = 4), 
           type = "lower", hc.order = TRUE, lab = TRUE)

  • if we want to place more stress on the correlation magnitude, we can change the method input to circle so that the size of the displayed circles is mapped to the absolute value of the correlation value:
ggcorrplot(penguins_cor_matrix, type = "lower", hc.order = TRUE,
           method = "circle")

You can ignore the Warning message that is displayed - just from the differences in ggplot implementation.

Heatmaps to display dataset structure with color

Heatmaps provide a way to display structure of the dataset using the fill of tiles in a matrix. The fill of the tiles is mapped to a variable’s standardized value (i.e., (x - mean(x)) / sd(x)). There is a convenient function in R called heatmap to create this type of figure:

heatmap(as.matrix(dplyr::select(penguins, 
                                bill_length_mm, bill_depth_mm, 
                                flipper_length_mm, body_mass_g)),
        scale = "column", 
        Rowv = NA, Colv = NA)

In order to manually create this figure, we’ll need to pivot our dataset from wide to long using the pivot_longer() function. This results in a dataset with one row per observation and variable combination. Then we use geom_tile as the geometric object with the standardized value mapped to the fill:

penguins |>
  mutate(penguin_index = as.factor(paste0("Penguin-", 1:n()))) |>
  dplyr::select(penguin_index, bill_length_mm, bill_depth_mm, 
                flipper_length_mm, body_mass_g) |>
  pivot_longer(bill_length_mm:body_mass_g,
               names_to = "variable",
               values_to = "raw_value") |>
  group_by(variable) |>
  mutate(std_value = (raw_value - mean(raw_value, na.rm = TRUE)) / 
           sd(raw_value, na.rm = TRUE)) |>
  ungroup() |>
  ggplot(aes(x = variable, y = penguin_index, fill = std_value)) +
  geom_tile() +
  theme_light() +
  theme(legend.position = "bottom",
        axis.text.y = element_text(size = 2)) 

In order to provide some notion of the correlation structure between variables, it’s useful to reorder the observations in the heatmap display by some variable:

penguins |>
  mutate(penguin_index = as.factor(paste0("Penguin-", 1:n())),
         penguin_index = fct_reorder(penguin_index, body_mass_g,
                                     # Ignore the missings when reordering
                                     .na_rm = TRUE)) |>
  dplyr::select(penguin_index, bill_length_mm, bill_depth_mm, 
                flipper_length_mm, body_mass_g) |>
  pivot_longer(bill_length_mm:body_mass_g,
               names_to = "variable",
               values_to = "raw_value") |>
  group_by(variable) |>
  mutate(std_value = (raw_value - mean(raw_value, na.rm = TRUE)) / 
           sd(raw_value, na.rm = TRUE)) |>
  ungroup() |>
  ggplot(aes(x = variable, y = penguin_index, fill = std_value)) +
  geom_tile() +
  scale_fill_gradient(low = "darkblue", high = "darkorange") +
  theme_light() +
  theme(legend.position = "bottom",
        axis.text.y = element_text(size = 2)) 

Parallel coordinates plot with GGally

In a parallel coordinates plot, we create an axis for each varaible and align these axes side-by-side, drawing lines between observations from one axis to the next. This can be useful for visualizing structure among both the variables and observations in our dataset. These are useful when working with a moderate number of observations and variables - but can be overwhelming with too many.

We use the ggparcoord() function from the GGally package to make parallel coordinates plots:

penguins |>
  ggparcoord(columns = 3:6)

There are several ways we can modify this parallel coordinates plot:

  • we should always adjust the transparency of the lines using the alphaLines input to help handle overlap:
penguins |>
  ggparcoord(columns = 3:6, alphaLines = .2)

  • we can color each observation’s lines by a categorical variable, which can be useful for revealing group structure:
penguins |>
  ggparcoord(columns = 3:6, alphaLines = .2, groupColumn = "species")

  • we can change how the y-axis is constructed by modifying the scale input, which by default is std that is simply subtracting the mean and dividing by the standard deviation. We could instead use uniminmax so that minimum of the variable is zero and the maximum is one:
penguins |>
  ggparcoord(columns = 3:6, alphaLines = .2, groupColumn = "species",
             scale = "uniminmax")

  • we can also reorder the variables a number of different ways with the order input (see help(ggparcoord) for details). There appears to be some weird errors however with the different options, but you can still manually provide the order of indices as follows:
penguins |>
  ggparcoord(columns = 3:6, alphaLines = .2, groupColumn = "species",
             order = c(6, 5, 3, 4))