library(tidyverse)
<-
starbucks read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-12-21/starbucks.csv") |>
# Convert columns to numeric that were saved as character
mutate(trans_fat_g = as.numeric(trans_fat_g),
fiber_g = as.numeric(fiber_g))
Demo 05: Principal Component Analysis
Principal Components of Starbucks
Throughout this demo we will again use the dataset about Starbucks drinks available in the #TidyTuesday project.
You can read in and manipulate various columns in the dataset with the following code:
We will apply principal component analysis (PCA) to the quantitative variables in this dataset:
# Select the variables of interest:
<- starbucks |>
starbucks_quant_data ::select(serv_size_m_l:caffeine_mg)
dplyrdim(starbucks_quant_data)
[1] 1147 11
As seen above, there are 11 quantitative variables in the dataset, and it’s difficult to visualize 11 quantitative variables simultaneously. Maybe we can “get away” with just plotting two dimensions that make up the majority of the variation among these 11 variables (i.e., the first two principal components).
To conduct PCA, you must center and standardize your variables. We can either do that manually with the scale()
function:
<- scale(starbucks_quant_data) scaled_starbucks_quant_data
Or we can tell R
do that for us before performing PCA using the prcomp()
function:
# perform PCA
<- prcomp(starbucks_quant_data,
starbucks_pca # Center and scale variables:
center = TRUE, scale. = TRUE)
# This is equivalent to the following commented out code:
# starbucks_pca <- prcomp(scaled_starbucks_quant_data,
# center = FALSE, scale. = FALSE)
# View the summary
summary(starbucks_pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 2.4748 1.3074 1.0571 0.97919 0.67836 0.56399 0.4413
Proportion of Variance 0.5568 0.1554 0.1016 0.08716 0.04183 0.02892 0.0177
Cumulative Proportion 0.5568 0.7122 0.8138 0.90093 0.94276 0.97168 0.9894
PC8 PC9 PC10 PC11
Standard deviation 0.28123 0.16874 0.08702 0.04048
Proportion of Variance 0.00719 0.00259 0.00069 0.00015
Cumulative Proportion 0.99657 0.99916 0.99985 1.00000
There are 11 variables in this dataset, and thus there are 11 principal components. However, we can see that the first principal component accounts for over half of the variation in the dataset(!), while the second accounts for about 15% of the variation. As we can see, the variation accounted by each component adds up to the total variation in the data (i.e., the “cumulative proportion” equals 100% in the PC11 column). Also, in the first row, we can see that \(\text{Var}(Z_1) > \text{Var}(Z_2) > \cdots > \text{Var}(Z_{11})\), as expected given what we talked about in lecture.
We haven’t actually computed the principal components \(Z_1,\dots,Z_{11}\) yet. In brief, PCA provides a \(p \times p\) “rotation matrix,” and the matrix \(\boldsymbol{Z} = (Z_1,\dots,Z_{11})\) is equal to the original data matrix \(X\) times the rotation matrix. The prcomp()
function returns us the result of this matrix multiplication: the matrix of the principal component scores \(\boldsymbol{Z} = (Z_1,\dots,Z_{11})\) which can be accessed in the following way:
<- starbucks_pca$x
starbucks_pc_matrix head(starbucks_pc_matrix)
PC1 PC2 PC3 PC4 PC5 PC6
[1,] -3.766852 -1.0023657 0.2482698 -0.1521871448 0.24739830 -0.11365847
[2,] -3.633234 -0.6946439 1.2059943 -0.3720566566 0.06052789 -0.06406410
[3,] -3.518063 -0.3981399 2.2165170 -0.5967175941 -0.13122572 -0.01937237
[4,] -3.412061 -0.1067045 3.3741594 -0.8490378243 -0.26095965 -0.00899485
[5,] -3.721426 -0.9868147 -1.0705094 0.0949330091 -0.27181508 0.17491809
[6,] -3.564899 -0.6712499 -0.7779083 -0.0003019903 -0.72054963 0.37005543
PC7 PC8 PC9 PC10 PC11
[1,] -0.02812472 0.006489978 0.05145094 -0.06678083 -0.019741873
[2,] 0.05460952 0.021148978 0.07094211 -0.08080545 -0.023480029
[3,] 0.09050806 0.031575955 0.08901403 -0.09389227 -0.028669251
[4,] 0.11585507 0.037521689 0.11287190 -0.11582260 -0.034691142
[5,] 0.07009414 0.037736197 0.02892317 -0.03631676 -0.005775410
[6,] 0.20236484 0.068154160 0.03705252 -0.03497690 -0.002469611
We could have manually computed this using the returned rotation
matrix and the original data (but centered and scaled). You perform matrix multiplication in R
using the %*%
operator:
<-
manual_starbucks_pc_matrix as.matrix(scaled_starbucks_quant_data) %*% starbucks_pca$rotation
head(manual_starbucks_pc_matrix)
PC1 PC2 PC3 PC4 PC5 PC6
[1,] -3.766852 -1.0023657 0.2482698 -0.1521871448 0.24739830 -0.11365847
[2,] -3.633234 -0.6946439 1.2059943 -0.3720566566 0.06052789 -0.06406410
[3,] -3.518063 -0.3981399 2.2165170 -0.5967175941 -0.13122572 -0.01937237
[4,] -3.412061 -0.1067045 3.3741594 -0.8490378243 -0.26095965 -0.00899485
[5,] -3.721426 -0.9868147 -1.0705094 0.0949330091 -0.27181508 0.17491809
[6,] -3.564899 -0.6712499 -0.7779083 -0.0003019903 -0.72054963 0.37005543
PC7 PC8 PC9 PC10 PC11
[1,] -0.02812472 0.006489978 0.05145094 -0.06678083 -0.019741873
[2,] 0.05460952 0.021148978 0.07094211 -0.08080545 -0.023480029
[3,] 0.09050806 0.031575955 0.08901403 -0.09389227 -0.028669251
[4,] 0.11585507 0.037521689 0.11287190 -0.11582260 -0.034691142
[5,] 0.07009414 0.037736197 0.02892317 -0.03631676 -0.005775410
[6,] 0.20236484 0.068154160 0.03705252 -0.03497690 -0.002469611
As you can see from just the first so many rows, these matrices match. If we view the dimensionality of this matrix (just the one returned to us by prcomp
), we can seee that it matches the dimensionality of the original dataset:
dim(starbucks_pc_matrix)
[1] 1147 11
Indeed, it is literally an 11-dimensional rotation of our dataset. However, the first column of this matrix accounts for over half of the variation in the data and the second column accounts for over 15% of the variation, so maybe we can “get away” with plotting just those first two dimensions.
To recreate what the summary
output of prcomp
function gave us above, the following line of code computes the standard deviation of each \(Z\) (the numbers match what’s given in the first row of numbers above):
apply(starbucks_pc_matrix, MARGIN = 2, FUN = sd)
PC1 PC2 PC3 PC4 PC5 PC6 PC7
2.47478380 1.30742010 1.05712064 0.97918632 0.67836258 0.56399067 0.44130936
PC8 PC9 PC10 PC11
0.28122634 0.16874262 0.08701525 0.04048139
This corresponds to the singular values, i.e., \(\sqrt{\lambda_j}\). We can then compute the proportion of variance explained by each component (also displayed in the summary
output) by squaring these values and dividing by the number of columns:
# Note that I can just replace the sd function above with the var function
apply(starbucks_pc_matrix, MARGIN = 2, FUN = var) /
ncol(starbucks_pc_matrix)
PC1 PC2 PC3 PC4 PC5 PC6
0.5567777142 0.1553952108 0.1015912775 0.0871641674 0.0418341630 0.0289168610
PC7 PC8 PC9 PC10 PC11
0.0177049043 0.0071898413 0.0025885519 0.0006883321 0.0001489766
The plot below displays the first two PCs \(Z_1\) and \(Z_2\):
# First add these columns to the original dataset:
<- starbucks |>
starbucks mutate(pc1 = starbucks_pc_matrix[,1],
pc2 = starbucks_pc_matrix[,2])
|>
starbucks ggplot(aes(x = pc1, y = pc2)) +
geom_point(alpha = 0.25) +
labs(x = "PC 1", y = "PC 2")
This matches what we saw returned by MDS!
One Biplot to Rule Them All
However, the components by themselves aren’t very interpretable - how do they relate to original variables? At this point, it’s important to remember that the principal components are linear combinations of the original variables. So, there is a (deterministic) linear relationship between the original variables and the principal components that we are plotting here.
Using the popular R
package factoextra
, we can plot these linear relationships on top of the scatterplot. We can do so using what’s called a biplot, which is essentially just a fancy expression for “scatterplots with arrows on top”. After installing the factoextra
package, we can create the biplot using the fviz_pca_biplot()
function on the prcomp
output directly (but with the observation labels turned off!):
# install.packages("factoextra")
library(factoextra)
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# Designate to only label the variables:
fviz_pca_biplot(starbucks_pca, label = "var",
# Change the alpha for the observations -
# which is represented by ind
alpha.ind = .25,
# Modify the alpha for the variables (var):
alpha.var = .75,
# Modify the color of the variables
col.var = "darkblue")
The above plot tells us a lot of information:
The direction of a particular arrow is indicative of “as this variable increases….” For example, the far left arrow for
caffeine_mg
suggests that, ascaffeine_mg
increases, \(Z_1\) and \(Z_2\) tend to decrease (in other words, within the definition of \(Z_1\) and \(Z_2\), the coefficient forcaffeine_mg
is negative; this is verified below). You can contrast this withserv_size_m_l
which is pointing to the upper right, indicating that asserv_size_m_l
increases then both \(Z_1\) and \(Z_2\) tend to increase.The angle of the different vectors is also indicative of the correlation between different variables. If two vectors are at a right angle (90 degrees), that suggests that they are uncorrelated, e.g.,
serv_size_m_l
andsaturated_fag_g
. If two vectors are in similar directions (i.e., their angle is less than 90 degrees), that suggests that they are positively correlated, e.g.,sugar_g
andtotal_carbs_g
. If two vectors are in different directions (i.e., their angle is greater than 90 degrees), that suggests that they are negatively correlated, e.g.,caffeine_mg
andcalories
.The length of the lines also indicate how strongly related the principal components are with the individual variables. For example,
serv_size_m_l
has a fairly long line because it has a large positive coefficient for \(Z_1\) in the rotation matrix (see below). Meanwhile,caffeine_mg
has a relatively short arrow because its coefficients are relatively small.
For reference, the below code shows the rotation matrix we used to create the \(Z\)s. You’ll see that the directions of the vectors in the above plot are the first two columns of this matrix.
$rotation starbucks_pca
PC1 PC2 PC3 PC4 PC5
serv_size_m_l 0.20078297 0.44103545 0.35053466 -0.117331692 -0.71633828
calories 0.39488151 0.10314156 -0.01048587 0.055814030 0.11487225
total_fat_g 0.35254969 -0.31687231 0.06598414 0.046196797 0.07677253
saturated_fat_g 0.33929914 -0.30565133 0.05310592 -0.003731227 0.16145662
trans_fat_g 0.29974182 -0.39855899 0.01855869 -0.092804122 -0.35695525
cholesterol_mg 0.33049434 -0.37077805 0.01219867 -0.105617624 -0.18815364
sodium_mg 0.33573598 0.24647412 -0.09107538 -0.083512068 0.34969486
total_carbs_g 0.34858318 0.34483762 -0.09623296 0.002842153 0.12386718
fiber_g 0.11351848 0.04137855 0.17814948 0.956078124 -0.04719036
sugar_g 0.34234584 0.35100839 -0.13314389 -0.109371714 0.12108189
caffeine_mg -0.03085327 -0.01056235 0.89572768 -0.167846419 0.35265479
PC6 PC7 PC8 PC9 PC10
serv_size_m_l 0.30806678 0.13668394 0.04039275 0.01194522 -0.001076764
calories -0.01331210 -0.18521073 0.09836135 -0.45551398 0.744248239
total_fat_g 0.37698224 -0.03833030 0.03871096 -0.58859673 -0.518643989
saturated_fat_g 0.57285718 -0.06553378 -0.26369346 0.56257742 0.209355859
trans_fat_g -0.50043224 0.15197176 -0.58086994 -0.05398876 0.032105721
cholesterol_mg -0.26449384 -0.04594580 0.74615325 0.27703165 -0.032124871
sodium_mg -0.06228905 0.82317144 0.06292570 0.04230447 -0.037304757
total_carbs_g -0.17619489 -0.34490217 -0.08588651 0.12501079 -0.148886253
fiber_g -0.11365528 0.06192955 0.01207815 0.10654914 -0.061378250
sugar_g -0.16729497 -0.33345131 -0.10758116 0.14408095 -0.321644156
caffeine_mg -0.19600402 -0.06671121 -0.02122274 0.01530108 -0.020691492
PC11
serv_size_m_l 0.0053899973
calories -0.1070327163
total_fat_g 0.0489644534
saturated_fat_g -0.0152794817
trans_fat_g 0.0069417249
cholesterol_mg -0.0004710159
sodium_mg 0.0185545403
total_carbs_g 0.7347049650
fiber_g -0.0730283725
sugar_g -0.6635335478
caffeine_mg -0.0094861578
In the above example, we plotted the first two principal components; thus, implicitly, we have chosen \(k = 2\), the only reason being that it is easy to visualize. However, how many principal components should we actually be using?
Creating and Interpreting Scree Plots
There is a common visual used to answer this question, but first let’s build some intuition. We already know that \(Z_1\) accounts for the most variation in our data, \(Z_2\) accounts for the next most, and so on. Thus, each time we add a new principal component dimension, we capture a “higher proportion of the information in the data,” but that increase in proportion decreases for each new dimension we add. (You may have to read those last two sentences a few times to get what I mean.) Thus, in practice, it is recommended to keep adding principal components until the marginal gain “levels off,” i.e., decreases to the point that it isn’t too beneficial to add another dimension to the data.
This trade-off between dimensions and marginal gain in information is often inspected visually using a scree plot, or what is more commonly known as an elbow plot. In an elbow plot, the x-axis has the numbers \(1,2,\dots,p\) (i.e., the dimensions in the data), and the y-axis has the proportion of variation that the particular principal component \(Z_j\) accounts for. We can construct the scree plot using the fviz_screeplot()
function from `factoextra
fviz_eig(starbucks_pca, addlabels = TRUE) # Add the labels
The graphical rule-of-thumb is to then look for the “elbow,” i.e., where the proportion of variation starts to become flat. Unfortunately there is not a definitive “this is the elbow for sure” rule, and it is up to your judgment. Another useful rule-of-thumb is to consider drawing a horizontal line at 1 divided by the number of variables in your original dataset. Why do you think that is a useful rule? We easily do this because factoextra
generates ggplot
objects, so we can add another geometric layer corresponding to our reference:
fviz_eig(starbucks_pca, addlabels = TRUE) +
# Have to multiply by 100 to get on percent scale
geom_hline(yintercept = 100 * (1 / ncol(starbucks_quant_data)),
linetype = "dashed", color = "darkred")
Based on this plot, I think there’s a strong argument to stop at \(k = 3\) (but maybe go up to \(k = 5\) for another substantial drop in the elbow).
Let’s say we decide \(k = 3\). This means that we should use the first three principal components in our graphics and other analyses in order for a “satisfactory” amount of the variation in the data to be captured. Our above visual only plots the first two principal components, and so we are “hiding” some data information that we are better off plotting in some way if possible (specifically, we are hiding about 30% of the information, i.e., the total amount of information captured by principal components 3, 4, …, 11, and about a third of this remaining information is captured by that third component that we are not plotting). This means that, theoretically, we should plot three quantitative variables, and we’ve discussed a bit about how to do this - you could use the size of points, transparency, or even a 3D scatterplot if you wanted to - but we are not going to explore that further here. Alternatively, you could just make three scatterplots (one for each pair of principal components).
If you’re having issues with the factoextra
package then you can easily remake the scree plot manually. All we need to do is grab the proportion of variance explained by each component, turn it into a table, and then display it in some way. We already computed these values earlier in the demo, but we also just grab the singular values directly provided to us by R
:
# Manual creation of elbow plot, start by computing the eigenvalues and dividing by
# the total variance in the data:
tibble(prop_var = (starbucks_pca$sdev)^2 / ncol(starbucks_quant_data)) |>
# Add a column for the PC index:
mutate(pc_index = 1:n()) |>
# Now make the plot!
ggplot(aes(x = pc_index, y = prop_var)) +
geom_bar(stat = "identity", fill = "blue", alpha = 0.75) +
geom_point(color = "black", size = 2) +
geom_line() +
# Add the horizontal reference line:
geom_hline(yintercept = (1 / ncol(starbucks_quant_data)),
linetype = "dashed", color = "darkred") +
# And label:
labs(x = "Dimensions", y = "Proportion of explained variance") +
theme_bw()
Making a biplot from scratch is much more difficult… Instead, you can try the ggfortify
package (which is useful for making model diagnostic plots). The following code demonstrates how to do this (after you installed ggfortify
):
# install.packages("ggfortify")
library(ggfortify)
Warning: package 'ggfortify' was built under R version 4.2.3
autoplot(starbucks_pca,
data = starbucks_quant_data,
alpha = 0.25,
loadings = TRUE, loadings.colour = 'darkblue',
loadings.label.colour = 'darkblue',
loadings.label = TRUE, loadings.label.size = 3,
loadings.label.repel = TRUE) +
theme_bw()