Visualizing Quantitative Distributions

Prof Ron Yurko

2024-09-09

Reminders, previously, and today…

  • HW2 is due Wednesday Sept 11th!

  • HW3 is posted and due next Wednesday Sept 18th

  • Visualized 2D categorical data with more bar charts, include mosaic plots

  • Walked through different approaches for 1D quantitative data visualization

TODAY:

  • Thinking carefully about histograms

  • Introduction to density estimation

  • Visualization 1D quantitative by 1D categorical distributions

Simulate data from mixture of Normal distributions

Will sample 100 draws from \(N(-1.5, 1)\) and 100 draws from \(N(1.5, 1)\)

Revisit histograms

set.seed(2024)
fake_data <- 
  tibble(fake_x = c(rnorm(100, -1.5), rnorm(100, 1.5))) |>
  mutate(component = c(rep("left", 100), rep("right", 100)))

fake_data |>
  ggplot(aes(x = fake_x)) +
  geom_histogram() +
  scale_x_continuous(limits = c(-5, 5))

What happens as we change the number of bins?

fake_data |>
  ggplot(aes(x = fake_x)) +
  geom_histogram(bins = 15) +
  scale_x_continuous(limits = c(-5, 5))

What happens as we change the number of bins?

fake_data |>
  ggplot(aes(x = fake_x)) +
  geom_histogram(bins = 60) +
  scale_x_continuous(limits = c(-5, 5))

What happens as we change the number of bins?

fake_data |>
  ggplot(aes(x = fake_x)) +
  geom_histogram(bins = 5) +
  scale_x_continuous(limits = c(-5, 5))

What happens as we change the number of bins?

fake_data |>
  ggplot(aes(x = fake_x)) +
  geom_histogram(bins = 100) +
  scale_x_continuous(limits = c(-5, 5))

Variability of graphs - 30 bins

set.seed(2024)
fake_data <- 
  tibble(fake_x = c(rnorm(100, -1.5), rnorm(100, 1.5))) |>
  mutate(component = c(rep("left", 100), rep("right", 100)))

fake_data |>
  ggplot(aes(x = fake_x)) +
  geom_histogram() +
  scale_x_continuous(limits = c(-5, 5))

What happens with a different sample?

set.seed(1979)
fake_data2 <- 
  tibble(fake_x = c(rnorm(100, -1.5), rnorm(100, 1.5))) |>
  mutate(component = c(rep("left", 100), rep("right", 100)))

fake_data2 |>
  ggplot(aes(x = fake_x)) +
  geom_histogram() +
  scale_x_continuous(limits = c(-5, 5))

Variability of graphs - 15 bins

Variability of graphs - 5 bins

Variability of graphs - 100 bins

What do visualizations of continuous distributions display?

Probability that continuous variable \(X\) takes a particular value is 0

e.g., \(P\) (flipper_length_mm \(= 200\)) \(= 0\), why?

Instead we use the probability density function (PDF) to provide a relative likelihood

For continuous variables we can use the cumulative distribution function (CDF),

\[ F(x) = P(X \leq x) \]

For \(n\) observations we can easily compute the Empirical CDF (ECDF):

\[\hat{F}_n(x) = \frac{\text{# obs. with variable} \leq x}{n} = \frac{1}{n} \sum_{i=1}^{n}1(x_i \leq x)\]

  • where \(1()\) is the indicator function, i.e. ifelse(x_i <= x, 1, 0)

Display full distribution with ECDF plot

penguins |>
  ggplot(aes(x = flipper_length_mm)) + 
  stat_ecdf() +
  theme_bw()

What about comparing to theoretical distributions?

One-Sample Kolmogorov-Smirnov Test

  • We compare the ECDF \(\hat{F}(x)\) to a theoretical distribution’s CDF \(F(x)\)

  • The one sample KS test statistic is: \(\text{max}_x |\hat{F}(x) - F(x)|\)

Flipper length example

What if we assume flipper_length_mm follows Normal distribution?

  • i.e., flipper_length_mm \(\sim N(\mu, \sigma^2)\)

Need estimates for mean \(\mu\) and standard deviation \(\sigma\):

flipper_length_mean <- mean(penguins$flipper_length_mm, na.rm = TRUE)
flipper_length_sd <- sd(penguins$flipper_length_mm, na.rm = TRUE)

Perform one-sample KS test using ks.test():

ks.test(x = penguins$flipper_length_mm, y = "pnorm",
        mean = flipper_length_mean, sd = flipper_length_sd)

    Asymptotic one-sample Kolmogorov-Smirnov test

data:  penguins$flipper_length_mm
D = 0.12428, p-value = 5.163e-05
alternative hypothesis: two-sided

Flipper length example

Visualize distribution comparisons using quantile-quantile (q-q) plots

penguins |>
  ggplot(aes(sample = flipper_length_mm)) +
  stat_qq() +
  stat_qq_line()

What about displaying conditional distributions?

penguins |>
  ggplot(aes(x = flipper_length_mm)) + 
  geom_histogram(aes(fill = species))

What about displaying conditional distributions?

penguins |>
  ggplot(aes(x = flipper_length_mm)) + 
  geom_histogram(aes(fill = species),
                 position = "identity", alpha = 0.3)

Normalize histogram frequencies with density values

penguins |>
  ggplot(aes(x = flipper_length_mm)) + 
  geom_histogram(aes(y = after_stat(density), fill = species),
                 position = "identity", alpha = 0.3) 

Use density curves instead for comparison

penguins |>
  ggplot(aes(x = flipper_length_mm)) + 
  geom_density(aes(color = species))

We should NOT fill the density curves

penguins |>
  ggplot(aes(x = flipper_length_mm)) + 
  geom_density(aes(fill = species), alpha = .3)

What’s the relationship between these two figures?

Kernel density estimation

Goal: estimate the PDF \(f(x)\) for all possible values (assuming it is continuous / smooth)

\[ \text{Kernel density estimate: } \hat{f}(x) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h} K_h(x - x_i) \]

  • \(n =\) sample size, \(x =\) new point to estimate \(f(x)\) (does NOT have to be in dataset!)
  • \(h =\) bandwidth, analogous to histogram bin width, ensures \(\hat{f}(x)\) integrates to 1

  • \(x_i =\) \(i\)th observation in dataset

  • \(K_h(x - x_i)\) is the Kernel function, creates weight given distance of \(i\)th observation from new point
    • as \(|x - x_i| \rightarrow \infty\) then \(K_h(x - x_i) \rightarrow 0\), i.e. further apart \(i\)th row is from \(x\), smaller the weight

    • as bandwidth \(h \uparrow\) weights are more evenly spread out (as \(h \downarrow\) more concentrated around \(x\))

    • typically use Gaussian / Normal kernel: \(\propto e^{-(x - x_i)^2 / 2h^2}\)

    • \(K_h(x - x_i)\) is large when \(x_i\) is close to \(x\)

Wikipedia example

We display kernel density estimates with geom_density()

penguins |>
  ggplot(aes(x = flipper_length_mm)) + 
  geom_density() +
  theme_bw()

Choice of kernel?

What about the bandwidth?

Use Gaussian reference rule (rule-of-thumb) \(\approx 1.06 \cdot \sigma \cdot n^{-1/5}\), where \(\sigma\) is the observed standard deviation

Modify the bandwidth using the adjust argument - value to multiply default bandwidth by

penguins |>
  ggplot(aes(x = flipper_length_mm)) + 
  geom_density(adjust = 0.5) +
  theme_bw()

What about the bandwidth?

Use Gaussian reference rule (rule-of-thumb) \(\approx 1.06 \cdot \sigma \cdot n^{-1/5}\), where \(\sigma\) is the observed standard deviation

Modify the bandwidth using the adjust argument - value to multiply default bandwidth by

penguins |>
  ggplot(aes(x = flipper_length_mm)) + 
  geom_density(adjust = 2) +
  theme_bw()

CAUTION: dealing with bounded data…

set.seed(101)
bound_data <- tibble(fake_x = runif(100))

bound_data |>
  ggplot(aes(x = fake_x)) +
  geom_density() +
  geom_rug(alpha = 0.5) + #<<
  stat_function(data = 
                  tibble(fake_x = c(0, 1)),
                fun = dunif, color = "red") +
  scale_x_continuous(limits = c(-.5, 1.5))

Visualizing conditional distributions with violin plots

penguins |>
  ggplot(aes(x = species, y = flipper_length_mm)) +
  geom_violin() +
  coord_flip()

Visualizing conditional distributions with violin plots

penguins |>
  ggplot(aes(x = species, y = flipper_length_mm)) +
  geom_violin() + 
  geom_boxplot(width = .2) +
  coord_flip()

Visualizing conditional distributions with ggridges package

library(ggridges)
penguins |>
  ggplot(aes(x = flipper_length_mm, y = species)) +
  geom_density_ridges(rel_min_height = 0.01)

Visualizing conditional distributions with ggbeeswarm package

library(ggbeeswarm)
penguins |>
  ggplot(aes(x = flipper_length_mm, y = species)) +
  geom_beeswarm(cex = 1.5) +
  theme_bw()

Recap and next steps

  • Discussed impact of bins on histograms

  • Covered ECDFs and connection to KS-tests

  • Walked through density estimation

  • Discussed ways of visualizing conditional distributions

  • HW2 is due Wednesday Sept 11th!

  • HW3 is posted and due next Wednesday Sept 18th

BONUS: Visualizing the KS test statistic

# First create the ECDF function for the variable:
fl_ecdf <- ecdf(penguins$flipper_length_mm)
# Compute the absolute value of the differences between the ECDF for the values
# and the theoretical values with assumed Normal distribution:
abs_ecdf_diffs <- abs(fl_ecdf(penguins$flipper_length_mm) - pnorm(penguins$flipper_length_mm,
                                                                  mean = flipper_length_mean, sd = flipper_length_sd))
# Now find where the maximum difference is:
max_abs_ecdf_diff_i <- which.max(abs_ecdf_diffs)
# Get this flipper length value:
max_fl_diff_value <- penguins$flipper_length_mm[max_abs_ecdf_diff_i]
# Plot the ECDF with the theoretical Normal and KS test info:
penguins |>
  ggplot(aes(x = flipper_length_mm)) +
  stat_ecdf(color = "darkblue") +
  # Use stat_function to draw the Normal ECDF
  stat_function(fun = pnorm, args = list(mean = flipper_length_mean, sd = flipper_length_sd), color = "black", linetype = "dashed") +
  # Draw KS test line:
  geom_vline(xintercept = max_fl_diff_value, color = "red") +
  # Add text with the test results (x and y are manually entered locations)
  annotate(geom = "text", x = 215, y = .25, label = "KS test stat = 0.12428\np-value = 5.163e-05") + 
  labs(x = "Flipper length (mm)", y = "Fn(x)") + theme_bw()