2024-09-25
HW4 is due TONIGHT Sept 25th
You need to email me a draft of your EDA report! (1 per group)
Walked through non-linear dimension reduction with t-SNE
Discussed visualizing trends, highglighting points of emphasis
TODAY:
Walk through the basics of time series data techniques
Introduce visualizations and inference with spatial data
Time series can be characterized by three features:
Trends: Does the variable increase or decrease over time, on average?
Seasonality: Are there changes in the variable that regularly happen (e.g., every winter, every hour, etc.)? Sometimes called periodicity.
Noise: Variation in the variable beyond average trends and seasonality.
Moving averages are a starting point for visualizing how a trend changes over time
The Financial Times COVID-19 plots displayed a moving average (sometimes called a rolling average)
Intuition
Divide your data into small subsets (“windows”)
Compute the average within each window
Connect the averages together to make a trend line
Sometimes called a simple moving average
This is exactly what we did with LOESS… we called this a sliding window, but it’s the same thing
Intuition
Divide your data into small subsets (windows)
Compute the average within each window
Connect the averages together to make a trend line
Mathematically, a moving average can be written as the following:
\[\mu_k = \frac{\sum_{t=k - h + 1}^k X_t}{h}\]
Large \(h\): Smooth line; captures global trends
Small \(h\): Jagged/volatile line; captures local trends
co2: Mauna Loa Atmospheric CO2 Concentration dataset (monthly \(\text{CO}^2\) concentration 1959 to 1997)
Can use as.Date() to create time indexes.
Default format is Year/Month/Day. For something else, need to specify format in as.Date() (e.g., format = "%m/%d/%Y")
scale_x_date() to create interpretable axis labelsggseas package to plot moving averagesTwo other common averages: Cumulative moving averages and weighted moving averages.
\[\mu_k^{(CMA)} = \frac{\sum_{t=1}^k X_t}{k}\]
\[\mu_k^{(WMA)} = \frac{\sum_{t=k - h + 1}^k X_t \cdot w_t}{ \sum_{t=k - h + 1}^k w_t}\]
Time series data is fundamentally different from other data problems we’ve worked with because measurements are not independent
Obvious example: The temperature today is correlated with temperature yesterday. (Maybe not in Pittsburgh?)
Important term: lags. Used to determine if one time point influences future time points.
Lag 1: Comparing time series at time \(t\) with time series at time \(t - 1\).
Lag 2: Comparing time series at time \(t\) with time series at time \(t - 2\).
And so on…
Let’s say we have time measurements \((X_1, X_2, X_3, X_4, X_5)\).
The \(\ell = 1\) lag is \((X_2, X_3, X_4, X_5)\) vs \((X_1, X_2, X_3, X_4)\).
The \(\ell = 2\) lag is \((X_3, X_4, X_5)\) vs \((X_1, X_2, X_3)\).
Consider: Are previous outcomes (lags) predictive of future outcomes?
Autocorrelation: Correlation between a time series and a lagged version of itself.
Define \(r_{\ell}\) as the correlation between a time series and Lag \(\ell\) of that time series.
Lag 1: \(r_1\) is correlation between \((X_2, X_3, X_4, X_5)\) and \((X_1,X_2,X_3,X_4)\)
Lag 2: \(r_2\) is correlation between \((X_3, X_4, X_5)\) and \((X_1,X_2,X_3)\)
And so on…
Common diagnostic: Plot \(\ell\) on x-axis, \(r_{\ell}\) on y-axis.
Tells us if correlations are “significantly large” or “significantly small” for certain lags
To make an autocorrelation plot, we use the acf() function; the ggplot version uses autoplot()
With strong global trends, autocorrelations will be very positive.
. . .
Helpful: Visualize autocorrelations after removing the global trend (compute moving average with rollapply())
Remember that there are three main components to a time series:
Average trends
Seasonality
Noise
. . .
Use ggsdc() (from ggseas) to decompose a time series into these three components
Plots the observed time series.
Plots a loess curve as the global trend.
Plots another loess curve on (observed - trend) as the seasonality.
Plots the noise (observed - trend - seasonality).
Typically location is measured with latitude / longitude (2D)
Latitude: Measures North / South (the “y-axis”)
Range is \((-90^{\circ}, 90^{\circ})\)
Measures degrees from the equator \((0^{\circ})\)
\((-90^{\circ}, 0^{\circ})\) = southern hemisphere
\((0^{\circ}, 90^{\circ})\) = northern hemisphere
Longitude: Measures East/West (the “x-axis”)
Range is \((-180^{\circ}, 180^{\circ})\)
Measures degrees from the prime meridian \((0^{\circ})\) in Greenwich, England
\((-180^{\circ}, 0^{\circ})\) = eastern hemisphere
\((0^{\circ}, 180^{\circ})\) = western hemisphere
Map projections: Transformation of the lat / long coordinates on a sphere (the earth) to a 2D plane
There are many different projections - each will distort the map in different ways.
The most common projections are:
ggmapDraw map based on lat / lon coordinates
Put the box into get_stadiamap() to access Stamen Maps (you need an API key!)
Draw the map using ggmap() to serve as base
ggmap
Point Pattern Data: lat-long coordinates where events have occurred
Point-Referenced data: Latitude-longitude (lat-long) coordinates as well as one or more variables specific to those coordinates.
Areal Data: Geographic regions with one or more variables associated with those regions.
Each type is structured differently within a dataset
Each type requires a different kind of graph(s)
We’re going to review each type of data. Then, we’re going to demonstrate how to plot these different data types
Today: Point-referenced and point pattern
Monday: Areal data
Point Pattern Data: lat-long coordinates where events have occurred
Point pattern data simply records the lat-long of events; thus, there are only two columns
Again, latitude and longitude are represented with dots, sometimes called a dot or bubble map.
The goal is to understand how the density of events varies across space
The density of the dots can also be visualized (e.g., with contours)
Point-Referenced data: Latitude-longitude (lat-long) coordinates as well as one or more variables specific to those coordinates
Point-referenced data will have the following form:
# A tibble: 3 × 6
lat lon altitude n_depart n_arrive name
<dbl> <dbl> <dbl> <int> <int> <chr>
1 -6.08 145. 5282 5 5 Goroka Airport
2 -5.21 146. 20 8 8 Madang Airport
3 -5.83 144. 5388 10 12 Mount Hagen Kagamuga Airport
The goal is to understand how the variable(s) (e.g., altitude) vary across different spatial locations
Typically, the latitude and longitude are represented with dots, and the variable(s) are represented with size and/or colors
ggmap(map) +
geom_point(data = airports,
aes(x = lon, y = lat,
size = sqrt(n_depart), color = sqrt(n_arrive)),
alpha = .5) +
scale_size_area(breaks = sqrt(c(1, 5, 10, 50, 100, 500)),
labels = c(1, 5, 10, 50, 100, 500),
name = "# departures") +
scale_color_distiller(palette = "Spectral") +
labs(color = "sqrt(# arrivals)") +
theme(legend.position = "bottom")
There are whole courses, textbooks, and careers dedicated to this. We’re not going to cover everything!
However, there are some straightforward analyses that can be done for spatial data.
Point-Referenced Data:
Divide geography into groups (e.g., north/south/east/west) and use regression to test if there are significant differences.
Regression of \(\text{outcome} \sim \text{latitude} + \text{longitude}\). Smoothing regression (e.g., loess) is particularly useful here.
For basic linear regression:
Plot \((x, y)\) as points
Fit the regression model \(y \sim x\), to give us \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 \cdot x\)
Plot \((x, \hat{y})\) as a line
For point reference data, we have the following variables:
Consider the following linear regression model: \(z \sim \text{lat} + \text{long}\)
Goal: Make a visual involving \((\text{long}, \text{lat}, \hat{z})\), and possibly \(z\).
Goal: Make a visual involving (long, lat, \(\hat{z}\)) and possibly \(z\)
Want \(\hat{z}\) for many (long, lat) combos (not just the observed one!)
To do this, follow this procedure:
Fit the model \(z \sim \text{lat} + \text{long}\)
Create a grid of \((\text{long}, \text{lat})_{ij}\)
Generate \(\hat{z}_{ij}\) for each \((\text{long}, \text{lat})_{ij}\)
Plot a heat map or contour plot of (long, lat, \(\hat{z}\))
This is known as kriging, or spatial interpolation
The steps used to create this map are…
Fit an interactive regression model using loess()
Make a grid of lat/long coordinates, using seq() and expand.grid()
Get estimated outcomes across the grid using predict()
Use geom_contour_filled() to color map by estimated outcomes
Walked through basics of time series data, such as moving averages, autocorrelation, seasonality
Visualized spatial data in a 2D plane (latitude/longitude), i.e., maps
Point pattern: Scatterplots with density contours
Point-referenced: Scatterplots with color/size, use regression/loess for inference
HW4 is due TONIGHT! Email me a draft of your EDA report! (1 per group)
Next time: Visualizing areal data and creating high-quality graphics
CW CH 13 Visualizing time series and other functions of an independent variable, CW CH 14 Visualizing trends, CW Chapter 15 Visualizing geospatial data, KH Chapter 7 Draw Maps