Introduction to Elo ratings (INSTRUCTOR SOLUTIONS)

An introduction to Elo ratings using NFL game outcomes.

Author

Ron Yurko

Intro and Data

The purpose of this module is to introduce the basics of Elo ratings in the context of measuring NFL team strengths. This file contains guided exercises demonstrating how to implement Elo ratings from scratch in R.

We’ll use a subset of the NFL Game Outcomes dataset available on the SCORE Sports Data Repository, only considering games during the 2023-24 season. The following code chunk reads in the larger dataset and filters down to only include games during the 2023-24 season:

# Need to have the tidyverse installed prior to starting!
library(tidyverse)

nfl_games <- read_csv("https://data.scorenetwork.org/data/nfl_mahomes_era_games.csv") |>
  filter(season == 2023)

As indicated in the overview page for the larger dataset, each row in the dataset corresponds to a single game played during the 2023-24 season:

nfl_games

# A tibble: 285 × 10
   season game_id      game_type  week home_team away_team home_score away_score
    <dbl> <chr>        <chr>     <dbl> <chr>     <chr>          <dbl>      <dbl>
 1   2023 2023_01_DET… REG           1 KC        DET               20         21
 2   2023 2023_01_CAR… REG           1 ATL       CAR               24         10
 3   2023 2023_01_HOU… REG           1 BAL       HOU               25          9
 4   2023 2023_01_CIN… REG           1 CLE       CIN               24          3
 5   2023 2023_01_JAX… REG           1 IND       JAX               21         31
 6   2023 2023_01_TB_… REG           1 MIN       TB                17         20
 7   2023 2023_01_TEN… REG           1 NO        TEN               16         15
 8   2023 2023_01_SF_… REG           1 PIT       SF                 7         30
 9   2023 2023_01_ARI… REG           1 WAS       ARI               20         16
10   2023 2023_01_GB_… REG           1 CHI       GB                20         38
# ℹ 275 more rows
# ℹ 2 more variables: game_outcome <dbl>, score_diff <dbl>

Note the game_type column indicates if the game was during the regular season (REG), or during the playoffs with the different values indicating the different playoff rounds:

table(nfl_games$game_type)


CON DIV REG  SB  WC 
  2   4 272   1   6

The week column just increases in the correct order, which will be convenient for implementing Elo ratings over the course of the NFL season.

Background Information

Elo ratings were created by physicist Arpad Elo in the 1960s for rating chess players. The main idea behind Elo ratings is to create an exchange in rating points between players (or teams) after a match. The simplest version of this system was constructed to be a zero-sum rating, such that the winner gains x points while the same number of x points are subtracted from the loser’s rating. If the win was expected, the winner receives fewer points than if the win was unexpected (i.e., an upset) - where the expectation is set prior to the match. This system results in a dynamic rating that is adjusted for opponent quality.

Method Details

We’re going to consider a simple version of Elo ratings in the context of measuring NFL team strength via a rating. Let the rating for the home team be \(R_{\text{home}}\) and the away team rating be \(R_{\text{away}}\). Then the expected score for the home team \(E_{\text{home}}\) is calculated as:

\[ E_{\text{home}} = \frac{1}{1+10^{\left(R_{\text{away}}-R_{\text{home}}\right) / 400}}, \]

and the expected score for the away team \(E_{\text{away}}\) is computed in a similar manner:

\[ E_{\text{away}} = \frac{1}{1+10^{\left(R_{\text{home}}-R_{\text{away}}\right) / 400}}. \] These expected scores represent the probability of winning¹, e.g., \(E_{\text{home}}\) represents the probability of winning for the home team.

The choice of 10 and 400 in the denominator may appear arbitrary at first, but they correspond to:

a logistic curve (thus bounded by 0 and 1) with base 10, and
a scaling factor of 400 which can be tuned to yield better predictions (discussed in more detail below).

A more general representation of the expected score would replace the choice of 10 with some constant (e.g., \(e\)) and replace 400 with a tune-able quantity \(d\). For now though we will just use 10 and 400 since they are the original choices.

While the above quantities represent the expectation of a game between teams with ratings \(R_{\text{home}}\) and \(R_{\text{away}}\), we need a step to update the ratings after observing the game outcome. We can update we update the ratings for the home team based on the observed score \(S_{\text{home}}\):

\[ R^{\text{new}}_{\text{home}} = R_{\text{home}} + K \cdot (S_{\text{home}} - E_{\text{home}}) \]

The observed score \(S_{\text{home}}\) is based on the game outcome such that,

\(S_{\text{home}} = 1\) if the home team wins,
\(S_{\text{home}} = 0.5\) if it is a draw, or
\(S_{\text{home}} = 0\) if the home team loses.

We compute the updated rating for the away team \(R^{\text{new}}_{\text{away}}\) in a similar manner, by replacing home team quantities with those with respect to the away team.

The quantity \(K\) is known as the update factor, indicating the maximum number of Elo rating points a team gains from winning a single game (and how many points are subtracted if they lose). This is a tuning parameter, which ideally should be selected to yield optimal predictive performance.

Thought Exercise

QUESTION: Given the above equation for \(R^{\text{new}}_{\text{home}}\), what do you think will happen as you increase \(K\)? Does a single game cause a larger or smaller change on a team’s rating? Likewise, what do you think will happen if you decrease \(K\)? Describe what you expect to observe in 1-3 sentences.

ANSWER: The update factor \(K\) controls how sensitive the ratings should be to a single game outcome. A larger choice of \(K\) will lead to a larger change in a team’s rating following a single game, and likewise a smaller choice of \(K\) will lead to a smaller change in a team’s rating.

Although the details are beyond the scope of this module, there is a relationship between this Elo ratings update formula with stochastic gradient descent for logistic regression.

Learn by Doing

We’ll now proceed to implement Elo ratings for NFL teams during the 2023-24 season in R.

Calculating and Updating Ratings

We will start by creating two helper functions to compute the expected scores and updated ratings after a game.

Active Exercise

First, based on the above formulas for expected score, complete the function calc_expected_score that takes in as input a team_rating and opp_team_rating then returns the expected score for the team relative to the opp_team. For ease, your function should use the base 10 logistic curve and scaling factor of 400 as fixed quantities.

calc_expected_score <- function(team_rating, opp_team_rating) {
  1 / (1 + 10^((opp_team_rating - team_rating) / 400))
}

QUESTION: Using your function calc_expected_score, what is the expected score for a team with a rating of 1400 playing against an opposing team with a rating of 1600?

ANSWER:

calc_expected_score(1400, 1600)

[1] 0.2402531

The expected score for the team is about 0.24, indicating that the team with a rating of 1400 has an estimated 24% chance of beating an opponent with a rating of 1600.

Active Exercise

Next, complete the calc_new_rating function that takes in an initial team_rating, the observed_score and expected_score with respect to that team, along with a choice of the update_factor \(K\) to return the new rating. For now, we will just consider \(K = 20\) as the default choice.

calc_new_rating <- function(team_rating, observed_score, expected_score,
                            update_factor = 20) {
  team_rating + update_factor * (observed_score - expected_score)
}

QUESTION: Using your functions calc_expected_score and calc_new_rating together, what is the new rating for team that had an initial rating of 1300 but beat an opponent with a rating of 1700? You should answer this question using \(K = 20\), and pass in the output of calc_expected_score to be the expected_score. How does the observed change in the team’s rating compare to the maximum number of points the rating can change by?

ANSWER:

calc_new_rating(1300, 1, calc_expected_score(1300, 1700))

[1] 1318.182

The team’s rating improves to roughly 1318 after winning. Since the maximum number of possible points is \(K = 20\), this is indicative of how beating a team with a rating of 1700 was relatively unexpected and nearly earned the team the maximum possible 20 points.

NFL Elo Ratings

Now with the basics, let’s move on to perform these calculations over the entire season, updating a table to include each team’s Elo rating following every game. We can implement this using a for loop to proceed through each game in the nfl_games table, looking up each team’s previous ratings and performing the above calculations.

Prior to beginning this loop, we will set-up a table initializing each team with a rating of 1500. This a naive approach since we likely have prior knowledge about each team’s strength before the start of the season, but we’ll discuss this in more detail at the end of the module. For now, we’ll use 1500 since it is a common choice for initializing Elo ratings. The code chunk below initializes this starting table of ratings beginning with an imaginary week 0:

nfl_elo_ratings <- tibble(team = unique(nfl_games$home_team),
                          elo_rating = 1500,
                          week = 0)
nfl_elo_ratings

# A tibble: 32 × 3
   team  elo_rating  week
   <chr>      <dbl> <dbl>
 1 KC          1500     0
 2 ATL         1500     0
 3 BAL         1500     0
 4 CLE         1500     0
 5 IND         1500     0
 6 MIN         1500     0
 7 NO          1500     0
 8 PIT         1500     0
 9 WAS         1500     0
10 CHI         1500     0
# ℹ 22 more rows

Active Exercise

QUESTION: The code chunk below outlines the for loop to be used in updating team ratings after every game during the 2023-24 season in the nfl_games dataset. Fill in the missing portions code marked by ??? using your above calc_expected_score and calc_new_rating functions. While this module does not cover all of the basics of tidyverse data wrangling, the code comments should help make the various steps clear.

ANSWER:

for (game_i in 1:nrow(nfl_games)) {
   
  # Grab the home and away teams in the current game:
  home_team <- nfl_games$home_team[game_i]
  away_team <- nfl_games$away_team[game_i]
  # What was the observed score by the home team?
  observed_home_score <- nfl_games$game_outcome[game_i]
  # Retain the week number for this game:
  game_week <- nfl_games$week[game_i]
  
  # What was each team's rating from their latest game in the
  # current elo ratings table, starting with the home team:
  home_rating <- nfl_elo_ratings |>
    filter(team == home_team) |>
    # Sort in descending order
    arrange(desc(week)) |>
    # Grab the latest game
    slice(1) |>
    # Just return the elo rating
    pull(elo_rating)
  
  # Same thing for away team
  away_rating <- nfl_elo_ratings |>
    filter(team == away_team) |>
    arrange(desc(week)) |>
    slice(1) |>
    pull(elo_rating)
  
  # Now get their new ratings, starting with the home team:
  new_home_rating <- calc_new_rating(home_rating, observed_home_score, 
                                     calc_expected_score(home_rating,
                                                         away_rating))
  # And repeating for the away team using the opposite input as home team:
  new_away_rating <- calc_new_rating(away_rating, 1 - observed_home_score, 
                                     calc_expected_score(away_rating,
                                                         home_rating))
  
  # Set up a table containing the updated ratings for each team after the game
  updated_ratings <- tibble(team = c(home_team, away_team),
                            elo_rating = c(new_home_rating, new_away_rating),
                            # Store the week index of the game
                            week = rep(game_week, 2))
  
  # Add each teams new ratings to the current elo ratings table by row binding:
  nfl_elo_ratings <- nfl_elo_ratings |>
    bind_rows(updated_ratings)
  
}

After you run the completed for loop, you can view and inspect the ratings in different ways. For example, the following code chunk will return the final rating for each after the completion of the entire season of games:

nfl_elo_ratings |>
  group_by(team) |>
  # Since some teams make the playoffs, need to find the rating 
  # for each team's final weeK:
  summarize(final_rating = elo_rating[which.max(week)]) |>
  # Sort in descending order of the rating so the best team is first:
  arrange(desc(final_rating))

# A tibble: 32 × 2
   team  final_rating
   <chr>        <dbl>
 1 KC           1576.
 2 BAL          1575.
 3 SF           1564.
 4 DET          1561.
 5 BUF          1547.
 6 DAL          1546.
 7 CLE          1532.
 8 HOU          1528.
 9 MIA          1525.
10 LA           1523.
# ℹ 22 more rows

BONUS: Expand To Visualize Ratings

It is often helpful to visualize how the team ratings are changing over time. While this module does not cover the details about the ggplot2 visualization library, the following code creates a line for each team:

nfl_elo_ratings |>
  # Input the dataset into ggplot, mapping the week to the x-axis and
  # the elo_rating to the y-axis, colored by team:
  ggplot(aes(x = week, y = elo_rating, color = team)) +
  geom_line() +
  theme_bw() +
  labs(x = "Week", y = "Elo rating",
       title = "NFL Elo ratings in 2023-24 season")

While we can observe ratings changing over the season for every team, this visualization is less than ideal. Instead one could take advantage of the team colors available using the load_teams function from the nflverse This is a little more involved, but here is example way to create a figure highlighting teams in each division separately (this requires installing the nflreadr, ggrepel, and cowplot packages:

library(nflreadr)
nfl_team_colors <- load_teams() |>
  dplyr::select(team_abbr, team_division, team_color)

# Create a dataset that has each team's final Elo rating
nfl_team_final <- nfl_elo_ratings |>
  group_by(team) |>
  summarize(week = max(week),
            elo_rating = elo_rating[which.max(week)],
            .groups = "drop") |>
  inner_join(nfl_team_colors, by = c("team" = "team_abbr")) |>
  arrange(desc(elo_rating))
 
# Need ggrepel:
library(ggrepel)
division_plots <- 
  lapply(sort(unique(nfl_team_final$team_division)),
         function(nfl_division) {                            
             # Pull out the teams in the division
             division_teams <- nfl_team_final |>
               filter(team_division == nfl_division) |>
               mutate(team = fct_reorder(team, desc(elo_rating))) 
             
             # Get the Elo ratings data just for these teams:
             division_data <- nfl_elo_ratings |>
               filter(team %in% division_teams$team) |>
               mutate(team = factor(team,
                                    levels = levels(division_teams$team))) |>
               # Make text labels for them:
               group_by(team) |>
               mutate(team_label = if_else(week == max(week),
                                           as.character(team), 
                                           NA_character_)) |>
               ungroup()
             
             # Now make the full plot
             nfl_elo_ratings |>
               # Plot all of the other teams as gray lines:
               filter(!(team %in% division_teams$team)) |>
               ggplot(aes(x = week, y = elo_rating, group = team)) +
               geom_line(color = "gray", alpha = 0.5) +
               # But display the division teams with their colors:
               geom_line(data = division_data,
                         aes(x = week, y = elo_rating, group = team,
                             color = team)) +
               geom_label_repel(data = division_data,
                                aes(label = team_label,
                                    color = team), nudge_x = 1, na.rm = TRUE,
                                direction = "y") +
               scale_color_manual(values = division_teams$team_color,
                                  guide = FALSE) +
               theme_bw() +
               labs(x = "Week", y = "Elo rating",
                    title = paste0("Division: ", nfl_division)) 
         })
# Display the grid of plots with cowplot!
library(cowplot)
plot_grid(plotlist = division_plots, ncol = 2, align = "hv")

Assessing the Ratings

The result of the for loop from above provides us with a dataset that contains the rating for each team after every week. But how do we know if we can trust this approach for estimating team ratings? We can assess the predictive performance of the Elo ratings based on the estimated probabilities for every game given the team’s ratings entering the game.

Challenge: Missing Team ratings

To demonstrate this, we will first need to fill in for missing weeks for teams due to bye weeks. You can see in the output from the table counts below that during certain weeks there are fewer than 32 teams with ratings (this is not a concern in the values post week 18 since that corresponds to playoffs):

table(nfl_elo_ratings$week)


 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 
32 32 32 32 32 28 30 26 32 28 28 28 32 26 30 32 32 32 32 12  8  4  2

The follow code chunks fixes this issue by iterating over each possible week, and fills in missing week ratings with the last available rating. There are multiple ways to do this! The details of the code are not necessarily important for understanding how to assess the accuracy of Elo ratings but rather a practical concern with implementation. If you are not interested in how the code works, feel free to just run it for usage in the remaining steps.

# First get a vector of the unique teams from the table:
nfl_teams <- unique(nfl_elo_ratings$team)

# Now iterate over the weeks and decide how to grab the team ratings
complete_nfl_elo_ratings <- 
  map_dfr(unique(nfl_elo_ratings$week),
          function(week_i) {
            
            # How many teams have ratings in the week?
            covered_teams <- nfl_elo_ratings |>
              filter(week == week_i) |>
              pull(team) |>
              unique()
            
            if (length(nfl_teams) == length(covered_teams)) {
              # Just return the rows from the nfl_elo_ratings table:
              nfl_elo_ratings |>
                filter(week == week_i)
              
            } else {
              # Otherwise, we need to fill in the missing teams
              # First get the latest ratings for every team prior to this week:
              latest_ratings <- nfl_elo_ratings |>
                filter(week < week_i) |>
                group_by(team) |>
                summarize(elo_rating = elo_rating[which.max(week)],
                          .groups = "drop") |>
                mutate(week = week_i)
              
              # Now gather the ratings for teams that are not missing this week:
              week_ratings <- nfl_elo_ratings |>
                filter(week == week_i)
              
              # Join together the missing ratings and return:
              week_ratings |>
                bind_rows(filter(latest_ratings,
                                 !(latest_ratings$team %in% week_ratings$team)))
              
            }

          })

# And this now fixes the previous problem:
table(complete_nfl_elo_ratings$week)


 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 
32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32

We will now make two copies of the complete_nfl_elo_ratings table - one to use for home teams and another to use for away teams. The code chunk below initializes these copies, and also adds 1 to the week column to indicate which week to use team’s rating for when predicting:

home_elo_ratings <- complete_nfl_elo_ratings |>
  mutate(week = week + 1) |>
  # Rename the team and elo_rating columns
  rename(home_team = team,
         home_elo_rating = elo_rating)

# And repeat for away teams:
away_elo_ratings <- complete_nfl_elo_ratings |>
  mutate(week = week + 1) |>
  rename(away_team = team,
         away_elo_rating = elo_rating)

Next, we can join the ratings stored in these two tables to the nfl_games table to estimate the expected outcome with respect to the home team. The following code chunk demonstrates how to left_join the team ratings, and then compute the probability of winning for the home team:

upd_nfl_games <- nfl_games |>
  # First join home team by the team abbreviation and week
  left_join(home_elo_ratings, by = c("home_team", "week")) |>
  # Repeat for away team ratings:
  left_join(away_elo_ratings, by = c("away_team", "week")) |>
  # And now compute the expectation, home_win_prob:
  mutate(home_win_prob = calc_expected_score(home_elo_rating,
                                             away_elo_rating))
upd_nfl_games

# A tibble: 285 × 13
   season game_id      game_type  week home_team away_team home_score away_score
    <dbl> <chr>        <chr>     <dbl> <chr>     <chr>          <dbl>      <dbl>
 1   2023 2023_01_DET… REG           1 KC        DET               20         21
 2   2023 2023_01_CAR… REG           1 ATL       CAR               24         10
 3   2023 2023_01_HOU… REG           1 BAL       HOU               25          9
 4   2023 2023_01_CIN… REG           1 CLE       CIN               24          3
 5   2023 2023_01_JAX… REG           1 IND       JAX               21         31
 6   2023 2023_01_TB_… REG           1 MIN       TB                17         20
 7   2023 2023_01_TEN… REG           1 NO        TEN               16         15
 8   2023 2023_01_SF_… REG           1 PIT       SF                 7         30
 9   2023 2023_01_ARI… REG           1 WAS       ARI               20         16
10   2023 2023_01_GB_… REG           1 CHI       GB                20         38
# ℹ 275 more rows
# ℹ 5 more variables: game_outcome <dbl>, score_diff <dbl>,
#   home_elo_rating <dbl>, away_elo_rating <dbl>, home_win_prob <dbl>

We can now assess the use of the Elo rating system with the computed home_win_prob values relative to the observed game_outcome. While there are a number of ways to evaluate the performance of a probability estimate, in this module we will consider the use of the Brier score which is computed as the mean squared difference between the observed outcome and predicted probabilities. In the context of our Elo rating system notation, the Brier score is computed across \(N\) games as:

\[ \frac{1}{N} \sum_{i = 1}^{N} (S_{\text{home},i} - E_{\text{home},i})^2 \] where the use of subscript \(i\) refers to the outcome and expectation for the home team in game \(i\).

Thought Exercise

QUESTION: Given the above formula for computing the Brier score, what do you think is a better indicator predictive accuracy: a lower or higher Brier score?

ANSWER: It is more optimal to achieve a lower Brier score as this indicates the predicted probabilities are closer to the observed outcome.

We will compute the Brier score using our NFL Elo ratings, and compare the performance to always using a 50/50 probability for every game, i.e., as if we never learned any information over the course of the season.

Active Exercise

QUESTION: Compute and report the Brier score for both: (1) the Elo rating based home_win_prob and (2) as if you predicted the probability to be 0.5 for every game. Which approach performs better? Are you surprised by the outcome?

ANSWER:

The following code chunk computes the Brier score for both approaches:

upd_nfl_games |>
  summarize(elo_brier_score = mean((game_outcome - home_win_prob)^2),
            base_brier_score = mean((game_outcome - 0.5)^2))

# A tibble: 1 × 2
  elo_brier_score base_brier_score
            <dbl>            <dbl>
1           0.245             0.25

We can see that the Elo ratings based Brier score is slightly lower than the 50/50 baseline. This is not surprising and should be re-assuring! We learn information about teams over the course of the season and this leads to better performance in predicting game outcomes.

Although you have just implemented and evaluated the use of Elo ratings in the context of NFL games, so far we have just considered an update factor of \(K = 20\). But is there a more optimal choice?

Challenging Active Exercise

QUESTION: You will now proceed to different choices for the update factor. Rather than tediously code the entire process from above multiple times, we will wrap up the code to compute the Brier score for a given choice of \(K\) inside a function compute_elo_brier_score. The code chunk in the ANSWER portion below provides you with a template to fill out, this function takes in both the nfl_games data and input for the update_factor \(K\). Once you have completed the function, compute and report the Brier score for \(K =\) 5, 20, 50, and 100 (your value for \(K = 20\) should match the previous output). Which choice of \(K\) yields the best Brier score?

ANSWER:

The following is the complete version of the template code:

compute_elo_brier_score <- function(games_table, update_factor) {
  # First initialize the ratings:
  nfl_elo_ratings <- tibble(team = unique(games_table$home_team),
                            elo_rating = 1500,
                            week = 0)
  
  # Loop through to construct the Elo ratings:
  for (game_i in 1:nrow(games_table)) {
    
    # Grab the home and away teams in the current game:
    home_team <- games_table$home_team[game_i]
    away_team <- games_table$away_team[game_i]
    # What was the observed score by the home team?
    observed_home_score <- games_table$game_outcome[game_i]
    # Retain the week number for this game:
    game_week <- games_table$week[game_i]
    
    # What was each team's rating from their latest game in the
    # current elo ratings table, starting with the home team:
    home_rating <- nfl_elo_ratings |>
      filter(team == home_team) |>
      # Sort in descending order
      arrange(desc(week)) |>
      # Grab the latest game
      slice(1) |>
      # Just return the elo rating
      pull(elo_rating)
    
    # Same thing for away team
    away_rating <- nfl_elo_ratings |>
      filter(team == away_team) |>
      arrange(desc(week)) |>
      slice(1) |>
      pull(elo_rating)
    
    # Now get their new ratings, starting with the home team:
    new_home_rating <- calc_new_rating(home_rating, observed_home_score, 
                                       calc_expected_score(home_rating,
                                                           away_rating),
                                       update_factor)
    # And repeating for the away team using the opposite input as home team:
    new_away_rating <- calc_new_rating(away_rating, 1 - observed_home_score, 
                                       calc_expected_score(away_rating,
                                                           home_rating),
                                       update_factor)
    
    # Set up a table containing the updated ratings for each team after the game
    updated_ratings <- tibble(team = c(home_team, away_team),
                              elo_rating = c(new_home_rating, new_away_rating),
                              # Store the week index of the game
                              week = rep(game_week, 2))
    
    # Add each teams new ratings to the current elo ratings table by row binding:
    nfl_elo_ratings <- nfl_elo_ratings |>
      bind_rows(updated_ratings)
    
  }
  
  # Fill in the missing ratings for teams:
  
  # First get a vector of the unique teams from the table:
  nfl_teams <- unique(nfl_elo_ratings$team)
  
  # Now iterate over the weeks and decide how to grab the team ratings
  complete_nfl_elo_ratings <- 
    map_dfr(unique(nfl_elo_ratings$week),
            function(week_i) {
              
              # How many teams have ratings in the week?
              covered_teams <- nfl_elo_ratings |>
                filter(week == week_i) |>
                pull(team) |>
                unique()
              
              if (length(nfl_teams) == length(covered_teams)) {
                # Just return the rows from the nfl_elo_ratings table:
                nfl_elo_ratings |>
                  filter(week == week_i)
                
              } else {
                # Otherwise, we need to fill in the missing teams
                # First get the latest ratings for every team prior to this week:
                latest_ratings <- nfl_elo_ratings |>
                  filter(week < week_i) |>
                  group_by(team) |>
                  summarize(elo_rating = elo_rating[which.max(week)],
                            .groups = "drop") |>
                  mutate(week = week_i)
                
                # Now gather the ratings for teams that are not missing this week:
                week_ratings <- nfl_elo_ratings |>
                  filter(week == week_i)
                
                # Join together the missing ratings and return:
                week_ratings |>
                  bind_rows(filter(latest_ratings,
                                   !(latest_ratings$team %in% week_ratings$team)))
                
              }
              
            })
  
  
  # Join the home and away team ratings for every game to get the probabilities:
  home_elo_ratings <- complete_nfl_elo_ratings |>
    mutate(week = week + 1) |>
    # Rename the team and elo_rating columns
    rename(home_team = team,
           home_elo_rating = elo_rating)
  
  # And repeat for away teams:
  away_elo_ratings <- complete_nfl_elo_ratings |>
    mutate(week = week + 1) |>
    rename(away_team = team,
           away_elo_rating = elo_rating)
  
  # Compute and return the Brier score
  games_table |>
    # First join home team by the team abbreviation and week
    left_join(home_elo_ratings, by = c("home_team", "week")) |>
    # Repeat for away team ratings:
    left_join(away_elo_ratings, by = c("away_team", "week")) |>
    # And now compute the expectation, home_win_prob:
    mutate(home_win_prob = calc_expected_score(home_elo_rating,
                                               away_elo_rating)) |>
    summarize(brier_score = mean((game_outcome - home_win_prob)^2)) |>
    pull(brier_score)
}

# Return the Brier score across the values for K:
compute_elo_brier_score(nfl_games, 5)

[1] 0.2478319

compute_elo_brier_score(nfl_games, 20)

[1] 0.244792

compute_elo_brier_score(nfl_games, 50)

[1] 0.2470157

compute_elo_brier_score(nfl_games, 100)

[1] 0.258864

Although the Brier score is close between 5, 20, and 50, we can see that 20 yields the lowest Brier score. Notably, \(K = 100\) yields a Brier score that is worse than guessing 50/50 for every game! This is indicative of over-reacting and over-fitting to an individual game.

Discussion

You have now learned the basics behind the popular Elo rating system in sports, including the steps for implementing Elo ratings from scratch in R to measure NFL team strength. Furthermore, you have a basic understanding for how to assess the predictive performance of the ratings using Brier score and how this can be used to tune the choice the update factor \(K\). Although we only considered the ratings for the 2023-24 season, you could observe how ratings change across a larger dataset of games spanning the Patrick Mahomes’ era.

However, there are a number of additional questions and considerations that we did not cover in this module such as:

Initial Elo ratings: Rather than using 1500 as the initial values for every team, you could use a more informed starting point such as Neil Paine’s NFL Elo ratings which start at the beginning of the league history.
New season? New roster?: We just demonstrated Elo ratings within one season, but what do we do across multiple seasons? Do we simply just use the final rating from the previous season as the initial rating in the following season? But teams change rosters so we likely want to make some correction.
Scaling factor: We fixed the scaling factor to be 400 in this module, but we could also tune this quantity in the same manner as \(K\).
What games matter in assessment?: Although we walked through assessing the performance Elo ratings performance with Brier scores, we treated every game equally in this calculation. What if we wanted to tune our Elo rating system to yield the most optimal predictions in the playoffs, and ignore performance in the first few weeks of the season?
What about margin of victory?: We only considered win/loss in a binary manner, but the score differential in a game may be informative of team strength. There are extensions to handle margin of victory in Elo ratings, but details are beyond the scope of this module.

With these considerations in mind, you now have the capability in implement Elo ratings in practice across a variety of sports.

You may find these additional resources helpful:

For football fans, you can simulate NFL seasons using your Elo ratings with the nflseedR package.
The elo package in R provides convenient functions for computing Elo ratings, similar to the functions we defined above.
An overview of the popular Glicko rating system by statistician Mark Glickman that quantifies uncertainty about the ratings.

Footnotes

Technically, it represents the probability of winning and half the probability of drawing, but for our purposes we will just treat this as the probability of winning due to how rare draws are in the NFL.↩︎