If you are new to R or are having trouble understanding the code in the below sections we highly recommend the nflfastR beginner’s guide in vignette("beginners_guide")
.
nflfastR comes with a set of functions to access NFL play-by-play data and team rosters. This section provides a brief introduction to the essential functions.
nflfastR processes and cleans up play-by-play data and adds variables through it’s models. Since some of these tasks are performed by separate functions, the easiest way to compute the complete nflfastR dataset is build_nflfastR_pbp()
. The main input for that function is a set of game ids which can be accessed with fast_scraper_schedules()
. The following code demonstrates how to build the nflfastR dataset for the Super Bowls of the 2017 - 2019 seasons.
library(nflfastR)
library(dplyr, warn.conflicts = FALSE)
ids <- nflfastR::fast_scraper_schedules(2017:2019) %>%
dplyr::filter(game_type == "SB") %>%
dplyr::pull(game_id)
pbp <- nflfastR::build_nflfastR_pbp(ids)
#> ── Build nflfastR Play-by-Play Data ───────────── nflfastR version 3.2.0.9006 ──
#> ● Start download of 3 games...
#> ✔ Download finished. Adding variables...
#> ✔ added game variables
#> ✔ added nflscrapR variables
#> ✔ added ep variables
#> ✔ added air_yac_ep variables
#> New names:
#> * td_prob -> td_prob...1
#> * opp_td_prob -> opp_td_prob...2
#> * fg_prob -> fg_prob...3
#> * opp_fg_prob -> opp_fg_prob...4
#> * safety_prob -> safety_prob...5
#> * ...
#> ✔ added wp variables
#> ✔ added air_yac_wp variables
#> ✔ added cp and cpoe
#> ✔ added fixed drive variables
#> ✔ added series variables
#> ● Cleaning up play-by-play... (ℹ If you run this with a lot of seasons this could take a few minutes.)
#> ✔ Cleaning completed
#> ✔ added qb_epa
#> ● Computing xyac...
#> ✔ added xyac variables
#> ── DONE ────────────────────────────────────────────────────────────────────────
In most cases, however, it is not necessary to execute this function for individual games, because nflfastR provides both a data repository and the second main play-by-play function: update_db()
. Please see Example 8: Using the built-in database function for how to work with that function.
Joining roster data to the play-by-play dataset is possible as well. The data can be accessed with the function fast_scraper_roster()
and it’s application is demonstrated in Example 10: Working with roster and position data.
All examples listed below assume that the following two libraries are installed and loaded.
The functionality of nflscrapR
can be duplicated by using fast_scraper()
. This obtains the same information contained in nflscrapR
(plus some extra) but much more quickly. To compare to nflscrapR
, we use their data repository as the program no longer functions now that the NFL has taken down the old Gamecenter feed. Note that EP differs from nflscrapR as we use a newer era-adjusted model (more on this in this post on Open Source Football).
This example also uses the built-in function clean_pbp()
to create a ‘name’ column for the primary player involved (the QB on pass play or ball-carrier on run play).
readr::read_csv(url("https://github.com/ryurko/nflscrapR-data/blob/master/play_by_play_data/regular_season/reg_pbp_2019.csv?raw=true")) %>%
dplyr::filter(home_team == "SF" & away_team == "SEA") %>%
dplyr::select(desc, play_type, ep, epa, home_wp) %>%
utils::head(5) %>%
knitr::kable(digits = 3)
desc | play_type | ep | epa | home_wp |
---|---|---|---|---|
J.Myers kicks 65 yards from SEA 35 to end zone, Touchback. | kickoff | 0.815 | 0.000 | NA |
(15:00) T.Coleman left guard to SF 26 for 1 yard (J.Clowney). | run | 0.815 | -0.606 | 0.500 |
(14:19) T.Coleman right tackle to SF 25 for -1 yards (P.Ford). | run | 0.209 | -1.146 | 0.485 |
(13:45) (Shotgun) J.Garoppolo pass short middle to K.Bourne to SF 41 for 16 yards (J.Taylor). Caught at SF39. 2-yac | pass | -0.937 | 3.223 | 0.453 |
(12:58) PENALTY on SEA-J.Reed, Encroachment, 5 yards, enforced at SF 41 - No Play. | no_play | 2.286 | 0.774 | 0.551 |
nflfastR::fast_scraper("2019_10_SEA_SF") %>%
nflfastR::clean_pbp() %>%
dplyr::select(desc, play_type, ep, epa, home_wp, name) %>%
utils::head(6) %>%
knitr::kable(digits = 3)
desc | play_type | ep | epa | home_wp | name |
---|---|---|---|---|---|
GAME | NA | NA | NA | NA | NA |
5-J.Myers kicks 65 yards from SEA 35 to end zone, Touchback. | kickoff | 1.474 | 0.000 | 0.546 | NA |
(15:00) 26-T.Coleman left guard to SF 26 for 1 yard (90-J.Clowney). | run | 1.474 | -0.554 | 0.546 | T.Coleman |
(14:19) 26-T.Coleman right tackle to SF 25 for -1 yards (97-P.Ford). | run | 0.920 | -0.814 | 0.528 | T.Coleman |
(13:45) (Shotgun) 10-J.Garoppolo pass short middle to 84-K.Bourne to SF 41 for 16 yards (24-J.Taylor). Caught at SF39. 2-yac | pass | 0.107 | 2.427 | 0.498 | J.Garoppolo |
(12:58) PENALTY on SEA-91-J.Reed, Encroachment, 5 yards, enforced at SF 41 - No Play. | no_play | 2.534 | 0.600 | 0.573 | NA |
This is a demonstration of nflfastR
’s capabilities. While nflfastR
can scrape a batch of games very quickly, please be respectful of Github’s servers and use the data repository which hosts all the scraped and cleaned data whenever possible. The only reason to ever actually use the scraper is if it’s in the middle of the season and we haven’t updated the repository with recent games (but we will try to keep it updated).
# get list of some games from 2019
games_2019 <- nflfastR::fast_scraper_schedules(2019) %>%
utils::head(10) %>%
dplyr::pull(game_id)
tictoc::tic(glue::glue("{length(games_2019)} games with nflfastR:"))
f <- nflfastR::fast_scraper(games_2019, pp = TRUE)
tictoc::toc()
#> 10 games with nflfastR:: 11.628 sec elapsed
Let’s look at CPOE leaders from the 2009 regular season.
As discussed above, nflfastR
has a data repository for old seasons, so there’s no need to actually scrape them. Let’s use that here (the below reads .rds files, but .csv and .parquet are also available).
tictoc::tic("loading all games from 2009")
games_2009 <- readRDS(url("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2009.rds")) %>% dplyr::filter(season_type == "REG")
tictoc::toc()
#> loading all games from 2009: 2.806 sec elapsed
games_2009 %>%
dplyr::filter(!is.na(cpoe)) %>%
dplyr::group_by(passer_player_name) %>%
dplyr::summarize(cpoe = mean(cpoe), Atts = n()) %>%
dplyr::filter(Atts > 200) %>%
dplyr::arrange(-cpoe) %>%
utils::head(5) %>%
knitr::kable(digits = 1)
passer_player_name | cpoe | Atts |
---|---|---|
D.Brees | 7.5 | 509 |
P.Rivers | 6.6 | 474 |
P.Manning | 6.5 | 569 |
B.Favre | 6.1 | 527 |
B.Roethlisberger | 5.4 | 503 |
When working with nflfastR
, drive results are automatically included. We use fixed_drive
and fixed_drive_result
since the NFL-provided information is a bit wonky. Let’s look at how much more likely teams were to score starting from 1st & 10 at their own 20 yard line in 2015 (the last year before touchbacks on kickoffs changed to the 25) than in 2000.
games_2000 <- readRDS(url("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2000.rds"))
games_2015 <- readRDS(url("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2015.rds"))
pbp <- dplyr::bind_rows(games_2000, games_2015)
pbp %>%
dplyr::filter(season_type == "REG" & down == 1 & ydstogo == 10 & yardline_100 == 80) %>%
dplyr::mutate(drive_score = dplyr::if_else(fixed_drive_result %in% c("Touchdown", "Field Goal"), 1, 0)) %>%
dplyr::group_by(season) %>%
dplyr::summarize(drive_score = mean(drive_score)) %>%
knitr::kable(digits = 3)
season | drive_score |
---|---|
2000 | 0.156 |
2015 | 0.179 |
So about 23% of 1st & 10 plays from teams’ own 20 would see the drive end up in a score in 2000, compared to 30% in 2015. This has implications for Expected Points models (see vignette("nflfastR-models")
).
Let’s build the NFL team tiers using offensive and defensive expected points added per play for the 2005 regular season. The logo urls of the espn logos are integrated into the ?teams_colors_logos
data frame which is delivered with the package.
Let’s also use the included helper function clean_pbp()
, which creates “rush” and “pass” columns that (a) properly count sacks and scrambles as pass plays and (b) properly include plays with penalties. Using this, we can keep only rush or pass plays.
library(ggimage)
pbp <- readRDS(url("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2005.rds")) %>%
dplyr::filter(season_type == "REG") %>%
dplyr::filter(!is.na(posteam) & (rush == 1 | pass == 1))
offense <- pbp %>%
dplyr::group_by(posteam) %>%
dplyr::summarise(off_epa = mean(epa, na.rm = TRUE))
defense <- pbp %>%
dplyr::group_by(defteam) %>%
dplyr::summarise(def_epa = mean(epa, na.rm = TRUE))
logos <- teams_colors_logos %>% dplyr::select(team_abbr, team_logo_espn)
offense %>%
dplyr::inner_join(defense, by = c("posteam" = "defteam")) %>%
dplyr::inner_join(logos, by = c("posteam" = "team_abbr")) %>%
ggplot2::ggplot(aes(x = off_epa, y = def_epa)) +
ggplot2::geom_abline(slope = -1.5, intercept = c(.4, .3, .2, .1, 0, -.1, -.2, -.3), alpha = .2) +
ggplot2::geom_hline(aes(yintercept = mean(off_epa)), color = "red", linetype = "dashed") +
ggplot2::geom_vline(aes(xintercept = mean(def_epa)), color = "red", linetype = "dashed") +
ggimage::geom_image(aes(image = team_logo_espn), size = 0.05, asp = 16 / 9) +
ggplot2::labs(
x = "Offense EPA/play",
y = "Defense EPA/play",
caption = "Data: @nflfastR",
title = "2005 NFL Offensive and Defensive EPA per Play"
) +
ggplot2::theme_bw() +
ggplot2::theme(
aspect.ratio = 9 / 16,
plot.title = ggplot2::element_text(size = 12, hjust = 0.5, face = "bold")
) +
ggplot2::scale_y_reverse()
We have provided a calculator for working with the Expected Points model. Here is an example of how to use it, looking for how the Expected Points on a drive beginning following a touchback has changed over time.
While I have put in 'SEA'
for home_team
and posteam
, this only matters for figuring out whether the team with the ball is the home team (there’s no actual effect for given team; it would be the same no matter what team is supplied).
data <- tibble::tibble(
"season" = 1999:2019,
"home_team" = "SEA",
"posteam" = "SEA",
"roof" = "outdoors",
"half_seconds_remaining" = 1800,
"yardline_100" = c(rep(80, 17), rep(75, 4)),
"down" = 1,
"ydstogo" = 10,
"posteam_timeouts_remaining" = 3,
"defteam_timeouts_remaining" = 3
)
nflfastR::calculate_expected_points(data) %>%
dplyr::select(season, yardline_100, td_prob, ep) %>%
knitr::kable(digits = 2)
season | yardline_100 | td_prob | ep |
---|---|---|---|
1999 | 80 | 0.33 | 0.64 |
2000 | 80 | 0.33 | 0.64 |
2001 | 80 | 0.33 | 0.64 |
2002 | 80 | 0.34 | 0.82 |
2003 | 80 | 0.34 | 0.82 |
2004 | 80 | 0.34 | 0.82 |
2005 | 80 | 0.34 | 0.82 |
2006 | 80 | 0.34 | 0.81 |
2007 | 80 | 0.34 | 0.81 |
2008 | 80 | 0.34 | 0.81 |
2009 | 80 | 0.34 | 0.81 |
2010 | 80 | 0.34 | 0.81 |
2011 | 80 | 0.34 | 0.81 |
2012 | 80 | 0.34 | 0.81 |
2013 | 80 | 0.34 | 0.81 |
2014 | 80 | 0.35 | 0.98 |
2015 | 80 | 0.35 | 0.98 |
2016 | 75 | 0.38 | 1.46 |
2017 | 75 | 0.38 | 1.46 |
2018 | 75 | 0.41 | 1.47 |
2019 | 75 | 0.41 | 1.47 |
Not surprisingly, offenses have become much more successful over time, with the kickoff touchback moving from the 20 to the 25 in 2016 providing an additional boost. Note that the td_prob
in this example is the probability that the next score within the same half will be a touchdown scored by team with the ball, not the probability that the current drive will end in a touchdown (this is why the numbers are different from Example 4 above).
We could compare the most recent four years to the expectation for playing in a dome by inputting all the same things and changing the roof
input:
data <- tibble::tibble(
"season" = 2016:2019,
"week" = 5,
"home_team" = "SEA",
"posteam" = "SEA",
"roof" = "dome",
"half_seconds_remaining" = 1800,
"yardline_100" = c(rep(75, 4)),
"down" = 1,
"ydstogo" = 10,
"posteam_timeouts_remaining" = 3,
"defteam_timeouts_remaining" = 3
)
nflfastR::calculate_expected_points(data) %>%
dplyr::select(season, yardline_100, td_prob, ep) %>%
knitr::kable(digits = 2)
season | yardline_100 | td_prob | ep |
---|---|---|---|
2016 | 75 | 0.41 | 1.81 |
2017 | 75 | 0.41 | 1.81 |
2018 | 75 | 0.44 | 1.84 |
2019 | 75 | 0.44 | 1.84 |
So for 2018 and 2019, 1st & 10 from a home team’s own 25 yard line had higher EP in domes than at home, which is to be expected.
We have also provided a calculator for working with the win probability models. Here is an example of how to use it, looking for how the win probability to begin the game depends on the pre-game spread.
While I have put in 'SEA'
for home_team
and posteam
, this only matters for figuring out whether the team with the ball is the home team (there’s no actual effect for given team; it would be the same no matter what team is supplied).
data <- tibble::tibble(
"receive_2h_ko" = 0,
"home_team" = "SEA",
"posteam" = "SEA",
"score_differential" = 0,
"half_seconds_remaining" = 1800,
"game_seconds_remaining" = 3600,
"spread_line" = c(1, 3, 4, 7, 14),
"down" = 1,
"ydstogo" = 10,
"yardline_100" = 75,
"posteam_timeouts_remaining" = 3,
"defteam_timeouts_remaining" = 3
)
nflfastR::calculate_win_probability(data) %>%
dplyr::select(spread_line, wp, vegas_wp) %>%
knitr::kable(digits = 2)
spread_line | wp | vegas_wp |
---|---|---|
1 | 0.55 | 0.51 |
3 | 0.55 | 0.60 |
4 | 0.55 | 0.64 |
7 | 0.55 | 0.74 |
14 | 0.55 | 0.87 |
Not surprisingly, vegas_wp
increases with the amount a team was coming into the game favored by.
If you’re comfortable using dplyr
functions to manipulate and tidy data, you’re ready to use a database. Why should you use a database?
nflfastR
makes it extremely easy to build a database and keep it updatedTo start, we need to install the two packages required for this that aren’t installed automatically when nflfastR
installs: DBI
and RSQLite
(advanced users can use other types of databases, but this example will use SQLite):
install.packages("DBI")
install.packages("RSQLite")
As with always, you only need to install these once. They don’t need to be loaded to build the database because nflfastR
knows how to use them, but we do need them later on when working with the database.
There’s exactly one function in nflfastR
that works with databases: update_db
. Some notes:
update_db()
with no arguments, it will build a SQLite database called pbp_db
in your current working directory, with play-by-play data in a table called nflfastR_pbp
.dbdir
.dbname
.tblname
.force_rebuild = TRUE
. This is primarily intended for the case when we update the play-by-play data in the data repo due to fixing a bug and you want to force the database to be wiped and updated.force_rebuild
(e.g. force_rebuild = c(2019, 2020)
).db_connection
is intended for advanced users who want to use other DBI drivers, such as MariaDB, Postgres or odbc. Please note that dbdir
and dbname
are dropped when a db_connection
is provided but the argument tblname
will still be used to write the data table into the database.Let’s say I just want to dump a database into the current working directory. Here we go!
nflfastR::update_db()
#> ── Update nflfastR Play-by-Play Database ──────── nflfastR version 3.2.0.9006 ──
#> ℹ Can't find the data table 'nflfastR_pbp' in your database. Will load the play by play data from scratch.
#> ● Starting download of 22 seasons between 1999 and 2020...
#> ● Checking for missing completed games...
#> ℹ You have 5846 games and are missing 0.
#> ✔ Database update completed
#> ℹ Path to your db: './pbp_db'
#> ── DONE ────────────────────────────────────────────────────────────────────────
This created a database in the current directory called pbp_db
.
Wait, that’s it? That’s it! What if it’s partway through the season and you want to make sure all the new games are added to the database? What do you run? update_db()
! (just make sure you’re in the directory the database is saved in or you supply the right file path)
nflfastR::update_db()
#> ── Update nflfastR Play-by-Play Database ──────── nflfastR version 3.2.0.9006 ──
#> ● Checking for missing completed games...
#> ℹ You have 5846 games and are missing 0.
#> ✔ Database update completed
#> ℹ Path to your db: '/Users/runner/work/nflfastR/nflfastR/vignettes/pbp_db'
#> ── DONE ────────────────────────────────────────────────────────────────────────
If it’s partway through a season and you want to re-build a season to allow for data corrections from the NFL to propagate into your database, you can specify one season to be rebuilt:
nflfastR::update_db(force_rebuild = 2020)
#> ── Update nflfastR Play-by-Play Database ──────── nflfastR version 3.2.0.9006 ──
#> ● Purging 2020 season(s) from the data table 'nflfastR_pbp' in your connected database...
#> ● Starting download of the 2020 season(s)...
#> ● Checking for missing completed games...
#> ℹ You have 5846 games and are missing 0.
#> ✔ Database update completed
#> ℹ Path to your db: '/Users/runner/work/nflfastR/nflfastR/vignettes/pbp_db'
#> ── DONE ────────────────────────────────────────────────────────────────────────
Now we can make a connection to the database. This is the only part that will look a little bit foreign, but all you need to know is where your database is located. If it’s in your current working directory, this will work:
connection <- DBI::dbConnect(RSQLite::SQLite(), "./pbp_db")
connection
#> <SQLiteConnection>
#> Path: /Users/runner/work/nflfastR/nflfastR/vignettes/pbp_db
#> Extensions: TRUE
It looks like nothing happened, but we now have a connection to the database. Now we’re ready to do stuff. If you aren’t familiar with databases, they’re organized around tables. Here’s how to see which tables are present in our database:
DBI::dbListTables(connection)
#> [1] "nflfastR_pbp"
Since we went with the defaults, there’s a table called nflfastR_pbp
. Another useful function is to see the fields (i.e., columns) in a table:
DBI::dbListFields(connection, "nflfastR_pbp") %>%
utils::head(10)
#> [1] "play_id" "game_id" "old_game_id" "home_team" "away_team"
#> [6] "season_type" "week" "posteam" "posteam_type" "defteam"
This is the same list as the list of columns in nflfastR
play-by-play. Notice we had to supply the name of the table above ("nflfastR_pbp"
).
With all that out of the way, there’s only a couple more things to learn. The main driver here is tbl
, which helps get output with a specific table in a database:
pbp_db <- dplyr::tbl(connection, "nflfastR_pbp")
And now, everything will magically just “work”: you can forget you’re even working with a database!
pbp_db %>%
dplyr::group_by(season) %>%
dplyr::summarize(n = dplyr::n())
#> # Source: lazy query [?? x 2]
#> # Database: sqlite 3.34.0
#> # [/Users/runner/work/nflfastR/nflfastR/vignettes/pbp_db]
#> season n
#> <int> <int>
#> 1 1999 46136
#> 2 2000 45492
#> 3 2001 45435
#> 4 2002 47818
#> 5 2003 47335
#> 6 2004 47203
#> 7 2005 47344
#> 8 2006 46867
#> 9 2007 46789
#> 10 2008 46445
#> # … with more rows
pbp_db %>%
dplyr::filter(rush == 1 | pass == 1, down <= 2, !is.na(epa), !is.na(posteam)) %>%
dplyr::group_by(pass) %>%
dplyr::summarize(mean_epa = mean(epa))
#> Warning: Missing values are always removed in SQL.
#> Use `mean(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source: lazy query [?? x 2]
#> # Database: sqlite 3.34.0
#> # [/Users/runner/work/nflfastR/nflfastR/vignettes/pbp_db]
#> pass mean_epa
#> <dbl> <dbl>
#> 1 0 -0.0989
#> 2 1 0.0745
So far, everything has stayed in the database. If you want to bring a query into memory, just use collect()
at the end:
russ <- pbp_db %>%
dplyr::filter(name == "R.Wilson" & posteam == "SEA") %>%
dplyr::select(desc, epa) %>%
dplyr::collect()
russ
#> # A tibble: 6,434 x 2
#> desc epa
#> <chr> <dbl>
#> 1 (14:12) 3-R.Wilson pass short right to 18-S.Rice to SEA 34 for 9 yar… 1.13
#> 2 (12:53) 3-R.Wilson pass incomplete deep left to 18-S.Rice. PENALTY o… 2.68
#> 3 (11:25) (Shotgun) 3-R.Wilson pass incomplete short right to 18-S.Ric… -1.31
#> 4 (10:24) (Shotgun) 3-R.Wilson pass short left to 18-S.Rice to ARI 31 … 0.928
#> 5 (9:47) 3-R.Wilson scrambles right end ran ob at ARI 27 for 4 yards (… -0.0194
#> 6 (8:35) 3-R.Wilson pass incomplete short right to 18-S.Rice. -0.426
#> 7 (7:54) (Shotgun) 3-R.Wilson left end pushed ob at ARI 9 for 4 yards … -1.17
#> 8 (:27) 3-R.Wilson sacked at SEA 17 for -5 yards (51-P.Lenon). Penalty… -1.13
#> 9 (14:28) (Shotgun) 3-R.Wilson pass short right to 17-B.Edwards to SEA… 1.94
#> 10 (13:59) 3-R.Wilson pass incomplete deep left to 87-B.Obomanu. -0.453
#> # … with 6,424 more rows
So we’ve searched through about 1 million rows of data across 300+ columns and only brought about 5,500 rows and two columns into memory. Pretty neat! This is how I supply the data to the shiny apps on rbsdm.com without running out of memory on the server. Now there’s only one more thing to remember. When you’re finished doing what you need with the database:
DBI::dbDisconnect(connection)
For more details on using a database with nflfastR
, see Thomas Mock’s life-changing post here.
The variables in xyac
are as follows:
xyac_epa
: The expected value of EPA gained after the catch, starting from where the catch was made.xyac_success
: The probability the play earns positive EPA (relative to where play started) based on where ball was caught.xyac_fd
: Probability play earns a first down based on where the ball was caught.xyac_mean_yardage
and xyac_median_yardage
: Average and median expected yards after the catch based on where the ball was caught.Some other notes:
epa
= air_epa
+ yac_epa
, where air_epa
is the EPA associated with a catch at the target location. If a receiver loses a fumble, it is removed from his yac_epa
air_epa
+ xyac_epa
yac_epa
to xyac_epa
, as in the example belowfirst_down
to xyac_fd
Let’s create measures for EPA and first downs over expected in 2015:
games_2015 %>%
dplyr::group_by(receiver, receiver_id, posteam) %>%
dplyr::mutate(tgt = sum(complete_pass + incomplete_pass)) %>%
dplyr::filter(tgt >= 50) %>%
dplyr::filter(complete_pass == 1, air_yards < yardline_100, !is.na(xyac_epa)) %>%
dplyr::summarize(
epa_oe = mean(yac_epa - xyac_epa),
actual_fd = mean(first_down),
expected_fd = mean(xyac_fd),
fd_oe = mean(first_down - xyac_fd),
rec = dplyr::n()
) %>%
dplyr::ungroup() %>%
dplyr::select(receiver, posteam, actual_fd, expected_fd, fd_oe, epa_oe, rec) %>%
dplyr::arrange(-epa_oe) %>%
utils::head(10) %>%
knitr::kable(digits = 3)
receiver | posteam | actual_fd | expected_fd | fd_oe | epa_oe | rec |
---|---|---|---|---|---|---|
D.Johnson | ARI | 0.500 | 0.391 | 0.109 | 0.334 | 50 |
R.Gronkowski | NE | 0.688 | 0.615 | 0.073 | 0.265 | 80 |
J.White | NE | 0.489 | 0.434 | 0.055 | 0.264 | 47 |
T.Ginn | CAR | 0.800 | 0.734 | 0.066 | 0.249 | 45 |
D.Lewis | NE | 0.472 | 0.309 | 0.163 | 0.238 | 36 |
L.Green | LAC | 0.629 | 0.526 | 0.103 | 0.216 | 35 |
O.Beckham Jr. | NYG | 0.692 | 0.706 | -0.014 | 0.207 | 91 |
G.Bernard | CIN | 0.373 | 0.289 | 0.083 | 0.204 | 51 |
T.Riddick | DET | 0.400 | 0.304 | 0.096 | 0.203 | 80 |
D.Woodhead | LAC | 0.468 | 0.354 | 0.114 | 0.172 | 77 |
The presence of so many running backs on this list suggests that even though it takes into account target depth and pass direction, the model doesn’t do a great job capturing space. Alternatively, running backs might be better at generating yards after the catch since running with the football is their primary role.
At long last, there’s a way to merge the new play-by-play data with roster information. The easy part is getting the rosters:
roster <- nflfastR::fast_scraper_roster(2019)
Now let’s load play-by-play data from 2019:
games_2019 <- readRDS(url("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2019.rds"))
Here is what the new player IDs look like:
games_2019 %>%
dplyr::filter(rush == 1 | pass == 1, posteam == "SEA") %>%
dplyr::select(desc, name, id)
#> # A tibble: 1,204 x 3
#> desc name id
#> <chr> <chr> <chr>
#> 1 (11:51) (Shotgun) 32-C.Carson left tackle to … C.Cars… 32013030-2d30-3033-33…
#> 2 (11:24) 3-R.Wilson pass incomplete deep left … R.Wils… 32013030-2d30-3032-39…
#> 3 (11:19) (Shotgun) 3-R.Wilson pass short left … R.Wils… 32013030-2d30-3032-39…
#> 4 (2:48) (Shotgun) 74-G.Fant reported in as eli… C.Cars… 32013030-2d30-3033-33…
#> 5 (2:16) 74-G.Fant reported in as eligible. 3-… R.Wils… 32013030-2d30-3032-39…
#> 6 (1:34) (Shotgun) 32-C.Carson left tackle to S… C.Cars… 32013030-2d30-3033-33…
#> 7 (:40) (Shotgun) 3-R.Wilson pass short left to… R.Wils… 32013030-2d30-3032-39…
#> 8 (:10) (Shotgun) 32-C.Carson left guard to CIN… C.Cars… 32013030-2d30-3033-33…
#> 9 (15:00) 3-R.Wilson sacked at CIN 41 for -9 ya… R.Wils… 32013030-2d30-3032-39…
#> 10 (14:15) (Shotgun) 3-R.Wilson pass short middl… R.Wils… 32013030-2d30-3032-39…
#> # … with 1,194 more rows
But these IDs aren’t very useful. So we need to decode them using the new function decode_player_ids
:
games_2019 %>%
dplyr::filter(rush == 1 | pass == 1, posteam == "SEA") %>%
nflfastR::decode_player_ids() %>%
dplyr::select(desc, name, id)
#> ● Start decoding player ids...
#> ✔ Decoding completed.
#> # A tibble: 1,204 x 3
#> desc name id
#> <chr> <chr> <chr>
#> 1 (11:51) (Shotgun) 32-C.Carson left tackle to SEA 21 for 1 … C.Carson 00-0033…
#> 2 (11:24) 3-R.Wilson pass incomplete deep left [97-G.Atkins]… R.Wilson 00-0029…
#> 3 (11:19) (Shotgun) 3-R.Wilson pass short left to 14-DK.Metc… R.Wilson 00-0029…
#> 4 (2:48) (Shotgun) 74-G.Fant reported in as eligible. 32-C.… C.Carson 00-0033…
#> 5 (2:16) 74-G.Fant reported in as eligible. 3-R.Wilson sack… R.Wilson 00-0029…
#> 6 (1:34) (Shotgun) 32-C.Carson left tackle to SEA 23 for 5 y… C.Carson 00-0033…
#> 7 (:40) (Shotgun) 3-R.Wilson pass short left to 32-C.Carson … R.Wilson 00-0029…
#> 8 (:10) (Shotgun) 32-C.Carson left guard to CIN 32 for 3 yar… C.Carson 00-0033…
#> 9 (15:00) 3-R.Wilson sacked at CIN 41 for -9 yards (94-S.Hub… R.Wilson 00-0029…
#> 10 (14:15) (Shotgun) 3-R.Wilson pass short middle to 32-C.Car… R.Wilson 00-0029…
#> # … with 1,194 more rows
So now we have the familiar GSIS IDs. Let’s apply this to the whole dataframe:
decoded_pbp <- games_2019 %>%
nflfastR::decode_player_ids()
#> ● Start decoding player ids...
#> ✔ Decoding completed.
Now we’re ready to join to the roster data using these IDs:
joined <- decoded_pbp %>%
dplyr::filter(!is.na(receiver_id)) %>%
dplyr::select(posteam, season, desc, receiver, receiver_id, epa) %>%
dplyr::left_join(roster, by = c("receiver_id" = "gsis_id"))
# the real work is done, this just makes a table and has it look nice
joined %>%
dplyr::filter(position %in% c("WR", "TE", "RB")) %>%
dplyr::group_by(receiver_id, receiver, position) %>%
dplyr::summarize(tot_epa = sum(epa), n = n()) %>%
dplyr::arrange(-tot_epa) %>%
dplyr::ungroup() %>%
dplyr::group_by(position) %>%
dplyr::mutate(position_rank = 1:n()) %>%
dplyr::filter(position_rank <= 5) %>%
dplyr::rename(Pos_Rank = position_rank, Player = receiver, Pos = position, Tgt = n, EPA = tot_epa) %>%
dplyr::select(Player, Pos, Pos_Rank, Tgt, EPA) %>%
knitr::kable(digits = 0)
#> `summarise()` has grouped output by 'receiver_id', 'receiver'. You can override using the `.groups` argument.
Player | Pos | Pos_Rank | Tgt | EPA |
---|---|---|---|---|
T.Kelce | TE | 1 | 179 | 100 |
C.Godwin | WR | 1 | 123 | 87 |
D.Adams | WR | 2 | 161 | 77 |
T.Lockett | WR | 3 | 139 | 76 |
J.Jones | WR | 4 | 164 | 72 |
C.Kupp | WR | 5 | 145 | 71 |
G.Kittle | TE | 2 | 129 | 56 |
C.McCaffrey | RB | 1 | 147 | 52 |
D.Waller | TE | 3 | 123 | 45 |
A.Ekeler | RB | 2 | 113 | 43 |
J.Cook | TE | 4 | 75 | 43 |
Z.Ertz | TE | 5 | 147 | 42 |
J.White | RB | 3 | 105 | 27 |
D.Cook | RB | 4 | 77 | 26 |
M.Ingram | RB | 5 | 33 | 22 |
Not surprisingly, all 5 of the top 5 WRs in terms of EPA added come in ahead of the top RB. Note that the number of targets won’t match official stats because we’re including plays with penalties.