44 minute read

Continuing with the streak of posts about college football posts, here I finally get into developing some models to predict the winning team and margin of victory of college football games. For this analysis, I was inspired by this post that mentioned a leaderboard for people who picked winners from the 2021-2022 season. Reminds me of Kaggle competitions, which I did once for fun and wrote about it.

I also found this published paper both extremely interesting and helpful because it lays out in detail the author’s analytic methodology for building their predictive model. The first part of this post is my attempt at replicating their technique using a different data source (collegefootball.com’s API) and using data from the 2017-2020 seasons to predict games of the 2021 season. Spoiler, my initial tree-based models don’t perform nearly as well as theirs, and I eventually try a neural network with much better success.

Get the Data

First, I’ll load data I previously obtained via the API I mentioned above. If you want to see how to make those API calls, I explain it in this post. The data in /games/teams has a bunch of game result data that are similar to the data used in the article above.

gs <- readRDS("gameStats.RData")
gs <- type.convert(gs, as.is = TRUE)
head(gs)
##          id           school homeAway points week year rushingTDs
## 1 400935247    Arizona State     home     37    1 2017          2
## 2 400935247 New Mexico State     away     31    1 2017          2
## 3 400944827 Sacramento State     away      6    1 2017          0
## 4 400944827            Idaho     home     28    1 2017          2
## 5 400934573           Temple     away     16    1 2017          0
## 6 400934573       Notre Dame     home     49    1 2017          5
##   puntReturnYards puntReturnTDs puntReturns passingTDs kickReturnYards
## 1               6             0           2          2              51
## 2               0             0           1          3              27
## 3              12             0           3          0              77
## 4              23             0           3          2              NA
## 5               2             0           2          2             142
## 6               0             0           1          2              48
##   kickReturnTDs kickReturns kickingPoints interceptionYards interceptionTDs
## 1             0           1             7                49               1
## 2             0           1             1                NA              NA
## 3             0           3             6                 0               0
## 4            NA          NA             4                NA              NA
## 5             0           6             4                43               0
## 6             0           2             7                NA              NA
##   passesIntercepted fumblesRecovered totalFumbles tacklesForLoss defensiveTDs
## 1                 2                0            1              8            1
## 2                NA                0            1             14            0
## 3                 2                2            1              8            0
## 4                NA                0            2              6            0
## 5                 1                0            1              4            0
## 6                NA                1            0             11            0
##   tackles sacks qbHurries passesDeflected firstDowns thirdDownEff fourthDownEff
## 1      56     6         0               3         19         5-15           1-1
## 2      64     7         0               0         27        11-18           0-1
## 3      45     4         0               1          8         1-15           0-1
## 4      37     4         0               2         15         2-11           2-3
## 5      37     2         6               3         18         5-17           0-2
## 6      44     3         4               3         26         6-13           1-2
##   totalYards netPassingYards completionAttempts yardsPerPass rushingYards
## 1        400             321              24-30         10.7           79
## 2        549             398              40-58          6.9          151
## 3        190              95              10-25          3.8           95
## 4        360             163              14-19          8.6          197
## 5        330             245              19-35          7.0           85
## 6        606             184              17-30          6.1          422
##   rushingAttempts yardsPerRushAttempt totalPenaltiesYards turnovers fumblesLost
## 1              40                 2.0                1-10         0           0
## 2              30                 5.0                6-59         2           0
## 3              35                 2.7                 1-2         0           0
## 4              44                 4.5                3-20         3           1
## 5              37                 2.3                6-42         1           1
## 6              44                 9.6                4-20         1           0
##   interceptions possessionTime
## 1             0          29:40
## 2             2          30:20
## 3             0          29:06
## 4             2          30:54
## 5             0          33:49
## 6             1          26:11

I also want to get the list of all of the FBS teams becuase I want to filter out the games played against non-FBS teams. There’s a bunch of data in that table, but I’m really only interested in the school column.

fbs <- readRDS("fbs_teams.RData")
head(fbs)
##   document.id   id            school       mascot abbreviation alt_name1
## 1           1 2005         Air Force      Falcons          AFA      <NA>
## 2           2 2006             Akron         Zips          AKR      <NA>
## 3           3  333           Alabama Crimson Tide          ALA      <NA>
## 4           4 2026 Appalachian State Mountaineers          APP      <NA>
## 5           5   12           Arizona     Wildcats         ARIZ      <NA>
## 6           6    9     Arizona State   Sun Devils          ASU      <NA>
##   alt_name2      alt_name3    conference division   color alt_color
## 1       AFA      Air Force Mountain West Mountain #004a7b   #ffffff
## 2       AKR          Akron  Mid-American     East #00285e   #84754e
## 3       ALA        Alabama           SEC     West #690014   #f1f2f3
## 4       APP Appalachian St      Sun Belt     <NA> #000000   #ffcd00
## 5      ARIZ        Arizona        Pac-12    South #002449   #00205b
## 6       ASU  Arizona State        Pac-12    South #942139   #f1f2f3
##   location.venue_id                     location.name    location.city
## 1              3713                    Falcon Stadium Colorado Springs
## 2              3768 Summa Field at InfoCision Stadium            Akron
## 3              3657              Bryant Denny Stadium       Tuscaloosa
## 4              3792               Kidd Brewer Stadium            Boone
## 5              3619                   Arizona Stadium           Tucson
## 6              3947                 Sun Devil Stadium            Tempe
##   location.state location.zip location.country_code location.timezone
## 1             CO        80840                    US    America/Denver
## 2             OH        44399                    US  America/New_York
## 3             AL        35487                    US   America/Chicago
## 4             NC        28608                    US  America/New_York
## 5             AZ        85721                    US   America/Phoenix
## 6             AZ        85287                    US   America/Phoenix
##   location.latitude location.longitude location.elevation location.capacity
## 1          38.99697         -104.84362        2024.875732             46692
## 2          41.07255          -81.50834        321.2875061             30000
## 3          33.20828          -87.55038        70.05136108            101821
## 4          36.21143          -81.68543        991.3414307             30000
## 5          32.22881         -110.94887        742.1530151             50782
## 6          33.42645         -111.93250        360.1569519             65870
##   location.year_constructed location.grass location.dome
## 1                      1962          FALSE         FALSE
## 2                      2009          FALSE         FALSE
## 3                      1929           TRUE         FALSE
## 4                      1962          FALSE         FALSE
## 5                      1928          FALSE         FALSE
## 6                      1958          FALSE         FALSE

Now I’ll get a vector of all of the game IDs from the first table in which a non-FBS team played. Here are the first three.

library(dplyr) # I'm going to need to do some data wrangling

fcsIDs <- gs %>% filter(!school %in% (fbs %>% .$school)) %>% .$id
fcsIDs[1:3]
## [1] 400944827 400944827 400933835

Now I apply the filter and separate out some of the columns. For example, the totalPenaltiesYards column is formatted as penalties-yards, so for example 3-25. I’ll then create a new column that divides one of those values by the other so that I get a single yardsPerPenalty value. I’ll also convert the time of possession into minutes and get rid of columns I no longer need.

library(tidyr) # needed for the separate function below

gs <-
  gs %>%
  filter(!id %in% fcsIDs) %>%
  separate(totalPenaltiesYards, into = c("penalties", "penaltiesYards"), convert = TRUE) %>%
  separate(completionAttempts, into = c("passCompletions", "passAttempts"), convert = TRUE) %>%
  separate(possessionTime, into = c("posMM", "posSS"), convert = TRUE) %>%
  mutate(posTime = posMM * 60 + posSS,
         yardsPerPenalty = penaltiesYards / penalties,
         passEfficiency = passCompletions / passAttempts,
         yardsPerPlay = (yardsPerPass + yardsPerRushAttempt) / 2) %>%
  select(-totalYards, -posMM, -posSS, -fourthDownEff, -thirdDownEff)

I found that there are a number of NAs in the data. For this demonstration, I’ll just convert them all to 0. If I was cleaning this data for an actual competition or to try to beat Vegas, I’d take a much harder look at those NAs and treat them carefully.

gs[is.na(gs)] <- 0

I also found that there’s a game ID that’s repeated, so I’ll filter out the repeat.

# there's an id that's repeated - there should be two games for each ID
gs %>% group_by(id) %>% count() %>% filter(n > 2)
## # A tibble: 1 x 2
## # Groups:   id [1]
##          id     n
##       <int> <int>
## 1 401309547     4
# take a look at that ID
gs %>% filter(id == 401309547)
## # A tibble: 4 x 43
##          id school        homeAway points  week  year rushingTDs puntReturnYards
##       <int> <chr>         <chr>     <int> <int> <int>      <int>           <int>
## 1 401309547 Wyoming       away         50     2  2021          4              17
## 2 401309547 Wyoming       away         50     2  2021          0               0
## 3 401309547 Northern Ill~ home         43     2  2021          5               0
## 4 401309547 Northern Ill~ home         43     2  2021          0               0
## # ... with 35 more variables: puntReturnTDs <int>, puntReturns <int>,
## #   passingTDs <int>, kickReturnYards <int>, kickReturnTDs <int>,
## #   kickReturns <int>, kickingPoints <int>, interceptionYards <int>,
## #   interceptionTDs <int>, passesIntercepted <int>, fumblesRecovered <int>,
## #   totalFumbles <int>, tacklesForLoss <dbl>, defensiveTDs <int>,
## #   tackles <int>, sacks <dbl>, qbHurries <int>, passesDeflected <int>,
## #   firstDowns <int>, netPassingYards <int>, passCompletions <int>, ...
# filter out the ID with zeros in the rushingTDs column
gs <- gs %>% filter(!(id == 401309547 & rushingTDs == 0))

In the article, they did some feature engineering that I’ll replicate here.

gs <-
  gs %>%
  mutate(offensePoints = (rushingTDs + passingTDs) * 7 + kickingPoints,
         offenseYards = netPassingYards + rushingYards) %>%
  arrange(id)

They also created some features for one team based on how the other team performed. Doing this in the original gs dataframe made my head hurt, so I split it into separate dataframes for the home and away teams. While I’m at it, I’ll also calculate MOV, the margin of victory.

home <- gs %>% filter(homeAway == "home")
away <- gs %>% filter(homeAway == "away")

home <-
  home %>%
  mutate(
    defensePointsAllowed = away$offensePoints,
    defenseYPPAAllowed = away$yardsPerPass,
    defenseYPRAAllowed = away$yardsPerRushAttempt,
    defensePassYardsAllowed = away$netPassingYards,
    defenseRushYardsAllowed = away$rushingYards,
    defenseYardsPerPlayAllowed = away$yardsPerPlay,
    forcedPenaltyYards = away$penaltiesYards,
    forcedTO = away$turnovers,
    MOV = points - away$points
  )

away <-
  away %>%
  mutate(
    defensePointsAllowed = home$offensePoints,
    defenseYPPAAllowed = home$yardsPerPass,
    defenseYPRAAllowed = home$yardsPerRushAttempt,
    defensePassYardsAllowed = home$netPassingYards,
    defenseRushYardsAllowed = home$rushingYards,
    defenseYardsPerPlayAllowed = home$yardsPerPlay,
    forcedPenaltyYards = home$penaltiesYards,
    forcedTO = home$turnovers,
    MOV = points - home$points
  )

I also wanted to get each team’s Elo score,which I found in the /games data via the API. For some reason, I grabbed the pregame Elo score instead of the post-game Elo, but that’s fine. I’ll account for that later. I’ll also filter out those games using the game IDs in the home dataframe. The /games data also contains a boolean neutral_site, which will be useful. There’s always a home and away team in every game (as indicated in the homeAway column) even if the game is played on a neutral site. If I want to correctly factor in whether a team has a home field advantage, I’ll need to account for those games played at a neutral site.

eloSite <- readRDS("eloSite.RData") %>% as_tibble()

eloSite <- eloSite %>% filter(id %in% (home %>% .$id))

head(eloSite)
## # A tibble: 6 x 7
##          id home_team   home_pregame_elo away_team away_pregame_elo neutral_site
##       <dbl> <chr>                  <dbl> <chr>                <dbl> <lgl>       
## 1 400935282 Colorado S~             1546 Oregon S~             1412 FALSE       
## 2 400938887 UMass                   1230 Hawai'i               1273 FALSE       
## 3 400941786 San José S~             1266 South Fl~             1626 FALSE       
## 4 400935257 Rice                    1239 Stanford              1666 TRUE        
## 5 400938591 UCF                     1446 Florida ~             1272 FALSE       
## 6 400935230 Minnesota               1600 Buffalo               1219 FALSE       
## # ... with 1 more variable: ..JSON <list>

Now I’ll combine that data with the home and away dataframes.

home <-
  home %>%
  left_join(eloSite %>% select(id, home_team, home_pregame_elo, neutral_site),
            by = c("id" = "id", "school" = "home_team")) %>%
  rename("elo" = "home_pregame_elo")

away <-
  away %>%
  left_join(eloSite %>% select(id, away_team, away_pregame_elo, neutral_site),
            by = c("id" = "id", "school" = "away_team")) %>%
  rename("elo" = "away_pregame_elo")

The authors explained that their methodology didn’t use raw game results. Instead, they used seasonal cumulative means. Let’s take rushingTDs for an example. Say a given team in week 1 had 1 rushing TD. Week 1 results are unchanged. If they scored 2 rushing TDs in week 2, then the cumulative mean becomes 1.5 and so on through the weeks. At the beginning of the next season, reset and start over. Hopefully, that makes sense.

# re-combine the home and away dataframes into one long dataframe
ha <-
  bind_rows(home, away) %>%
  arrange(school, year, week)

# I want to calculate the cumulative MOV mean, but I also need to preserve the
# actual MOV as the response variable.
mov_actual <- ha %>% .$MOV

# now calculate the cumulative means
cmeans<-
  ha %>%
  group_by(school, year) %>%
  summarize_at(vars(c(3:52)), cummean) %>%
  ungroup()

head(cmeans)
## # A tibble: 6 x 52
##   school  year points  week rushingTDs puntReturnYards puntReturnTDs puntReturns
##   <chr>  <int>  <dbl> <dbl>      <dbl>           <dbl>         <dbl>       <dbl>
## 1 Air F~  2017   13     3         0                0               0        0   
## 2 Air F~  2017   18.5   3.5       1.5             18               0        2   
## 3 Air F~  2017   25     4         1.67            12               0        1.33
## 4 Air F~  2017   30     4.5       2.25            12.5             0        1.5
## 5 Air F~  2017   30.8   5         2.8             14               0        1.4
## 6 Air F~  2017   33.2   5.5       3.33            13.7             0        1.33
## # ... with 44 more variables: passingTDs <dbl>, kickReturnYards <dbl>,
## #   kickReturnTDs <dbl>, kickReturns <dbl>, kickingPoints <dbl>,
## #   interceptionYards <dbl>, interceptionTDs <dbl>, passesIntercepted <dbl>,
## #   fumblesRecovered <dbl>, totalFumbles <dbl>, tacklesForLoss <dbl>,
## #   defensiveTDs <dbl>, tackles <dbl>, sacks <dbl>, qbHurries <dbl>,
## #   passesDeflected <dbl>, firstDowns <dbl>, netPassingYards <dbl>,
## #   passCompletions <dbl>, passAttempts <dbl>, yardsPerPass <dbl>, ...

There are a few columns from the ha dataframe, like the game ID, schools names, the week, the year, etc. that I don’t want to have a cumulative mean - I want the original columns. So I’ll select those columns and combine them with the cumulative means. Then I’ll add the actual MOV results for use as the predictor variable.

ha <- bind_cols(ha[, c(1:3, 5, 6, 55, 56)], cmeans[, c(3, 5:52)])
ha$MOV_actual <- mov_actual

head(ha)
## # A tibble: 6 x 57
##          id school    homeAway  week  year   elo neutral_site points rushingTDs
##       <dbl> <chr>     <chr>    <int> <int> <dbl> <lgl>         <dbl>      <dbl>
## 1 400935352 Air Force away         3  2017  1562 FALSE          13         0   
## 2 400945259 Air Force home         4  2017  1559 FALSE          18.5       1.5
## 3 400945263 Air Force away         5  2017  1560 FALSE          25         1.67
## 4 400941815 Air Force away         6  2017  1508 FALSE          30         2.25
## 5 400945272 Air Force home         7  2017  1509 FALSE          30.8       2.8
## 6 400945278 Air Force away         8  2017  1506 FALSE          33.2       3.33
## # ... with 48 more variables: puntReturnYards <dbl>, puntReturnTDs <dbl>,
## #   puntReturns <dbl>, passingTDs <dbl>, kickReturnYards <dbl>,
## #   kickReturnTDs <dbl>, kickReturns <dbl>, kickingPoints <dbl>,
## #   interceptionYards <dbl>, interceptionTDs <dbl>, passesIntercepted <dbl>,
## #   fumblesRecovered <dbl>, totalFumbles <dbl>, tacklesForLoss <dbl>,
## #   defensiveTDs <dbl>, tackles <dbl>, sacks <dbl>, qbHurries <dbl>,
## #   passesDeflected <dbl>, firstDowns <dbl>, netPassingYards <dbl>, ...

I also want to include the spread as a predictor variable, which I can get via the API from /lines.

spread <- readRDS("spread.RData")
head(spread)
## # A tibble: 6 x 2
##          id home_spread
##       <dbl>       <dbl>
## 1 400944997        9.5
## 2 400944998        9.33
## 3 400934536       24.2
## 4 400945016      -20.3
## 5 400934493      -18.7
## 6 400938591      -17.3

Now I’ll deal with the neutral site column since the dataframe is merged back together. I’m going to make a HFA column that’s a 1 for the home team at their home field, a -1 for the away team, and a 0 for both teams at a neutral site. Then I’ll add the pre-game spread for the home and away teams.

ha <-
  ha %>%
  mutate(HFA = case_when(
    homeAway == "home" & !neutral_site ~ 1,
    homeAway == "away" & !neutral_site ~ -1,
    neutral_site ~ 0)) %>%
  left_join(spread, by = "id") %>%
  mutate(spread = ifelse(homeAway == "home", home_spread, -home_spread))

Something else I want to add is each team’s seasonal win/loss record in the for of the percent of games they’ve won. I keep adding more and more predictors. We’ll find out later if any of it will even be useful.

ha <-
  ha %>%
  mutate(wins = ifelse(MOV_actual > 0, 1, 0)) %>%
  group_by(school, year) %>%
  mutate(games = row_number(),
         cumwins = cumsum(wins),
         winPct = cumwins / games) %>%
  select(-wins, -games, -cumwins)

Ok, this is where things get a little complicated. But first, a sidebar. I wrote this code over the course of a few weekends. Until this particular project, I’d never dealt with data where there are “sides” to think about. It took me a while to wrap my mind around it, and this is code from my first efforts where I was just trying to find “a way” to get things done. Pure brute force. Later you’ll see that I came up with a much better (and faster) way to do what I about to demonstrate. All part of the sausage-making the way I look at it.

The idea here is that I wanted to account for home field advantage. It’s a definite thing with a measurable effect. I’ll demonstrate using the home dataframe from earlier. I’ll remove games played at a neutral site, and get the mean margin of victory.

home %>%
  filter(!neutral_site) %>%
  summarize(meanMOV = mean(MOV))
## # A tibble: 1 x 1
##   meanMOV
##     <dbl>
## 1    3.97

Home teams win by almost 4 points on average. We can also check to see if the mean MOV is statistically significant. I’ll use the Wilcoxon signed rank test because I’m certain the data are not normally distributed - they’re football scores, after all. I’ll do a one-sided test where the null hypothesis is that the mean MOV is 0.

wilcox.test(home %>% filter(!neutral_site) %>% .$MOV, alternative = "greater")
##
## 	Wilcoxon signed rank test with continuity correction
##
## data:  home %>% filter(!neutral_site) %>% .$MOV
## V = 3355196, p-value < 2.2e-16
## alternative hypothesis: true location is greater than 0

Based on that p-value, we reject the null hypothesis at the 95% confidence level. Right, so home field advantage is a real thing, and I need to account for it.

To do that, I wanted to get away from the idea of home team and away team and instead think of them more generically by randomly picking one team as “the team” and the other as “the opponent”. My ha dataframe is still grouped by school and year, so I’ll ungroup it and call it my end-of-week results data - it’s a mental trick so I think about things differently. Then I’ll get all of the unique game IDs.

eowResults <- ha %>% ungroup()

ids <- sort(unique(eowResults %>% .$id))

Next, I’ll loop over the ID. For each ID, I’ll start by picking out the common columns (id, week, and year), and then identify one team as team1 and the other as team2. For the first game, and for column-naming consistency as I build this new dataframe df, team 1 will be “the team”, and team2 will be “the opponent”. For all other games, I pick a random number between 0 and 1. If it’s greater than or equal to 0.5, team1 is “the team” and team2 is “the opponent”. Otherwise, it’s the reverse.

for (i in ids){
  common <- eowResults %>% filter(id == i) %>% slice(1) %>% select(id, week, year)

  team1 <- eowResults %>% filter(id == i) %>% slice(1) %>% select(-id, -week, -year)
  colnames(team1) <- paste(colnames(team1), "team", sep = "_")

  team2 <- eowResults %>% filter(id == i) %>% slice(2) %>% select(-id, -week, -year)
  colnames(team2) <- paste(colnames(team2), "opponent", sep = "_")

  if (i == min(ids)){df <- bind_cols(common, team1, team2)}
  else{
    ifelse(runif(1) >= 0.5,
           newRow <- bind_cols(common, team1, team2),
           newRow <- bind_cols(common, team2, team1))
    df <- df %>% bind_rows(newRow)}

}

Like I said, brute force. Now I have some redundant columns, so I’ll drop drop them before going further.

df <-
  df %>%
  select(-homeAway_team, -homeAway_opponent, -neutral_site_team, -neutral_site_opponent,
         -MOV_actual_opponent, -HFA_opponent, -home_spread_opponent, -spread_opponent,
         -home_spread_team)

The final big step is to add a bunch of new columns to represent the difference in one team’s average performance versus the average performance of the other team. Think of it this way. If a team scores on average 30 points a game, you might think that’s pretty good. However, if the teams they’ve played have on average given up 40 points, then that 30 points doesn’t look as impressive. I want to account for that for a number of statistics, so that’s what I do here.

df <-
  df %>%
  mutate(
    diffPointsScored = offensePoints_team - defensePointsAllowed_opponent,
    diffPointsAllowed = defensePointsAllowed_team - offensePoints_opponent,
    diffYPPAOff = yardsPerPass_team - defenseYPPAAllowed_opponent,
    diffYPPADef = defenseYPPAAllowed_team - yardsPerPass_opponent,
    diffYPRAOff = yardsPerRushAttempt_team - defenseYPRAAllowed_opponent,
    diffYPRADef = defenseYPRAAllowed_team - yardsPerRushAttempt_opponent,
    diffPassYardsOff = netPassingYards_team - defensePassYardsAllowed_opponent,
    diffPassYardsDef = defensePassYardsAllowed_team - netPassingYards_opponent,
    diffRushYardsOff = rushingYards_team - defenseRushYardsAllowed_opponent,
    diffRushYardsDef = defenseRushYardsAllowed_team - rushingYards_opponent,
    diffYPPOff = yardsPerPlay_team - defenseYardsPerPlayAllowed_opponent,
    diffYPPDef = defenseYardsPerPlayAllowed_team - yardsPerPlay_opponent,
    diffELO = elo_team - elo_opponent,
    diffWinPct = winPct_team - winPct_opponent
  ) %>%
  select(-elo_team, -elo_opponent, -winPct_team, -winPct_opponent)

At this point, I noticed that there are a number of NAs in the spread_team column, and they’re all from the 2020 season. I guess that might have to do with it being the first COVID season. I’ll replace those NAs with 0s - again, maybe not the best idea if I was doing this for real.

df <- df %>% mutate(spread_team = replace_na(spread_team, 0))

One last thing. This is going to seem out of place here, but since I cranked away at this data set a bunch of times as I was working through this, I developed some insights and some practices. First, I’ll explicitly set the teams as factors instead of strings. I’ll also generate a new column, diffMOV as the difference in the mean margin of victory - that turned out to be useful. I’ll also get rid of any column with “TD” in it’s name because I found them to be unhelpful. Same thing with pass completions, pass attempts (I essentially have those already in passEfficiency), and rushing attempts. Finally, I reorder columns to make looking at the dataset a little easier.

lvls <- sort(unique(c(df %>% .$school_team, df %>% .$school_opponent)))

df <- df %>%
  mutate(school_team = factor(school_team, levels = lvls),
         school_opponent = factor(school_opponent, levels = lvls),
         diffMOV  = MOV_team - MOV_opponent) %>%
  select(
    -MOV_team, -MOV_opponent,
    -rushingTDs_team, -rushingTDs_opponent,
    -puntReturnTDs_team, -puntReturnTDs_opponent,
    -passingTDs_team, -passingTDs_opponent,
    -kickReturnTDs_team, -kickReturnTDs_opponent,
    -interceptionTDs_team, -interceptionTDs_opponent,
    -passCompletions_team, -passCompletions_opponent,
    -passAttempts_team, -passAttempts_opponent,
    -rushingAttempts_team, -rushingAttempts_opponent
  ) %>%
  select(id, week, year, school_team, MOV_actual_team, HFA_team, spread_team,
         school_opponent, diffELO, everything())

Data wrangling is finally complete, and the data are now ready to be split into training and test data sets. The training data will be the 2017-2020 seasons (df20), and I’ll train some models on that to predict the 2021 season (df21).

df20 <- df %>% filter(year < 2021) %>% select(-year)
df21 <- df %>% filter(year == 2021)

Pause one more time and think about the data at this point. Recall that the df20 and df21 data are game results. I can’t use game results in the test data set - I won’t have results prior to each game being played! I will however have week 1 results available at the start of week 2, and both of those will be available at the start of week 3, and so on. If I group the test data by team and week, I can just shift all of the results data down one row. That will produce NAs for the first game each team plays, so I’ll drop those. This means I won’t have a prediction for the first game each team plays, either. I’ll need to find another way of doing things if I want to do this for real. Forging ahead anyway for now…

df21 <-
  df21 %>%  
  arrange(school_team, week) %>%
  group_by(school_team) %>%
  mutate(across(9:102, lag)) %>%
  ungroup() %>%
  drop_na() %>%
  select(-year)

head(df21)
## # A tibble: 6 x 102
##        id  week school_team MOV_actual_team HFA_team spread_team school_opponent
##     <dbl> <int> <fct>                 <int>    <dbl>       <dbl> <fct>          
## 1  4.01e8     3 Air Force                -4        1       -8.9  Utah State     
## 2  4.01e8     4 Air Force                24        1       -3.62 Florida Atlant~
## 3  4.01e8     5 Air Force                28       -1      -11.5  New Mexico     
## 4  4.01e8     6 Air Force                10        1       -5.5  Wyoming        
## 5  4.01e8     7 Air Force                 7       -1        3.12 Boise State    
## 6  4.01e8     8 Air Force                -6        1       -2.88 San Diego State
## # ... with 95 more variables: diffELO <dbl>, points_team <dbl>,
## #   puntReturnYards_team <dbl>, puntReturns_team <dbl>,
## #   kickReturnYards_team <dbl>, kickReturns_team <dbl>,
## #   kickingPoints_team <dbl>, interceptionYards_team <dbl>,
## #   passesIntercepted_team <dbl>, fumblesRecovered_team <dbl>,
## #   totalFumbles_team <dbl>, tacklesForLoss_team <dbl>,
## #   defensiveTDs_team <dbl>, tackles_team <dbl>, sacks_team <dbl>, ...

Feature Selection With Lasso Regression

That was a fair amount of work! We now have data sets with 102 columns, and I know that a decent amount of those columns are just noise. My go-to method for feature selection is lasso regression, and it happens to be the same techniques used by the authors. But wait! I’ve been a fan of the caret package to do modeling, but now there’s tidymodels - written with tidy concepts in mind. And written by Max Kuhn - the auther of caret. I’ve been wanting to try this out, so here goes.

After importing the library, I need to get my training and test sets back into one dataset so tidymodels can do its thing in its tidy way. I join them and the re-split them with initial_time_split() because my overvations are by week and year. The proportion is 2682/3298 because the first 2682 observations are the games in the 2017-2020 seasons. I’ll also create a validation data set.

library(tidymodels)

df_joined <- df20 %>% bind_rows(df21)

data_split <- initial_time_split(df_joined, prop = 2682/3298)

train_data <- training(data_split)
test_data  <- testing(data_split)

I’m not going to go into detail here about what each of these steps are doing because I can’t to it any better than the docs.

# create the recipe
# note I omit the categorical school variables
lm_rec <-
  recipe(MOV_actual_team ~ ., data = train_data %>% select(-id, -week, -school_team, -school_opponent)) %>%
  step_dummy(all_nominal_predictors()) %>%        # one-hot encoding
  step_zv(all_predictors()) %>%                   # eliminate zero variance columns
  step_normalize(all_numeric(), -all_outcomes())  # normalize numeric variables

# define the model and hyperparameters to tune
lm_mod <-
  linear_reg(penalty = tune(), mixture = 1) %>%
  set_engine("glmnet")

# values to select when tuning
lambda_grid <- tibble(penalty = 10^seq(-2, 1, length.out = 50))

# cross validation folds
set.seed(345)
folds <- vfold_cv(train_data, v = 10)

# create the workflow
lm_wflow <-
  workflow() %>%
  add_model(lm_mod) %>%
  add_recipe(lm_rec)

# tune the model
lm_res <-
  lm_wflow %>%
  tune_grid(
    grid = lambda_grid,
    resamples = folds,
    control = control_grid(save_pred = TRUE)
  )

Let’s take a look at the mean error and R-squared for the different penalty values (thanks to this post for the nice plot idea).

lm_res %>%
  collect_metrics() %>%
  ggplot(aes(penalty, mean, color = .metric)) +
  geom_errorbar(aes(
    ymin = mean - std_err,
    ymax = mean + std_err
  ),
  alpha = 0.5
  ) +
  geom_line(size = 1.5) +
  facet_wrap(~.metric, scales = "free", nrow = 2) +
  scale_x_log10() +
  theme_bw() +
  theme(legend.position = "none")

I have some choices of what do to next. Normally, what I’d do is select the simplest model within one standard error of the model with the lowest RMSE. I can do that with the select_by_one_std_err() function as shown below. Note the penalty value for that model and hold that thought.

lm_res %>% select_by_one_std_err(metric = "rmse", desc(penalty))
## # A tibble: 1 x 9
##   penalty .metric .estimator  mean     n std_err .config            .best .bound
##     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>              <dbl>  <dbl>
## 1    1.60 rmse    standard    12.8    10   0.244 Preprocessor1_Mod~  12.6   12.8

I could also forget the “within one standard error” aspect and just select the model with the lowest RMSE (or I could select the least complex model that has no more than a certain percentage loss of RMSE). Select the model with the lowest error gives the folloing penalty value. Which one shoud I choose?

lowest_rmse <- lm_res %>% select_best("rmse")
lowest_rmse
## # A tibble: 1 x 2
##   penalty .config              
##     <dbl> <chr>                
## 1   0.256 Preprocessor1_Model24

Take a look at the model object below. If I go with the 1SE lambda of 1.6, based on the degrees of freedom (Df) I’ll have only 2 or 3 predictors in the model (1.64 just happens to fall between the two penalty values below, so I’m not sure how many predictor will actually be selected). So out of about 100 predictors, only 2 or three will be selected. To me, that means a few variables dominate and the rest are just noise. With the lowest RMSE lambda value of 0.256, there will be 21-22 predictors selected. For the purpose of this post, I’m going to choose the lowest RMSE.

For the final fit, I finalize the work flow based on the lowest RMSE. Notice I also use data_split, which contains both the training and test data. From that fit, I get the RMSE and R-squared for the test data set. The RMSE is about 3 touchdowns, and the R-squared is alarmingly low.

lm_final_wf <-
  lm_wflow %>%
  finalize_workflow(lowest_rmse)

lm_final_fit <-
  lm_final_wf %>%
  last_fit(data_split)
## ! train/test split: preprocessor 1/1, model 1/1 (predictions): Unknown columns: `(Intercept)`
lm_final_fit %>% collect_metrics()
## # A tibble: 2 x 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard      21.2   Preprocessor1_Model1
## 2 rsq     standard       0.153 Preprocessor1_Model1
lm_final_fit %>%
  extract_fit_parsnip()
## parsnip model object
##
## Fit time:  10ms
##
## Call:  glmnet::glmnet(x = maybe_matrix(x), y = y, family = "gaussian", alpha = ~1)
##
##    Df  %Dev  Lambda
## 1   0  0.00 17.7300
## 2   1 11.16 16.1500
## 3   1 20.42 14.7200
## 4   1 28.11 13.4100
## 5   1 34.50 12.2200
## 6   1 39.80 11.1300
## 7   1 44.20 10.1500
## 8   1 47.85  9.2440
## 9   1 50.88  8.4230
## 10  1 53.40  7.6750
## 11  1 55.49  6.9930
## 12  1 57.23  6.3720
## 13  1 58.67  5.8060
## 14  2 59.87  5.2900
## 15  2 60.98  4.8200
## 16  2 61.90  4.3920
## 17  2 62.66  4.0020
## 18  2 63.29  3.6460
## 19  2 63.82  3.3220
## 20  2 64.25  3.0270
## 21  2 64.62  2.7580
## 22  2 64.92  2.5130
## 23  2 65.17  2.2900
## 24  2 65.37  2.0860
## 25  2 65.54  1.9010
## 26  2 65.69  1.7320
## 27  3 65.84  1.5780
## 28  3 66.00  1.4380
## 29  3 66.13  1.3100
## 30  4 66.24  1.1940
## 31  5 66.35  1.0880
## 32  6 66.45  0.9913
## 33  6 66.55  0.9032
## 34  6 66.63  0.8230
## 35  7 66.72  0.7498
## 36  7 66.80  0.6832
## 37 10 66.87  0.6225
## 38 13 66.95  0.5672
## 39 14 67.03  0.5168
## 40 15 67.10  0.4709
## 41 15 67.16  0.4291
## 42 17 67.21  0.3910
## 43 18 67.26  0.3562
## 44 20 67.30  0.3246
## 45 21 67.35  0.2958
## 46 21 67.38  0.2695
## 47 22 67.42  0.2455
## 48 22 67.45  0.2237
## 49 22 67.47  0.2039
## 50 24 67.50  0.1857
## 51 28 67.52  0.1692
## 52 32 67.55  0.1542
## 53 35 67.58  0.1405
## 54 39 67.64  0.1280
## 55 42 67.69  0.1167
## 56 43 67.74  0.1063
## 57 45 67.79  0.0969
## 58 46 67.83  0.0882
## 59 47 67.86  0.0804
## 60 51 67.90  0.0733
## 61 54 67.93  0.0668
## 62 54 67.96  0.0608
## 63 55 67.98  0.0554
## 64 56 68.00  0.0505
## 65 57 68.02  0.0460
## 66 59 68.05  0.0419
## 67 59 68.06  0.0382
## 68 59 68.07  0.0348
## 69 59 68.08  0.0317
## 70 61 68.09  0.0289
## 71 62 68.10  0.0263
## 72 64 68.10  0.0240
## 73 65 68.11  0.0219
## 74 68 68.11  0.0199
## 75 69 68.13  0.0181
## 76 70 68.13  0.0165
## 77 72 68.14  0.0151
## 78 72 68.15  0.0137
## 79 73 68.15  0.0125
## 80 74 68.16  0.0114
## 81 74 68.16  0.0104
## 82 74 68.17  0.0095
## 83 74 68.17  0.0086
## 84 75 68.17  0.0079
## 85 75 68.19  0.0072
## 86 76 68.19  0.0065

Here I plot the selected predictors color coded by whether the coefficient is positive or negative. Clearly, diffMOV dominates all other variables. Other important variables include the spread and home field advantage. Everything after that is probably just noise.

lm_final_fit %>%
  extract_fit_parsnip() %>%
  vip::vi(lambda = lowest_rmse$penalty) %>%
  mutate(
    Importance = abs(Importance),
    Variable = forcats::fct_reorder(Variable, Importance)
  ) %>%
  filter(Importance != 0) %>%
  ggplot(aes(x = Importance, y = Variable, fill = Sign)) +
  geom_col() +
  scale_x_continuous(expand = c(0, 0)) +
  labs(y = NULL) +
  theme_bw()

Well, what if we were to make predictions based off of this model? First I’ll make a scatter plot of predictions versus actual MOV.

lm_final_fit %>%
  collect_predictions() %>%
  ggplot() +
  geom_abline(intercept = 0, slope = 1, color = "red") +
  geom_point(aes(x=`.pred`, y=MOV_actual_team)) +
  theme_bw() +
  labs(title = "Linear Model Prediction Results",
       x = "Predicted MOV",
       y = "Actual MOV")

There’s a little bit of a trend there, but it looks more like a shotgun blast. Not surprising, though, given the RMSE and R-squared we saw earlier. Let’s see what percent of predictions were correct in a head-to-head sense - in other words, what percent predicted just the correct winner.

lm_final_fit %>%
  collect_predictions() %>%
  mutate(both_positive = `.pred` > 0 & MOV_actual_team > 0,
         both_negative = `.pred` < 0 & MOV_actual_team < 0,
         correct = both_positive + both_negative) %>%
  summarize(sumCorrect = sum(correct)) / nrow(df21)
##   sumCorrect
## 1  0.6396104

Hmm… 64% Better than half, anyway, but not very impressive. Before I move on to another model type, I’ll grab the 22 predictors so I can remove everything else from the training and test data sets.

keep_vars <-
  lm_final_fit %>%
  extract_fit_parsnip() %>%
  vip::vi(lambda = lowest_rmse$penalty) %>%
  filter(Importance != 0) %>%
  .$ Variable

Random Forest Model

I’ll try a few different models with the same data and see how they compare. I’ll start with a random forest model using the ranger package. I’ll use the same tidymodels procedure as above: create a recipe, model, and workflow, tune hyperparameters, and show the errors for the best model. This takes a while to execute on my laptop even when using all available cores.

cores <- parallel::detectCores()

# the recipe
rf_rec <-
  recipe(MOV_actual_team ~ ., data = train_data %>% select(keep_vars, MOV_actual_team)) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors())
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(keep_vars)` instead of `keep_vars` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
# the model to tune
rf_mod <-
  rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>%
  set_engine("ranger", importance = "impurity", num.threads = cores) %>%
  set_mode("regression")

# the workflow
rf_wflow <-
  workflow() %>%
  add_model(rf_mod) %>%
  add_recipe(rf_rec)

# now set the seed for reproducability and tune the hyperparameters
set.seed(1234)

rf_fit <-
  rf_wflow %>%
  tune_grid(grid = 25,
            control = control_grid(),
            resamples = folds)
## i Creating pre-processing data to finalize unknown parameter: mtry
# take a look at the best models
rf_fit %>% show_best(metric = "rmse")
## # A tibble: 5 x 8
##    mtry min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1    19    37 rmse    standard    11.8    10   0.268 Preprocessor1_Model17
## 2    21    31 rmse    standard    11.8    10   0.264 Preprocessor1_Model03
## 3    20    19 rmse    standard    11.8    10   0.265 Preprocessor1_Model07
## 4    19    30 rmse    standard    11.8    10   0.264 Preprocessor1_Model21
## 5    17    29 rmse    standard    11.9    10   0.260 Preprocessor1_Model06

There’s a nice autoplot() function that comes with tune (one of the tidymodels dependencies) that can be used with various tidymodels objects. Let’s check it out with rf_fit.

autoplot(rf_fit) + theme_bw()

Looks like large numbers of mtry are good, but the error flattens out after 50 or so.

# final fit
last_rf_mod <-
  rand_forest(mtry = 17, min_n = 22, trees = 1000) %>% # original
  set_engine("ranger", importance = "impurity", num.threads = cores) %>%
  set_mode("regression")

# the last workflow
last_rf_workflow <-
  rf_wflow %>%
  update_model(last_rf_mod)

# the last fit
set.seed(345)
last_rf_fit <-
  last_rf_workflow %>%
  last_fit(data_split)

last_rf_fit %>% collect_metrics()
## # A tibble: 2 x 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard      22.2   Preprocessor1_Model1
## 2 rsq     standard       0.123 Preprocessor1_Model1

Still pretty bad. Let’s take a look at variable importance for this model.

last_rf_fit %>%
  pluck(".workflow", 1) %>%   
  extract_fit_parsnip() %>%
  vip::vip(num_features = 22) +
  theme_bw()

Really only three variables that are contributing much to the model. I’ll do predicted versus actual again and hope for less of a shotgun blast.

last_rf_fit %>%
  collect_predictions() %>%
  ggplot() +
  geom_abline(intercept = 0, slope = 1, color = "red") +
  geom_point(aes(x=`.pred`, y=MOV_actual_team)) +
  theme_bw() +
  labs(title = "Ranger Prediction Results",
       x = "Predicted MOV",
       y = "Actual MOV")

Maybe looks a little more elongated but that could be wishful thinking. How about the percent correct head to head?

last_rf_fit %>%
  collect_predictions() %>%
  mutate(both_positive = `.pred` > 0 & MOV_actual_team > 0,
         both_negative = `.pred` < 0 & MOV_actual_team < 0,
         correct = both_positive + both_negative) %>%
  summarize(sumCorrect = sum(correct)) / nrow(df21)
##   sumCorrect
## 1  0.6087662

Slightly worse, but basically the same. Well, while I’m at it, I’ll look at predictions, actual MOV, and the spread.

last_rf_fit %>%
  collect_predictions() %>%
  mutate(spread_team = test_data$spread_team) %>%
  select(`.pred`, MOV_actual_team, spread_team) %>%
  head()
## # A tibble: 6 x 3
##    .pred MOV_actual_team spread_team
##    <dbl>           <int>       <dbl>
## 1 27.1                -4       -8.9
## 2 -0.927              24       -3.62
## 3 15.1                28      -11.5
## 4 25.0                10       -5.5
## 5 -2.40                7        3.12
## 6  9.35               -6       -2.88

The predictions don’t compare well, that’s for sure. Well, this may partially be the result of replacing NAs with 0s earlier, and I suspect I just have too many garbage predictors in the model. The authors were getting a head to head accuracy in the low 70’s, so there’s definite room for improvement. I’ll drive on anyway and look at teams that did a lot better or worse than predicted. Here I show only the extreme two ends. I plot (predicted - actual), so positive values are for teams that I predicted would win by a much larger margin than they actually did.

last_rf_fit %>%
  collect_predictions() %>%
  mutate(team = df21$school_team,
         opp = df21$school_opponent,
         spread_team = df21$spread_team) %>%
  group_by(team) %>%
  summarize(meanDiff = mean(`.pred` - MOV_actual_team)) %>%
  arrange(meanDiff) %>%
  mutate(team = forcats::fct_reorder(team, meanDiff)) %>%
  filter(abs(meanDiff) > 7) %>%
  ggplot() +
  geom_col(aes(x=team, y=meanDiff)) +
  coord_flip() +
  theme_bw()

Finally, I’ll look at the proportion of games I correctly predicted over time. I’m curious if I do poorly early in the season and then improve, or if it’s relatively consistent. Quick note: the first time I made this plot, I was getting 100% accuracy in the early weeks, and then it got steadily worse over time. That was a giant red flag, and after thoroughly scrubbing my code, I found an error where some actual results had snuck into the test data. I’m glad I caught that!

last_rf_fit %>%
  collect_predictions() %>%
  bind_cols(test_data %>% select(id, week, school_team, school_opponent) %>% rename("ID" = "id")) %>%
  mutate(incorrect = case_when(
    `.pred` > 0 & MOV_actual_team < 0 ~ 1,
    `.pred` < 0 & MOV_actual_team > 0 ~ 1,
    TRUE ~ 0),
    correct= case_when(
      `.pred` > 0 & MOV_actual_team > 0 ~ 1,
      `.pred` < 0 & MOV_actual_team < 0 ~ 1,
      TRUE ~ 0)) %>%
  group_by(week) %>%
  summarize(sumIncorrect = sum(incorrect),
            sumCorrect = sum(correct),
            n = n(),
            proportionCorrect = sumCorrect / n) %>%
  ggplot() +
  geom_col(aes(x=week, y=proportionCorrect)) +
  theme_bw()

Looks reasonably consistent, so at least I know I fixed that error.

Gradient Boost Machine Model

I don’t have much hope that this will be any better than the last two models, but it’ll give me another rep with the tidymodels workflow. Note that I’m using the random forest recipe (rf_rec). In hindsight, the recipe is model agnostic, so I should have just called it something generic to avoid confusion.

set.seed(345)
gbm_mod <-
  boost_tree(trees = 1000,
             tree_depth = tune(),
             min_n = tune(),
             loss_reduction = tune(),
             sample_size = tune(),
             mtry = tune(),
             learn_rate = tune()) %>%
  set_engine("xgboost", num.threads = cores) %>%
  set_mode("regression")

gbm_wflow <-
  workflow() %>%
  add_model(gbm_mod) %>%
  add_recipe(rf_rec)

gbm_fit <-
  gbm_wflow %>%
  tune_grid(grid = 25,
            control = control_grid(save_pred = TRUE),
            resamples = folds)
## i Creating pre-processing data to finalize unknown parameter: mtry
## ! Fold01: preprocessor 1/1, model 1/25 (predictions): Unknown columns: `(Intercept)`
## ! Fold01: preprocessor 1/1, model 2/25 (predictions): Unknown columns: `(Intercept)`

[snip]

gbm_fit %>% show_best(metric = "rmse")
## # A tibble: 5 x 12
##    mtry min_n tree_depth learn_rate loss_reduction sample_size .metric
##   <int> <int>      <int>      <dbl>          <dbl>       <dbl> <chr>  
## 1    21    21          5    0.00729       6.66e- 3       0.291 rmse   
## 2    19    13          8    0.0112        6.75e- 7       0.888 rmse   
## 3     9    39          5    0.0276        7.80e-10       0.280 rmse   
## 4    17    32          7    0.00167       2.87e- 2       0.552 rmse   
## 5    15    28         14    0.0610        5.99e+ 0       0.129 rmse   
## # ... with 5 more variables: .estimator <chr>, mean <dbl>, n <int>,
## #   std_err <dbl>, .config <chr>

Then the final fit.

gbm_best <-
  gbm_fit %>%
  select_best(metric = "rmse")

# final fit
last_gbm_mod <-
  boost_tree(trees = 1000,
              tree_depth = gbm_best %>% .$tree_depth,
              min_n = gbm_best %>% .$min_n,
              loss_reduction = gbm_best %>% .$loss_reduction,
              sample_size = gbm_best %>% .$sample_size,
              mtry = gbm_best %>% .$mtry,
              learn_rate = gbm_best %>% .$learn_rate) %>%
  set_engine("xgboost", num.threads = cores) %>%
  set_mode("regression")

# the last workflow
last_gbm_workflow <-
  gbm_wflow %>%
  update_model(last_gbm_mod)

# the last fit
set.seed(345)
last_gbm_fit <-
  last_gbm_workflow %>%
  last_fit(data_split)
## ! train/test split: preprocessor 1/1, model 1/1 (predictions): Unknown columns: `(Intercept)`
# since I used data_split above, this includes the test data set
last_gbm_fit %>% collect_metrics()
## # A tibble: 2 x 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard      22.7   Preprocessor1_Model1
## 2 rsq     standard       0.114 Preprocessor1_Model1

More of the same, of course, and now variable importance.

last_gbm_fit %>%
  pluck(".workflow", 1) %>%   
  extract_fit_parsnip() %>%
  vip::vip(num_features = 20) +
  theme_bw()

Yep, and now accuracy.

last_gbm_fit %>%
  collect_predictions() %>%
  mutate(both_positive = `.pred` > 0 & MOV_actual_team > 0,
         both_negative = `.pred` < 0 & MOV_actual_team < 0,
         correct = both_positive + both_negative) %>%
  summarize(sumCorrect = sum(correct)) / nrow(df21)
##   sumCorrect
## 1  0.6152597

About the same as the others, so no surprise.

Cubist Model

Just to beat a dead horse, one more time with a Cubist model.

cub_mod <-
  rules::cubist_rules(committees = tune(), neighbors = tune()) %>%
  set_engine("Cubist") %>%
  set_mode("regression")

cub_wflow <-
  workflow() %>%
  add_model(cub_mod) %>%
  add_recipe(rf_rec)

cub_fit <-
  cub_wflow %>%
  tune_grid(grid = 25,
            control = control_grid(save_pred = TRUE),
            resamples = folds)

cub_fit %>% show_best(metric = "rmse")
## # A tibble: 5 x 8
##   committees neighbors .metric .estimator  mean     n std_err .config           
##        <int>     <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>             
## 1         40         0 rmse    standard    11.7    10   0.251 Preprocessor1_Mod~
## 2         24         9 rmse    standard    12.1    10   0.228 Preprocessor1_Mod~
## 3         18         9 rmse    standard    12.2    10   0.227 Preprocessor1_Mod~
## 4         51         8 rmse    standard    12.2    10   0.227 Preprocessor1_Mod~
## 5          5         8 rmse    standard    12.2    10   0.227 Preprocessor1_Mod~

The final fit.

# final fit
last_cub_mod <-
  rules::cubist_rules(committees = 47, neighbors = 0) %>%
  set_engine("Cubist") %>%
  set_mode("regression")

# the last workflow
last_cub_workflow <-
  cub_wflow %>%
  update_model(last_cub_mod)

# the last fit
set.seed(345)
last_cub_fit <-
  last_cub_workflow %>%
  last_fit(data_split) # this includes the test data set

last_cub_fit %>% collect_metrics()
## # A tibble: 2 x 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard     23.0    Preprocessor1_Model1
## 2 rsq     standard      0.0951 Preprocessor1_Model1

Cubist variable importance.

last_cub_fit %>%
  pluck(".workflow", 1) %>%   
  extract_fit_parsnip() %>%
  vip::vip(num_features = 20) +
  theme_bw()

At least with this model, a small number of variables don’t dominate the whole thing. Not sure if that’s a good or a bad thing in this case. Let’s check the Cubist accuracy.

last_cub_fit %>%
  collect_predictions() %>%
  mutate(both_positive = `.pred` > 0 & MOV_actual_team > 0,
         both_negative = `.pred` < 0 & MOV_actual_team < 0,
         correct = both_positive + both_negative) %>%
    summarize(sumCorrect = sum(correct)) / nrow(df21)
##   sumCorrect
## 1  0.6006494

Same as all of the others. Well, aside from demonstrating the garbage-in garbage-out principle, it was interesting to dip my toe in the waters of predictive modeling of sports events. As a bonus, I got some practice with tidymodels, which was great.

At this point in my weekend modeling, I paused and thought about what might have gone wrong and what might have gone right in the methodology I used. I also read this post about using a neural net to predict college football games. I tend to shy away from neural nets for regression problems. In my experience, they’re a pain to set up, take a long time to train, and I really haven’t seen them outperform the simpler, faster, and interpretable tree-based models.

Then I stumbled across this blog post from RStudio that demonstrated the use of tabnet, a neural net model developed by Google specifically for tabular data that incorporates some of the processes in tree-based models to improve interpretability. In the original paper, the authors also demonstrated that tabnet was on par, and often better, than the current go-to models like xgboost. So, time to give it a whack.

Neural Net Model

As I mentioned, I’m not too happy with the mess of predictors I’ve been using so far. Time for a fresh start using just a few predictors similar to the CollegeFootballData.com blog I just mentioned. I have lots of variable in my environment right now, so first I’m going to purge.

rm(list = ls())

I’ll load some of the same data sets from disk that I used earlier and filter non-FBS games as before. However, this time, I’m only going to keep a few columns: id, school, conference, homeAway, points, week, and year.

gs <- readRDS("gameStats_full.RData")
gs <- type.convert(gs, as.is = TRUE)

# get the ID of non-FBS games
fbs <- readRDS("fbs_teams.RData")
fcsIDs <- gs %>% filter(!school %in% (fbs %>% .$school)) %>% .$id

# filter
gs <-
  gs %>%
  filter(!id %in% fcsIDs) %>%
  select(id, school, conference, homeAway, points, week, year)

I’ll create home and away dataframes again as earlier and add the Elo and neutral site data, and then recombine it all into a dataframe called pregame. I’ll calculate the margin of victory (mov) here, too. I noticed one of the away conference names was missing but was able to track down that it was a FBS independent team.

home <- gs %>% filter(homeAway == "home")
away <- gs %>% filter(homeAway == "away")

# get pre-game Elo and neutral vield data
eloSite <- readRDS("eloSite.RData") %>%
  filter(!id %in% (fcsIDs)) %>%
  tidyjson::as_tibble(eloSite)

# get pre-game spread
spread <- readRDS("spread.RData")
spread <- spread %>% filter(!id %in% (fcsIDs))
spread <- spread %>% drop_na()

pregame <- spread %>% left_join(eloSite, by = "id")

pregame <-
  pregame %>%
  left_join(home %>% select(-school, -homeAway), by = "id") %>%
  rename("home_conference" = "conference", "home_points" = "points") %>%
  left_join(away %>% select(-school, -year, -homeAway, -week), by = "id") %>%
  rename("away_conference" = "conference", "away_points" = "points") %>%
  mutate(mov = home_points - away_points) %>%
  select(-home_points, -away_points) %>%
  arrange(year) %>%
  mutate(away_conference = replace_na(away_conference, "FBS Independents"))

head(pregame)
## # A tibble: 6 x 12
##          id home_spread home_team    home_pregame_elo away_team away_pregame_elo
##       <dbl>       <dbl> <chr>                   <dbl> <chr>                <dbl>
## 1 400944997        9.5  Bowling Gre~             1204 Ohio                  1376
## 2 400944998        9.33 Central Mic~             1233 Toledo                1568
## 3 400934536       24.2  Kansas                    972 Kansas S~             1681
## 4 400945016      -20.3  Western Mic~             1662 Kent Sta~              935
## 5 400934493      -18.7  Oklahoma St~             1727 Tulsa                 1665
## 6 400938591      -17.3  UCF                      1446 Florida ~             1272
## # ... with 6 more variables: neutral_site <lgl>, home_conference <chr>,
## #   week <int>, year <int>, away_conference <chr>, mov <int>

I’m only working with 12 columns this time - well, 9 if you don’t count id, week, and year. I won’t use those for training or testing. They’re just to keep track of things. So it’s down to team, conference, spread, Elo, and the home field advantage thing, which I’ll address next.

Recall my first attempt was complicated and involved looping over game IDs. I came up with a much better way this time. I randomly select game IDs, then apply the HFA indicator in place. Way faster and a lot clearer what I’m doing.

set.seed(42)
idx <- sample(1:nrow(pregame), trunc(nrow(pregame) / 2))
pregame$HFA <- 0
pregame[idx, "HFA"] <- 1
pregame[-idx, "HFA"] <- -1
pregame[-idx, "mov"] <- pregame[-idx, "mov"]
pregame <- pregame %>% mutate(HFA = ifelse(neutral_site, 0, HFA))

The last thing to do is turn the team names into factors.

pregame <-
  pregame %>%
  mutate(home_team = factor(home_team),
         away_team = factor(away_team))

That’s it! I’m ready to split the data in the training, test, and validation sets and get to model building.

data_split <-
  initial_time_split(
    pregame %>% select(-id, -year, -neutral_site, -week),
    prop = 2656/3391)

train_data <- training(data_split)
test_data  <- testing(data_split)
set.seed(234)
val_set <- validation_split(train_data, prop = 0.80)

There are many hyperparameters to tune, and I’ll exclude epochs, batch_size, and virtual_batch_size just to speed this part up.

library(tabnet)

nn_rec <- recipe(mov ~ ., train_data) %>%
  step_normalize(all_numeric(), -all_outcomes())

nn_mod <-
  tabnet(epochs = 5,
         batch_size = 256,
         decision_width = tune(),
         attention_width = tune(),
         num_steps = tune(),
         penalty = tune(),
         virtual_batch_size = 64,
         momentum = tune(),
         feature_reusage = tune(),
         learn_rate = tune()
  ) %>%
  set_engine("torch") %>%
  set_mode("regression")

nn_wf <-
  workflow() %>%
  add_model(nn_mod) %>%
  add_recipe(nn_rec)

set.seed(42)
nn_fit <-
  nn_wf %>%
  tune_grid(val_set,
            grid = 50,
            control = control_grid())

nn_fit %>% show_best(metric = "rmse")
## # A tibble: 5 x 13
##      penalty learn_rate decision_width attention_width num_steps feature_reusage
##        <dbl>      <dbl>          <int>           <int>     <int>           <dbl>
## 1    7.93e-4    0.0903              31              17         5            1.22
## 2    1.43e-1    0.0239              15              62         8            1.15
## 3    1.87e-8    0.0395              54              20         4            1.61
## 4    8.36e-8    0.0130              61              26        10            1.47
## 5    6.20e-2    0.00816             29              37         9            1.10
## # ... with 7 more variables: momentum <dbl>, .metric <chr>, .estimator <chr>,
## #   mean <dbl>, n <int>, std_err <dbl>, .config <chr>

The final fit.

# with tuned parameters
last_nn_mod <-
  tabnet(epochs = 15,
         batch_size = 256,
         decision_width = 15,
         attention_width = 62,
         num_steps = 8,
         penalty = 0.1430669,
         virtual_batch_size = 64,
         momentum = 0.194,
         feature_reusage = 1.148,
         learn_rate = 0.02395
  ) %>%
  set_engine("torch", verbose = TRUE) %>%
  set_mode("regression")

# the last workflow
last_nn_workflow <-
  nn_wf %>%
  update_model(last_nn_mod)

# the last fit
set.seed(42)
last_nn_fit <-
  last_nn_workflow %>%
  last_fit(data_split)
## [Epoch 001] Loss: 351.982961
## [Epoch 002] Loss: 278.342331
## [Epoch 003] Loss: 272.993945
## [Epoch 004] Loss: 258.913946
## [Epoch 005] Loss: 261.977900
## [Epoch 006] Loss: 269.318384
## [Epoch 007] Loss: 257.957574
## [Epoch 008] Loss: 259.053101
## [Epoch 009] Loss: 257.555994
## [Epoch 010] Loss: 259.599928
## [Epoch 011] Loss: 248.678506
## [Epoch 012] Loss: 246.407794
## [Epoch 013] Loss: 253.863184
## [Epoch 014] Loss: 253.021082
## [Epoch 015] Loss: 250.405311
# since I used data_split above, this includes the test data set
last_nn_fit %>% collect_metrics()
## # A tibble: 2 x 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard      16.1   Preprocessor1_Model1
## 2 rsq     standard       0.415 Preprocessor1_Model1

Both metrics are quite a bit higher than what we’ve seen up to this point, so I’m optimistic this model will perform better.

last_nn_fit %>%
  pluck(".workflow", 1) %>%   
  extract_fit_parsnip() %>%
  vip::vip() +
  theme_bw()

The model relies heavily on the spread, which makes sense since given that it’s hard to beat the spread, and I’m considering only a few other predictors.

# find percent I predicted the correct winner
last_nn_fit %>%
  collect_predictions() %>%
  mutate(both_positive = `.pred` > 0 & mov > 0,
         both_negative = `.pred` < 0 & mov < 0,
         correct = both_positive + both_negative) %>%
  summarize(Correct = sum(correct)) / nrow(test_data)
##     Correct
## 1 0.7319728

This model correctly predicts 73.2% of the game winners, compared to the ~ 60% for the earlier models. How does that compare to the spread?

# how does the spread do?
last_nn_fit %>%
  collect_predictions() %>%
  mutate(spread = -test_data$home_spread,
         both_positive = spread > 0 & mov > 0,
         both_negative = spread < 0 & mov < 0,
         correct = both_positive + both_negative) %>%
  summarize(Correct = sum(correct)) / nrow(test_data)
##     Correct
## 1 0.7292517

Not surprising that it’s almost the same. Really at this point, this model basically is the spread. Obviously, if we want to beat the spread, we need to improve on this. Let’s check the predicted vs. actual raw values.

last_nn_fit %>%
  collect_predictions() %>%
  head(10)
## # A tibble: 10 x 5
##    id                 .pred  .row   mov .config             
##    <chr>              <dbl> <int> <int> <chr>               
##  1 train/test split   5.43   2657    33 Preprocessor1_Model1
##  2 train/test split  22.8    2658     8 Preprocessor1_Model1
##  3 train/test split   0.584  2659    14 Preprocessor1_Model1
##  4 train/test split  -6.72   2660     3 Preprocessor1_Model1
##  5 train/test split   3.36   2661    -7 Preprocessor1_Model1
##  6 train/test split  15.0    2662     4 Preprocessor1_Model1
##  7 train/test split   5.88   2663    10 Preprocessor1_Model1
##  8 train/test split  -9.86   2664    -4 Preprocessor1_Model1
##  9 train/test split  12.4    2665     4 Preprocessor1_Model1
## 10 train/test split -21.0    2666   -18 Preprocessor1_Model1

And now let’s plot them.

last_nn_fit %>%
  collect_predictions() %>%
  ggplot() +
  geom_abline(intercept = 0, slope = 1, color = "red") +
  geom_point(aes(x=`.pred`, y=mov)) +
  theme_bw() +
  labs(title = "NN Prediction Results",
       x = "Predicted MOV",
       y = "Actual MOV")

Interpretability Plots

Earlier I mentioned that tabnet models were developed to be interpretable. The following plot shows the first 50 games in the test set with games on the x axis. For these games, it seems the Elo scores were the biggest contributors. Interesting. I wonder why we don’t see more of an impact from the spread.

exp_fit <-
  last_nn_fit %>%
  pluck(".workflow", 1) %>%   
  extract_fit_parsnip()

ex_fit <- tabnet_explain(exp_fit$fit, test_data[1:50, ])

autoplot(ex_fit) +
  labs(x = "Game", y = "Predictor")

Next, for the same 50 games, we see that different predictors are important in different steps. Here we see that spread comes into play mostly in the 7th and 8th step.

# PER-STEP, OBSERVATION-LEVEL FEATURE IMPORTANCES
autoplot(ex_fit, type="steps") +
  labs(x = "Game", y = "Predictor")

Next Steps

While tabnet looks promising, I can’t directly compare its performance with the other models in this post because I used a different data set for it. I saw in the tidymodels docs that there’s a way to do model comparison using work flow sets, so I’ll check that out in another post.

Comments