Dataset Download

Player statistics, team metrics, and historical data used to build this model

Introduction

In this document, I constructed an ordinal logistic regression model to predict the 2024 AFL Brownlow Medal votes. To do so, I used a dataset I compiled from various trusted sources, including the fitzRoy R package, afl.com.au, AFL Tables, and the AFL Coaches Association. I took the time to prepare and clean the data, transforming it to lay a solid foundation for building a robust predictive model.

The model I developed this year is similar to my efforts in 2023, but with the addition of a Position variable. This variable states the player’s primary position according to the AFL website. I added this in as I found that some non-midfielders - particularly ruckman such as Rowan Marshall, Max Gawn and Tim English - had very inflated predicted votes compared to what they actually polled last year. As the Brownlow Medal is considered a midfielder award (for good reason), I needed to incorporate the playing position of players in some capacity.

For my 2024 Brownlow prediction modelling, I trained the model on home-and-away season data from 2015-2023, with the goal to forecast how many Brownlow votes a player might receive based on their performance metrics.

Once the model was trained, I shifted to the testing phase, where I evaluated its performance using player data from 2024. I conducted a game-by-game analysis to compare my predictions with the actual Brownlow votes awarded by umpires, providing insights into how well the model performed. I also took a broader look at season-wide predictions, comparing my predicted votes with the actual counts to assess the model’s accuracy and predictive capabilities.

Let’s dive in!

Data Preparation

Libraries

To start, I loaded the key libraries needed for data analysis and modelling. These libraries were essential for handling tasks like data manipulation, statistical modelling, and formatting the output:

library(tidyverse)   #Collection of R packages for data science
library(ordinal)     #Package for fitting ordinal regression models
library(knitr)       #Package for dynamic report generation in R
library(unglue)      #Package for string manipulation
library(fastDummies) #Package for creating dummy variables
library(writexl)     #Package for exporting data to Excel files
library(DT)          #Package for creating interactive tables
library(caret)       #Package to help create a confusion matrix 
library(ggplot2)     #Package for visual analysis and plotting

Data Pre-Processing

I began by reading in the AFL Brownlow Medal data. The dataset includes detailed player statistics and was sourced from reliable outlets like the FitzRoy R package, afl.com.au, AFL Tables, and the AFL Coaches Association (AFLCA). I have already gathered, joined and cleaned the data into the file brownlow_data.csv so that it is ready for analysis!

#Read in the data
brownlow_data <- read.csv("brownlow_data_2024.csv")

To get a better understanding of the dataset, I took a quick look at the first 100 rows:

#Print the table in the nice format
datatable(head(brownlow_data,100), options = list(scrollX = TRUE, pageLength = 5))

I also displayed the column names to get a sense of the variables included in the dataset:

#Display the column names
colnames(brownlow_data)

##  [1] "Team.Name"                           "Player.Jumper.Number"               
##  [3] "Year"                                "Round.Number"                       
##  [5] "Player.ID"                           "Game.ID"                            
##  [7] "First.Name"                          "Surname"                            
##  [9] "Photo"                               "Opponent.Name"                      
## [11] "Team.Status"                         "Venue.Name"                         
## [13] "Venue.State"                         "Time.On.Ground.Percentage"          
## [15] "Goals"                               "Behinds"                            
## [17] "Kicks"                               "Handballs"                          
## [19] "Disposals"                           "Marks"                              
## [21] "Bounces"                             "Tackles"                            
## [23] "Contested.Possessions"               "Uncontested.Possessions"            
## [25] "Total.Possessions"                   "Inside.50s"                         
## [27] "Marks.Inside.50"                     "Contested.Marks"                    
## [29] "Hitouts"                             "One.Percenters"                     
## [31] "Disposal.Efficiency.Percentage"      "Clangers"                           
## [33] "Frees.For"                           "Frees.Against"                      
## [35] "DreamTeam.Points"                    "Rebound.50s"                        
## [37] "Goal.Assists"                        "Goal.Accuracy"                      
## [39] "Rating.Points"                       "Turnovers"                          
## [41] "Intercepts"                          "Tackles.Inside.50"                  
## [43] "Shots.At.Goal"                       "Score.Involvements"                 
## [45] "Metres.Gained"                       "Centre.Clearances"                  
## [47] "Stoppage.Clearances"                 "Total.Clearances"                   
## [49] "Effective.Kicks"                     "Kicking.Efficiency.Percentage"      
## [51] "Kick.To.Handball.Ratio"              "Effective.Disposals"                
## [53] "Marks.On.Lead"                       "Intercept.Marks"                    
## [55] "Contested.Possession.Rate"           "Hitouts.To.Advantage"               
## [57] "Hitout.Win.Percentage"               "Hitout.To.Advantage.Rate.Percentage"
## [59] "Ground.Ball.Gets"                    "F50.Ground.Ball.Gets"               
## [61] "Score.Launches"                      "Pressure.Acts"                      
## [63] "Defensive.Half.Pressure.Acts"        "Spoils"                             
## [65] "Ruck.Contests"                       "Contest.Defense.One.On.Ones"        
## [67] "Contest.Defense.Losses"              "Contest.Defense.Loss.Percentage"    
## [69] "Contest.Offence.One.On.Ones"         "Contest.Offence.Wins"               
## [71] "Contest.Offence.Wins.Percentage"     "Centre.Bounce.Attendances"          
## [73] "Kick.Ins"                            "Kick.Ins.Play.On"                   
## [75] "Team.Goals"                          "Team.Behinds"                       
## [77] "Team.Total"                          "Opponent.Goals"                     
## [79] "Opponent.Behinds"                    "Opponent.Total"                     
## [81] "Team.Result"                         "Position"                           
## [83] "Uncontested.Marks"                   "Effective.Handballs"                
## [85] "Margin"                              "Coaches.Votes"                      
## [87] "Date"                                "Brownlow.Votes"

There are 88 variables that capture different aspects of each player’s game performance. To simplify the analysis and avoid redundancy, I created new variables and remove overlapping ones. For instance:

The Disposals variable isn’t needed since we already have Kicks and Handballs separately.
I’ll create an Ineffective.Kicks variable by subtracting Effective.Kicks from Kicks, and then remove the original Kicks variable.

Now, let’s go ahead with these transformations and other necessary pre-processing steps:

#Add a score involvements variable that doesn't include direct goals, behinds or goal assists
brownlow_data$Involvements.No.Scores.Or.Assists <- brownlow_data$Score.Involvements - 
                                                brownlow_data$Goals - 
                                                brownlow_data$Behinds - 
                                                brownlow_data$Goal.Assists

#Add a hitouts variable that doesn't include those that go to advantage
brownlow_data$Hitouts.No.Advantage <- brownlow_data$Hitouts - 
                                      brownlow_data$Hitouts.To.Advantage

#Add an Outside F50 ground ball gets variable 
brownlow_data$Outside.50.Ground.Ball.Gets <- brownlow_data$Ground.Ball.Gets - 
                                             brownlow_data$F50.Ground.Ball.Gets

#Add a contested offence 1-on-1 variable that doesn't include those won
brownlow_data$Contest.Offence.One.On.Ones.Not.Won <- brownlow_data$Contest.Offence.One.On.Ones - 
                                                     brownlow_data$Contest.Offence.Wins

#Add a contested defence 1-on-1 variable that doesn't include those lost
brownlow_data$Contest.Defense.One.On.Ones.Not.Lost <- brownlow_data$Contest.Defense.One.On.Ones - 
                                                      brownlow_data$Contest.Defense.Losses

#Add an ineffective kicks variable
brownlow_data$Ineffective.Kicks <- brownlow_data$Kicks - 
                                   brownlow_data$Effective.Kicks

#Add an ineffective handballs variable
brownlow_data$Ineffective.Handballs <- brownlow_data$Handballs - 
                                       brownlow_data$Effective.Handballs

#Add a shots at goal variable that doesn't include those that go in for a goal or behind
brownlow_data$Shots.No.Score <- brownlow_data$Shots.At.Goal - 
                                brownlow_data$Goals - 
                                brownlow_data$Behinds

#Add a forward half pressure acts variable
brownlow_data$Forward.Half.Pressure.Acts <- brownlow_data$Pressure.Acts - 
                                            brownlow_data$Defensive.Half.Pressure.Acts

#Add a kick ins variable that doesn't include those where the kicker plays on out of the goal square
brownlow_data$Kick.Ins.Not.Play.On <- brownlow_data$Kick.Ins - 
                                      brownlow_data$Kick.Ins.Play.On

#Add a contested possessions variable that doesn't include contested marks
brownlow_data$Contested.Possessions.No.Mark <- brownlow_data$Contested.Possessions - 
                                              brownlow_data$Contested.Marks

#Add an uncontested possessions variable that doesn't include uncontested marks
brownlow_data$Uncontested.Possessions.No.Mark <- brownlow_data$Uncontested.Possessions - 
                                                brownlow_data$Uncontested.Marks

#Add an intercepts variable that doesn't include intercept marks
brownlow_data$Intercepts.No.Mark <- brownlow_data$Intercepts - 
                                   brownlow_data$Intercept.Marks

#Add an intercepts variable that doesn't include intercept marks
brownlow_data$Tackles.Outside.50 <- brownlow_data$Tackles - 
                                    brownlow_data$Tackles.Inside.50

#Add a Player variable which pastes a player's first name and surname
brownlow_data$Player <- paste(brownlow_data$First.Name, brownlow_data$Surname)

To boost the predictive power of the model, I also added variables that capture votes from previous seasons. Players who have polled well in the past often continue to perform strongly.

#Previous season Brownlow Votes and Coaches Votes variables
prev_season_votes <- brownlow_data %>% 
  group_by(Player.ID, Year) %>% 
  summarise(
  Brownlow.Votes.Previous.Season = sum(Brownlow.Votes, na.rm = TRUE),
  Coaches.Votes.Previous.Season = sum(Coaches.Votes, na.rm = TRUE)) %>% 
  mutate(Year = Year + 1)

#Join these new variables to the brownlow_data
brownlow_data <- full_join(brownlow_data, 
                           prev_season_votes) %>% 
                 filter(!is.na(First.Name))

#Replace NA Brownlow Votes with a 0
brownlow_data$Brownlow.Votes.Previous.Season <- replace(brownlow_data$Brownlow.Votes.Previous.Season, 
                                           is.na(brownlow_data$Brownlow.Votes.Previous.Season), 
                                           0)

#Replace NA Coaches Votes with a 0
brownlow_data$Coaches.Votes.Previous.Season <- replace(brownlow_data$Coaches.Votes.Previous.Season, 
                                                   is.na(brownlow_data$Coaches.Votes.Previous.Season), 
                                                   0)

I’ve also calculated the maximum votes a player received in any season prior to the one in which the game was played (dating back to 2015)

#Max Brownlow Votes and Coaches Votes in a season variables

#Add empty variable 
Max.Brownlow.Votes.Season_prior <- data.frame()

#Loop through each season of available data to get the most Brownlow 
#Votes and Coaches Votes a player has received prior to that year
for(i in min(brownlow_data$Year):max(brownlow_data$Year)){
Max.Brownlow.Votes.Seasons_prior_i <- brownlow_data %>% 
  filter(Year <= i) %>%
  group_by(Player.ID, Year) %>% 
  summarise(Brownlow.Votes.Previous.Season = sum(Brownlow.Votes, na.rm = TRUE),
            Coaches.Votes.Previous.Season = sum(Coaches.Votes, na.rm = TRUE)) %>% 
  group_by(Player.ID) %>%
  summarise(Max.Brownlow.Votes.Season = max(Brownlow.Votes.Previous.Season),
            Max.Coaches.Votes.Season = max(Coaches.Votes.Previous.Season),
            Year = i + 1)

Max.Brownlow.Votes.Season_prior <- bind_rows(Max.Brownlow.Votes.Season_prior, Max.Brownlow.Votes.Seasons_prior_i)
}

#Join these new variables to the brownlow_data
brownlow_data <- full_join(brownlow_data, Max.Brownlow.Votes.Season_prior) %>% 
                 filter(!is.na(First.Name))

#Replace NA Brownlow Votes with a 0
brownlow_data$Max.Brownlow.Votes.Season <- replace(brownlow_data$Max.Brownlow.Votes.Season, 
                                          is.na(brownlow_data$Max.Brownlow.Votes.Season), 
                                          0)

#Replace NA Coaches Votes with a 0
brownlow_data$Max.Coaches.Votes.Season <- replace(brownlow_data$Max.Coaches.Votes.Season, 
                                                  is.na(brownlow_data$Max.Coaches.Votes.Season), 
                                                  0)

It was important to also convert the Brownlow.Votes and Coaches.Votes variables into factors. This ensures that R treats them as categorical rather than continuous variables, which is crucial for accurate interpretation and analysis when building the predictive model.

#Convert Brownlow Votes variable to factor
brownlow_data$Brownlow.Votes <- as.factor(brownlow_data$Brownlow.Votes)

#Convert Coaches Votes variable to factor
brownlow_data$Coaches.Votes <- as.factor(brownlow_data$Coaches.Votes)

Next, I refined the dataset by keeping only the columns that are essential for the analysis.

#Select variables that will be used in the model
brownlow_data <- brownlow_data %>% 
  select(Year, Game.ID, Round.Number, Team.Name, 
         Opponent.Name, Player, Goals, Behinds, 
         Effective.Kicks, Effective.Handballs, Ineffective.Kicks, Ineffective.Handballs, 
         Bounces, Tackles.Inside.50, Tackles.Outside.50, Contested.Possessions.No.Mark, 
         Uncontested.Possessions.No.Mark, Inside.50s, Marks.Inside.50, Contested.Marks, 
         Uncontested.Marks, One.Percenters, Clangers, Frees.For, 
         Frees.Against, Rebound.50s, Goal.Assists, Rating.Points, 
         Turnovers, Intercepts.No.Mark, Involvements.No.Scores.Or.Assists, Metres.Gained, 
         Centre.Clearances, Stoppage.Clearances, Marks.On.Lead, Intercept.Marks, 
         Hitouts.To.Advantage, Hitouts.No.Advantage, Outside.50.Ground.Ball.Gets, F50.Ground.Ball.Gets, 
         Score.Launches, Forward.Half.Pressure.Acts, Defensive.Half.Pressure.Acts, Spoils, 
         Contest.Offence.One.On.Ones.Not.Won, Contest.Offence.Wins, Contest.Defense.One.On.Ones.Not.Lost, 
         Contest.Defense.Losses, Shots.No.Score, Time.On.Ground.Percentage, Team.Goals, Team.Behinds, 
         Opponent.Goals, Opponent.Behinds, Margin, Kick.Ins.Not.Play.On, 
         Kick.Ins.Play.On, Max.Brownlow.Votes.Season, Max.Coaches.Votes.Season, Brownlow.Votes.Previous.Season, 
         Coaches.Votes.Previous.Season, Coaches.Votes, Team.Result, Position, Brownlow.Votes)

This updated dataset, with the newly created and refined variables, will serve as the foundation for developing the predictive model.

#Present the enhanced dataset in a nice format
datatable(head(brownlow_data,100), options = list(scrollX = TRUE, pageLength = 5))

Model Preparation

Training Data Selection

To begin, I specified the time periods for training and testing the model. The training period covered data from 2015 to 2023, while the testing period focused on the 2024 season:

#First season of training period
train_start_season <- 2015

#Last season of training period
train_end_season <- 2023

#Season to predict Brownlow Votes
test_season <- 2024

Next, I created a training dataset by filtering the data for the specified seasons from 2015 to 2023.

#Create dataset that encompasses the training period
brownlow_train <- brownlow_data %>% 
  filter(Year >= train_start_season, Year <= train_end_season)

Standardising the Numeric Statistics

Next, I standardised the numeric statistics in the training dataset to ensure that each variable has a mean of 0 and a standard deviation of 1. This normalization helps improve the model’s performance by ensuring that all variables are on a similar scale.

#Standardise the numeric values
train_numeric_standardised <- scale(brownlow_train[,8:(ncol(brownlow_train)-4)])
brownlow_train[,8:(ncol(brownlow_train)-5)] <- train_numeric_standardised

#Store the center and scale as this will be used to standardise the testing set
train_numeric_standardised.center<-attr(train_numeric_standardised,"scaled:center")
train_numeric_standardised.scale<-attr(train_numeric_standardised,"scaled:scale")

Creating Dummy Variables

Next, I converted categorical variables like Coaches.Votes, Team.Result and Position into dummy variables. These dummy variables are binary (0 or 1) and allow the model to process categorical data more effectively:

#Create dummy variables for Coaches Votes and Team Result
brownlow_train <- dummy_cols(brownlow_train, 
                             select_columns = c('Coaches.Votes', 'Team.Result', 'Position'),
                             remove_selected_columns = TRUE)

Model Training

Building the Ordinal Logistic Regression Model

To predict the number of Brownlow votes a player might receive based on their performance statistics, I used an ordinal logistic regression model. This model takes into account various metrics from the cleaned and enhanced dataset.

The code below defines the model formula and fits the ordinal logistic regression model using the clm function from the ordinal package. This package is tailored for fitting cumulative link models, including ordinal logistic regression. The clm function estimates the model parameters through maximum likelihood estimation, which is well-suited for the ordered nature of the response variable. This is particularly important for ordinal outcomes, like the number of Brownlow votes, where the categories are ranked.

my_model <- clm(Brownlow.Votes ~ 
                      Goals + Behinds + Effective.Kicks + Effective.Handballs + 
                      Ineffective.Kicks + Ineffective.Handballs + Bounces + 
                      Tackles.Inside.50 + Tackles.Outside.50 + Contested.Possessions.No.Mark + 
                      Uncontested.Possessions.No.Mark + Inside.50s + Marks.Inside.50 + 
                      Contested.Marks + Uncontested.Marks + One.Percenters + 
                      Clangers + Frees.For + Frees.Against + 
                      Rebound.50s + Goal.Assists + Rating.Points + 
                      Turnovers + Intercepts.No.Mark + Involvements.No.Scores.Or.Assists + 
                      Metres.Gained + Centre.Clearances + Stoppage.Clearances + 
                      Marks.On.Lead + Intercept.Marks + Outside.50.Ground.Ball.Gets + 
                      F50.Ground.Ball.Gets + Score.Launches + Forward.Half.Pressure.Acts + 
                      Defensive.Half.Pressure.Acts + Spoils + Contest.Offence.One.On.Ones.Not.Won + 
                      Contest.Offence.Wins + Contest.Defense.One.On.Ones.Not.Lost + Contest.Defense.Losses + 
                      Shots.No.Score + Time.On.Ground.Percentage + Team.Result_W + 
                      Team.Result_L + Team.Goals + Team.Behinds + 
                      Opponent.Goals + Opponent.Behinds + Kick.Ins.Not.Play.On + 
                      Kick.Ins.Play.On + Hitouts.To.Advantage + Hitouts.No.Advantage + 
                      Position_KEY_DEFENDER + Position_KEY_FORWARD + Position_MEDIUM_DEFENDER + Position_RUCK +
                      Position_MEDIUM_FORWARD + Position_MIDFIELDER + Position_MIDFIELDER_FORWARD + 
                      Coaches.Votes_1 + Coaches.Votes_2 + Coaches.Votes_3 + 
                      Coaches.Votes_4 + Coaches.Votes_5 + Coaches.Votes_6 + 
                      Coaches.Votes_7 + Coaches.Votes_8 + Coaches.Votes_9 + 
                      Coaches.Votes_10 + Max.Brownlow.Votes.Season + Brownlow.Votes.Previous.Season + 
                      Max.Coaches.Votes.Season + Coaches.Votes.Previous.Season,
                 
                 data = brownlow_train, link = "logit", #Use logistic link function
                 
                 threshold = "flexible", #Allow flexible thresholds
                 
                 Hess = TRUE, #Compute Hessian matrix for testing
                 
                 maxIter = 100, #Allow up to 100 iterations
                 
                 tol = 1e-6, #Set a small tolerance for convergence
                 
                 contrasts = list(Team.Name = "contr.treatment"), #Specify contrasts for the 'Team' variable
                 
                 na.action = na.omit)

Model Summary

Finally, I summarised the model to evaluate its performance and understand the significance of each predictor variable.

This summary includes details about the coefficients, standard errors, z-values, and p-values for each predictor in the model. It helps to assess the significance of each predictor variable and gauge the overall performance of the model.

#Print a summary of the model
summary(my_model)

## formula: 
## Brownlow.Votes ~ Goals + Behinds + Effective.Kicks + Effective.Handballs + Ineffective.Kicks + Ineffective.Handballs + Bounces + Tackles.Inside.50 + Tackles.Outside.50 + Contested.Possessions.No.Mark + Uncontested.Possessions.No.Mark + Inside.50s + Marks.Inside.50 + Contested.Marks + Uncontested.Marks + One.Percenters + Clangers + Frees.For + Frees.Against + Rebound.50s + Goal.Assists + Rating.Points + Turnovers + Intercepts.No.Mark + Involvements.No.Scores.Or.Assists + Metres.Gained + Centre.Clearances +      Stoppage.Clearances + Marks.On.Lead + Intercept.Marks + Outside.50.Ground.Ball.Gets + F50.Ground.Ball.Gets + Score.Launches + Forward.Half.Pressure.Acts + Defensive.Half.Pressure.Acts + Spoils + Contest.Offence.One.On.Ones.Not.Won + Contest.Offence.Wins + Contest.Defense.One.On.Ones.Not.Lost + Contest.Defense.Losses + Shots.No.Score + Time.On.Ground.Percentage + Team.Result_W + Team.Result_L + Team.Goals + Team.Behinds + Opponent.Goals + Opponent.Behinds + Kick.Ins.Not.Play.On + Kick.Ins.Play.On +      Hitouts.To.Advantage + Hitouts.No.Advantage + Position_KEY_DEFENDER + Position_KEY_FORWARD + Position_MEDIUM_DEFENDER + Position_RUCK + Position_MEDIUM_FORWARD + Position_MIDFIELDER + Position_MIDFIELDER_FORWARD + Coaches.Votes_1 + Coaches.Votes_2 + Coaches.Votes_3 + Coaches.Votes_4 + Coaches.Votes_5 + Coaches.Votes_6 + Coaches.Votes_7 + Coaches.Votes_8 + Coaches.Votes_9 + Coaches.Votes_10 + Max.Brownlow.Votes.Season + Brownlow.Votes.Previous.Season + Max.Coaches.Votes.Season + Coaches.Votes.Previous.Season
## data:    brownlow_train
## 
##  link  threshold nobs  logLik    AIC      niter max.grad cond.H 
##  logit flexible  77962 -12665.38 25482.76 8(0)  1.05e-11 7.1e+05
## 
## Coefficients:
##                                       Estimate Std. Error z value Pr(>|z|)    
## Goals                                 0.748480   0.032220  23.230  < 2e-16 ***
## Behinds                              -0.005403   0.021689  -0.249 0.803281    
## Effective.Kicks                       0.517579   0.062590   8.269  < 2e-16 ***
## Effective.Handballs                   0.512193   0.064590   7.930 2.19e-15 ***
## Ineffective.Kicks                     0.346748   0.040618   8.537  < 2e-16 ***
## Ineffective.Handballs                 0.171015   0.027087   6.314 2.73e-10 ***
## Bounces                               0.053101   0.015824   3.356 0.000791 ***
## Tackles.Inside.50                     0.057025   0.019541   2.918 0.003520 ** 
## Tackles.Outside.50                    0.185739   0.022008   8.440  < 2e-16 ***
## Contested.Possessions.No.Mark         0.276808   0.072790   3.803 0.000143 ***
## Uncontested.Possessions.No.Mark       0.064118   0.068858   0.931 0.351774    
## Inside.50s                           -0.011091   0.024866  -0.446 0.655572    
## Marks.Inside.50                       0.068018   0.026507   2.566 0.010287 *  
## Contested.Marks                       0.142081   0.025711   5.526 3.28e-08 ***
## Uncontested.Marks                     0.149506   0.042156   3.547 0.000390 ***
## One.Percenters                       -0.046792   0.053226  -0.879 0.379343    
## Clangers                              0.051468   0.027968   1.840 0.065731 .  
## Frees.For                             0.069251   0.020470   3.383 0.000717 ***
## Frees.Against                        -0.052661   0.024228  -2.174 0.029738 *  
## Rebound.50s                          -0.002818   0.028648  -0.098 0.921654    
## Goal.Assists                          0.021439   0.017696   1.212 0.225697    
## Rating.Points                        -0.043357   0.039141  -1.108 0.267985    
## Turnovers                             0.018373   0.025995   0.707 0.479691    
## Intercepts.No.Mark                   -0.071584   0.029967  -2.389 0.016906 *  
## Involvements.No.Scores.Or.Assists     0.021459   0.025260   0.850 0.395573    
## Metres.Gained                         0.091336   0.039719   2.300 0.021475 *  
## Centre.Clearances                     0.074735   0.020955   3.566 0.000362 ***
## Stoppage.Clearances                   0.026953   0.025309   1.065 0.286880    
## Marks.On.Lead                         0.018376   0.019450   0.945 0.344779    
## Intercept.Marks                       0.127835   0.025842   4.947 7.54e-07 ***
## Outside.50.Ground.Ball.Gets          -0.101629   0.041913  -2.425 0.015320 *  
## F50.Ground.Ball.Gets                 -0.070432   0.024448  -2.881 0.003965 ** 
## Score.Launches                        0.002124   0.022268   0.095 0.924008    
## Forward.Half.Pressure.Acts           -0.041505   0.027195  -1.526 0.126962    
## Defensive.Half.Pressure.Acts          0.034669   0.024956   1.389 0.164765    
## Spoils                                0.084240   0.060368   1.395 0.162882    
## Contest.Offence.One.On.Ones.Not.Won  -0.023461   0.024914  -0.942 0.346362    
## Contest.Offence.Wins                 -0.051035   0.019276  -2.648 0.008105 ** 
## Contest.Defense.One.On.Ones.Not.Lost  0.004919   0.033735   0.146 0.884080    
## Contest.Defense.Losses               -0.017529   0.031525  -0.556 0.578181    
## Shots.No.Score                        0.009738   0.017739   0.549 0.583040    
## Time.On.Ground.Percentage             0.122243   0.044717   2.734 0.006263 ** 
## Team.Result_W                         0.058525   0.210503   0.278 0.780996    
## Team.Result_L                        -0.764167   0.212572  -3.595 0.000325 ***
## Team.Goals                           -0.190602   0.026781  -7.117 1.10e-12 ***
## Team.Behinds                         -0.064705   0.022068  -2.932 0.003367 ** 
## Opponent.Goals                       -0.317388   0.029035 -10.931  < 2e-16 ***
## Opponent.Behinds                     -0.011389   0.021630  -0.527 0.598505    
## Kick.Ins.Not.Play.On                  0.018278   0.024660   0.741 0.458573    
## Kick.Ins.Play.On                      0.002166   0.024892   0.087 0.930653    
## Hitouts.To.Advantage                  0.140068   0.037401   3.745 0.000180 ***
## Hitouts.No.Advantage                  0.159450   0.038077   4.188 2.82e-05 ***
## Position_KEY_DEFENDER                 0.074896   0.153224   0.489 0.624981    
## Position_KEY_FORWARD                  0.166673   0.104120   1.601 0.109426    
## Position_MEDIUM_DEFENDER             -0.322276   0.106155  -3.036 0.002398 ** 
## Position_RUCK                        -0.329731   0.139566  -2.363 0.018150 *  
## Position_MEDIUM_FORWARD               0.023392   0.098944   0.236 0.813109    
## Position_MIDFIELDER                   0.374093   0.057253   6.534 6.40e-11 ***
## Position_MIDFIELDER_FORWARD          -0.006422   0.113393  -0.057 0.954833    
## Coaches.Votes_1                       1.019329   0.087746  11.617  < 2e-16 ***
## Coaches.Votes_2                       1.295514   0.081375  15.920  < 2e-16 ***
## Coaches.Votes_3                       1.453939   0.082087  17.712  < 2e-16 ***
## Coaches.Votes_4                       1.806927   0.081769  22.098  < 2e-16 ***
## Coaches.Votes_5                       2.127983   0.083878  25.370  < 2e-16 ***
## Coaches.Votes_6                       2.395894   0.087927  27.249  < 2e-16 ***
## Coaches.Votes_7                       2.485323   0.090491  27.465  < 2e-16 ***
## Coaches.Votes_8                       2.783398   0.090439  30.776  < 2e-16 ***
## Coaches.Votes_9                       3.362680   0.100198  33.560  < 2e-16 ***
## Coaches.Votes_10                      4.010018   0.100740  39.806  < 2e-16 ***
## Max.Brownlow.Votes.Season             0.083329   0.053263   1.564 0.117705    
## Brownlow.Votes.Previous.Season        0.007218   0.043422   0.166 0.867976    
## Max.Coaches.Votes.Season             -0.006793   0.056540  -0.120 0.904367    
## Coaches.Votes.Previous.Season         0.003556   0.002644   1.345 0.178680    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##     Estimate Std. Error z value
## 0|1   5.3840     0.2164   24.88
## 1|2   6.3042     0.2177   28.95
## 2|3   7.7023     0.2204   34.95

The summary of the fitted ordinal logistic regression model for predicting Brownlow Votes provides a detailed look at the various performance metrics and their importance in determining vote counts.

Key offensive metrics, such as Goals, Effective.Kicks, and Effective.Handballs, show strong positive coefficients and highly significant p-values. This indicates that these actions greatly boost the chances of receiving votes and serve as reliable predictors. While ineffective actions like Ineffective.Kicks and Ineffective.Handballs also have positive coefficients, their impact is somewhat less pronounced, though they still hold statistical significance.

Defensive metrics are equally important. Tackles (inside and outside the 50-metre arc) and intercepts (possessions and marks) both positively contribute to vote-getting. This highlights the value of defensive efforts in gaining recognition. On the flip side, negative coefficients for actions like Frees.Against suggest that penalties and failures can hinder the likelihood of receiving votes.

Team performance metrics play a crucial role as well. The model indicates that winning a match and the nature of losses significantly influence vote counts, with losses having a notably negative impact. Team scoring and opponent scoring further illustrate how overall game outcomes affect individual recognition.

As I assumed, the model also takes into a player’s primary position, with midfielders more likely to receive votes than those playing in other positions, such as rucks and medium defenders.

But most eye-opening, is that coaches votes have a significant relationship to vote-getting on Brownlow Medal night, with all combinations of coaches votes (from 1 to 10) being significant variables in the model.

I’ve decided not to regularize or reduce the model because my primary goal was to capture as much detailed information as possible from the available data. Since the Brownlow Medal voting process is subjective and influenced by a multitude of factors, I wanted to ensure that no potentially valuable information was excluded from the model. Regularizing or simplifying the model could have risked losing some of this nuance, potentially missing out on subtle patterns that contribute to how votes are awarded. By allowing the model to be more complex, I aimed to increase its ability to pick up on these smaller details, which may be critical in accurately predicting votes in tight races or less obvious matchups. In this case, maximizing the model’s flexibility was more important than trying to prevent overfitting, as I believe the richness of the data outweighs the potential risks of over-complication.

Overall, this model emphasizes the complex nature of player performance evaluation, weaving together offensive skills, defensive capabilities, team success, and external recognition to predict Brownlow Votes. The combination of significant variables and their highly significant p-values suggests that excelling in multiple areas of the game is essential for gaining votes.

Model Testing

The model testing process involves applying the trained model to a new season’s data to predict the Brownlow Medal votes for each player in each game. This testing phase assesses the model’s performance and ensures its predictions are reliable and accurate.

#Create dataset that encompasses the testing period
brownlow_test <- brownlow_data %>% 
  filter(Year == test_season)

#Print the table in the nice format
datatable(head(brownlow_test,100), options = list(scrollX = TRUE, pageLength = 5))

After a quick peek of the 2024 data, I then standardise the numeric values on the same scale as the training set, as well as create dummy variables for the categorical variables Coaches.Votes, Team.Result and Position.

#Standardise the numeric values, based on the training set
brownlow_test[,8:(ncol(brownlow_test)-4)] <- scale(brownlow_test[,8:(ncol(brownlow_test)-4)],
                                                   center=train_numeric_standardised.center,
                                                   scale=train_numeric_standardised.scale) 

#Create dummy variables for Coaches Votes, Team Result and Position
brownlow_test <- dummy_cols(brownlow_test, 
                            select_columns = c('Coaches.Votes', 'Team.Result', 'Position'),
                            remove_selected_columns = TRUE)

Game-By-Game Votes

Next, the trained model is used to make predictions on the test data. The probabilities of receiving different vote counts (0, 1, 2, or 3 votes) are calculated for each player in each game.

#Use the predict function to make predictions on the test data
predictions <- predict(my_model,
                       newdata = brownlow_test %>% select(-Brownlow.Votes), 
                       type = 'prob')

#Transform the predictions into a dataframe that is readable
predictions_probability_matrix  <- data.frame(matrix(unlist(predictions), 
                                              nrow = nrow(brownlow_test)))

#Change the column names of the predictions_probability_matrix dataframe
colnames(predictions_probability_matrix) <- c("Votes.0", "Votes.1", "Votes.2", "Votes.3")

#Bind the columns of brownlow_test dataframe with predictions_probability_matrix 
brownlow_test_predictions  <- cbind.data.frame(brownlow_test, predictions_probability_matrix)

Expected votes for each player for each game are then calculated:

#Calculate the "expected" votes a player should receive each game according to the model
brownlow_test_predictions$Expected.Votes <- 1*brownlow_test_predictions$Votes.1 + 
                                            2*brownlow_test_predictions$Votes.2 + 
                                            3*brownlow_test_predictions$Votes.3

Let’s see which players had the highest “Expected Votes” games according to the model:

#Ordering player's games according to their "expected" votes
expected_votes <- brownlow_test_predictions %>% 
  select(Game.ID,   Round.Number,   Player, Team.Name, Opponent.Name, Expected.Votes) %>%
  arrange(desc(Expected.Votes))

#Presenting this table in a nice format
datatable(expected_votes, options = list(scrollX = TRUE, pageLength = 10))

So, according to the model, Nick Daicos’ round 24 performance against Melbourne was consider the best game according to Expected.Votes, followed by a number of notable individual performances from other superstars of the competition.

However, with the way the Brownlow Medal works, we can only give out a single 3 vote, a single 2 vote, and a single 1 vote each game. But by using the Expected.Votes, we can assign the 3, 2 and 1 based on the top three players Expected.Votes.

We will also compare this to the official votes given out by the umpires at the end of each game (the column Brownlow.Votes).

#Assign the 3, 2 and 1 votes to the top 3 "expected" votes players of each game
round.by.round.votes <- brownlow_test_predictions %>%
  group_by(Game.ID) %>%
  top_n(3, Expected.Votes) %>%
  mutate(Predicted.Votes = order(order(Expected.Votes, Player, decreasing=FALSE))) %>% 
  select(Game.ID, Round.Number, Player, Team.Name, 
         Opponent.Name, Predicted.Votes, Expected.Votes) %>%
  arrange(Game.ID, desc(Predicted.Votes))

#Get the actual assigned Brownlow votes for each game
brownlow_votes <- brownlow_data %>% 
  filter(Year == test_season) %>%
  select(Game.ID, Round.Number, Player, Team.Name, Opponent.Name, Brownlow.Votes) 

#Join the Predicted votes with the Actual votes
round.by.round.votes <- full_join(round.by.round.votes,
                                  brownlow_votes) %>% 
                        arrange(Game.ID, desc(Predicted.Votes))

#Replace NA predicted votes with a 0
round.by.round.votes$Predicted.Votes <- replace(round.by.round.votes$Predicted.Votes, 
                                                is.na(round.by.round.votes$Predicted.Votes), 
                                                0)

Let’s display our predicted votes for each game in a nice format:

#Filter round.by.round.votes to only see players predicted to poll a vote in each game
game_vote_predictions <- round.by.round.votes %>%
                         filter(Predicted.Votes %in% c(1,2,3))

#Present the table in a nice format
datatable(game_vote_predictions, options = list(scrollX = TRUE, pageLength = 3))

Using this, we can identify how well the model performs compared to the actual votes given by the umpires.

#Convert Predicted.Votes to a factor variable 
round.by.round.votes$Predicted.Votes <- as.factor(round.by.round.votes$Predicted.Votes)

#Present the confusion matrix
confusion_matrix <- confusionMatrix(round.by.round.votes$Predicted.Votes, round.by.round.votes$Brownlow.Votes)
confusion_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3
##          0 8660  121   86   35
##          1  127   36   29   15
##          2   79   28   58   42
##          3   36   22   34  115
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9313          
##                  95% CI : (0.9261, 0.9363)
##     No Information Rate : 0.9348          
##     P-Value [Acc > NIR] : 0.9170          
##                                           
##                   Kappa : 0.4495          
##                                           
##  Mcnemar's Test P-Value : 0.8525          
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3
## Sensitivity            0.9728  0.17391 0.280193  0.55556
## Specificity            0.6103  0.98164 0.984006  0.99012
## Pos Pred Value         0.9728  0.17391 0.280193  0.55556
## Neg Pred Value         0.6103  0.98164 0.984006  0.99012
## Prevalence             0.9348  0.02174 0.021737  0.02174
## Detection Rate         0.9094  0.00378 0.006091  0.01208
## Detection Prevalence   0.9348  0.02174 0.021737  0.02174
## Balanced Accuracy      0.7916  0.57778 0.632100  0.77284

Statistics by Class

Sensitivity (TP/(TP+FN)):

0 votes: 97.28%: Very high sensitivity indicating the model is excellent at identifying instances with 0 votes. This is expected as a high proportion of players are assigned 0 votes each game, which makes this pretty easy.
1 vote: 17.39%: Low sensitivity, indicating challenges in identifying 1 vote instances.
2 votes: 28.02%: Low sensitivity, indicating challenges in identifying 2 vote instances.
3 votes: 55.56%: Moderate sensitivity, highlighting that the model is better at identifying standout 3 vote performances compared to 1 vote and 2 vote games. With 46 players in each game, to predict the player who will get 3 votes more than half the time is a pretty good effort.

Specificity (TN/(TN+FP)):

0 votes: 61.03%: Indicates moderate ability to identify non-0 vote instances.
1 vote: 98.16%: High specificity, indicating the model rarely misclassifies other instances as 1 vote. This is expected as a high proportion of players are assigned 0 votes each game.
2 votes: 98.4%: High specificity, indicating the model rarely misclassifies other instances as 2 votes. This is expected as a high proportion of players are assigned 0 votes each game.
3 votes: 99.01%: High specificity, indicating the model rarely misclassifies other instances as 3 votes. This is expected as a high proportion of players are assigned 0 votes each game.

Season Votes

I can then put the round-by-round votes altogether into a predicted leaderboard based off of the model votes.

summary_table <- game_vote_predictions %>%
  ungroup() %>% select(Player, Team.Name, Round.Number, Predicted.Votes) %>%
  pivot_wider(names_from = Round.Number, values_from = Predicted.Votes, names_prefix = "R.") %>%
  mutate(across(starts_with("R."), as.numeric)) %>%
  mutate(
    Predicted_Votes = rowSums(select(., starts_with("R.")), na.rm = TRUE),
    Count_3_Votes = rowSums(select(., starts_with("R.")) == 3, na.rm = TRUE),
    Count_2_Votes = rowSums(select(., starts_with("R.")) == 2, na.rm = TRUE),
    Count_1_Vote = rowSums(select(., starts_with("R.")) == 1, na.rm = TRUE),
    Games_Polled = rowSums(select(., starts_with("R.")) > 0, na.rm = TRUE)
  ) %>% 
  select(1:2, 28, 3:27, 29:32) %>%
  arrange(desc(Predicted_Votes), desc(Count_3_Votes), desc(Count_2_Votes), desc(Count_1_Vote))

datatable(summary_table, options = list(scrollX = TRUE, pageLength = 10))

We will compare these predictions to the actual vote count and see how well our model performs.

We will remove all players who were both predicted to receive 0 votes and didn’t receive an official vote, as this will give us a truer sense of how accurate our model is at predicting the final vote count.

#Create a vote tally based on the predicted votes of the model
vote_tally <- game_vote_predictions %>%
  group_by(Player, Team.Name) %>%
  summarise(Predicted.Votes = 3 * sum(Predicted.Votes == 3) +
                              2 * sum(Predicted.Votes == 2) +
                              1 * sum(Predicted.Votes == 1), .groups = 'drop') %>%
  arrange(desc(Predicted.Votes))

#Create a vote tally based on the actual votes of the model
actual_votes <- brownlow_data %>%
  filter(Year == test_season) %>% 
  group_by(Player, Team.Name) %>% 
  summarise(Actual.Votes = 3*sum(Brownlow.Votes == 3) +
                           2*sum(Brownlow.Votes == 2) +
                           1*sum(Brownlow.Votes == 1), .groups = 'drop') %>% 
  arrange(desc(Actual.Votes)) 

#Join both the predicted vote tally and the actual vote tally for comparison
vote_tally <- full_join(vote_tally, actual_votes) %>%
              arrange(desc(Predicted.Votes), desc(Actual.Votes))

#Replace NA predicted votes with a 0
vote_tally$Predicted.Votes <- replace(vote_tally$Predicted.Votes, 
                                      is.na(vote_tally$Predicted.Votes), 
                                      0)

#Replace NA actual votes with a 0
vote_tally$Actual.Votes <- replace(vote_tally$Actual.Votes, 
                                   is.na(vote_tally$Actual.Votes), 
                                   0)

vote_tally <- vote_tally %>%
  mutate(Predicted.Rank = rank(-Predicted.Votes, ties.method = 'min'),
         Actual.Rank = rank(-Actual.Votes, ties.method = 'min')) %>%
  filter(Predicted.Votes + Actual.Votes != 0)

#Present the table in a nice format for comparison
datatable(vote_tally, options = list(scrollX = TRUE))

The analysis of the predictions compared to the actual votes for the Brownlow Medal reveals several important insights into the model’s performance.

Our model has done reasonably well. The highlight is that it has correctly predicted Patrick Cripps to break the all time vote tally record (not as extreme as he actually did!), as well having Nick Daicos finish second with a high vote count as well.

In terms of error metrics for the vote counts, the Mean Absolute Error (MAE) is 2.008, and the Root Mean Square Error (RMSE) is 2.631.

The MAE indicates that, on average, the predicted votes differ from the actual votes by about 2.008 votes. This relatively low average error demonstrates that the model’s predictions are quite close to the actual votes. The RMSE, being slightly higher than the MAE, further penalizes larger errors, but its value of 2.631 still reflects a reasonable level of accuracy, indicating that significant prediction errors are infrequent.

#Plotting predicted votes against actual votes
ggplot(vote_tally, aes(x = Predicted.Votes, y = Actual.Votes)) +
  geom_point() +
  geom_smooth(method = 'lm', col = 'red') +
  ggtitle("Predicted vs Actual Votes") +
  xlab("Predicted Votes") +
  ylab("Actual Votes")

Furthermore, the correlation between the predicted and actual votes is approximately 0.928. This high correlation indicates a strong linear relationship, suggesting that the predictions are closely aligned with the actual outcomes. This strong positive correlation signifies that as the predicted votes increase, the actual votes tend to increase as well, showing that the model is effectively capturing the voting patterns to a significant extent.

Conclusion

In conclusion, the development and evaluation of the ordinal logistic regression model for predicting Brownlow Medal votes showed its effectiveness in providing reliable predictions based on a solid dataset of player performances. By incorporating a range of offensive and defensive statistics, the model captures key aspects of player performance, allowing for a nuanced understanding of what contributes to Brownlow recognition. Key metrics like goals, effective kicks, and tackles are strong predictors of votes, highlighting the importance of both offensive and defensive contributions to a player’s overall impact on the game.

Furthermore, the model’s integration of Coaches’ votes further enhances its predictive power. This aspect demonstrates that players who are recognised by the coaches are more likely to receive votes. Additionally, variables regarding past recognition, such as previous Brownlow and Coaches’ votes, demonstrates that players who consistently perform well and are acknowledged for their efforts in the past are more likely to continue receiving votes.

The model’s predictive abilities, assessed through game-by-game analysis and overall season performance, indicate a strong foundation for future enhancements. By refining the model and incorporating real-time data, I can continue improving its accuracy and reliability in predicting Brownlow Medal votes.

Overall, this project highlights the complexity of assessing player performance in the AFL and demonstrates how statistical modelling can enhance our understanding of the game and its players.

Predicting the 2024 Brownlow Medal

Cooper Denny