In this document, I constructed an ordinal logistic regression model to predict the 2024 AFL Brownlow Medal votes. To do so, I used a dataset I compiled from various trusted sources, including the fitzRoy R package, afl.com.au, AFL Tables, and the AFL Coaches Association. I took the time to prepare and clean the data, transforming it to lay a solid foundation for building a robust predictive model.
The model I developed this year is similar to my efforts in 2023, but
with the addition of a Position variable. This variable
states the player’s primary position according to the AFL website. I
added this in as I found that some non-midfielders - particularly
ruckman such as Rowan Marshall, Max Gawn and Tim English - had very
inflated predicted votes compared to what they actually polled last
year. As the Brownlow Medal is considered a midfielder award (for good
reason), I needed to incorporate the playing position of players in some
capacity.
For my 2024 Brownlow prediction modelling, I trained the model on home-and-away season data from 2015-2023, with the goal to forecast how many Brownlow votes a player might receive based on their performance metrics.
Once the model was trained, I shifted to the testing phase, where I evaluated its performance using player data from 2024. I conducted a game-by-game analysis to compare my predictions with the actual Brownlow votes awarded by umpires, providing insights into how well the model performed. I also took a broader look at season-wide predictions, comparing my predicted votes with the actual counts to assess the model’s accuracy and predictive capabilities.
Let’s dive in!
To start, I loaded the key libraries needed for data analysis and modelling. These libraries were essential for handling tasks like data manipulation, statistical modelling, and formatting the output:
library(tidyverse) #Collection of R packages for data science
library(ordinal) #Package for fitting ordinal regression models
library(knitr) #Package for dynamic report generation in R
library(unglue) #Package for string manipulation
library(fastDummies) #Package for creating dummy variables
library(writexl) #Package for exporting data to Excel files
library(DT) #Package for creating interactive tables
library(caret) #Package to help create a confusion matrix
library(ggplot2) #Package for visual analysis and plotting
I began by reading in the AFL Brownlow Medal data. The dataset
includes detailed player statistics and was sourced from reliable
outlets like the FitzRoy R package, afl.com.au, AFL Tables,
and the AFL Coaches Association (AFLCA). I have already gathered, joined
and cleaned the data into the file brownlow_data.csv so
that it is ready for analysis!
#Read in the data
brownlow_data <- read.csv("brownlow_data_2024.csv")
To get a better understanding of the dataset, I took a quick look at the first 100 rows:
#Print the table in the nice format
datatable(head(brownlow_data,100), options = list(scrollX = TRUE, pageLength = 5))
I also displayed the column names to get a sense of the variables included in the dataset:
#Display the column names
colnames(brownlow_data)
## [1] "Team.Name" "Player.Jumper.Number"
## [3] "Year" "Round.Number"
## [5] "Player.ID" "Game.ID"
## [7] "First.Name" "Surname"
## [9] "Photo" "Opponent.Name"
## [11] "Team.Status" "Venue.Name"
## [13] "Venue.State" "Time.On.Ground.Percentage"
## [15] "Goals" "Behinds"
## [17] "Kicks" "Handballs"
## [19] "Disposals" "Marks"
## [21] "Bounces" "Tackles"
## [23] "Contested.Possessions" "Uncontested.Possessions"
## [25] "Total.Possessions" "Inside.50s"
## [27] "Marks.Inside.50" "Contested.Marks"
## [29] "Hitouts" "One.Percenters"
## [31] "Disposal.Efficiency.Percentage" "Clangers"
## [33] "Frees.For" "Frees.Against"
## [35] "DreamTeam.Points" "Rebound.50s"
## [37] "Goal.Assists" "Goal.Accuracy"
## [39] "Rating.Points" "Turnovers"
## [41] "Intercepts" "Tackles.Inside.50"
## [43] "Shots.At.Goal" "Score.Involvements"
## [45] "Metres.Gained" "Centre.Clearances"
## [47] "Stoppage.Clearances" "Total.Clearances"
## [49] "Effective.Kicks" "Kicking.Efficiency.Percentage"
## [51] "Kick.To.Handball.Ratio" "Effective.Disposals"
## [53] "Marks.On.Lead" "Intercept.Marks"
## [55] "Contested.Possession.Rate" "Hitouts.To.Advantage"
## [57] "Hitout.Win.Percentage" "Hitout.To.Advantage.Rate.Percentage"
## [59] "Ground.Ball.Gets" "F50.Ground.Ball.Gets"
## [61] "Score.Launches" "Pressure.Acts"
## [63] "Defensive.Half.Pressure.Acts" "Spoils"
## [65] "Ruck.Contests" "Contest.Defense.One.On.Ones"
## [67] "Contest.Defense.Losses" "Contest.Defense.Loss.Percentage"
## [69] "Contest.Offence.One.On.Ones" "Contest.Offence.Wins"
## [71] "Contest.Offence.Wins.Percentage" "Centre.Bounce.Attendances"
## [73] "Kick.Ins" "Kick.Ins.Play.On"
## [75] "Team.Goals" "Team.Behinds"
## [77] "Team.Total" "Opponent.Goals"
## [79] "Opponent.Behinds" "Opponent.Total"
## [81] "Team.Result" "Position"
## [83] "Uncontested.Marks" "Effective.Handballs"
## [85] "Margin" "Coaches.Votes"
## [87] "Date" "Brownlow.Votes"
There are 88 variables that capture different aspects of each player’s game performance. To simplify the analysis and avoid redundancy, I created new variables and remove overlapping ones. For instance:
Disposals variable isn’t needed since we already
have Kicks and Handballs separately.Ineffective.Kicks variable by
subtracting Effective.Kicks from Kicks, and
then remove the original Kicks variable.Now, let’s go ahead with these transformations and other necessary pre-processing steps:
#Add a score involvements variable that doesn't include direct goals, behinds or goal assists
brownlow_data$Involvements.No.Scores.Or.Assists <- brownlow_data$Score.Involvements -
brownlow_data$Goals -
brownlow_data$Behinds -
brownlow_data$Goal.Assists
#Add a hitouts variable that doesn't include those that go to advantage
brownlow_data$Hitouts.No.Advantage <- brownlow_data$Hitouts -
brownlow_data$Hitouts.To.Advantage
#Add an Outside F50 ground ball gets variable
brownlow_data$Outside.50.Ground.Ball.Gets <- brownlow_data$Ground.Ball.Gets -
brownlow_data$F50.Ground.Ball.Gets
#Add a contested offence 1-on-1 variable that doesn't include those won
brownlow_data$Contest.Offence.One.On.Ones.Not.Won <- brownlow_data$Contest.Offence.One.On.Ones -
brownlow_data$Contest.Offence.Wins
#Add a contested defence 1-on-1 variable that doesn't include those lost
brownlow_data$Contest.Defense.One.On.Ones.Not.Lost <- brownlow_data$Contest.Defense.One.On.Ones -
brownlow_data$Contest.Defense.Losses
#Add an ineffective kicks variable
brownlow_data$Ineffective.Kicks <- brownlow_data$Kicks -
brownlow_data$Effective.Kicks
#Add an ineffective handballs variable
brownlow_data$Ineffective.Handballs <- brownlow_data$Handballs -
brownlow_data$Effective.Handballs
#Add a shots at goal variable that doesn't include those that go in for a goal or behind
brownlow_data$Shots.No.Score <- brownlow_data$Shots.At.Goal -
brownlow_data$Goals -
brownlow_data$Behinds
#Add a forward half pressure acts variable
brownlow_data$Forward.Half.Pressure.Acts <- brownlow_data$Pressure.Acts -
brownlow_data$Defensive.Half.Pressure.Acts
#Add a kick ins variable that doesn't include those where the kicker plays on out of the goal square
brownlow_data$Kick.Ins.Not.Play.On <- brownlow_data$Kick.Ins -
brownlow_data$Kick.Ins.Play.On
#Add a contested possessions variable that doesn't include contested marks
brownlow_data$Contested.Possessions.No.Mark <- brownlow_data$Contested.Possessions -
brownlow_data$Contested.Marks
#Add an uncontested possessions variable that doesn't include uncontested marks
brownlow_data$Uncontested.Possessions.No.Mark <- brownlow_data$Uncontested.Possessions -
brownlow_data$Uncontested.Marks
#Add an intercepts variable that doesn't include intercept marks
brownlow_data$Intercepts.No.Mark <- brownlow_data$Intercepts -
brownlow_data$Intercept.Marks
#Add an intercepts variable that doesn't include intercept marks
brownlow_data$Tackles.Outside.50 <- brownlow_data$Tackles -
brownlow_data$Tackles.Inside.50
#Add a Player variable which pastes a player's first name and surname
brownlow_data$Player <- paste(brownlow_data$First.Name, brownlow_data$Surname)
To boost the predictive power of the model, I also added variables that capture votes from previous seasons. Players who have polled well in the past often continue to perform strongly.
#Previous season Brownlow Votes and Coaches Votes variables
prev_season_votes <- brownlow_data %>%
group_by(Player.ID, Year) %>%
summarise(
Brownlow.Votes.Previous.Season = sum(Brownlow.Votes, na.rm = TRUE),
Coaches.Votes.Previous.Season = sum(Coaches.Votes, na.rm = TRUE)) %>%
mutate(Year = Year + 1)
#Join these new variables to the brownlow_data
brownlow_data <- full_join(brownlow_data,
prev_season_votes) %>%
filter(!is.na(First.Name))
#Replace NA Brownlow Votes with a 0
brownlow_data$Brownlow.Votes.Previous.Season <- replace(brownlow_data$Brownlow.Votes.Previous.Season,
is.na(brownlow_data$Brownlow.Votes.Previous.Season),
0)
#Replace NA Coaches Votes with a 0
brownlow_data$Coaches.Votes.Previous.Season <- replace(brownlow_data$Coaches.Votes.Previous.Season,
is.na(brownlow_data$Coaches.Votes.Previous.Season),
0)
I’ve also calculated the maximum votes a player received in any season prior to the one in which the game was played (dating back to 2015)
#Max Brownlow Votes and Coaches Votes in a season variables
#Add empty variable
Max.Brownlow.Votes.Season_prior <- data.frame()
#Loop through each season of available data to get the most Brownlow
#Votes and Coaches Votes a player has received prior to that year
for(i in min(brownlow_data$Year):max(brownlow_data$Year)){
Max.Brownlow.Votes.Seasons_prior_i <- brownlow_data %>%
filter(Year <= i) %>%
group_by(Player.ID, Year) %>%
summarise(Brownlow.Votes.Previous.Season = sum(Brownlow.Votes, na.rm = TRUE),
Coaches.Votes.Previous.Season = sum(Coaches.Votes, na.rm = TRUE)) %>%
group_by(Player.ID) %>%
summarise(Max.Brownlow.Votes.Season = max(Brownlow.Votes.Previous.Season),
Max.Coaches.Votes.Season = max(Coaches.Votes.Previous.Season),
Year = i + 1)
Max.Brownlow.Votes.Season_prior <- bind_rows(Max.Brownlow.Votes.Season_prior, Max.Brownlow.Votes.Seasons_prior_i)
}
#Join these new variables to the brownlow_data
brownlow_data <- full_join(brownlow_data, Max.Brownlow.Votes.Season_prior) %>%
filter(!is.na(First.Name))
#Replace NA Brownlow Votes with a 0
brownlow_data$Max.Brownlow.Votes.Season <- replace(brownlow_data$Max.Brownlow.Votes.Season,
is.na(brownlow_data$Max.Brownlow.Votes.Season),
0)
#Replace NA Coaches Votes with a 0
brownlow_data$Max.Coaches.Votes.Season <- replace(brownlow_data$Max.Coaches.Votes.Season,
is.na(brownlow_data$Max.Coaches.Votes.Season),
0)
It was important to also convert the Brownlow.Votes and
Coaches.Votes variables into factors. This ensures that R
treats them as categorical rather than continuous variables, which is
crucial for accurate interpretation and analysis when building the
predictive model.
#Convert Brownlow Votes variable to factor
brownlow_data$Brownlow.Votes <- as.factor(brownlow_data$Brownlow.Votes)
#Convert Coaches Votes variable to factor
brownlow_data$Coaches.Votes <- as.factor(brownlow_data$Coaches.Votes)
Next, I refined the dataset by keeping only the columns that are essential for the analysis.
#Select variables that will be used in the model
brownlow_data <- brownlow_data %>%
select(Year, Game.ID, Round.Number, Team.Name,
Opponent.Name, Player, Goals, Behinds,
Effective.Kicks, Effective.Handballs, Ineffective.Kicks, Ineffective.Handballs,
Bounces, Tackles.Inside.50, Tackles.Outside.50, Contested.Possessions.No.Mark,
Uncontested.Possessions.No.Mark, Inside.50s, Marks.Inside.50, Contested.Marks,
Uncontested.Marks, One.Percenters, Clangers, Frees.For,
Frees.Against, Rebound.50s, Goal.Assists, Rating.Points,
Turnovers, Intercepts.No.Mark, Involvements.No.Scores.Or.Assists, Metres.Gained,
Centre.Clearances, Stoppage.Clearances, Marks.On.Lead, Intercept.Marks,
Hitouts.To.Advantage, Hitouts.No.Advantage, Outside.50.Ground.Ball.Gets, F50.Ground.Ball.Gets,
Score.Launches, Forward.Half.Pressure.Acts, Defensive.Half.Pressure.Acts, Spoils,
Contest.Offence.One.On.Ones.Not.Won, Contest.Offence.Wins, Contest.Defense.One.On.Ones.Not.Lost,
Contest.Defense.Losses, Shots.No.Score, Time.On.Ground.Percentage, Team.Goals, Team.Behinds,
Opponent.Goals, Opponent.Behinds, Margin, Kick.Ins.Not.Play.On,
Kick.Ins.Play.On, Max.Brownlow.Votes.Season, Max.Coaches.Votes.Season, Brownlow.Votes.Previous.Season,
Coaches.Votes.Previous.Season, Coaches.Votes, Team.Result, Position, Brownlow.Votes)
This updated dataset, with the newly created and refined variables, will serve as the foundation for developing the predictive model.
#Present the enhanced dataset in a nice format
datatable(head(brownlow_data,100), options = list(scrollX = TRUE, pageLength = 5))
To begin, I specified the time periods for training and testing the model. The training period covered data from 2015 to 2023, while the testing period focused on the 2024 season:
#First season of training period
train_start_season <- 2015
#Last season of training period
train_end_season <- 2023
#Season to predict Brownlow Votes
test_season <- 2024
Next, I created a training dataset by filtering the data for the specified seasons from 2015 to 2023.
#Create dataset that encompasses the training period
brownlow_train <- brownlow_data %>%
filter(Year >= train_start_season, Year <= train_end_season)
Next, I standardised the numeric statistics in the training dataset to ensure that each variable has a mean of 0 and a standard deviation of 1. This normalization helps improve the model’s performance by ensuring that all variables are on a similar scale.
#Standardise the numeric values
train_numeric_standardised <- scale(brownlow_train[,8:(ncol(brownlow_train)-4)])
brownlow_train[,8:(ncol(brownlow_train)-5)] <- train_numeric_standardised
#Store the center and scale as this will be used to standardise the testing set
train_numeric_standardised.center<-attr(train_numeric_standardised,"scaled:center")
train_numeric_standardised.scale<-attr(train_numeric_standardised,"scaled:scale")
Next, I converted categorical variables like
Coaches.Votes, Team.Result and
Position into dummy variables. These dummy variables are
binary (0 or 1) and allow the model to process categorical data more
effectively:
#Create dummy variables for Coaches Votes and Team Result
brownlow_train <- dummy_cols(brownlow_train,
select_columns = c('Coaches.Votes', 'Team.Result', 'Position'),
remove_selected_columns = TRUE)
To predict the number of Brownlow votes a player might receive based on their performance statistics, I used an ordinal logistic regression model. This model takes into account various metrics from the cleaned and enhanced dataset.
The code below defines the model formula and fits the ordinal
logistic regression model using the clm function from the
ordinal package. This package is tailored for fitting
cumulative link models, including ordinal logistic regression. The
clm function estimates the model parameters through maximum
likelihood estimation, which is well-suited for the ordered nature of
the response variable. This is particularly important for ordinal
outcomes, like the number of Brownlow votes, where the categories are
ranked.
my_model <- clm(Brownlow.Votes ~
Goals + Behinds + Effective.Kicks + Effective.Handballs +
Ineffective.Kicks + Ineffective.Handballs + Bounces +
Tackles.Inside.50 + Tackles.Outside.50 + Contested.Possessions.No.Mark +
Uncontested.Possessions.No.Mark + Inside.50s + Marks.Inside.50 +
Contested.Marks + Uncontested.Marks + One.Percenters +
Clangers + Frees.For + Frees.Against +
Rebound.50s + Goal.Assists + Rating.Points +
Turnovers + Intercepts.No.Mark + Involvements.No.Scores.Or.Assists +
Metres.Gained + Centre.Clearances + Stoppage.Clearances +
Marks.On.Lead + Intercept.Marks + Outside.50.Ground.Ball.Gets +
F50.Ground.Ball.Gets + Score.Launches + Forward.Half.Pressure.Acts +
Defensive.Half.Pressure.Acts + Spoils + Contest.Offence.One.On.Ones.Not.Won +
Contest.Offence.Wins + Contest.Defense.One.On.Ones.Not.Lost + Contest.Defense.Losses +
Shots.No.Score + Time.On.Ground.Percentage + Team.Result_W +
Team.Result_L + Team.Goals + Team.Behinds +
Opponent.Goals + Opponent.Behinds + Kick.Ins.Not.Play.On +
Kick.Ins.Play.On + Hitouts.To.Advantage + Hitouts.No.Advantage +
Position_KEY_DEFENDER + Position_KEY_FORWARD + Position_MEDIUM_DEFENDER + Position_RUCK +
Position_MEDIUM_FORWARD + Position_MIDFIELDER + Position_MIDFIELDER_FORWARD +
Coaches.Votes_1 + Coaches.Votes_2 + Coaches.Votes_3 +
Coaches.Votes_4 + Coaches.Votes_5 + Coaches.Votes_6 +
Coaches.Votes_7 + Coaches.Votes_8 + Coaches.Votes_9 +
Coaches.Votes_10 + Max.Brownlow.Votes.Season + Brownlow.Votes.Previous.Season +
Max.Coaches.Votes.Season + Coaches.Votes.Previous.Season,
data = brownlow_train, link = "logit", #Use logistic link function
threshold = "flexible", #Allow flexible thresholds
Hess = TRUE, #Compute Hessian matrix for testing
maxIter = 100, #Allow up to 100 iterations
tol = 1e-6, #Set a small tolerance for convergence
contrasts = list(Team.Name = "contr.treatment"), #Specify contrasts for the 'Team' variable
na.action = na.omit)
Finally, I summarised the model to evaluate its performance and understand the significance of each predictor variable.
This summary includes details about the coefficients, standard errors, z-values, and p-values for each predictor in the model. It helps to assess the significance of each predictor variable and gauge the overall performance of the model.
#Print a summary of the model
summary(my_model)
## formula:
## Brownlow.Votes ~ Goals + Behinds + Effective.Kicks + Effective.Handballs + Ineffective.Kicks + Ineffective.Handballs + Bounces + Tackles.Inside.50 + Tackles.Outside.50 + Contested.Possessions.No.Mark + Uncontested.Possessions.No.Mark + Inside.50s + Marks.Inside.50 + Contested.Marks + Uncontested.Marks + One.Percenters + Clangers + Frees.For + Frees.Against + Rebound.50s + Goal.Assists + Rating.Points + Turnovers + Intercepts.No.Mark + Involvements.No.Scores.Or.Assists + Metres.Gained + Centre.Clearances + Stoppage.Clearances + Marks.On.Lead + Intercept.Marks + Outside.50.Ground.Ball.Gets + F50.Ground.Ball.Gets + Score.Launches + Forward.Half.Pressure.Acts + Defensive.Half.Pressure.Acts + Spoils + Contest.Offence.One.On.Ones.Not.Won + Contest.Offence.Wins + Contest.Defense.One.On.Ones.Not.Lost + Contest.Defense.Losses + Shots.No.Score + Time.On.Ground.Percentage + Team.Result_W + Team.Result_L + Team.Goals + Team.Behinds + Opponent.Goals + Opponent.Behinds + Kick.Ins.Not.Play.On + Kick.Ins.Play.On + Hitouts.To.Advantage + Hitouts.No.Advantage + Position_KEY_DEFENDER + Position_KEY_FORWARD + Position_MEDIUM_DEFENDER + Position_RUCK + Position_MEDIUM_FORWARD + Position_MIDFIELDER + Position_MIDFIELDER_FORWARD + Coaches.Votes_1 + Coaches.Votes_2 + Coaches.Votes_3 + Coaches.Votes_4 + Coaches.Votes_5 + Coaches.Votes_6 + Coaches.Votes_7 + Coaches.Votes_8 + Coaches.Votes_9 + Coaches.Votes_10 + Max.Brownlow.Votes.Season + Brownlow.Votes.Previous.Season + Max.Coaches.Votes.Season + Coaches.Votes.Previous.Season
## data: brownlow_train
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 77962 -12665.38 25482.76 8(0) 1.05e-11 7.1e+05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## Goals 0.748480 0.032220 23.230 < 2e-16 ***
## Behinds -0.005403 0.021689 -0.249 0.803281
## Effective.Kicks 0.517579 0.062590 8.269 < 2e-16 ***
## Effective.Handballs 0.512193 0.064590 7.930 2.19e-15 ***
## Ineffective.Kicks 0.346748 0.040618 8.537 < 2e-16 ***
## Ineffective.Handballs 0.171015 0.027087 6.314 2.73e-10 ***
## Bounces 0.053101 0.015824 3.356 0.000791 ***
## Tackles.Inside.50 0.057025 0.019541 2.918 0.003520 **
## Tackles.Outside.50 0.185739 0.022008 8.440 < 2e-16 ***
## Contested.Possessions.No.Mark 0.276808 0.072790 3.803 0.000143 ***
## Uncontested.Possessions.No.Mark 0.064118 0.068858 0.931 0.351774
## Inside.50s -0.011091 0.024866 -0.446 0.655572
## Marks.Inside.50 0.068018 0.026507 2.566 0.010287 *
## Contested.Marks 0.142081 0.025711 5.526 3.28e-08 ***
## Uncontested.Marks 0.149506 0.042156 3.547 0.000390 ***
## One.Percenters -0.046792 0.053226 -0.879 0.379343
## Clangers 0.051468 0.027968 1.840 0.065731 .
## Frees.For 0.069251 0.020470 3.383 0.000717 ***
## Frees.Against -0.052661 0.024228 -2.174 0.029738 *
## Rebound.50s -0.002818 0.028648 -0.098 0.921654
## Goal.Assists 0.021439 0.017696 1.212 0.225697
## Rating.Points -0.043357 0.039141 -1.108 0.267985
## Turnovers 0.018373 0.025995 0.707 0.479691
## Intercepts.No.Mark -0.071584 0.029967 -2.389 0.016906 *
## Involvements.No.Scores.Or.Assists 0.021459 0.025260 0.850 0.395573
## Metres.Gained 0.091336 0.039719 2.300 0.021475 *
## Centre.Clearances 0.074735 0.020955 3.566 0.000362 ***
## Stoppage.Clearances 0.026953 0.025309 1.065 0.286880
## Marks.On.Lead 0.018376 0.019450 0.945 0.344779
## Intercept.Marks 0.127835 0.025842 4.947 7.54e-07 ***
## Outside.50.Ground.Ball.Gets -0.101629 0.041913 -2.425 0.015320 *
## F50.Ground.Ball.Gets -0.070432 0.024448 -2.881 0.003965 **
## Score.Launches 0.002124 0.022268 0.095 0.924008
## Forward.Half.Pressure.Acts -0.041505 0.027195 -1.526 0.126962
## Defensive.Half.Pressure.Acts 0.034669 0.024956 1.389 0.164765
## Spoils 0.084240 0.060368 1.395 0.162882
## Contest.Offence.One.On.Ones.Not.Won -0.023461 0.024914 -0.942 0.346362
## Contest.Offence.Wins -0.051035 0.019276 -2.648 0.008105 **
## Contest.Defense.One.On.Ones.Not.Lost 0.004919 0.033735 0.146 0.884080
## Contest.Defense.Losses -0.017529 0.031525 -0.556 0.578181
## Shots.No.Score 0.009738 0.017739 0.549 0.583040
## Time.On.Ground.Percentage 0.122243 0.044717 2.734 0.006263 **
## Team.Result_W 0.058525 0.210503 0.278 0.780996
## Team.Result_L -0.764167 0.212572 -3.595 0.000325 ***
## Team.Goals -0.190602 0.026781 -7.117 1.10e-12 ***
## Team.Behinds -0.064705 0.022068 -2.932 0.003367 **
## Opponent.Goals -0.317388 0.029035 -10.931 < 2e-16 ***
## Opponent.Behinds -0.011389 0.021630 -0.527 0.598505
## Kick.Ins.Not.Play.On 0.018278 0.024660 0.741 0.458573
## Kick.Ins.Play.On 0.002166 0.024892 0.087 0.930653
## Hitouts.To.Advantage 0.140068 0.037401 3.745 0.000180 ***
## Hitouts.No.Advantage 0.159450 0.038077 4.188 2.82e-05 ***
## Position_KEY_DEFENDER 0.074896 0.153224 0.489 0.624981
## Position_KEY_FORWARD 0.166673 0.104120 1.601 0.109426
## Position_MEDIUM_DEFENDER -0.322276 0.106155 -3.036 0.002398 **
## Position_RUCK -0.329731 0.139566 -2.363 0.018150 *
## Position_MEDIUM_FORWARD 0.023392 0.098944 0.236 0.813109
## Position_MIDFIELDER 0.374093 0.057253 6.534 6.40e-11 ***
## Position_MIDFIELDER_FORWARD -0.006422 0.113393 -0.057 0.954833
## Coaches.Votes_1 1.019329 0.087746 11.617 < 2e-16 ***
## Coaches.Votes_2 1.295514 0.081375 15.920 < 2e-16 ***
## Coaches.Votes_3 1.453939 0.082087 17.712 < 2e-16 ***
## Coaches.Votes_4 1.806927 0.081769 22.098 < 2e-16 ***
## Coaches.Votes_5 2.127983 0.083878 25.370 < 2e-16 ***
## Coaches.Votes_6 2.395894 0.087927 27.249 < 2e-16 ***
## Coaches.Votes_7 2.485323 0.090491 27.465 < 2e-16 ***
## Coaches.Votes_8 2.783398 0.090439 30.776 < 2e-16 ***
## Coaches.Votes_9 3.362680 0.100198 33.560 < 2e-16 ***
## Coaches.Votes_10 4.010018 0.100740 39.806 < 2e-16 ***
## Max.Brownlow.Votes.Season 0.083329 0.053263 1.564 0.117705
## Brownlow.Votes.Previous.Season 0.007218 0.043422 0.166 0.867976
## Max.Coaches.Votes.Season -0.006793 0.056540 -0.120 0.904367
## Coaches.Votes.Previous.Season 0.003556 0.002644 1.345 0.178680
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## 0|1 5.3840 0.2164 24.88
## 1|2 6.3042 0.2177 28.95
## 2|3 7.7023 0.2204 34.95
The summary of the fitted ordinal logistic regression model for predicting Brownlow Votes provides a detailed look at the various performance metrics and their importance in determining vote counts.
Key offensive metrics, such as Goals,
Effective.Kicks, and Effective.Handballs, show
strong positive coefficients and highly significant p-values. This
indicates that these actions greatly boost the chances of receiving
votes and serve as reliable predictors. While ineffective actions like
Ineffective.Kicks and Ineffective.Handballs
also have positive coefficients, their impact is somewhat less
pronounced, though they still hold statistical significance.
Defensive metrics are equally important. Tackles (inside and outside
the 50-metre arc) and intercepts (possessions and marks) both positively
contribute to vote-getting. This highlights the value of defensive
efforts in gaining recognition. On the flip side, negative coefficients
for actions like Frees.Against suggest that penalties and
failures can hinder the likelihood of receiving votes.
Team performance metrics play a crucial role as well. The model indicates that winning a match and the nature of losses significantly influence vote counts, with losses having a notably negative impact. Team scoring and opponent scoring further illustrate how overall game outcomes affect individual recognition.
As I assumed, the model also takes into a player’s primary position, with midfielders more likely to receive votes than those playing in other positions, such as rucks and medium defenders.
But most eye-opening, is that coaches votes have a significant relationship to vote-getting on Brownlow Medal night, with all combinations of coaches votes (from 1 to 10) being significant variables in the model.
I’ve decided not to regularize or reduce the model because my primary goal was to capture as much detailed information as possible from the available data. Since the Brownlow Medal voting process is subjective and influenced by a multitude of factors, I wanted to ensure that no potentially valuable information was excluded from the model. Regularizing or simplifying the model could have risked losing some of this nuance, potentially missing out on subtle patterns that contribute to how votes are awarded. By allowing the model to be more complex, I aimed to increase its ability to pick up on these smaller details, which may be critical in accurately predicting votes in tight races or less obvious matchups. In this case, maximizing the model’s flexibility was more important than trying to prevent overfitting, as I believe the richness of the data outweighs the potential risks of over-complication.
Overall, this model emphasizes the complex nature of player performance evaluation, weaving together offensive skills, defensive capabilities, team success, and external recognition to predict Brownlow Votes. The combination of significant variables and their highly significant p-values suggests that excelling in multiple areas of the game is essential for gaining votes.
The model testing process involves applying the trained model to a new season’s data to predict the Brownlow Medal votes for each player in each game. This testing phase assesses the model’s performance and ensures its predictions are reliable and accurate.
#Create dataset that encompasses the testing period
brownlow_test <- brownlow_data %>%
filter(Year == test_season)
#Print the table in the nice format
datatable(head(brownlow_test,100), options = list(scrollX = TRUE, pageLength = 5))
After a quick peek of the 2024 data, I then standardise the numeric
values on the same scale as the training set, as well as create dummy
variables for the categorical variables Coaches.Votes,
Team.Result and Position.
#Standardise the numeric values, based on the training set
brownlow_test[,8:(ncol(brownlow_test)-4)] <- scale(brownlow_test[,8:(ncol(brownlow_test)-4)],
center=train_numeric_standardised.center,
scale=train_numeric_standardised.scale)
#Create dummy variables for Coaches Votes, Team Result and Position
brownlow_test <- dummy_cols(brownlow_test,
select_columns = c('Coaches.Votes', 'Team.Result', 'Position'),
remove_selected_columns = TRUE)
Next, the trained model is used to make predictions on the test data. The probabilities of receiving different vote counts (0, 1, 2, or 3 votes) are calculated for each player in each game.
#Use the predict function to make predictions on the test data
predictions <- predict(my_model,
newdata = brownlow_test %>% select(-Brownlow.Votes),
type = 'prob')
#Transform the predictions into a dataframe that is readable
predictions_probability_matrix <- data.frame(matrix(unlist(predictions),
nrow = nrow(brownlow_test)))
#Change the column names of the predictions_probability_matrix dataframe
colnames(predictions_probability_matrix) <- c("Votes.0", "Votes.1", "Votes.2", "Votes.3")
#Bind the columns of brownlow_test dataframe with predictions_probability_matrix
brownlow_test_predictions <- cbind.data.frame(brownlow_test, predictions_probability_matrix)
Expected votes for each player for each game are then calculated:
#Calculate the "expected" votes a player should receive each game according to the model
brownlow_test_predictions$Expected.Votes <- 1*brownlow_test_predictions$Votes.1 +
2*brownlow_test_predictions$Votes.2 +
3*brownlow_test_predictions$Votes.3
Let’s see which players had the highest “Expected Votes” games according to the model:
#Ordering player's games according to their "expected" votes
expected_votes <- brownlow_test_predictions %>%
select(Game.ID, Round.Number, Player, Team.Name, Opponent.Name, Expected.Votes) %>%
arrange(desc(Expected.Votes))
#Presenting this table in a nice format
datatable(expected_votes, options = list(scrollX = TRUE, pageLength = 10))
So, according to the model, Nick Daicos’ round 24 performance against
Melbourne was consider the best game according to
Expected.Votes, followed by a number of notable individual
performances from other superstars of the competition.
However, with the way the Brownlow Medal works, we can only give out
a single 3 vote, a single 2 vote, and a single 1 vote each game. But by
using the Expected.Votes, we can assign the 3, 2 and 1
based on the top three players Expected.Votes.
We will also compare this to the official votes given out by the
umpires at the end of each game (the column
Brownlow.Votes).
#Assign the 3, 2 and 1 votes to the top 3 "expected" votes players of each game
round.by.round.votes <- brownlow_test_predictions %>%
group_by(Game.ID) %>%
top_n(3, Expected.Votes) %>%
mutate(Predicted.Votes = order(order(Expected.Votes, Player, decreasing=FALSE))) %>%
select(Game.ID, Round.Number, Player, Team.Name,
Opponent.Name, Predicted.Votes, Expected.Votes) %>%
arrange(Game.ID, desc(Predicted.Votes))
#Get the actual assigned Brownlow votes for each game
brownlow_votes <- brownlow_data %>%
filter(Year == test_season) %>%
select(Game.ID, Round.Number, Player, Team.Name, Opponent.Name, Brownlow.Votes)
#Join the Predicted votes with the Actual votes
round.by.round.votes <- full_join(round.by.round.votes,
brownlow_votes) %>%
arrange(Game.ID, desc(Predicted.Votes))
#Replace NA predicted votes with a 0
round.by.round.votes$Predicted.Votes <- replace(round.by.round.votes$Predicted.Votes,
is.na(round.by.round.votes$Predicted.Votes),
0)
Let’s display our predicted votes for each game in a nice format:
#Filter round.by.round.votes to only see players predicted to poll a vote in each game
game_vote_predictions <- round.by.round.votes %>%
filter(Predicted.Votes %in% c(1,2,3))
#Present the table in a nice format
datatable(game_vote_predictions, options = list(scrollX = TRUE, pageLength = 3))
Using this, we can identify how well the model performs compared to the actual votes given by the umpires.
#Convert Predicted.Votes to a factor variable
round.by.round.votes$Predicted.Votes <- as.factor(round.by.round.votes$Predicted.Votes)
#Present the confusion matrix
confusion_matrix <- confusionMatrix(round.by.round.votes$Predicted.Votes, round.by.round.votes$Brownlow.Votes)
confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3
## 0 8660 121 86 35
## 1 127 36 29 15
## 2 79 28 58 42
## 3 36 22 34 115
##
## Overall Statistics
##
## Accuracy : 0.9313
## 95% CI : (0.9261, 0.9363)
## No Information Rate : 0.9348
## P-Value [Acc > NIR] : 0.9170
##
## Kappa : 0.4495
##
## Mcnemar's Test P-Value : 0.8525
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3
## Sensitivity 0.9728 0.17391 0.280193 0.55556
## Specificity 0.6103 0.98164 0.984006 0.99012
## Pos Pred Value 0.9728 0.17391 0.280193 0.55556
## Neg Pred Value 0.6103 0.98164 0.984006 0.99012
## Prevalence 0.9348 0.02174 0.021737 0.02174
## Detection Rate 0.9094 0.00378 0.006091 0.01208
## Detection Prevalence 0.9348 0.02174 0.021737 0.02174
## Balanced Accuracy 0.7916 0.57778 0.632100 0.77284
I can then put the round-by-round votes altogether into a predicted leaderboard based off of the model votes.
summary_table <- game_vote_predictions %>%
ungroup() %>% select(Player, Team.Name, Round.Number, Predicted.Votes) %>%
pivot_wider(names_from = Round.Number, values_from = Predicted.Votes, names_prefix = "R.") %>%
mutate(across(starts_with("R."), as.numeric)) %>%
mutate(
Predicted_Votes = rowSums(select(., starts_with("R.")), na.rm = TRUE),
Count_3_Votes = rowSums(select(., starts_with("R.")) == 3, na.rm = TRUE),
Count_2_Votes = rowSums(select(., starts_with("R.")) == 2, na.rm = TRUE),
Count_1_Vote = rowSums(select(., starts_with("R.")) == 1, na.rm = TRUE),
Games_Polled = rowSums(select(., starts_with("R.")) > 0, na.rm = TRUE)
) %>%
select(1:2, 28, 3:27, 29:32) %>%
arrange(desc(Predicted_Votes), desc(Count_3_Votes), desc(Count_2_Votes), desc(Count_1_Vote))
datatable(summary_table, options = list(scrollX = TRUE, pageLength = 10))
We will compare these predictions to the actual vote count and see how well our model performs.
We will remove all players who were both predicted to receive 0 votes and didn’t receive an official vote, as this will give us a truer sense of how accurate our model is at predicting the final vote count.
#Create a vote tally based on the predicted votes of the model
vote_tally <- game_vote_predictions %>%
group_by(Player, Team.Name) %>%
summarise(Predicted.Votes = 3 * sum(Predicted.Votes == 3) +
2 * sum(Predicted.Votes == 2) +
1 * sum(Predicted.Votes == 1), .groups = 'drop') %>%
arrange(desc(Predicted.Votes))
#Create a vote tally based on the actual votes of the model
actual_votes <- brownlow_data %>%
filter(Year == test_season) %>%
group_by(Player, Team.Name) %>%
summarise(Actual.Votes = 3*sum(Brownlow.Votes == 3) +
2*sum(Brownlow.Votes == 2) +
1*sum(Brownlow.Votes == 1), .groups = 'drop') %>%
arrange(desc(Actual.Votes))
#Join both the predicted vote tally and the actual vote tally for comparison
vote_tally <- full_join(vote_tally, actual_votes) %>%
arrange(desc(Predicted.Votes), desc(Actual.Votes))
#Replace NA predicted votes with a 0
vote_tally$Predicted.Votes <- replace(vote_tally$Predicted.Votes,
is.na(vote_tally$Predicted.Votes),
0)
#Replace NA actual votes with a 0
vote_tally$Actual.Votes <- replace(vote_tally$Actual.Votes,
is.na(vote_tally$Actual.Votes),
0)
vote_tally <- vote_tally %>%
mutate(Predicted.Rank = rank(-Predicted.Votes, ties.method = 'min'),
Actual.Rank = rank(-Actual.Votes, ties.method = 'min')) %>%
filter(Predicted.Votes + Actual.Votes != 0)
#Present the table in a nice format for comparison
datatable(vote_tally, options = list(scrollX = TRUE))
The analysis of the predictions compared to the actual votes for the Brownlow Medal reveals several important insights into the model’s performance.
Our model has done reasonably well. The highlight is that it has correctly predicted Patrick Cripps to break the all time vote tally record (not as extreme as he actually did!), as well having Nick Daicos finish second with a high vote count as well.
In terms of error metrics for the vote counts, the Mean Absolute Error (MAE) is 2.008, and the Root Mean Square Error (RMSE) is 2.631.
The MAE indicates that, on average, the predicted votes differ from the actual votes by about 2.008 votes. This relatively low average error demonstrates that the model’s predictions are quite close to the actual votes. The RMSE, being slightly higher than the MAE, further penalizes larger errors, but its value of 2.631 still reflects a reasonable level of accuracy, indicating that significant prediction errors are infrequent.
#Plotting predicted votes against actual votes
ggplot(vote_tally, aes(x = Predicted.Votes, y = Actual.Votes)) +
geom_point() +
geom_smooth(method = 'lm', col = 'red') +
ggtitle("Predicted vs Actual Votes") +
xlab("Predicted Votes") +
ylab("Actual Votes")
Furthermore, the correlation between the predicted and actual votes is approximately 0.928. This high correlation indicates a strong linear relationship, suggesting that the predictions are closely aligned with the actual outcomes. This strong positive correlation signifies that as the predicted votes increase, the actual votes tend to increase as well, showing that the model is effectively capturing the voting patterns to a significant extent.
In conclusion, the development and evaluation of the ordinal logistic regression model for predicting Brownlow Medal votes showed its effectiveness in providing reliable predictions based on a solid dataset of player performances. By incorporating a range of offensive and defensive statistics, the model captures key aspects of player performance, allowing for a nuanced understanding of what contributes to Brownlow recognition. Key metrics like goals, effective kicks, and tackles are strong predictors of votes, highlighting the importance of both offensive and defensive contributions to a player’s overall impact on the game.
Furthermore, the model’s integration of Coaches’ votes further enhances its predictive power. This aspect demonstrates that players who are recognised by the coaches are more likely to receive votes. Additionally, variables regarding past recognition, such as previous Brownlow and Coaches’ votes, demonstrates that players who consistently perform well and are acknowledged for their efforts in the past are more likely to continue receiving votes.
The model’s predictive abilities, assessed through game-by-game analysis and overall season performance, indicate a strong foundation for future enhancements. By refining the model and incorporating real-time data, I can continue improving its accuracy and reliability in predicting Brownlow Medal votes.
Overall, this project highlights the complexity of assessing player performance in the AFL and demonstrates how statistical modelling can enhance our understanding of the game and its players.