by André Vizzoni
In this post, I will introduce a prediction model that was the product of a research project that spanned four years (winning a few awards) and was the final project for my degree in Statistics. For those interested in the final project, here it is – though I warn everyone in advance that it is in Portuguese. The objective of this post, then, is to translate the most central parts of that project to English while, at the same time, talking about applications of the model to basketball data, since the original project used soccer data.
First, I will give an intuitive explanation of the model, with no equations or mathematical concepts introduced. Next will come the methodology section, where there will be a lot more maths and formal definitions. As such, people who are interested only on the intuitive definitions might wish to skip the methodology section). The idea behind this structuring of the post is for it to be understandable, both by laypeople and by people well versed in statistics. Finally, in the discussion section there will be a few summary comments, as well as a preview of things to come on this site.
Intuitive Explanation of the Model
In a basketball game, the winning team is the one that scores the most points, meaning that anyone interested in predicting the winner of a game should be concerned about the number of points scored by both teams. At the same time, predicting basketball game results runs into a number of problems, the main one being the huge amount of factors that affect what happens on the floor. The number of points scored by a team in a game may depend on who was playing at home, whether any major player was injured, the offensive quality of both teams, etc.
Thus, when trying to explain game results with a model, one chooses which factors to consider, and in what manner. Unsurprisingly, there are many different models with the same goal: to accurately predict single-game results. In this model, I chose to define the number of points scored by a team in a game as depending on:
- their attack,
- the defense of their opponent,
- home field/home court advantage (if they are the home team), and
- the particularities of their league.
Thus, we work with the idea that each squad has its inherent strengths, and that they are different from each other. That is, playing at home affects Utah’s performance in a way that is different from the way it affects Brooklyn’s performance, for example.
Therefore, any prediction we make depends on how we estimate the different skills within a league, as well as the effects the league itself has on the style of basketball played within it. Estimates for points scored are generated from the following information: number of points scored by a team, who was their opponent, and who was the home team.
With these estimates, we make predictions for each of the games not yet played by way of simulations. From multigame predictions, one can then make predictions for the team’s record at the end of the competition. As such, one can calculate the probabilities of a team qualifying for the playoffs, getting the first overall pick, etc.
For those interested in a slightly more technical explanation, it can be said that the data used are the number of points scored by each team, indexed by the week in which the game took place, by which team was the scoring team and by which team the scoring was done against. These points are then said to be random quantities, explained by the offensive quality of the attacking team, the defensive quality of the defending team, the quality of the home team’s home field/court advantage advantage (which will show only when the home team is the attacking team) and an intercept term.
From the data, and through a regression model (a Poisson regression model), it is possible to estimate the specific forces of each team. However, our primary interest is not to classify teams according to their strengths, but to use them to predict future outcomes. This is done through Markov Chain Monte Carlo (MCMC) simulations.
Through these simulations, we get samples for the strengths of each team, and these samples are then used to generate predictions for future observations, for the number of points scored by each team in future games. We can, therefore, approximate the probability of winning or losing for both teams playing in a match, and those odds are accumulated to generate predictions for the rest of the league, which gives us rough odds of a team qualifying for the playoffs, having the best or worst record in the league, or winning a certain number of games.
When it comes to models with the intent to predict game results, there are a number of models extant across the literature, though few have made an impact on the construction of the model being introduced here. Readers interested in learning more should consult these articles.
After careful evaluation of the methodologies of these articles, as well as other materials, I was decided to use a Poisson model for predicting matchup results. The model depends on defining the number of points the ith team scores on the jth team in week t of the season as a random variable, Xtij. The variable is Poisson distributed, so that
where θ represents the set of model parameters (the inherent strengths of each team and the intercept, which represents the league effect). The parameters are linked to the distribution of the number of points scored in the following manner:
where Ofi represents the offensive ability of the ith team, while Dei represents the defensive ability of the same team, Hoi the home court advantage of that team and Int the intercept parameter.
Since I consider myself a Bayesian statistician, we need to talk about priors for the model. The priors I used in the project were wholly uninformative, as can be seen in the following equations:
The equations indicate the use of normal priors for the parameters, whose means were defined as 0 (zero) and whose variances were defined as 1,000,000. Mean and variance are the aspects that identify the priors as uninformative, since a variable with zero mean and high variance is an extremely volatile variable. Using an uninformative prior means that any posterior estimates will be dependent mostly on the values of the likelihood function (in this case, because Xtij | θ has a known distribution). Since the object of this post isn’t to be an introduction to Bayesian inference, I will only say that we use Bayes’ Theorem to get posterior estimates for model parameters. There is a problem with the model as it is currently defined, though. The model is currently unidentifiable, which means its likelihood function can assume the same value for different sets of parameter values. Thus, there is a need for a constraint on the parameters: the sum of all the offensive factors has to be 0 (zero), and the same goes for the defensive factors and the home field factors. Then, if a league has N teams:
With a now identifiable model, we can use MCMC methods to generate samples of the parameter posteriors, as well as posterior estimates for the set of parameters for the model. The set of posterior samples can then be used to generate samples of the predictive distributions of the number of points scored in future games. As it stands, the model predictions come from the predictive samples.
As I believe this post to already be quite demanding and dense reading, this will be the final section of the current post. As such, I hope to write a few other posts here on the site in the near future:
- one to explain how the model fits the data (and how to evaluate its predictions as well as compare them to predictions from other models),
- one to relate the parameters of the model to basketball stats commonly used by the analytics community,
- one to introduce the dynamic adaptation (which means the model will cease to be static, and the parameters will be permitted to vary with time) of the model – with the costs and benefits from such an adaptation being outlined, and
- a final post to consider ways to improve the model.
Before we end the post, however, let’s take a quick look at a few issues one might take this model. First of all, it is a static model, meaning it assumes team strengths to hold constant throughout the season. We know that this is a wholly unreasonable assumption, and one that will be addressed by the dynamic adaptation. The model also doesn’t take into account any statistics about the performance of specific players, nor the effect of specific coaches. Therefore, though it has quite a robust and complicated mathematical definition, the model takes very simple inputs.
That doesn’t mean, though, that its predictions are bad. In the next post, we will talk about predictions in greater depth, but the final note for today is a sneak peek into a way to measure the quality of those predictions.
|Model||Quality of predictions|
The table shows, in a way, the probability of a model rightly predicting a specific game outcome. The model introduced here beat a collection of models (in the original project, I compared with other Brazilian models, though they aren’t showed here) in the quality of its predictions when it came to a few hundred soccer games, even beating FiveThirtyEight’s old model (which had many more inputs and much more complex calculations). FiveThirtyEight has since improved its model, and it is quite interesting to point out that the way they improved their model was by making it look a lot more like this one. There are no comparisons run with the newer data, but as of August of 2018, FiveThirtyEight’s predictions were losing to my model’s predictions.