Introduction
I am quite happy to have arrived at the final post of the series – not because I disliked writing the previous posts, but because I have worked for four years with the models presented in this series. I have become accustomed to their weaknesses and strengths, as well as those of the methods used for comparison. This post, however, is new and fresh. Finally, I have the change to put on paper (sort of) all the ideas I have had over the years on how to improve the model.
I don’t think it will surprise anyone reading this to know that I love modeling data. One of the reasons I love it is that modeling is not an exact science, as one might expect it to be. Modeling is just as much an art as a science; there are no certainties when we work with Probability Theory. Dealing with probabilities means making your peace with the fact that you will always be uncertain about your conclusions.
There are no absolute guarantees when it comes to the probability camp, because to work within it means to refuse to ascribe to a deterministic view of the world. For the probabilist, then, the world does not necessarily follow a set of predetermined rules. It may be random – chaotic, even – with that chaos being a major source of uncertainty.
All of these considerations lead us to the conclusion that modeling data involves handling uncertainty coming from multiple directions, including from the one doing the modeling. This is why I consider modeling an art that can be learned, specifically by looking at the work of others, which I will do a few times throughout this post. The post itself will be divided into three sections: the Improvements section, the New Ideas section, and the Conclusion section.
In parametric statistics, we assume that there is an underlying model that defines thoroughly the behavior of our data (the models presented in this series are examples of parametric statistics). That assumption is in itself a new source of uncertainty, as the statistician may not be completely right with their “guess” about the underlying model. There are other possibles sources of uncertainty like choosing the wrong endpoint for your data (which might mean you do not make use of important information contained in the unused data), or choosing the wrong method, or choosing the wrong priors (or using any kind of priors, for a hardcore frequentist). The list is seemingly endless.
The first section will deal with ideas that aim to improve either the dynamic or the static models (or both) by making slight changes to their structure. The second block will be for ideas that imply changes which are more than slight, maybe even ideas that involve building a completely different model. The last section will contain my final remarks on this series of posts.
Improvements
The first change that could be made has already been discussed in the Analysis section of my third post on the models: to take into account the pace of both teams playing in a game, since that affects how many points are scored in a match, which in turn ends up affecting the offensive and defensive factors for every team (a team with a high pace tends to have a higher offensive factor estimate and a lower defensive factor estimate than it probably should). There is a problem, though, in how to define this adjustment. Discussions on how to define it will lead to the second improvement that could be made. Therefore, I will talk about the improvement before talking about how to define the pace adjustment.
One smarter way to model basketball results might be not to model points scored directly, but to model how many field goals and free throws each team attempts. Instead of having one Poisson-distributed random variable for each team we would have three (free throw attempts, 2 point field goal attempts and 3 point field goal attempts). Or, we might have even more than three, if we used Baltej Parmar’s work on this site as a basis, and modeled the field goal attempts for each team from each area. Or, we could directly model only the number of possessions each team gets in a game. Every approach has advantages and disadvantages, and I don’t know which would be best.
Another idea for improvement came from Ryan Davis’ tweet and meshed well with an idea that I’d had for improving my model when dealing with soccer data: divide the model into two parts. The first partition would model the number of scoring chances (field goal and free throw attempts, in basketball) a team gets, while the second part would model a team’s efficiency in scoring with those chances. The first part links quite directly to the previous paragraph, and could be donewith any of the methods mentioned. The second part is a bit more interesting to execute.
One of the simpler models used in Bayesian statistics is one in which the data are Bernoulli-distributed and the prior is a Beta distribution, which means that the posterior is also a Beta distribution, only with different parameters. If we used a vague Beta prior and took every shot a team takes to be a Bernoulli variable that has success defined as the shot going in, we could have an estimate of a team’s shooting ability that changes constantly.
In words, we would take every team to be of similar shooting ability at the beginning of a season, while being sufficiently unsure of how good each team is that, with sufficient data, we could easily start to believe a particular team to be either quite good or quite bad at shooting (or anywhere in between). Moreover, with every shot a team took, we would have a better idea of how good its shooting is based on whether the shot went in or not, while also becoming more and more confident about that team’s quality.
I am not married to this idea, though. I could see myself using other priors (like the normal distribution), because of their interesting properties. I would also like to be able to model team shooting quality dynamically (which means that I would let a team’s scoring prowess fluctuate as time goes on). And the nice part is that all of this can also be applied to a team’s shooting defense (that is, how well the team does at stopping its opponent from scoring).
Another improvement could be made on the basis of the Glickmann paper which I already cited in the previous post, which would allow a team’s quality to vary between seasons. Making this alteration would be a sea change, since I have thus only talked about models that deal with one year of data. Of course, it would be possible use multiple seasons of data and put them through both the dynamic and the static models I’ve introduced. Doing so would be inadvisable, however, as it would produce an estimate of a team’s quality for all of the seasons combined.
The dynamic model might even be capable of handling the data and coming out with reasonable estimates. The static model, on the other hand, would end up estimating an average team quality for all the seasons, which would work terribly for any team whose performance varies significantly from season to season – basically all NBA teams.
A smarter way to handle that change in seasons would be to define a new omega, like the ones used to allow team quality to change from week to week. This would be a new omega which would only appear when we change from one season to the other. The variance of this new term would need to be much higher than that of the in-season omegas, as a higher variance would allow team quality to make bigger swings.
Returning to the discussion about modeling the number of possessions for each team: what if we included rebounding and turnovers for each team? Doing so would potentially make the possession modeling more accurate (as it would help explain the difference in the number of possessions for either team in a single match) while also recognizing strengths that are not related to team shooting. We know that these factors are not be fully taken into account by the model, as demonstrated in the third post of this series.
Finally, we need to consider score effects. Research in every sport has found that the state of the game score at a given point in the game affects how teams play from that point forward. More to the point, in sports like football (where teams start passing more when trailing and running the football more when leading), soccer, and hockey (in both of these sports, the leading team falls into a defensive shell while the trailing team pushes forward) that difference is shown to be significant enough that modelers usually take it into account when defining their models.
One such modeler is the mathematician Micah Blake McCurdy (whose work is of great interest to anyone that likes hockey and/or statistical modeling of sports results). McCurdy gave an impressive presentation on the subject of score effects in hockey at CBJHAC, a conference on hockey analytics put on by the NHL’s Columbus Blue Jackets. In his masterful presentation (link here), Micah brought out a number of new insights into modeling score effects, insights which could be adapted to basketball. In this respect, it would be interesting to look at the order in which teams score (whether they go on scoring runs or droughts) to try and measure how momentum affects a team’s probability of winning a game, which could lead us to become better at estimating that team’s quality in their forthcoming games.
As I keep repeating, I am a Bayesian statistician. Unfortunately, frequentists may not grasp the majority of what I have discussed so far in this post. With them in mind, I would like to take a quick look at how to do everything I have proposed using frequentist techniques. Someone more knowledgeable than I about frequentist inference might have good reason to criticize my comparisons. SInce I’d like this post to be a bit more inclusive, however, I’m willing to discuss anything that is sufficiently similar to warrant an attempt at translation.
When it comes to the discussion about how to model possessions/what to model (should you include rebounds, for instance), there are no significant differences between a frequentist view of the subject or a Bayesian one. The only difference is the fundamental philosophical difference between the two schools of thought (to put it simply, a frequentist believes that a parameter is a fixed quantity, while the bayesian believes that a probability distribution can be used to represent the degree of belief we have about that parameter). As such, the frequentist wouldn’t use priors for the expected number of points scored by each team, nor would they use MCMC techniques to generate samples of the posterior distribution. The main decision point – deciding which variables to defind as of interest – doesn’t change according to your approach to statistics.
The more significant changes begin to appear when we turn to the task of dividing the model in two parts, one for modeling how many field goals a certain team will attempt from a certain region of the court and one for modeling how likely the team is to score from that region (conversely, when talking about defense, we would be talking about how many field goals a team lets its opponent attempt and how good it is at preventing the opponent from scoring with these attempts). When modeling the number of attempts, the frequentist would run a Poisson regression; the frequentist a logistic regression for the portion dealing with scoring efficiency.
There would be no difference between a Bayesian and a frequentist in handling score efffects, mostly because I haven’t defined how I would look at the score effects, only that I should. It is also important to point out that, from what I know, McCurdy is a frequentist, so the example I used is inherently usable by frequentists.
The main practical point of contention between frequentists and Bayesians arises with respect to modeling parameters dynamically. The way for a frequentist to let variations of time affect their estimators would be to use regularization techniques, like ridge regression, or to do what FiveThirtyEight does with its Elo Model’s K-factor. You could also run your regression with different weights for different moments in time. That is what Football Outsiders does with its weighted DVOA. Basically, by changing the weights you could give more importance to more recent observations, and slowly care less and less about what older games have to tell you.
This is where I have to come in and say that the idea a lot of people take to heart, that of only caring about the most recent games of a team and giving little or no weight to previous games, did not work at all when I tried it in the earlier stages of my research project. I came to the conclusion that you should never throw data out if you can help it, as the static model that gave the same weight to every observation dramatically outperformed any model where any amount of data was ignored. Also, as you ignore more and more data, the model starts performing so badly that it isn’t much better than the null model (which means that flipping a coin and choosing a winner based on the side of the coin that came up was almost as accurate) and actually performs worse the simple model (which means that always choosing the home team to win would bring you better results).
New Ideas
There is no better way to start than by revisiting what I said about modeling being an art and looking at a few examples of artwork. We can look to Micah Blake McCurdy, again, and at his beautiful model for estimating shot rates in hockey. If you have the time and inclination to read his explanation of the Magnus model I advise you to do so, not only because of how good he is at explaining everything that he has done, but also because of his beautiful graphs. He also has models to estimate the quality of a shooter and the quality of a defender, as well as the impact a coach has on a team.
Of course, most of the people who come to this site are looking for insights about basketball, not hockey. But, everything Micah has done could be adapted to basketball data. The shot rates modeling has a pretty straightforward translation to basketball, as does the coaching impact model, while the shooting models could be adapted to look at the quality of a shooter and the quality of the defender guarding them.
The manner in which McCurdy takes the distance of a shot into account is by looking at the distance from which a shot is taken and handling every decrease in distance as having the same value in upping the probability of a shot going in. While I could argue that things shouldn’t be that “smooth” in hockey (I’d guess there are spots on the ice that have specific effects on the probability of a shot going in, like, say, the Ovechkin spot), I have even more reasons to believe that they shouldn’t be smoothed in basketball.
Because of the difference in the value of a made shot based on the shot’s location (a three pointer is worth more than any other shot), as well as the increasing prevalence of Moreyball over the years, I’d guess that the effect of distance on the probability of a shot going in isn’t linear. That is, a shot might be more likely to go in from three point range than you would expect given the distance from the basket simply because players practice a lot of three pointers, given its strategic importance. Either way, a model like McCurdy’s gives us the opportunity to test my hypothesis for reasonableness, since we could look at whether the predictive likelihood of the model is higher when distance isn’t modeled linearly.
Micah isn’t the only one with works of art that I admire. This very website’s owner, Greg, has done amazing work when modeling basketball (and I am definitely not one for insincere praise or sucking up). His latest post is a perfect example of that, not only for the work contained in there, but for the older posts he cites, like this one, on steals, or this one, on the disappearance of the big man (which is one of my biggest pet peeves, since I don’t understand how so many people can so confidently assume that, in a sport for extremely large people, the largest people are no longer important). Honestly, people should just buy his book.
But the reason I cite Greg’s work is because it comes at basketball in a different way than Micah comes at hockey, in that Greg spends a lot of time looking at understanding and modeling some of the smaller picture things in the NBA, and ends up unearthing a lot of interesting conclusions, ones that should be taken into account when looking at the bigger picture. And he allows those interested in his work a deeper understanding of how good a player is, as well as how they generate value for their team.
Player evaluation is an important component of predicting match results, since a team is a combination of players and coaches. As such, changing players should change the expected results, which is something that a model like mine doesn’t directly account for. By letting the parameters vary with time, I try to account for that, but it still isn’t perfect.
Though player evaluation is highly important, it is also fraught with problems. The one I would like to focus on is that of multicollinearity. The problem arises when the same information (or at least a large portion of it) is contained in multiple variables. For instance, if we are looking at children and adolescents, height and age correlate quite strongly (as children age, their height tends to increase), which might hurt your ability to model a desired variable using both age and height as explanatory variables.
Most methods can’t handle perfect multicollinearity, and imperfect but high multicollinearity can create problems that the modeler may not know how to deal with. Therefore, it is important to find ways to make sure multicollinearity doesn’t prevent the model from describing what it is supposed to describe. A simple way to do this is to take the variables that are too highly correlated and exclude all but one of them, as is common in machine learning. There are other methods as well, with reducing the effect of multicollinearity being one of the uses of lasso regression.
I will focus, though, on methods for dimensionality reduction. The most common and well known method is principal components analysis (PCA), which is used in both statistics and machine learning. The problem that arises with PCA, however, is that when you apply a probabilistic view over PCA, one of its main advantages – that principal components are independent – is only true if the principal components are normally distributed.
True, the Central Limit Theorem (CLT) states that when sample size increases, standardized variables slowly become normally distributed. That is still an approximation, for which we can only estimate an upper bound (with the Berry-Esseen theorem). It is even harder, if not impossible, to estimate how close to the real values the approximation comes, when we are dealing with predictions. By using the CLT, you’re assuming that your variables are very close to an ideal form that helps with your calculations. Sometimes this is true, but it is not necessarily the case.
A way to handle the difficulty would be to use the Independent Components Analysis (ICA), which depends explicitly on your variables not being normally distributed, as well as being independent. Independency is very hard to prove, which makes it a very strong assumption for any method to work. There is also another limitation with both PCA and ICA: the components might not be easily interpretable.
That’s where psychologists (a group that’s seldom talked about when it comes to sports or statistical analysis) come in. Factor analysis is very common in psychometrics, and can be divided into two segments: Exploratory Factor Analysis (EFA) and Confirmatory Factor Analysis (CFA). EFA hits many of the same problems that PCA and ICA face, but CFA doesn’t. CFA is used when the analyst, armed with their domain knowledge, has a number of hypotheses about the latent structure of their data and wants to test those hypotheses.
A simple example would be to build a database with statistics related to passing (shot assists, turnovers per pass, secondary assists, etc.) and try to estimate how good a player is at passing by assuming that all of the variables have a common factor, passing ability. This has the advantage of being quite interpretable, since it is based on a construct (passing ability) and the estimation process is done with that specific construct in mind. It is also possible to allow different constructs to be somewhat related, within this framework, which means that the person doing the modeling has a lot of freedom, another advantage.
The Four Factors could be a useable framework for basketball in this regard. With multiple statistics related to each factor, one could estimate how well a player or team performs in each factor, and put the results together in a model for predicting game outcomes. I even have an idea on how to do that with my current model, which would allow me to both estimate a player’s quality and their team’s chance of winning a certain game at the same time. I am quite confident in my idea, but not confident enough to put it out in the open without more work. The main point for me is that sports and CFA go really well together, in a number of ways.
Conclusion
With this last lengthy post, my series on modeling sports results is done. As such, I’d like to use this section to thank Greg and everyone who has read even a single of word of what I wrote. Greg gave me an opportunity to write for his amazing audience, as well as volunteered to edit my ramblings, all of which I’m very greatful for.
The readers have helped me tremendously. When I started writing this series, at the end of last year, I’d spent almost a year and a half running away from turning my final project into a paper that could actually be published, since the task sounded so daunting. And the last few months have not only led me to translating it all to a second language, but have also given me an idea of how the paper should look (I’ve now finally begun work on that front), as well as reignited my interest in thinking about sports (which can be clearly seen in this last post), with many new ideas for articles having popped into my head.
I hope this isn’t the last we’ll see of each other. But, if it is, thanks for reading.