by André Vizzoni
Parts one and two of this series of posts were quite heavy with probabilistic jargon (even with me trying to lighten it), and I hope this third one won’t be like that. The main objective of this article is to show the readers how the outputs of this model fit within the current basketball analytics framework. With that in mind, I will be looking at the relationship between the parameters of every team and the following stats:
- Team 2-Point Field Goal Percentage (T2FG%)
- Opponent 2-Point Field Goal Percentage (O2FG%)
- Team 3-Point Field Goal Percentage (T3FG%)
- Opponent 3-Point Field Goal Percentage (O3FG%)
- Team Effective Field Goal Percentage (TEFG%)
- Opponent Effective Field Goal Percentage (OEFG%)
- Team True Shooting Percentage (TTS%)
- Opponent True Shooting Percentage (OTS%)
- Team Offensive Rebounding Percentage (TOREB%)
- Opponent Offensive Rebounding Percentage (OOREB%)
- Team Defensive Rebounding Percentage (TDREB%)
- Opponent Defensive Rebounding Percentage (ODREB%)
- Team Free Throw Rate (TFT%)
- Opponent Free Throw Rate (OFT%)
- Team Turnover Percentage (TTO%)
- Opponent Turnover Percentage (OTO%)
Since I’m using this Justin Jacobs’ article as a guide, it also makes sense to look at the Four Factors Adjusted Ratings, hosted on this website. First, I will lay out the Methodology for my analysis. Next, I will analyze the scatter plots of the model parameters versus the assorted stats in the Analysis section. On the basis of those results, I will use the Discussion section to make sense of all that the analysis unearths.
First off, a quick refresher from the first two posts, specifically the definition of the model (Part 1, Methodology), and the point estimates for the model parameters (Part 2, Estimates): the mean number of points a team would be expected to score is a function of the quality of its offense and its home court advantage, while the number of points it would be expected to allow is a function of its defense.
We can then take a look at the overall quality of each team by looking at its win probability. When trying to calculate that probability, however, two problems arise. The first is related to home court advantage: when playing at home, the term related to home court appears in the formula. As a result, we have to calculate a home win probability and an away win probability for every team.
The second problem has to do with how to calculate those probabilities. The issue derives from the fact that the result of a game depends on both teams involved. As such, we need to conceive of an adversary that can be used as a measuring stick. It seemed obvious to me to choose a team that’s perfectly average, that has all of its parameters valued as zero, as the measuring stick. And a team’s general ability will be measured as its probability of winning a game against an exactly average team, either at home or away from home. Figure 1 shows the point and interval estimates for those probabilities.
Now that the win probabilities have been introduced, we have three parameters for each team, as well as two functions of the parameters. So, five “numbers” for each team. The next step is to examine the relationship between those numbers and the team stats. The stats come from the NBA’s own website, while I’ll be using the point estimates for my parameters and their functions.
First, an image to show how the process works. Figure 2 has four graphs in it. They share the same x axis label because the same model parameter is being looked at for each graph (in this case, the parameter is the offensive quality of each team). The y axis labels change slightly, though. That’s because the team stat being looked at doesn’t change (in this case, Team Winning Percentage), but the snapshot taken changes in each case. The upper left graph shows the team’s win percentage at home, while the upper right one shows away winning percentage, the lower left takes into account all games, and the lower right shows the differential (home win percentage minus away win percentage, which means negative numbers show that the team wins more away from home than at home).
We can now move on to other parts of the graphs. The graphs include a scatter plot of the two variables (the model parameter and the team stat), with a regression line being plotted as well. Also shown are the R squared (or coefficient of determination) and the regression p-value. To put it simply, the p-value tries to answer the question, “Is there a linear relationship between the two variables?” Smaller p-values indicate that there is a linear relationship. The coefficient of determination investigates how strong that relationship is (higher values mean stronger relationships). The regression line shows the direction of the relationship – if the line runs from lower left to upper right, the relationship is positive. Negative relationships, on the other hand, have regression lines that go from the upper left to the lower right.
Now, let’s talk about linear regression in a more rigorous way. I’m not trying to write a book on linear regression, but I will cite some problems with what I’m doing. First off, I haven’t done any residual analysis to check whether all of the assumptions are being respected. I haven’t checked whether the response variables are normally distributed, whether the relationship between both variables is actually linear, whether there is homoscedasticity (constant variance) and whether the observations are actually independent. So, I’m just going ahead and assuming all of the premises are true, which is probably unreasonable. I plan on doing an actual residual analysis for a regression model on this site in the future, just not in this post. Additionally, p-values are not necessarily the best measures of anything (PSA: I’m a very big Gelman fan), especially in cases like this one with so few observations.
There are three parameters and two parameter functions, as well as twenty team stats: the sixteen from the Introduction section and four others, Team Win Percentage (TWin%), Team Minutes (TMin, which isn’t particularly interesting, I admit; it’s just that I had an idea which I ended up givng up on, but kept the stat), Team Points per Game (TPTS/GP) and Opponent Points per Game (OPTS/GP). Since each stat is also shown in four different situations (at home, away, overall and differential between home and away), there are four hundred graphs, shown on a hundred different images, with four graphs per image (just like Figure 2).
Showing that many pictures would turn this post into a gruesome read, so I will only show here the images that are important to my analysis. But the rest of the images are available at this Dropbox folder, for those interested.
We’ll start the analysis by showing a picture that won’t be available in the folder. Figure 3 shows that model parameters can largely be considered independent of each other, which was to be expected, since they were defined as independent right at the beginning of the model’s definition. That may not be true for the offense-defense pair, since the graph shows that there is a weak relationship for both of them. I will assume that they are independent, though I don’t have more data to see if this persists with larger datasets. This dependence may actually be related to the pace of a team, since basketball is a sport where teams don’t tend to have huge advantages in time of possession, as in american football; consequently, a team that takes a lot of shots will probably face more shots, and both score and allow a lot of points.
Figures 4 and 5 show that this model is partially identifiable, since the offense and the home court parameters both have meaningful relationships with the number of points a team scores at home, though interestingly home court has its strongest relationship with point differential. That may actually be the most interesting result of this whole analysis, and something that really helps with the interpretation of my model’s results.
I also found support for Dean Oliver’s theory of four factors: there seems to be independence between offensive capabilities and the team’s free throw rate, as Figure 6 shows. Thus, it may be that offensive capability would be better measured by things like effective field goal percentage, while the ability to force the opponent into fouling – though important for the number of points a team scores – deserves separate attention in a model.
Figure 7 shows that the same holds true for offensive rebounding, while Figure 8 shows that, though the same is not exactly true for defensive rebounding, rebounding is another skill that isn’t completely accounted for by the model.
Figure 9 brings what I think is the last interesting point, as there seems to be no relationship between turnover rate (both team and opponent rates) and Offense or Defense, but there is a significant relationship between turnover rate and home court. That could mean that the big advantage home court brings has to do with turnovers (which could make sense, in a way, since it could mean that teams who are playing away from home are more prone to making mistakes). More probably, the relationship could indicate that turnovers should be looked at separately, as there seems to be an identifiability problem going on with the home court term. Such a problem would normally indicate that we are using a single parameter to explain what should be explained by multiple parameters.
To me, this analysis was incredibly interesting, bringing to light a lot of ways to improve the model. It showed that taking pace into account is important and could explain the dependence between offense and defense that shouldn’t be there. Also, we found that the abilities of rebounding and fouling (or evading fouls) should probably be added as well. In addition, the analysis suggested that turnovers could help explain home court advantage or be one of the causes of the identifiability problems this model experiences.