By Alan Moghaddam
I’ve been obsessed with Machine Learning lately. Like really obsessed. The problem is that I do not care about predicting how many passengers would have survived the Titanic sinking, or if a computer can tell whether an image is a dog or cat (all cats or you need not apply), or performing handwriting analysis to see if someone drew a 0 or a 6. Fret no more, I’ve done it – I have made a machine learning algorithm that scrapes data from basketballreference.com and can predict who will and will not make the playoffs!
Introduction
What is Machine Learning? To put it simply, Machine Learning is a subset of Artificial Intelligence where a computer uses data to develop a mathematical model to make decisions or predictions. It is really easy for a human to tell the difference between a cat or a dog simply by asking the following questions:
- Does it bark incessantly?
- Will it invade your personal space?
- Has it slobbered all over you?
These are all things that help us humans distinguish between cats and dogs, but a machine isn’t always privy to that information. If I feed a machine a series of images, it does not necessarily extrapolate the same key factors for identifying a dog. Though this state of affairs can frustrate our initial aims, it has the positive value of forming new connections that are not immediately obvious (to humans).
Types of Machine Learning
There are three* types of machine learning out there (I’m sure someone with a data science background will come and “well, actually” me on this): Supervised Learning, Unsupervised Learning, and Reinforcement Learning. I’ve stolen the wikipedia entry for this to help us understand the differences between them.
- Supervised learning: The computer is presented with example inputs and their desired outputs, given by a “teacher”, and the goal is to learn a general rule that maps inputs to outputs.
- Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning).
- Reinforcement learning: A computer program interacts with a dynamic environment in which it must perform a certain goal (such as driving a vehicle or playing a game against an opponent). As it navigates its problem space, the program is provided feedback that’s analogous to rewards, which it tries to maximize.
Machine Learning Examples: Unsupervised Learning
To put it in easier to understand terms, let’s think of the concept in basketball terms. Basketball has become somewhat positionless in the modern era. To say LeBron James is a forward or Kawhi Leonard is a guard is meaningless, as the referent has little connection with anything outside itself. The Rockets small-ball line-up is an even more glaring example. I can distinguish LeBron James from Anthony Davis or Dwight Howard positionally – because I’m a human. Let’s say we wanted to separate and cluster players based on box score stat lines. Well, I’ve done this before:
I did not tell the machine what position each player plays, I only fed it raw stats like steals, points, assists, etc. Some of the conclusions are astounding, particularly the machine placing Luka, LeBron, and Giannis all in the same dendron (insinuating they are all a similar type of player). I did not give the machine any indication of a correct answer, it just clustered.
Machine Learning Examples: Reinforcement Learning
Reinforcement learning is a little more difficult to understand. Let’s say you wanted to build an algorithm to play a video game, like the algorithm named Mari/o (seriously, if you’re into video games, the linked video is well worth the watch). In reinforcement learning, we reward the computer (either positively or negatively) for each decision it makes. The computer repeats operations over and over again to learn how to maximize his reward. Much like anyone who has played Mario before, that is how you as a human would approach the game: jumping on Goombas, using powerups, and anything else to maximize your chances of success.
Machine Learning Examples: Supervised Learning
Lastly, we have Supervised Learning. This is what we will be using to make our playoff predictor. (Warning: heavy jargon ahead) In supervised learning works, the human feed the machine a series of data points called training data. This training data is composed of features and labels. You can think of features as x-values and labels as y-values which the machine uses to generate a line of best fit. For our NBA data, the features will be relative performance stats like FG%, steals, fouls, etc. Our labels are simply an indicator of whether a team made the playoffs or not (yes or no, or more specifically 1 or 0).
Data Collection and Feature Selection
The two elements that commonly plague ML algorithms are: lack of data and incorrect feature selection. Luckily, with the NBA there is a plethora of easy-to-scrape data. Cleaning and collecting is basically what a data scientist will spend most of their time on. Luckily (for the readers), this article will not do the same because the folks of basketball-reference have done an excellent job of hosting all the necessary data in a nice and organized fashion.

Prerequisites
Here are the things you will need to make this on your own:
- Some form of a Python environment. My suggestion would be Anaconda and Jupyter Notebook.
- The Python library Beautiful Soup (pip install beautifulsoup4)
- Scikit-learn (pip install scikit-learn). With most instances of Anaconda you already have scikit-learn installed
- Optional: Download my Jupyter notebooks from GitHub. All the code listed in this will be in my Jupyter notebooks, so if you want to, you can easily look through those as well.
Gathering the data
To scrape the data from basketball reference, we will use the Python libraries Beautiful Soup and requests. This data will loop between 2004 until last season (2019). The code says 2020, but the for loop breaks before then. We do not – I repeat, we do not want to scrape the data for this season. Why is that? Simple – we don’t want to overtrain our algorithm to this data. We want this data to be “fresh” to the algorithm, so that we can test it to see if the algorithm works or not.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import re
import requests
start = 2004
stop = 2020
for year in range(start,stop,1):
site = 'https://www.basketball-reference.com/leagues/NBA_{}.html'
tableID = 'team-stats-per_game'
comm = re.compile("<!--|-->")
url= site.format(year)
file = "{}.csv"
html = requests.get(url).text
cleaned_soup = BeautifulSoup(re.sub("<!--|-->","", str(html)),'lxml')
tableStats = cleaned_soup.find('table', {'id':tableID})
headers = [th.getText() for th in tableStats.findAll('tr')[0].findAll('th')]
headers = headers[1:]
rows = tableStats.findAll('tr')[1:]
stats = [[td.getText() for td in rows[i].findAll('td')]
for i in range(len(rows))]
stats = pd.DataFrame(stats, columns = headers)
for col in stats:
if col != 'Team':
stats[col] = stats[col].rank()
stats["Playoffs"] = ["1" if "*" in ele else "0" for ele in stats["Team"]]
if year == start:
overallstats = stats
if year != start:
overallstats = overallstats.append(stats, ignore_index=True)
If in the next line you type “overallstats” and then run you should see the header of the table of the data collected. You cannot see the entire table, but this should give you a decent sense of what to expect. In our for loop, you can see a bit that looks like this:
stats["Playoffs"] = ["1" if "*" in ele else "0" for ele in stats["Team"]]
Basketball-reference has a lovely feature where if the team name contains a “*”, it denotes that the team made the playoffs. I cannot begin to tell you how much easier this makes it to figure out playoffs vs. not. Normally we would have to cross-reference the year with another table and it is just painful. This is…

Feature Selection
As stated earlier, our features for this algorithm will be raw team per-game stats, and our labels are playoffs or not. I tried a number of ways of doing this, but to account for a change in pace and emphasis on 3-point shooting the best way to use a broader spectrum of data is to then rank the team values. So instead of Team X, Y, and Z having a points per game of 105, 104, and 103, the algorithm will see the rank of X=1, Y=2 and Z=3. I did also try scaling all data for each year to its respective league average for that season, but I got some mixed results. So, if you decide to do this on your own, you can try other methods of data selection.
To start selecting features let’s separate teams in the playoffs from teams not in the playoffs. We do this by simply looking in the “Playoffs” column and grouping all the 0’s and 1’s. Note: lines 3 and 5 are there to remove “League Average” from consideration.
import matplotlib.pyplot as plt
noplayoffs = overallstats[overallstats['Playoffs'].astype(int) == 0]
noplayoffs = noplayoffs[~noplayoffs['Team'].str.contains("League Average")]
playoffs = overallstats[overallstats['Playoffs'].astype(int) == 1]
playoffs = playoffs[~playoffs['Team'].str.contains("League Average")]
Now let’s look at our data. A histogram plot will tell us how teams fair in all the categories we scraped.
fig, axs = plt.subplots(5,5, figsize=(16,16))
y=0
z=0
for stat in noplayoffs:
if stat not in ['Team','G','MP','Playoffs']:
axs[y,z].hist(noplayoffs[stat], bins = 8)
axs[y,z].set_title(stat)
if z < 5:
z+=1
if z == 5:
z = 0
y+=1
fig2, axs2 = plt.subplots(5,5, figsize=(16,16))
y=0
z=0
for stat in noplayoffs:
if stat not in ['Team','G','MP','Playoffs']:
axs2[y,z].hist(playoffs[stat], bins = 8)
axs2[y,z].set_title(stat)
if z < 5:
z+=1
if z == 5:
z = 0
y+=1
A couple of things to note, the rank function in Pandas assumes that a larger number should receive a lower rank. So the team that scores the most points will have a rank of 32, but a team that commits the most turnovers has a rank of 32 as well. Our algorithm can handle some data having an inverse relationship, so this is only confusing for us (not for the computer).

Looking at the data presented, teams that make the playoffs tend to do better in FG, FG%, 3P made, 3P attempted, 2P%, DRB, TRB, STL, BLK, TOV, and PF.

The same trends present in playoff teams appear for non-playoff teams, except in the opposite direction. In order to prevent overtraining the data, I think it’s best if we remove any stats that are a function of another stat. So removing total rebounds and field goals/field goal percentages makes this data a little slimmer.
That leaves us with the following stats: 3P%, 2P%, DRB, AST, STL, BLK, TOV, and PF. We also now can remove all the league averages as well. Let’s do that and save. (Note, we are also dropping the team names off the data input as well. The computer won’t know how to handle string objects like text, so there’s no point in keeping it).
overallstats = overallstats[~overallstats['Team'].str.contains("League Average")]
overallstats = overallstats[['3P%','2P%','DRB','AST','STL','BLK','TOV','PF','Playoffs']]
#save it to a file
overallstats.to_csv(f"statsfor{start}to{stop}.csv",index=False)
I will not detail it here, but we also need to scrape the 2019-2020 season stats to get a prediction for this season. The code is the same, just make the start 2020 and the stop 2021.
Splitting the Data
To train our algorithm we have to split the data into 3 parts: training data, validation data, and testing data. For our purposes I propose a 60:20:20 split for training : validation : testing (though this is not the only valid partitioning method). Without venturing too deeply into the subject, machine learning has a subtle dance between over- and undertraining an algorithm.

We need to be able to balance overtraining and undertraining. With a high enough complexity, the algorithm can simply memorize the data and regurgitate what it has seen before rather than actually making a decision. If there is not enough data to train the algorithm, that can create problems as well and give poorly-founded predictions.
To split the data we will need to run the following code:
import pandas as pd
from sklearn.model_selection import train_test_split
stats = pd.read_csv('statsfor2004to2020.csv')
features = stats.drop('Playoffs', axis=1)
labels = stats['Playoffs']
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.4, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5, random_state=42)
#This will test to see if we have a 60/20/20 split
for dataset in [y_train, y_val, y_test]:
print(round(len(dataset) / len(labels), 2))
X_train.to_csv('train_features.csv', index=False)
X_val.to_csv('val_features.csv', index=False)
X_test.to_csv('test_features.csv', index=False)
y_train.to_csv('train_labels.csv', index=False)
y_val.to_csv('val_labels.csv', index=False)
y_test.to_csv('test_labels.csv', index=False)
This not only splits our data set, it randomizes it. Randomization is critical because it ensure that we do not bias the training set with older or newer data (and the same goes for validation and testing sets too).
Training and Testing the Algorithm
To make life easy on everyone I am going to use an algorithm based upon RandomForestClassifier. If you want to understand how it works, I cannot do it justice compared to this article on TowardDataScience. By using a series of decision trees, RFC is able to process the data and make a decision accordingly. “Hyper-parameters” are a series of factors that can change the complexity of your algorithm. Choosing the level of these parameters in itself can also be daunting and a trade-off between computing time and model accuracy. We’ve made some assumptions on hyper-parameters to start, but again, you can adjust them in your own time if you want to play with them! Here is the code:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score
tr_features = pd.read_csv('train_features.csv')
tr_labels = pd.read_csv('train_labels.csv')
val_features = pd.read_csv('val_features.csv')
val_labels = pd.read_csv('val_labels.csv')
te_features = pd.read_csv('test_features.csv')
te_labels = pd.read_csv('test_labels.csv')
rf1 = RandomForestClassifier(n_estimators=5, max_depth=10)
rf1.fit(tr_features, tr_labels.values.ravel())
rf2 = RandomForestClassifier(n_estimators=100, max_depth=10)
rf2.fit(tr_features, tr_labels.values.ravel())
rf3 = RandomForestClassifier(n_estimators=100, max_depth=None)
rf3.fit(tr_features, tr_labels.values.ravel())
We are now done training the algorithm. More specifically, we are done training 3 variants of the algorithm. We have trained 3 variants with different hyper-parameters. In doing so, we can run all three through validation and testing to see which does best.
for mdl in [rf1, rf2, rf3]:
y_pred = mdl.predict(val_features)
accuracy = round(accuracy_score(val_labels, y_pred), 3)
precision = round(precision_score(val_labels, y_pred), 3)
recall = round(recall_score(val_labels, y_pred), 3)
print('MAX DEPTH: {} / # OF EST: {} -- A: {} / P: {} / R: {}'.format(mdl.max_depth,
mdl.n_estimators,
accuracy,
precision,
recall))
Output: MAX DEPTH: 10 / # OF EST: 5 -- A: 0.719 / P: 0.667 / R: 0.773
MAX DEPTH: 10 / # OF EST: 100 -- A: 0.792 / P: 0.731 / R: 0.864
MAX DEPTH: None / # OF EST: 100 -- A: 0.823 / P: 0.776 / R: 0.864
In my evaluation, the most accurate algorithm is the third one. The same goes for precision and recall. You will want to check which one performs best, and choose accordingly (remember, this is randomly seeded so your results may vary!)
Testing the Algorithm
Now is when we cascade everything down to testing.
y_pred = rf3.predict(te_features)
accuracy = round(accuracy_score(te_labels, y_pred), 3)
precision = round(precision_score(te_labels, y_pred), 3)
recall = round(recall_score(te_labels, y_pred), 3)
print('MAX DEPTH: {} / # OF EST: {} -- A: {} / P: {} / R: {}'.format(rf2.max_depth,
rf2.n_estimators,
accuracy,
precision,
recall))
This will tell us the model’s accuracy, which in this case did not take a tremendous hit.
Output: MAX DEPTH: 10 / # OF EST: 100 -- A: 0.781 / P: 0.778 / R: 0.824
Cool. Now, let’s import the current season and run the prediction:
stats = pd.read_csv('statsfor2020to2021.csv')
testerset = stats.drop('Playoffs', axis=1)
y_pred = rf3.predict(testerset)
print(y_pred)
Your output will be a 1×32 array. It’s a bunch of 1’s and 0’s:
Output: array([1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1,
1, 0, 0, 0, 0, 0, 0, 0], dtype=int64)
This is meaningless, but follow this code and it’ll make sense:
predictions = pd.DataFrame(y_pred.T, columns = ['Playoffs'])
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import re
import requests
site = 'https://www.basketball-reference.com/leagues/NBA_{}.html'
tableID = 'team-stats-per_game'
comm = re.compile("<!--|-->")
url= site.format(2020)
file = "{}.csv"
html = requests.get(url).text
cleaned_soup = BeautifulSoup(re.sub("<!--|-->","", str(html)),'lxml')
tableStats = cleaned_soup.find('table', {'id':tableID})
headers = [th.getText() for th in tableStats.findAll('tr')[0].findAll('th')]
headers = headers[1:]
rows = tableStats.findAll('tr')[1:]
teams = [[td.getText() for td in rows[i].findAll('td')]
for i in range(len(rows))]
teams = pd.DataFrame(teams, columns = headers)
teams= teams[['Team']]
teams['Playoffs'] = predictions
teams.to_csv("2020playoffteams.csv",index=False)
Alrighty then, now our teams dataframe makes sense. This is the output we should get from our csv:
Team | Playoffs |
Dallas Mavericks | 1 |
Milwaukee Bucks* | 1 |
Houston Rockets* | 1 |
Portland Trail Blazers | 1 |
Los Angeles Clippers* | 1 |
New Orleans Pelicans | 1 |
Washington Wizards | 0 |
Phoenix Suns | 0 |
Memphis Grizzlies | 1 |
Miami Heat* | 1 |
Boston Celtics* | 1 |
Los Angeles Lakers* | 1 |
San Antonio Spurs | 1 |
Atlanta Hawks | 0 |
Indiana Pacers* | 1 |
Toronto Raptors* | 1 |
Brooklyn Nets | 0 |
Utah Jazz* | 1 |
Denver Nuggets* | 1 |
Philadelphia 76ers* | 1 |
Sacramento Kings | 0 |
Oklahoma City Thunder* | 1 |
Orlando Magic | 1 |
Minnesota Timberwolves | 0 |
Detroit Pistons | 0 |
New York Knicks | 0 |
Cleveland Cavaliers | 0 |
Chicago Bulls | 0 |
Golden State Warriors | 0 |
Charlotte Hornets | 0 |
League Average | NaN |
Discussion
Our algorithm worked! And wouldn’t you know, it picked 18 teams to make the playoffs, wait, how did it know about the bubble and playoff games? No, we did not build a sentient being here, this just shows some of the flaws with our approach (fine, blame me – my approach). First and foremost, we did not seed any conference data nor put any restrictions therein. So, the algorithm spotted a trend between the 18 teams it chose and what it has seen of playoff teams before. The algorithm’s picks make sense, even though we still know a little better.
Case in point, the Brooklyn Nets: the Nets are a dreadful team in a weak conference. They’ve been plagued by injuries the entire year and one of their marquee players has yet to even put the uniform on and play a game. Yet, they are in the playoffs. Why? Well, they’re in the Eastern Conference. Contrast now the San Antonio Spurs. They are a team that it has been impossible to count out of the playoffs, and our algorithm picked them, hooray! Well, the Spurs back in March had one of the hardest strength-of-schedule in the league, and that might be why their Win/Loss record does not reflect their performance.
Regardless of a specific pick, the victory here is that we seeded an algorithm with team data but no Win/Loss totals and it correctly picked ~14 playoff teams. It over-picked by 2 teams, sure, and left one off the board that it should not have, but this is an excellent start.
Above all, what I hope you learned from this is the following: data scraping from basketball-reference, basic feature selection, and some really simple machine learning. Some possible improvements for future versions of this study could be to seed ranking instead of playoffs/no playoffs and use the same process to create a power ranker. We could also try splitting the teams into their conferences as well, and maybe even putting restrictions like top 8 per conference make it. But hey, for something that’ll take you an afternoon to code, this is kinda awesome!