# Francis Paesano
# Final Tutorial
# 12/21/2020
#
# In this tutorial I am going to teach you how to work with datasets in Python, and more specifically utilizing the Pandas
# library. This tutorial will analyze basketball statistics from the NBA from 1999-2020 and will analyze how the height of
# a player affects their play. The dataset we will be taking from has stats from leagues all around the world, but I will
# only be using the NBA.
# In this tutorial I will compare stats like points scored, steals, blocks, rebounds, 3 pointers, and more of players of
# all heights to get an understanding of how height affects these stats and from there get an understanding of how it
# affects playing style.
#
# Dataset: https://www.kaggle.com/jacobbaruch/basketball-players-stats-per-season-49-leagues?select=players_stats_by_
# season_full_details.csv
import warnings
import scipy.stats as stats
import pandas as pd
import seaborn as sns
from scipy.stats import ttest_rel
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smols
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from sklearn import linear_model
import sklearn.datasets
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
warnings.filterwarnings('ignore')
# If the csv file of the dataset is in the notebooks folder, use pd.read_csv to load the data into a dataframe.
data = pd.read_csv("players_stats_by_season_full_details.csv")
# Here I am tidying the data by grabbing all the columns I will need throughout my analysis from the dataset and
# inserting them into a new dataframe titled "NBA". And the only missing data I saw were some players birth dates,
# which I don't need so I didn't account for those missing values.
NBA = data[data['League'] == "NBA"]
NBA = data[data['Stage'] == "Regular_Season"]
NBA = NBA[['Season', 'Player', 'GP', 'MIN', 'FGM', 'FGA', '3PM', '3PA', 'FTM', 'FTA', 'TOV', 'PF', 'ORB', 'DRB',
'REB', 'AST', 'STL', 'BLK', 'PTS', 'birth_year', 'height', 'weight']]
# Since this file was saved as a csv in Excel, a height of 7-6 (6 foot 7 inches) for example was changed to 7-Jun, so I had
# to fix this issue. To fix it, I used a for loop to go through every row in the dataframe NBA and replace any occurance of
# May, Jun, or Jul (no other months because nobody is shorter than 5 feet or taller than 7'-11") with the number 5, 6, or 7
# respectively. And if a player was 6'-0" or 7'-0", their height was changed to Jun-00 and Jul-00, so replaced any occurance
# of those with the right height.
for i, row in NBA.iterrows():
NBA.loc[i, 'height'] = NBA.loc[i, 'height'].replace("Jun-00", "0-6")
NBA.loc[i, 'height'] = NBA.loc[i, 'height'].replace("Jul-00", "0-7")
NBA.loc[i, 'height'] = NBA.loc[i, 'height'].replace("May", "5")
NBA.loc[i, 'height'] = NBA.loc[i, 'height'].replace("Jun", "6")
NBA.loc[i, 'height'] = NBA.loc[i, 'height'].replace("Jul", "7")
# Then after that I converted the height column of this dataset, which is formatted as inches-height, to just inches and
# created a colum titled "height_in".
NBA['inches'], NBA['feet'] = NBA['height'].str.split("-").str
NBA['height_in'] = (12 * NBA['feet'].astype(int)) + NBA['inches'].astype(int)
NBA = NBA.drop(columns = ['inches', 'feet'])
# Finally, I created a new dataframe titled "NBA_player_averages" and used an aggregate function to group the data by player
# so that I had only one set of statistics for each player and it would act as their career averages during the years that
# this dataset spans. My aggregate function took the average of only the statistics that should be averaged, like points, games
# played, and blocks, but not other ones, like birth year, name, and height.
NBA_player_averages = NBA[['Player', 'GP', 'MIN', 'FGM', 'FGA', '3PM', '3PA', 'FTM', 'FTA', 'TOV', 'PF', 'ORB', 'DRB',
'REB', 'AST', 'STL', 'BLK', 'PTS', 'birth_year', 'height', 'height_in', 'weight']]
aggregation_functions = {'Player': 'first', 'GP': 'mean', 'MIN': 'mean', 'FGM': 'mean', 'FGA': 'mean', '3PM': 'mean',
'3PA': 'mean', 'FTM': 'mean', 'FTA': 'mean', 'TOV': 'mean', 'PF': 'mean', 'ORB': 'mean',
'DRB': 'mean', 'REB': 'mean', 'AST': 'mean', 'STL': 'mean', 'BLK': 'mean', 'PTS': 'mean',
'birth_year': 'first', 'height': 'first', 'height_in': 'first', 'weight': 'first'}
NBA_player_averages = NBA_player_averages.groupby(NBA_player_averages['Player']).aggregate(aggregation_functions)
# Players' heights can vary by a lot. The shortest player in this dataset is 5'-3" and the tallest is 7'-6". Height is a
# very big part of the game of basketball, and a lot of people wonder if it gives a player an unfair advantage. The first
# data visualization I will provide is a graph of how many players there are in the league at every height. The heights
# will be displayed in inches with the shortest player being 63 inches and the tallest 90 inches.
# First I found the tallest player and shortest player by applying the min() and max() functions to the NBA_player_averages
# dataframe. Then I created a dataframe titled "heights" that had two columns, one had all possible heights ranging from the
# tallest to the shortest in increments of 1 inch. The next column is reserved for the number of players that exist at each
# height. To populate this dataframe I am using a for loop to go through all the players in NBA_player_averages and add 1 to
# the row that corresponds to that player's height for every player. Finally I plot it as a bar graph.
shortest = NBA_player_averages['height_in'].min(axis = 0)
tallest = NBA_player_averages['height_in'].max(axis = 0)
heights = {'height': range(shortest, tallest + 1, 1), 'num': 0}
heights = pd.DataFrame(data = heights)
for i, row in NBA_player_averages.iterrows():
temp = NBA_player_averages.loc[i, 'height_in']
heights.iat[temp - 63, 1] = heights.iat[temp - 63, 1] + 1
heights.plot.bar(x = 'height', y = 'num', rot = 0)
plt.title("Number of Players at Every Height")
plt.xlabel("Height")
plt.ylabel("Number of Players")
plt.show()
# Height varies a lot amongst players, and what you can see from the graph of player's heights is that height is like a
# bell curve with most of the players in the league sitting somewhere in the middle of the range. Still however, there are
# some players that are exceptionally taller than others, and some people wonder if that gives them an unfiar advantage.
# In the coming sections, I will analyze different stats including points to see if being taller allows a player to do better
# in the game.
# How has the average height of a player in the NBA changed over the last 20 years from the 1999-2000 season to the 2000-
# 2001 season?
# Here I'll create a dataframe titled "NBA_average_height" which has only the "Season" and "height_in" from the NBA dataframe
# we created later. To get the average height from every year, I applied an aggregation function that grouped the data by
# season, and got the mean height.
NBA_average_height = NBA[['Season', 'height_in']]
aggregation_functions = {'Season': 'first', 'height_in': 'mean'}
NBA_average_height = NBA_average_height.groupby(NBA_average_height['Season']).aggregate(aggregation_functions)
xData = NBA_average_height['Season'].values
yData = NBA_average_height['height_in'].values
plt.plot(xData, yData, 'o')
plt.title("Average player height each season")
plt.xlabel("Season")
plt.ylabel("Average Height")
plt.xticks(rotation = 90)
plt.ylim(50, 100)
plt.show()
# As it turns out, the average height of a player in the NBA has not changed much at all over the years. It goes up some
# years and goes down some years, but the change has never been more than an inch or so.
# How does height affect field goals attempted?
#
# Field goals are any shot taken by a player that was not a free throw. So it includes 3 pointers as well as 2 pointers.
#
# I gathered the information I would need from NBA_player_averages into a dataframe titled height_fieldGoals. I want to
# get player's field goal attempts per game, so I created a new column in height_fieldGoals called "FGA per game" and
# that column will have the player's field goal attempts divided by how many games they played. Finally I only want to
# consider people who have significant play time in games because if you don't play much, your stats will be low for
# that reason instead of any independent variable that we are measuring. I have chosen 32 minutes as the cutoff, so I
# only take people who have 32 minutes or more played per game.
height_fieldGoals = NBA_player_averages[['Player', 'FGA', 'height_in', 'GP', 'MIN']]
height_fieldGoals['FGA per game'] = height_fieldGoals['FGA']/height_fieldGoals['GP']
height_fieldGoals['MINS per game'] = height_fieldGoals['MIN']/height_fieldGoals['GP']
height_fieldGoals = height_fieldGoals[height_fieldGoals['MINS per game'] >= 32]
xData = height_fieldGoals['height_in'].values
yData = height_fieldGoals['FGA per game'].values
m, b = np.polyfit(xData, yData, 1)
plt.plot(xData, m * xData + b)
plt.plot(xData, yData, 'o')
plt.title("Field Goal Attempts per Game and Height")
plt.xlabel("Height")
plt.ylabel("Field goal attempts")
plt.show()
correlation = height_fieldGoals.corr()
print(correlation.loc['FGA per game', 'height_in'].round(2))
# With a correlation of -.1, there is not a significant correlation between field goals attempted and height. So taller
# players do not attempt more field goals.
# How does height affect points scored?
#
# Just like for field goals, I created a column titled "PTS per game" which has the players points divided by number of
# games played. And I will again only be considering players who play at least 32 minutes a game.
height_points = NBA_player_averages[['Player', 'PTS', 'height_in', 'GP', 'MIN']]
height_points['PTS per game'] = height_points['PTS']/height_points['GP']
height_points['MINS per game'] = height_points['MIN']/height_points['GP']
height_points = height_points[height_points['MINS per game'] >= 32]
xData = height_points['height_in'].values
yData = height_points['PTS per game'].values
m, b = np.polyfit(xData, yData, 1)
plt.plot(xData, m * xData + b)
plt.plot(xData, yData, 'o')
plt.title("Points Scored per Game and Height")
plt.xlabel("Height")
plt.ylabel("Points Scored")
plt.show()
correlation = height_points.corr()
print(correlation.loc['PTS per game', 'height_in'].round(2))
# With a correlation of .01, there is not a significant correlation between points scored and height.
# Height clearly does not affect points scored. The linear regression line for this graph is flat. Taller players
# do not necessarily score more points.
# How does height affect 3 pointers attempted?
#
# For this statistic, I don't want to show how many 3 pointers players take, but what percentage of their shots are 3
# pointers. So I'll have a column for 3 pointers per game, a column for field goals per game, and using those values,
# I'll create a column for their percentage of shots taken that are 3 pointers. It will have their 3 pointers per game
# divided by field goals per game.
height_3points = NBA_player_averages[['Player', '3PA', 'height_in', 'GP', 'MIN', 'FGA']]
height_3points['3PA per game'] = height_3points['3PA']/height_3points['GP']
height_3points['FGA per game'] = height_3points['FGA']/height_3points['GP']
height_3points['3PA/FGA'] = height_3points['3PA per game']/height_3points['FGA per game']
height_3points['MINS per game'] = height_3points['MIN']/height_3points['GP']
height_3points = height_3points[height_3points['MINS per game'] >= 32]
xData = height_3points['height_in'].values
yData = height_3points['3PA/FGA'].values
m, b = np.polyfit(xData, yData, 1)
plt.plot(xData, m * xData + b)
plt.plot(xData, yData, 'o')
plt.title("3 Pointer Attempts per Game and Height")
plt.xlabel("Height")
plt.ylabel("Percentage of 3 pointers out of total shots")
plt.show()
correlation = height_3points.corr()
print(correlation.loc['3PA/FGA', 'height_in'].round(2))
# With a correlation of -.45, there is a significant correlation between 3 point percentage and height.
# Height does affect 3 pointers attempted. Taller players attempt less 3 pointers while shorter players attempt
# more.
# How does height affect assists made?
#
# This data (assists) as well as steal, blocks, and rebounds which come after are all measured the same way I measured
# field goals and points. I'll put the stats I need from NBA_player_averages into a new dataframe with a title that
# tells me what I'm measuring (in this case titled "height_assists"). I'll create a new column for their numbers per
# game which I'll get by dividing the statistic by their number of games played. Then I find minutes played per game by
# doing total minutes divided by number of games played, and for the final data, I'll only consider players who play at
# least 32 minutes a game.
height_assists = NBA_player_averages[['Player', 'AST', 'height_in', 'GP', 'MIN']]
height_assists['AST per game'] = height_assists['AST']/height_assists['GP']
height_assists['MINS per game'] = height_assists['MIN']/height_assists['GP']
height_assists = height_assists[height_assists['MINS per game'] >= 32]
xData = height_assists['height_in'].values
yData = height_assists['AST per game'].values
m, b = np.polyfit(xData, yData, 1)
plt.plot(xData, m * xData + b)
plt.plot(xData, yData, 'o')
plt.title("Assits Made per Game and Height")
plt.xlabel("Height")
plt.ylabel("Assists Made")
plt.show()
correlation = height_assists.corr()
print(correlation.loc['AST per game', 'height_in'].round(2))
# With a correlation of -.63, there is a significant correlation between assists made and height.
# Height does affect assists made. Shorter players make more assists and taller player make less. This means that
# the shorter players are passing to the taller players more than the other way around.
# How does height affect steals made?
height_steals = NBA_player_averages[['Player', 'STL', 'height_in', 'GP', 'MIN']]
height_steals['STL per game'] = height_steals['STL']/height_steals['GP']
height_steals['MINS per game'] = height_steals['MIN']/height_steals['GP']
height_steals = height_steals[height_steals['MINS per game'] >= 32]
xData = height_steals['height_in'].values
yData = height_steals['STL per game'].values
m, b = np.polyfit(xData, yData, 1)
plt.plot(xData, m * xData + b)
plt.plot(xData, yData, 'o')
plt.title("Steals Made per Game and Height")
plt.xlabel("Height")
plt.ylabel("Steals Made")
plt.show()
correlation = height_steals.corr()
print(correlation.loc['STL per game', 'height_in'].round(2))
# With a correlation of -.44, there is a significant correlation between steals made and height.
# Height has a moderate effect on steals made. Shorter players make more steals and taller players make less.
# How does height affect rebounds made?
height_rebounds = NBA_player_averages[['Player', 'REB', 'height_in', 'GP', 'MIN']]
height_rebounds['REB per game'] = height_rebounds['REB']/height_rebounds['GP']
height_rebounds['MINS per game'] = height_rebounds['MIN']/height_rebounds['GP']
height_rebounds = height_rebounds[height_rebounds['MINS per game'] >= 32]
xData = height_rebounds['height_in'].values
yData = height_rebounds['REB per game'].values
m, b = np.polyfit(xData, yData, 1)
plt.plot(xData, m * xData + b)
plt.plot(xData, yData, 'o')
plt.title("Rebounds Made per Game and Height")
plt.xlabel("Height")
plt.ylabel("Rebounds Made")
plt.show()
correlation = height_rebounds.corr()
print(correlation.loc['REB per game', 'height_in'].round(2))
# With a correlation of .74, there is a significant correlation between rebounds made and height.
# Height has a large effect on rebounds. Taller players make far more rebounds than shorter players.
# How does height affect blocks made?
height_blocks = NBA_player_averages[['Player', 'BLK', 'height_in', 'GP', 'MIN']]
height_blocks['BLK per game'] = height_blocks['BLK']/height_blocks['GP']
height_blocks['MINS per game'] = height_blocks['MIN']/height_blocks['GP']
height_blocks = height_blocks[height_blocks['MINS per game'] >= 32]
xData = height_blocks['height_in'].values
yData = height_blocks['BLK per game'].values
m, b = np.polyfit(xData, yData, 1)
plt.plot(xData, m * xData + b)
plt.plot(xData, yData, 'o')
plt.title("Blocks Made per Game and Height")
plt.xlabel("Height")
plt.ylabel("Blocks Made")
plt.show()
correlation = height_blocks.corr()
print(correlation.loc['BLK per game', 'height_in'].round(2))
# With a correlation of .65, there is a significant correlation between blocks made and height.
# Height has a large effect on blocks. Taller players make far more blocks than shorter players.
# How does height affect freethrow percentage?
#
# For freethrow percentage, I have to get both free throw attempts and free throws made from NBA_player_averages, and
# divide free throws made by free throw attempts to get each players 3 point percetage. That what percentage of their
# free throws go in.
height_freeThrows = NBA_player_averages[['Player', 'FTA', 'FTM', 'height_in', 'GP', 'MIN']]
height_freeThrows['FTA per game'] = height_freeThrows['FTA']/height_freeThrows['GP']
height_freeThrows['FTM per game'] = height_freeThrows['FTM']/height_freeThrows['GP']
height_freeThrows['FTM/FTA'] = height_freeThrows['FTM']/height_freeThrows['FTA']
height_freeThrows['MINS per game'] = height_freeThrows['MIN']/height_freeThrows['GP']
height_freeThrows = height_freeThrows[height_freeThrows['MINS per game'] >= 32]
xData = height_freeThrows['height_in'].values
yData = height_freeThrows['FTM/FTA'].values
m, b = np.polyfit(xData, yData, 1)
plt.plot(xData, m * xData + b)
plt.plot(xData, yData, 'o')
plt.title("Free Throw Percentage and Height")
plt.xlabel("Height")
plt.ylabel("Free Throw Percentage")
plt.show()
correlation = height_freeThrows.corr()
print(correlation.loc['FTM/FTA', 'height_in'].round(2))
# With a correclation of -.37, there is a moderate correlation between free throw percentage and height.
# Height has a moderate effect on free throws percentages. Shorter players are slightly better at shooting free throws.
# How does height affect three point percentage?
#
# For 3 point percentage I do the same thing as free throw percentage but for 3 pointers. 3 pointers made divided by
# 3 pointers attempted.
height_3pointPercent = NBA_player_averages[['Player', '3PA', '3PM', 'height_in', 'GP', 'MIN']]
height_3pointPercent['3PA per game'] = height_3pointPercent['3PA']/height_3pointPercent['GP']
height_3pointPercent['3PM per game'] = height_3pointPercent['3PM']/height_3pointPercent['GP']
height_3pointPercent['3PM/3PA'] = height_3pointPercent['3PM']/height_3pointPercent['3PA']
height_3pointPercent['MINS per game'] = height_3pointPercent['MIN']/height_3pointPercent['GP']
height_3pointPercent = height_3pointPercent[height_3pointPercent['MINS per game'] >= 32]
xData = height_3pointPercent['height_in'].values
yData = height_3pointPercent['3PM/3PA'].values
m, b = np.polyfit(xData, yData, 1)
plt.plot(xData, m * xData + b)
plt.plot(xData, yData, 'o')
plt.title("3 Point Percentage and Height")
plt.xlabel("Height")
plt.ylabel("3 Point Percentage")
plt.show()
correlation = height_3pointPercent.corr()
print(correlation.loc['3PM/3PA', 'height_in'].round(2))
# With a correclation of -.35, there is a moderate correlation between 3 pointer percentage and height.
# Height has a moderate effect on 3 pointer percentages. Shorter players are slightly better at shooting 3 pointers.
# Players who are taller do not necessarily score more points. However taller players make less assists, less 3 point
# attempts, less 3 point shots, less steals, less free throws, more rebounds, and more blocks. This gives us a lot of
# insight into the different play styles of short versus tall players.
#
# Taller players attempts less 3 pointers because they are better suited for getting closer to the basket, and making
# contact with the other players near the basket. Shorter players attempt more 3 pointer becasue they can easily get
# stopped by the tall players who are close to the basket, so they tend to stay further away.
#
# Taller players make less assists and shorter players make more. This is because taller players (who stay closer to the
# rim as we learned from the 3 point shot statistics) are not likely to pass to the shorter players further from the
# rim, but shorter player are encouraged to pass to the bigger players at the basket.
#
# Taller players make far more blocks and rebounds than shorter player because those plays tend to happend closer to the
# rim where taller players spend their time. However steals, made more by shorter players, can happend anywhere on the
# court. This is why there is a moderate trend of shorter players making more steal, but not a large trend.
#
# Taller players make slightly less free throws, most likely because they are used to shooting close shots to the
# basket, and therefore are not as accurate from distance. Taller players are also slighty worse at shooting 3 pointers
# for the same reason.
#
#
# What all this data tells us is that taller player are more geared towards defensive play since they are so much better
# at rebounds and blocks which are important statistics for defense. Shorter players are better at steals, another
# defensive stat, but that moderate trend does not overshadow the strong correlation of tall player to blocks and
# rebounds.
# Shorter players are slightly more geared towards offensive play as making more assists makes them such strong
# playmakers. They are also better shooters for 3 pointers and free throws, which score points and are of course and
# offensive objects.
#
# What we can conclude is that players that are taller do not recieve an unfiar advantage as a result becasue they do not
# score more points, and in fact their shooting percentages are worse than their shorter teammates. Supporting this claim
# is the fact that the average height of a player has not changed over the years, becasue if being taller did grant you
# an advatage, we would probably see players slowly getting taller every year as a result.
#
# People who still wonder if being taller makes players better can read this tutorial and find out that that is not true.
# People who also do not know much about basketball and how height affects play style can read this tutorial and learn a
# lot about the subject.
#
#
# From this tutorial, you have learned how to:
# Import data from the web into your Python code
# Tidy the data and isolate the data from the dataset that you want
# Analyze the data and create visual representations of the data
# Perform linear regressions on the data and add it to graphs
# Find correlation between two variables and use that to make conclusions
#
# The dataset can be viewed here:
# https://www.kaggle.com/jacobbaruch/basketball-players-stats-per-season-49-leagues?select=players_stats_by_
# season_full_details.csv
# And another article on height and how it affects basketball players can be found here:
# https://www.livestrong.com/article/363066-is-height-important-in-basketball/