STAT 451 Project Proposal

Names: Jackson Cramer, Joshua DeRuyter, Yingying Xie, Siti Aleeya Nuha Roslee

Read Data

In [2]:

import pandas as pd

df = pd.read_csv("games.csv")
df.head()

Out[2]:

	GAME_DATE_EST	GAME_ID	GAME_STATUS_TEXT	HOME_TEAM_ID	VISITOR_TEAM_ID	SEASON	TEAM_ID_home	PTS_home	FG_PCT_home	FT_PCT_home	...	AST_home	REB_home	TEAM_ID_away	PTS_away	FG_PCT_away	FT_PCT_away	FG3_PCT_away	AST_away	REB_away	HOME_TEAM_WINS
0	2022-12-22	22200477	Final	1610612740	1610612759	2022	1610612740	126.0	0.484	0.926	...	25.0	46.0	1610612759	117.0	0.478	0.815	0.321	23.0	44.0	1
1	2022-12-22	22200478	Final	1610612762	1610612764	2022	1610612762	120.0	0.488	0.952	...	16.0	40.0	1610612764	112.0	0.561	0.765	0.333	20.0	37.0	1
2	2022-12-21	22200466	Final	1610612739	1610612749	2022	1610612739	114.0	0.482	0.786	...	22.0	37.0	1610612749	106.0	0.470	0.682	0.433	20.0	46.0	1
3	2022-12-21	22200467	Final	1610612755	1610612765	2022	1610612755	113.0	0.441	0.909	...	27.0	49.0	1610612765	93.0	0.392	0.735	0.261	15.0	46.0	1
4	2022-12-21	22200468	Final	1610612737	1610612741	2022	1610612737	108.0	0.429	1.000	...	22.0	47.0	1610612741	110.0	0.500	0.773	0.292	20.0	47.0	0

5 rows × 21 columns

Data Description:

This dataset was acquired via Kaggle, titled "NBA Games Data." Each row of the dataset corresponds to a game and each column represents a statistic related to that game from the 2003 to 2022 seasons.

Link to dataset: https://www.kaggle.com/datasets/nathanlauga/nba-games?resource=download

Descriptions of the Questions

What statistics related to a NBA game are most valuable for predicting whether or not a home team wins?

Can we create a model using these features to predict whether or not a future team wins?

Descriptions of the Variables

FG_PCT_home - field goal percentage home team

FT_PCT_home - free throw percentage of home team

FG3_PCT_home - three point percentage of the home team

AST_home - assists of the home team

REB_home - rebounds of the home team

FG_PCT_away - field goal percentage away team

FT_PCT_away - free throw percentage of away team

FG3_PCT_away - three point percentage of the away team

AST_away - assists of the away team

REB_away - rebounds of the away team

HOME_TEAM_WINS - if home team won the game

Descriptions of the Methods

Logistic Regression - we will use this to turn our binary variable HOME_TEAM_WINS into a probability spanning from 0 to 1.

Lasso Regression - we will use this to select the features for which we will use to predict the probability that the home team wins.

Model Selection - we will apply multiple algorithms (decision tree, logistic regression, SVM) and evaluate which is most effective after feature selection (involving training, validation, and test split).