Project Proposal (1)

STAT 451 Project Proposal¶

Group 3: Anas Al-Rasbi, Thilak Raj Murugan, Tyler Avret, Tyler Wilson, ZK Zhao

Topic: Diabetes Health Indicators

Data:¶

The dataset comes from Kaggle.com. It is a clean dataset of 253,680 survey responses. The dataset has 21 feature variables ranging from binary classes to binned classes.

https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset/data?select=diabetes_binary_health_indicators_BRFSS2015.csv

In [11]:

import pandas as pd

df = pd.read_csv('diabetes_binary_health_indicators_BRFSS2015.csv')
df.head()

Out[11]:

	HighBP	HighChol	CholCheck	BMI	Smoker	PhysActivity	Fruits	...	AnyHealthcare	NoDocbcCost	GenHlth	MentHlth	PhysHlth	DiffWalk	Age	Education	Income
0	1.0	1.0	1.0	40.0	1.0	0.0	0.0	...	1.0	0.0	5.0	18.0	15.0	1.0	9.0	4.0	3.0
1	0.0	0.0	0.0	25.0	1.0	1.0	0.0	...	0.0	1.0	3.0	0.0	0.0	0.0	7.0	6.0	1.0
2	1.0	1.0	1.0	28.0	0.0	0.0	1.0	...	1.0	1.0	5.0	30.0	30.0	1.0	9.0	4.0	8.0
3	1.0	0.0	1.0	27.0	0.0	1.0	1.0	...	1.0	0.0	2.0	0.0	0.0	0.0	11.0	3.0	6.0
4	1.0	1.0	1.0	24.0	0.0	1.0	1.0	...	1.0	0.0	2.0	3.0	0.0	0.0	11.0	5.0	4.0

5 rows × 22 columns

Questions:¶

The question we would like to answer is which of the features best predicts diabetes.

Methods:¶

Group 3 will use classification algorithms to try and determine which features best predict if someone will develop diabetes. The algorithms we intend to use are SVM, Decision Tree, k-NN, and logistic regression.

Variables:¶

The variables of interest are high blood pressure, high cholesterol, BMI, smoker, physical activity, heavy alcohol consumption, sex, and age.

In [ ]: