STAT 451 Project Proposal¶
Group 3: Anas Al-Rasbi, Thilak Raj Murugan, Tyler Avret, Tyler Wilson, ZK Zhao
Topic: Diabetes Health Indicators
Data:¶
The dataset comes from Kaggle.com. It is a clean dataset of 253,680 survey responses. The dataset has 21 feature variables ranging from binary classes to binned classes.
In [11]:
import pandas as pd
df = pd.read_csv('diabetes_binary_health_indicators_BRFSS2015.csv')
df.head()
Out[11]:
Diabetes_binary | HighBP | HighChol | CholCheck | BMI | Smoker | Stroke | HeartDiseaseorAttack | PhysActivity | Fruits | ... | AnyHealthcare | NoDocbcCost | GenHlth | MentHlth | PhysHlth | DiffWalk | Sex | Age | Education | Income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 1.0 | 1.0 | 1.0 | 40.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1.0 | 0.0 | 5.0 | 18.0 | 15.0 | 1.0 | 0.0 | 9.0 | 4.0 | 3.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 25.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 1.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 7.0 | 6.0 | 1.0 |
2 | 0.0 | 1.0 | 1.0 | 1.0 | 28.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 1.0 | 1.0 | 5.0 | 30.0 | 30.0 | 1.0 | 0.0 | 9.0 | 4.0 | 8.0 |
3 | 0.0 | 1.0 | 0.0 | 1.0 | 27.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 1.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 11.0 | 3.0 | 6.0 |
4 | 0.0 | 1.0 | 1.0 | 1.0 | 24.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 1.0 | 0.0 | 2.0 | 3.0 | 0.0 | 0.0 | 0.0 | 11.0 | 5.0 | 4.0 |
5 rows × 22 columns
Questions:¶
The question we would like to answer is which of the features best predicts diabetes.
Methods:¶
Group 3 will use classification algorithms to try and determine which features best predict if someone will develop diabetes. The algorithms we intend to use are SVM, Decision Tree, k-NN, and logistic regression.
Variables:¶
The variables of interest are high blood pressure, high cholesterol, BMI, smoker, physical activity, heavy alcohol consumption, sex, and age.
In [ ]: