Given the many features within the dataset, there are a couple of initial steps that we’re considering taking to ensure robust predictions. For example, we’re thinking about using PCA to reduce dimensionality while still retaining relevant predictive features. Additionally, we may need to do some dataset balancing considering the low heart attack/disease incidence compared to non-incidence.
We are planning to implement the following machine learning models to the data:
We will assess the performance of each model using appropriate evaluation metrics, such as accuracy, precision, recall and ROC curve. Cross-validation techniques will be employed to ensure robustness in model assessment.
We expect to identify the machine learning algorithm(s) which perform best in predicting heart disease risk factors based on our evaluation metrics. The results will provide valuable insights into which model is most suitable for practical implementation in a healthcare setting.
import pandas as pd
# Manually download dataset
heart = pd.read_csv("heart_disease_health_indicators.csv")
print(heart.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 253661 entries, 0 to 253660 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 HeartDiseaseorAttack 253661 non-null int64 1 HighBP 253661 non-null int64 2 HighChol 253661 non-null int64 3 CholCheck 253661 non-null int64 4 BMI 253661 non-null int64 5 Smoker 253661 non-null int64 6 Stroke 253661 non-null int64 7 Diabetes 253661 non-null int64 8 PhysActivity 253661 non-null int64 9 Fruits 253661 non-null int64 10 Veggies 253661 non-null int64 11 HvyAlcoholConsump 253661 non-null int64 12 AnyHealthcare 253661 non-null int64 13 NoDocbcCost 253661 non-null int64 14 GenHlth 253661 non-null int64 15 MentHlth 253661 non-null int64 16 PhysHlth 253661 non-null int64 17 DiffWalk 253661 non-null int64 18 Sex 253661 non-null int64 19 Age 253661 non-null int64 20 Education 253661 non-null int64 21 Income 253661 non-null int64 dtypes: int64(22) memory usage: 42.6 MB None
print(heart.describe())
HeartDiseaseorAttack HighBP HighChol CholCheck \ count 253661.000000 253661.000000 253661.000000 253661.000000 mean 0.094173 0.428990 0.424113 0.962667 std 0.292070 0.494933 0.494209 0.189578 min 0.000000 0.000000 0.000000 0.000000 25% 0.000000 0.000000 0.000000 1.000000 50% 0.000000 0.000000 0.000000 1.000000 75% 0.000000 1.000000 1.000000 1.000000 max 1.000000 1.000000 1.000000 1.000000 BMI Smoker Stroke Diabetes \ count 253661.000000 253661.000000 253661.000000 253661.000000 mean 28.382475 0.443186 0.040570 0.296904 std 6.608638 0.496763 0.197292 0.698147 min 12.000000 0.000000 0.000000 0.000000 25% 24.000000 0.000000 0.000000 0.000000 50% 27.000000 0.000000 0.000000 0.000000 75% 31.000000 1.000000 0.000000 0.000000 max 98.000000 1.000000 1.000000 2.000000 PhysActivity Fruits ... AnyHealthcare NoDocbcCost \ count 253661.000000 253661.000000 ... 253661.000000 253661.000000 mean 0.756577 0.634264 ... 0.951049 0.084164 std 0.429149 0.481637 ... 0.215766 0.277633 min 0.000000 0.000000 ... 0.000000 0.000000 25% 1.000000 0.000000 ... 1.000000 0.000000 50% 1.000000 1.000000 ... 1.000000 0.000000 75% 1.000000 1.000000 ... 1.000000 0.000000 max 1.000000 1.000000 ... 1.000000 1.000000 GenHlth MentHlth PhysHlth DiffWalk \ count 253661.000000 253661.000000 253661.000000 253661.000000 mean 2.511379 3.184778 4.242028 0.168221 std 1.068472 7.412822 8.717905 0.374063 min 1.000000 0.000000 0.000000 0.000000 25% 2.000000 0.000000 0.000000 0.000000 50% 2.000000 0.000000 0.000000 0.000000 75% 3.000000 2.000000 3.000000 0.000000 max 5.000000 30.000000 30.000000 1.000000 Sex Age Education Income count 253661.000000 253661.000000 253661.000000 253661.000000 mean 0.440348 8.032197 5.050461 6.054052 std 0.496430 3.054203 0.985718 2.071036 min 0.000000 1.000000 1.000000 1.000000 25% 0.000000 6.000000 4.000000 5.000000 50% 0.000000 8.000000 5.000000 7.000000 75% 1.000000 10.000000 6.000000 8.000000 max 1.000000 13.000000 6.000000 8.000000 [8 rows x 22 columns]