Project Proposal¶
Read Data¶
In [3]:
import pandas as pd
# Replace with your actual file path
file_path = 'Medical_insurance.csv'
df = pd.read_csv(file_path)
# Check first 5 rows
df.head(5)
Out[3]:
age | sex | bmi | children | smoker | region | charges | |
---|---|---|---|---|---|---|---|
0 | 19 | female | 27.900 | 0 | yes | southwest | 16884.92400 |
1 | 18 | male | 33.770 | 1 | no | southeast | 1725.55230 |
2 | 28 | male | 33.000 | 3 | no | southeast | 4449.46200 |
3 | 33 | male | 22.705 | 0 | no | northwest | 21984.47061 |
4 | 32 | male | 28.880 | 0 | no | northwest | 3866.85520 |
Variables¶
age of the person, sex of the person, BMI (Body Mass Index), number of children, smoker (smoking status, yes/no), region of residence, charges (How much their health insurance costs)
Question of interest¶
- What are the most important factors that affect medical expenses?
- How well can machine learning models predict medical expenses and which models behaves the best?
- How can machine learning models be used to improve the efficiency and profitability of health insurance companies?
Methods¶
Linear Regression: Lasso or Ridge to deal with collinearity, to train model for predicting medical expenses
One hot encoding: to covert the region variable and sex to dummy variables,
Decision Tree: to get feature importances and show the result by barplot,
Graphing: scatter plot, heatmap, etc.
MSE/R^2: evaluation metric
We will look at how different factors are related to medical expenses, test out different models and compare their performances, and plot graphs for visualization of our results