Project Proposal¶

Read Data¶

In [3]:
import pandas as pd

# Replace with your actual file path
file_path = 'Medical_insurance.csv' 
df = pd.read_csv(file_path)

# Check first 5 rows
df.head(5)
Out[3]:
age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520

Variables¶

age of the person, sex of the person, BMI (Body Mass Index), number of children, smoker (smoking status, yes/no), region of residence, charges (How much their health insurance costs)

Question of interest¶

  1. What are the most important factors that affect medical expenses?
  2. How well can machine learning models predict medical expenses and which models behaves the best?
  3. How can machine learning models be used to improve the efficiency and profitability of health insurance companies?

Methods¶

Linear Regression: Lasso or Ridge to deal with collinearity, to train model for predicting medical expenses
One hot encoding: to covert the region variable and sex to dummy variables,
Decision Tree: to get feature importances and show the result by barplot,
Graphing: scatter plot, heatmap, etc.
MSE/R^2: evaluation metric

We will look at how different factors are related to medical expenses, test out different models and compare their performances, and plot graphs for visualization of our results