[Please put your name and NetID here.]
# ... your code here ... (import statements)
(Note: This paragraph is not instructions but rather is to communicate context for this exercise. We use the same Titanic data we used in HW02:
df.dropna()
to drop any observations with missing values; here we use data imputation instead.Sex
column by making a Female
column; here we do the same one-hot encoding with the help of pandas's df.join(pd.get_dummies())
.We evaluate how these strategies can improve model performance by allowing us to use columns with categorical or missing data.)
These data are described at https://www.kaggle.com/competitions/titanic/data (click on the small down-arrow to see the "Data Dictionary"), which is where they are from.
# ... your code here ...
# ... your code here ...
X.isna().any()
(where X is the name of your DataFrame of features) to see that
the 'Age' feature has missing values. (You can see the first missing value in
the sixth row that you displayed above.)# ... your code here ...
Accuracy on training data is 0.500
(0.500 may not be correct).# ... your code here ...
to include a binary 'male' feature made from the 'Sex' feature. (Or include a binary 'female' feature, according to your preference. Using both is unnecessary since either is the logical negation of the other.) That is, train on these features: 'Pclass', 'SibSp', 'Parch', 'male'.
# ... your code here ...
to include an 'age' feature made from 'Age' but replacing each missing value with the median of the non-missing ages. That is, train on these features: 'Pclass', 'SibSp', 'Parch', 'male', 'age'.
# ... your code here ...
X
to the subset consisting of all columns except mpg
.y
to the mpg
column.train_test_split()
to split X
and y
into X_train
, X_test
, y_train
, and y_test
.random_state=0
to get reproducible results.# ... your code here ...
LinearRegression()
Lasso()
Ridge()
The evaluation consists in displaying MSE$_\text{train}, $ MSE$_\text{test}$, and the coefficients $\mathbf{w}$ for each model.
# ... your code here ...
... your answers here in a markdown cell ...