HW04: Practice with feature engineering, splitting data, and fitting and regularizing linear models

[Please put your name and NetID here.]

Hello Students:

1. Feature engineering (one-hot encoding and data imputation)

(Note: This paragraph is not instructions but rather is to communicate context for this exercise. We use the same Titanic data we used in HW02:

We evaluate how these strategies can improve model performance by allowing us to use columns with categorical or missing data.)

1a. Read the data from http://www.stat.wisc.edu/~jgillett/451/data/kaggle_titanic_train.csv.

These data are described at https://www.kaggle.com/competitions/titanic/data (click on the small down-arrow to see the "Data Dictionary"), which is where they are from.

1b. Try to train a $k$NN model to predict $y=$ 'Survived' from $X=$ these features: 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch'.

1c. Try to train again, this time without the 'Sex' feature.

1d. Train without the 'Sex' and 'Age' features.

1e. Use one-hot encoding

to include a binary 'male' feature made from the 'Sex' feature. (Or include a binary 'female' feature, according to your preference. Using both is unnecessary since either is the logical negation of the other.) That is, train on these features: 'Pclass', 'SibSp', 'Parch', 'male'.

1f. Use data imputation

to include an 'age' feature made from 'Age' but replacing each missing value with the median of the non-missing ages. That is, train on these features: 'Pclass', 'SibSp', 'Parch', 'male', 'age'.

2. Explore model fit, overfit, and regularization in the context of multiple linear regression

2a. Prepare the data:

2b. Train three models on the training data and evaluate each on the test data:

The evaluation consists in displaying MSE$_\text{train}, $ MSE$_\text{test}$, and the coefficients $\mathbf{w}$ for each model.

2c. Answer a few questions about the models:

... your answers here in a markdown cell ...