HW02: Practice with logistic regression and decision tree

[Please put your name and NetID here.]

Hello Students:

1. Logistic regression

1a. Make a logistic regression model

relating the probability an iris has Species='virginica' to its 'Petal.Length' and classifying irises as 'virginica' or not 'virginica' (i.e. 'versicolor').

1b. Do some work with logistic regression by hand.

Consider the logistic regression model, $P(y _i = 1) = \frac{1}{1 + e^{-(\mathbf{w x} + b)}}\,.$

Logistic regression is named after the log-odds of success, $\ln \frac{p}{1 - p}$, where $p = P(y_i = 1)$. Show that this log-odds equals $\mathbf{w x} + b$. (That is, start with $\ln \frac{p}{1 - p}$ and connect it in a series of equalities to $\mathbf{w x} + b$.)

... your Latex math in a Markdown cell here ...

$\begin{align*} % In this Latex context, "&" separates columns and "\\" ends a line. \ln \frac{p}{1 - p} & = ...\\ & = ...\\ & = ...\\ & = ...\\ & = \mathbf{w x} + b\\ \end{align*} $

1c. Do some more work with logistic regression by hand.

I ran some Python/scikit-learn code to make the model pictured here:

From the image and without the help of running code, match each code line from the top list with its output from the bottom list.

  1. model.intercept_
  2. model.coef_
  3. model.predict(X)
  4. model.predict_proba(X)[:, 1]

A. array([0, 0, 0, 1]), B. array([0.003, 0.5, 0.5, 0.997]), C. array([5.832]), D. array([0.])

2. Decision tree

2a. Make a decision tree model on a Titanic data set.

Read the data from http://www.stat.wisc.edu/~jgillett/451/data/kaggle_titanic_train.csv.

These data are described at https://www.kaggle.com/competitions/titanic/data (click on the small down-arrow to see the "Data Dictionary"), which is where they are from.

2b. Which features are used in the (max_depth=2) decision-making? Answer in a markdown cell.

2c. What proportion (in the cleaned-up data) of females survived? What proportion of males survived?

Answer in two sentences via print(), with each proportion rounded to three decimal places.

Hint: There are many ways to do this. One quick way is to find the average of the Survived column for each subset.

2d. Do some decision tree calculations by hand.

Consider a decision tree node containing the following set of examples $S = \{(\mathbf{x}, y)\}$ where $\mathbf{x} = (x_1, x_2)$:

((4, 9), 1)

((2, 6), 0)

((5, 7), 0)

((3, 8), 1)

Find the entropy of $S$.

2e. Do some more decision tree calculations by hand.

Find a (feature, threshold) pair that yields the best split for this node.