[Please put your name and NetID here.]
# ... your code here ... (import statements)
relating the probability an iris has Species='virginica' to its 'Petal.Length' and classifying irises as 'virginica' or not 'virginica' (i.e. 'versicolor').
linear_model.LogisticRegression(C=1000)
so we all get the same results (they vary with C
).# ... your code here ...
Consider the logistic regression model, $P(y _i = 1) = \frac{1}{1 + e^{-(\mathbf{w x} + b)}}\,.$
Logistic regression is named after the log-odds of success, $\ln \frac{p}{1 - p}$, where $p = P(y_i = 1)$. Show that this log-odds equals $\mathbf{w x} + b$. (That is, start with $\ln \frac{p}{1 - p}$ and connect it in a series of equalities to $\mathbf{w x} + b$.)
$\begin{align*} % In this Latex context, "&" separates columns and "\\" ends a line. \ln \frac{p}{1 - p} & = ...\\ & = ...\\ & = ...\\ & = ...\\ & = \mathbf{w x} + b\\ \end{align*} $
I ran some Python/scikit-learn code to make the model pictured here:
From the image and without the help of running code, match each code line from the top list with its output from the bottom list.
model.intercept_
model.coef_
model.predict(X)
model.predict_proba(X)[:, 1]
A. array([0, 0, 0, 1])
,
B. array([0.003, 0.5, 0.5, 0.997])
,
C. array([5.832])
,
D. array([0.])
# ... Your answer here in a Markdown cell ...
# For example, "1: A, 2: B, 3: C, 4: D" is wrong but has the right format.
Read the data from http://www.stat.wisc.edu/~jgillett/451/data/kaggle_titanic_train.csv.
These data are described at https://www.kaggle.com/competitions/titanic/data (click on the small down-arrow to see the "Data Dictionary"), which is where they are from.
df.dropna()
. Display your data frame's shape before
and after dropping rows. (It should be (714, 4) after dropping rows.)df.Sex == 'female'
. This gives bool values True and False, which are interpreted as 1 and 0 when used in an arithmetic context.max_depth=None
to decided whether a passenger
Survived
from the other three columns. Report its accuracy (with 3 decimal places)
on training data along with the tree's depth (which is available in clf.tree_.max_depth
).max_depth=2
. Report its accuracy (with 3 decimal places).
Use tree.plot_tree()
to display it, including feature_names to make the tree easy to read.# ... your code here ...
# ... your English text in a Markdown cell here ...
Answer in two sentences via print(), with each proportion rounded to three decimal places.
Hint: There are many ways to do this. One quick way is to find the average of the Survived
column for each subset.
# ... your code here ...
Consider a decision tree node containing the following set of examples $S = \{(\mathbf{x}, y)\}$ where $\mathbf{x} = (x_1, x_2)$:
((4, 9), 1)
((2, 6), 0)
((5, 7), 0)
((3, 8), 1)
Find the entropy of $S$.
# ... your brief work and answer here in a markdown cell ...
Find a (feature, threshold) pair that yields the best split for this node.
# ... your brief work and answer here in a markdown cell ...