[Please put your name and NetID here.]
Start by downloading HW05.ipynb from this folder. Then develop it into your solution.
Write code where you see "... your code here ..." below. (You are welcome to use more than one cell.)
If you have questions, please ask them in class or office hours. Our TA and I are very happy to help with the programming (provided you start early enough, and provided we are not helping so much that we undermine your learning).
When you are done, run these Notebook commands:
Turn in HW05.ipynb and HW05.html to Canvas's HW05 assignment
As a check, download your files from Canvas to a new 'junk' folder. Try 'Kernel > Restart and Run All' on the '.ipynb' file to make sure it works. Glance through the '.html' file.
Turn in partial solutions to Canvas before the deadline. e.g. Turn in part 1, then parts 1 and 2, then your whole solution. That way we can award partial credit even if you miss the deadline. We will grade your last submission before the deadline.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import mixture
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import svm, linear_model, datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (confusion_matrix, precision_score, recall_score,
accuracy_score, roc_auc_score, RocCurveDisplay)
from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier
from imblearn.over_sampling import RandomOverSampler
The digits dataset has 1797 labeled images of hand-written digits.
digits.data
has shape (1797, 64).digits.data
array that corresponds to an 8x8 photo of a handwritten digit.digits.target
has shape (1797,). Each $y_i$ is a number from 0 to 9 indicating
the handwritten digit that was photographed and stored in $\mathbf{x}_i$.This step does not need to display any output.
# ... your code here ...
Loop through these four classifiers and corresponding parameters, doing a grid search to find the best hyperparameter setting. Use only the training data for the grid search.
kernel
in 'linear', 'rbf'.C
in 0.01, 1, 100.max_iter=5000
to avoid a nonconvergence warning.C
in 0.01, 1, 100.criterion='entropy
to get our ID3 tree.max_depth
in 1, 3, 5, 7.n_neighbors
in 1, 2, 3, 4.Hint:
clf
from
clf = GridSearchCV(...)
) to find the accuracy of the model on the validation data, i.e.
find clf.score(X_valid, y_valid)
.-1
or some other value)-np.Inf
)clf
from
clf = GridSearchCV(...)
(initialize it to None
or some other value)I needed about 30 lines of code to do this. It took a minute to run.
# ... your code here ...
.score(X_test, y_test)
on your best classifier/hyperparameters.y_test
values and the corresponding $\hat{y}$ values
predicted by your best classifier/hyperparameters on X_test
.y_test
and your $\hat{y}$ values disagree), show:# ... your code here ...
Use pd.read_table()
to read it into a DataFrame.
Hint: pd.read_table()
has many parameters. Check its documentation to find three parameters to:
# ... your code here ...
clf = mixture.GaussianMixture(n_components=1)
to make a one-class Gaussian model to decide which $\mathbf{x}=(\text{Exam1}, \text{Exam2})$ are outliers:¶Set a matrix X to the first two columns, Exam1 and Exam.
These exams were worth 125 points each. Transform scores to percentages in $[0, 100]$.
Hint: I tried the MinMaxScaler() first, but it does the wrong thing if there aren't scores of 0 and 125 in each column. So, instead, I just multiplied the whole matrix by 100 / 125.
Fit your classifier to X.
Hint:
mixture.GaussianMixture
includes a fit(X, y=None)
method
with the comment that y is ignored (as this is an unsupervised learning algorithm--there
is no $y$) but present for API consistency. So we can fit with just X.Print the center $\mathbf{\mu}$ and covariance matrix $\mathbf{\Sigma}$ from the two-variable $N_2(\mathbf{\mu}, \mathbf{\Sigma})$ distribution you estimated.
# ... your code here ...
clf
.¶# make contour plot of log-likelihood of samples from clf.score_samples()
margin = 10
x = np.linspace(0 - margin, 100 + margin)
y = np.linspace(0 - margin, 100 + margin)
grid_x, grid_y = np.meshgrid(x, y)
two_column_grid_x_grid_y = np.array([grid_x.ravel(), grid_y.ravel()]).T
negative_log_pdf_values = -clf.score_samples(two_column_grid_x_grid_y)
grid_z = negative_log_pdf_values
grid_z = grid_z.reshape(grid_x.shape)
plt.contour(grid_x, grid_y, grid_z, levels=10) # X, Y, Z
plt.title('(Exam1, Exam2) pairs')
Paste my code into your code cell below and add more code:
Add black $x$- and $y$- axes. Label them Exam1 and Exam2.
Plot the data points in blue.
Plot $\mathbf{\mu}=$ clf.means_
as a big lime dot.
Overplot (i.e. plot again) in red the 8 outliers determined by a threshold consisting of the 0.02 quantile of the pdf values $f_{\mathbf{\mu}, \mathbf{\Sigma}}(\mathbf{x})$ for each $\mathbf{x}$ in X.
Hint: clf.score_samples(X)
gives log likelihood, so np.exp(clf.score_samples(X))
gives the required $f_{\mathbf{\mu}, \mathbf{\Sigma}}(\mathbf{x})$ values.
# ... your code here ...
# ... your English text in a Markdown cell here ...
Hint: Compare $f_{\mathbf{\mu}, \mathbf{\Sigma}}(\mathbf{x})$ to your threshold
# ... your code here ...
Here I make a fake imbalanced data set by randomly sampling $y$ from a distribution with $P(y = 0) = 0.980$ and $P(y = 1) = 0.020$.
X, y = make_classification(n_samples=10000, n_classes=2, weights=[0.98, 0.02],
n_clusters_per_class=1, flip_y=0.01, random_state=0)
print(f'np.bincount(y)={np.bincount(y)}; we expect about 980 zeros and 20 ones.')
print(f'np.mean(y)={np.mean(y)}; we expect the proportion of ones to be about 0.020.')
np.bincount(y)=[9752 248]; we expect about 980 zeros and 20 ones. np.mean(y)=0.0248; we expect the proportion of ones to be about 0.020.
Here I split the data into 50% training and 50% testing data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
random_state=0, stratify=y)
print(f'np.bincount(y_train)={np.bincount(y_train)}')
print(f'np.mean(y_train)={np.mean(y_train)}.')
print(f'np.bincount(y_test)={np.bincount(y_test)}.')
print(f'np.mean(y_test)={np.mean(y_test)}.')
np.bincount(y_train)=[4876 124] np.mean(y_train)=0.0248. np.bincount(y_test)=[4876 124]. np.mean(y_test)=0.0248.
random_state=0
(to give us all a chance of getting the same results).# ... your code here ...
Note the high accuracy but lousy precision, recall, and AUC.
Note that since the data have about 98% $y = 0$, we could get about 98% accuracy by just always predicting $\hat{y} = 0$. High accuracy alone is not necessarily helpful.
RandomOverSampler(random_state=0)
to oversample only the training data
and get a balanced training data set.# ... your code here ...
Note that we traded a little accuracy for much improved precision, recall, and AUC.
If you do classification in your project and report accuracy, please also report the proportions of $y = 0$ and $y = 1$ in your test data so that we get insight into whether your model improves upon always guessing $\hat{y} = 0$ or always guessing $\hat{y} = 1$.