Classification

MNIST

mnist=fetch_openml('mnist_784', version=1, parser='pandas')

import matplotlib as mpl
import matplotlib.pyplot as plt

some_digit=X.iloc[0].to_numpy() # This returns Series
some_digit_image=some_digit.reshape(28,28)
Xd=X.iloc[[0]] #This returns DataFrame

print(type(Xd))
plt.imshow(some_digit_image, cmap=mpl.cm.binary, interpolation="nearest")
plt.axis("off")
plt.show()

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

Training a Binary Classifier

y_train_5=(y_train=='5')
y_test_5=(y_test=='5')
from sklearn.linear_model import SGDClassifier
sgd_clf=SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)
sgd_clf.predict(Xd)

Measuring Accuracy Using Cross-Validation

from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

Accuracy is gernerally not preferred performance measure for classifiers, especially when you are dealing with skewed datasets (i.e., when some classes are much more frequent than others).

Confusion Matrix

count the number of times instances of class A are classified as class B

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

#our classifier
y_train_pred=cross_val_predict(sgd_clf, X_train, y_train_5, cv=3) #K-fold-cross-validation
confusion_matrix(y_train_5, y_train_pred, labels=[True, False])

#nonsense classifier
y_train_pred_never=cross_val_predict(never_5_clf, X_train, y_train_5, cv=3) #K-fold-cross-validation
confusion_matrix(y_train_5, y_train_pred_never, labels=[True, False])

Precision: (true positive) / (true positive + false positive)
Recall (sensitivity, true positive rate): (true positive) / (true positives + false negative)
F1 Score: harmonic mean of precision and recall