Lecture 2. Basic machine learning models

HTML Slides html │ PDF Slides PDF │ Demo code on GitHub GitHub

Topic overview#

Some common machine learning tasks and models
Evaluating model performance
Limitations and assumptions

Resources used:

Feature Engineering Chapter 1
Introduction to Machine Learning with Python. Available at MRU Library
Scikit-learn User Guide
Hands on Machine Learning with Scikit-Learn and Tensorflow/PyTorch. Available at MRU Library

Machine learning#

To appropriately process the data, we need to know why we are doing it and what assumptions we’re making
Modern machine learning toolkits (such as scikit-learn) are so easy to use, they’re easy to use inappropriately
Goal: just enough understanding to use basic models responsibly

Why are we processing data?#

No reinforcement learning in this course, sorry

A selection of common models#

Supervised#

Linear/logistic regression
Decision trees
Support vector machines

Unsupervised#

K-means clustering
Principle component analysis

No free lunch#

A theory-heavy paper in 1996 showed that there is no one machine learning algorithm that excels in all situations
Subsequent work has confirmed this, e.g. a 2018 analysis
Tree-based methods, particularly gradient boosted trees tend to outperform other algorithms the most, but still have limitations
What does it mean to “outperform”?

Model evaluation: Classification#

True positive: predicted positive, label was positive ($TP$) ✔️
True negative: predicted negative, label was negative ($TN$) ✔️
False positive: predicted positive, label was negative ($FP$) ❌ (type I)
False negative: predicted negative, label was positive ($FN$) ❌ (type II)
Accuracy is the fraction of correct predictions, given as:
$$\mathrm{accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$

Precision and recall#

Precision: Out of all the positive predictions, how many were correct? $$\mathrm{precision} = \frac{TP}{TP + FP}$$
Recall: Out of all the positive labels, how many were correct? $$\mathrm{recall} = \frac{TP}{TP + FN}$$
Specificity: Out of all the negative labels, how many were correct? $$\mathrm{specificity} = \frac{TN}{TN + FP}$$

Confusion matrix#

	Predicted Positive	Predicted Negative
True Positive	TP	FN
True Negative	FP	TN

The axes might be reversed, but a good predictor will have strong diagonals
There’s also the F1 score, or harmonic mean of precision and recall: $$F1 = 2 \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}}$$

ROC Curves#

The receiver operating characteristic curve is a plot of the true positive rate (recall or sensitivity) vs. false positive rate (1 - specificity) as the detection threshold changes
The diagonal is the same as random guessing
A perfect classifier would hug the top left corner

Fun fact: the name comes from WWII radar operators, where true positives were airplanes and false positives were noise

Classification model: Support Vector Classifier#

Linear model that finds vector(s) to best separate classes
“Kernel trick” allows for nonlinear boundaries
Check out the SVM Appendix of Hands-on Machine Learning by Aurélien Geron for more info