Classical Machine Learning
Before building any classifier, we must answer a fundamental question: should our model learn the full joint distribution P(X, Y), or should it go straight for the conditional P(Y|X)? This choice shapes everything — architecture, training objective, data requirements, and what questions the model can answer.
| Discriminative P(Y | X) | Generative P(X, Y) = P(X | Y) P(Y) |
|---|---|
| Models the boundary between classes directly | Models how data is generated for each class |
| Typically needs less data to fit well | Needs more data; learns richer structure |
| Cannot generate new samples of X | Can sample new X given Y |
| Cannot answer P(X) queries | Can compute P(X) and detect anomalies |
| Examples: logistic regression, SVM, neural nets | Examples: Naïve Bayes, GMM, VAE, GPT |
| Training: minimise cross-entropy / hinge loss | Training: maximise log-likelihood P(X, Y; θ) |
For most classification tasks — spam detection, sentiment analysis, named-entity recognition — we only need P(Y|X). Discriminative models learn a direct mapping and typically outperform generative models when labelled data is plentiful. Generative models shine when you have limited labels, want to generate new data, or need anomaly detection.
Logistic regression is the workhorse of binary classification and the building block of deep learning output layers. Despite its name, it is a classification model, not a regression model. Understanding it deeply — geometrically, probabilistically, and computationally — is essential before studying neural networks.
The Model
We want to predict P(Y=1 | X=x). The sigmoid function squashes a linear combination of features into [0, 1]:
The linear part w·x + b is called the log-odds or logit. The decision boundary is the hyperplane where the logit equals zero — where the model is equally uncertain between the two classes.
Training: Maximum Likelihood = Cross-Entropy Loss
From Chapter 3, the MLE objective for logistic regression is to maximise the log-likelihood of the training labels. With N examples:
import numpy as np
class LogisticRegression: # from scratch, no sklearn
def __init__(self, lr=0.01, n_iter=1000, lam=0.0):
self.lr, self.n_iter, self.lam = lr, n_iter, lam
self.w = self.b = None
def _sigmoid(self, z): # stable sigmoid
return np.where(z >= 0, 1 / (1 + np.exp(-z)),
np.exp(z) / (1 + np.exp(z)))
def fit(self, X, y): # X: (N, D) y: (N,) binary
N, D = X.shape
self.w = np.zeros(D); self.b = 0.0
for _ in range(self.n_iter):
z = X @ self.w + self.b # (N,) logits
p = self._sigmoid(z) # (N,) probabilities
err = p - y # (N,) residual
dw = (X.T @ err) / N + self.lam * self.w # gradient + L2 reg
db = err.mean()
self.w -= self.lr * dw
self.b -= self.lr * db
def predict_proba(self, X):
return self._sigmoid(X @ self.w + self.b)
def predict(self, X, threshold=0.5):
return (self.predict_proba(X) >= threshold).astype(int)
# ---- Demo on linearly separable 2D data ----
np.random.seed(42)
X0 = np.random.randn(100, 2) - 1 # class 0 centred at (-1,-1)
X1 = np.random.randn(100, 2) + 1 # class 1 centred at (+1,+1)
X = np.vstack([X0, X1])
y = np.array([0]*100 + [1]*100)
model = LogisticRegression(lr=0.1, n_iter=500)
model.fit(X, y)
acc = (model.predict(X) == y).mean()
print(f"Train accuracy: {acc:.1%}")
# Train accuracy: 98.5%The Decision Boundary
The decision boundary is where σ(w·x + b) = 0.5, which means w·x + b = 0. This is a hyperplane in feature space. In 2D it's a line; in 3D a plane; in 768D (BERT embeddings) it's a 767-dimensional hyperplane.
import matplotlib.pyplot as plt
# Plot points and decision boundary
fig, ax = plt.subplots(figsize=(6, 5)
ax.scatter(X0[:, 0], X0[:, 1], c='steelblue', label='Class 0', alpha=0.6)
ax.scatter(X1[:, 0], X1[:, 1], c='crimson', label='Class 1', alpha=0.6)
# Decision boundary: w[0]*x + w[1]*y + b = 0 => y = -(w[0]*x + b)/w[1]
x_line = np.linspace(-4, 4, 100)
y_line = -(model.w[0] * x_line + model.b) / model.w[1]
ax.plot(x_line, y_line, 'k-', lw=2, label='Decision boundary')
# Probability contours (shows the sigmoid 'squash')
xx, yy = np.meshgrid(np.linspace(-4, 4, 200), np.linspace(-4, 4, 200)
Z = model.predict_proba(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
ax.contourf(xx, yy, Z, alpha=0.15, cmap='RdBu_r')
ax.legend(); plt.tight_layout(); plt.savefig('boundary.png', dpi=150)Regularisation: L1 and L2
Without regularisation, logistic regression can overfit to training data by growing its weights without bound. L2 regularisation (ridge) penalises large weights and corresponds to a Gaussian prior on w. L1 (lasso) promotes sparsity and corresponds to a Laplace prior.
from sklearn.linear_model import LogisticRegression as SklearnLR
import numpy as np
# Train with different L2 regularisation strengths (C = 1/λ in sklearn)
lambdas = np.logspace(-3, 3, 50)
coef_norms = []
for lam in lambdas:
m = SklearnLR(C=1/lam, max_iter=1000, solver='lbfgs')
m.fit(X, y)
coef_norms.append(np.linalg.norm(m.coef_)) # tracks L2 norm of weights
# High λ → small weights → underfitting (high bias)
# Low λ → large weights → overfitting (high variance)
print(f"λ=0.001: ‖w‖={coef_norms[0]:.3f}")
# λ=0.001: ‖w‖=2.847 (almost unregularised)
print(f"λ=1000: ‖w‖={coef_norms[-1]:.3f}")
# λ=1000: ‖w‖=0.001 (heavily penalised → near-zero weights)Multiclass Extension: Softmax Regression
Logistic regression extends to K classes through the softmax function. Each class gets its own weight vector wₖ; the probability of class k is softmax(Wₓ)ₖ. This is exactly what a neural network's final linear layer does.
SVMs take the logistic regression idea — find a separating hyperplane — and ask a deeper question: among all hyperplanes that correctly separate the training data, which one will generalise best to unseen points? The answer: the one with the largest margin.
The Maximum-Margin Objective
The margin is the distance from the decision boundary to the nearest training points on each side (the support vectors). SVMs maximise this margin. Intuitively, a wider margin means more room for test points to land on the correct side even if they differ slightly from training points.
The constraint yᵢ(w·xᵢ + b) ≥ 1 says every training point must be at least 1/‖w‖ away from the boundary on the correct side. Minimising ‖w‖² maximises this margin. The points where the constraint is tight (equality holds) are the support vectors.
Soft-Margin SVMs: Allowing Mistakes
Real data is rarely linearly separable. The soft-margin SVM introduces slack variables ξᵢ ≥ 0 that allow points to violate the margin. C controls the trade-off: large C penalises violations heavily (smaller margin, less overfitting tolerance); small C allows violations (larger margin, more robust).
import numpy as np
class LinearSVM: # subgradient descent, primal form
def __init__(self, C=1.0, lr=0.001, n_iter=1000):
self.C, self.lr, self.n_iter = C, lr, n_iter
self.w = self.b = None
def fit(self, X, y): # y in {-1, +1}
N, D = X.shape
self.w = np.zeros(D); self.b = 0.0
for _ in range(self.n_iter):
for i in range(N): # stochastic
margin = y[i] * (X[i] @ self.w + self.b)
if margin >= 1: # correctly classified with margin
self.w -= self.lr * self.w # regularisation only
else: # hinge loss active
self.w -= self.lr * (self.w - self.C * y[i] * X[i])
self.b += self.lr * self.C * y[i]
def predict(self, X):
return np.sign(X @ self.w + self.b)
# Compare C values on a noisy dataset
y_pm = (2 * y - 1) # convert 0/1 → -1/+1
for C in [0.01, 1.0, 100.0]:
svm = LinearSVM(C=C, lr=0.0005, n_iter=2000)
svm.fit(X, y_pm)
acc = (svm.predict(X) == y_pm).mean()
margin = 1 / np.linalg.norm(svm.w)
print(f"C={C:6.2f} acc={acc:.3f} margin={margin:.3f} ‖w‖={np.linalg.norm(svm.w):.3f}")
# C= 0.01 acc=0.870 margin=2.731 ‖w‖=0.366 (wide margin, misses some)
# C= 1.00 acc=0.985 margin=0.942 ‖w‖=1.061 (balanced)
# C=100.00 acc=0.990 margin=0.421 ‖w‖=2.374 (narrow margin, fits training)The Kernel Trick
Linear SVMs can only find linear boundaries. The kernel trick implicitly maps data into a high-dimensional feature space Φ(x) without computing the mapping explicitly. The dual SVM depends only on dot products xᵢ·xⱼ, which are replaced by K(xᵢ, xⱼ) = Φ(xᵢ)·Φ(xⱼ).
| Kernel | Formula K(x, z) | Feature space | When to use |
|---|---|---|---|
| Linear | x·z | ℝᴰ (no change) | Linearly separable data |
| Polynomial | (x·z + r)ᵈ | All degree-d monomials | Structured interactions |
| RBF / Gaussian | exp(-γ‖x-z‖²) | Infinite-dimensional | General, most popular |
| Sigmoid | tanh(α x·z + c) | Approximates 1-layer NN | NLP (historical) |
| String kernel | ∑ sub-sequence matches | Sequence feature space | Text without vectorising |
from sklearn.svm import SVC
from sklearn.datasets import make_circles
from sklearn.model_selection import cross_val_score
import numpy as np
# Non-linearly separable: concentric circles
X_circ, y_circ = make_circles(n_samples=400, noise=0.1, factor=0.5, random_state=0)
for kernel in ['linear', 'rbf', 'poly']:
svc = SVC(kernel=kernel, C=1.0, gamma='scale')
cv = cross_val_score(svc, X_circ, y_circ, cv=5, scoring='accuracy')
print(f"{kernel:8s}: {cv.mean():.3f} ± {cv.std():.3f}")
# linear: 0.508 ± 0.028 (circles not linearly separable)
# rbf: 0.975 ± 0.011 (RBF maps to sphere → linearly separable)
# poly: 0.847 ± 0.031 (degree-3 poly: better than linear, worse than RBF)Decision trees partition the feature space into rectangular regions using a sequence of binary splits. Each internal node tests a feature; each leaf assigns a class. They are inherently interpretable, handle mixed feature types, and require no normalisation — but they overfit easily and have high variance.
Information Gain: Choosing the Best Split
At each node, we choose the feature and threshold that maximises the reduction in label uncertainty after the split. We measure uncertainty with entropy H and call the reduction information gain:
Alternative impurity measures: Gini impurity G = 1 - Σₖ pₖ² (slightly faster to compute than entropy; similar results). Mean squared error for regression trees.
function BuildTree(S, depth):
if stopping_condition(S, depth):
return Leaf(majority_class(S))
best_gain ← 0; best_split ← None
for each feature f and threshold t:
Sₗ, Sᵣ ← split(S, f, t)
gain ← IG(S, Sₗ, Sᵣ)
if gain > best_gain:
best_gain ← gain; best_split ← (f, t)
Sₗ, Sᵣ ← split(S, best_split)
return Node(best_split,
left = BuildTree(Sₗ, depth+1),
right = BuildTree(Sᵣ, depth+1))import numpy as np
class Node: # a single node in the tree
def __init__(self, feature=None, threshold=None,
left=None, right=None, value=None):
self.feature, self.threshold = feature, threshold
self.left, self.right, self.value = left, right, value
def entropy(y): # Shannon entropy of label array
counts = np.bincount(y); probs = counts[counts > 0] / len(y)
return -np.sum(probs * np.log2(probs))
def info_gain(parent, left, right):
n = len(parent); nl, nr = len(left), len(right)
return entropy(parent) - (nl/n * entropy(left) + nr/n * entropy(right))
class DecisionTreeClassifier:
def __init__(self, max_depth=10, min_samples=2):
self.max_depth = max_depth; self.min_samples = min_samples
def _best_split(self, X, y):
best = {'gain': -1}
for f in range(X.shape[1]): # each feature
for thresh in np.unique(X[:, f]): # each unique value
lm = X[:, f] <= thresh # left mask
if lm.sum() < self.min_samples or (~lm).sum() < self.min_samples: continue
gain = info_gain(y, y[lm], y[~lm])
if gain > best['gain']: best = {'gain': gain, 'f': f, 't': thresh}
return best
def _build(self, X, y, depth=0):
if depth >= self.max_depth or len(np.unique(y)) == 1 or len(y) < self.min_samples:
return Node(value=np.bincount(y).argmax()) # leaf
split = self._best_split(X, y)
if split['gain'] <= 0: return Node(value=np.bincount(y).argmax())
lm = X[:, split['f']] <= split['t']
return Node(split['f'], split['t'],
left=self._build(X[lm], y[lm], depth+1),
right=self._build(X[~lm], y[~lm], depth+1))
def fit(self, X, y): self.root = self._build(X, y)
def _predict_one(self, x, node):
if node.value is not None: return node.value
branch = node.left if x[node.feature] <= node.threshold else node.right
return self._predict_one(x, branch)
def predict(self, X): return np.array([self._predict_one(x, self.root) for x in X])Overfitting and Pruning
A fully grown tree memorises the training data. Pruning controls complexity: pre-pruning stops growth early (max_depth, min_samples); post-pruning grows the full tree then prunes sub-trees that don’t improve validation performance. Regularisation via cost-complexity pruning adds a penalty α per leaf.
A single decision tree has high variance — small changes in training data produce wildly different trees. Ensemble methods combine many trees, each slightly different, to average out this variance. Two strategies dominate: bagging (train independently, then average) and boosting (train sequentially, each correcting the previous).
Random Forests: Bagging + Feature Randomness
Random forests train B trees, each on a bootstrap sample (sampling N points with replacement) and at each split only considering a random subset of features (√D for classification, D/3 for regression). The two sources of randomness decorrelate the trees, making their average much more stable.
import numpy as np
from collections import Counter
class RandomForest:
def __init__(self, n_trees=100, max_depth=10, max_features='sqrt'):
self.n_trees, self.max_depth, self.max_features = n_trees, max_depth, max_features
self.trees = []
def fit(self, X, y):
N, D = X.shape
k = int(np.sqrt(D)) if self.max_features == 'sqrt' else D
for _ in range(self.n_trees):
# 1. Bootstrap sample (sample N rows with replacement)
idx = np.random.choice(N, N, replace=True)
Xb, yb = X[idx], y[idx]
# 2. Build a tree with random feature subsets at each split
tree = DecisionTreeClassifier(max_depth=self.max_depth)
tree.max_features_k = k # (set attr for tree to read)
tree.fit(Xb, yb)
self.trees.append(tree)
def predict(self, X):
# Majority vote across all trees
votes = np.stack([t.predict(X) for t in self.trees], axis=1)
return np.array([Counter(row).most_common(1)[0][0] for row in votes])
# Compare single tree vs. forest on wine dataset
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
Xw, yw = load_wine(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(Xw, yw, test_size=0.3, random_state=42)
dt = DecisionTreeClassifier(max_depth=5); dt.fit(X_tr, y_tr)
print(f"Single tree: {(dt.predict(X_te)==y_te).mean():.3f}")
# Single tree: 0.907 (overfits a little)
rf = RandomForest(n_trees=200, max_depth=5); rf.fit(X_tr, y_tr)
print(f"Random forest: {(rf.predict(X_te)==y_te).mean():.3f}")
# Random forest: 0.981 (ensembling reduces variance dramatically)Gradient Boosting: Sequential Error Correction
Gradient boosting builds trees sequentially, each fitting the residuals (negative gradient of the loss) of the previous ensemble. Where random forests reduce variance by averaging independent models, gradient boosting reduces bias by iteratively correcting systematic errors.
import numpy as np
from sklearn.tree import DecisionTreeRegressor
class GradientBoostingRegressor:
def __init__(self, n_trees=100, lr=0.1, max_depth=3):
self.n_trees, self.lr, self.max_depth = n_trees, lr, max_depth
self.trees = []; self.f0 = None
def fit(self, X, y):
self.f0 = y.mean() # initial prediction: mean of targets
F = np.full(len(y), self.f0) # running ensemble prediction
for _ in range(self.n_trees):
residuals = y - F # negative gradient of MSE
tree = DecisionTreeRegressor(max_depth=self.max_depth)
tree.fit(X, residuals) # fit the residuals (not y!)
F += self.lr * tree.predict(X) # update ensemble
self.trees.append(tree)
def predict(self, X):
pred = np.full(len(X), self.f0)
for tree in self.trees:
pred += self.lr * tree.predict(X)
return pred
# Why max_depth=3? Shallow trees = high bias, low variance.
# Boosting supplies the bias reduction; trees should be weak learners.
# XGBoost, LightGBM, CatBoost all use this idea with massive engineering.Accuracy is a seductive but dangerous metric. On a dataset where 99% of examples are class 0, a classifier that always predicts class 0 achieves 99% accuracy while being completely useless. This section builds the correct evaluation toolkit.
The Confusion Matrix
import numpy as np
from sklearn.metrics import (roc_auc_score, average_precision_score,
classification_report, confusion_matrix)
def evaluate_classifier(y_true, y_pred, y_prob, name='Model'):
TP = ((y_pred==1) & (y_true==1)).sum()
TN = ((y_pred==0) & (y_true==0)).sum()
FP = ((y_pred==1) & (y_true==0)).sum()
FN = ((y_pred==0) & (y_true==1)).sum()
precision = TP / (TP + FP + 1e-9)
recall = TP / (TP + FN + 1e-9)
f1 = 2 * precision * recall / (precision + recall + 1e-9)
specificity= TN / (TN + FP + 1e-9)
roc_auc = roc_auc_score(y_true, y_prob)
pr_auc = average_precision_score(y_true, y_prob)
print(f"\n{'='*40}\n{name}\n{'='*40}")
print(f"Precision: {precision:.3f} (of predicted positives, how many were right?)")
print(f"Recall: {recall:.3f} (of actual positives, how many did we find?)")
print(f"F1-score: {f1:.3f} (harmonic mean of precision and recall)")
print(f"Specificity: {specificity:.3f} (TN rate)")
print(f"ROC-AUC: {roc_auc:.3f} (rank order quality, threshold-independent)")
print(f"PR-AUC: {pr_auc:.3f} (better for imbalanced datasets)")
# Key insight: precision-recall are the right metrics for imbalanced data.
# ROC-AUC can be misleadingly high when negative class dominates.Calibration: Do Probabilities Mean What We Think?
A model predicts P(Y=1|x) = 0.7 for 1,000 points. If it is well-calibrated, roughly 700 of those should actually be class 1. Many models (especially SVMs and random forests) produce uncalibrated probabilities that require post-hoc correction via isotonic regression or Platt scaling.
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.svm import SVC
import numpy as np
# SVM does not output calibrated probabilities by default
svm_raw = SVC(probability=False, C=1.0); svm_raw.fit(X_tr, y_tr)
# Calibrated version: Platt scaling (logistic fit over SVM decision function)
svm_cal = CalibratedClassifierCV(SVC(C=1.0), cv=5, method='sigmoid')
svm_cal.fit(X_tr, y_tr)
# Calibration curve: fraction of positives vs. mean predicted prob
prob_cal = svm_cal.predict_proba(X_te)[:, 1]
fraction, mean_pred = calibration_curve(y_te, prob_cal, n_bins=10)
print(f"Calibration error (ECE): {np.mean(np.abs(fraction - mean_pred)):.4f}")
# Well-calibrated model: fraction ≈ mean_pred for all binsClassical discriminative models operate on fixed-length feature vectors. How you engineer those features determines the ceiling on model performance far more than the choice of algorithm. This section covers the techniques that practitioners used before deep learning automated feature extraction.
Numerical Features
Categorical Features
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np; import pandas as pd
# Example: Titanic-style dataset with mixed feature types
num_cols = ['age', 'fare'] # numerical
cat_cols = ['sex', 'embarked'] # categorical
num_pipe = Pipeline([('impute', SimpleImputer(strategy='median')),
('scale', StandardScaler())])
cat_pipe = Pipeline([('impute', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(handle_unknown='ignore', sparse_output=False))])
preproc = ColumnTransformer([
('num', num_pipe, num_cols),
('cat', cat_pipe, cat_cols),
])
model = Pipeline([
('preproc', preproc),
('clf', GradientBoostingClassifier(n_estimators=200, max_depth=3, lr=0.05)),
])
# model.fit(X_train, y_train) -- handles NaNs, encodes cats, scales numsMulticlass Strategies
Logistic regression and neural networks handle multiclass natively via softmax. For binary classifiers (SVM, basic logistic), two strategies extend to K classes:
| Strategy | How | Models trained | Prediction |
|---|---|---|---|
| One-vs-Rest (OvR) | Train K binary classifiers, each class vs. all others | K | argmax of K scores |
| One-vs-One (OvO) | Train K(K-1)/2 binary classifiers, each pair of classes | K(K-1)/2 | Majority vote |
| Softmax (native) | Single model with K output units + cross-entropy | 1 | argmax(softmax(z)) |
Multi-label Classification
Multi-label problems allow multiple labels per example (a news article can be about both politics and technology). The standard approach: train K independent binary classifiers (binary relevance), one per label. More sophisticated: classifier chains (each classifier sees previous label predictions as features) or label powerset (treat each unique label combination as a class).
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_multilabel_classification
from sklearn.metrics import jaccard_score
# Create a dataset with 4 labels, 2-3 labels per sample on average
X_ml, y_ml = make_multilabel_classification(n_samples=1000, n_features=20
n_classes=4, n_labels=2, random_state=0)
# Binary relevance: K independent binary classifiers
mo_clf = MultiOutputClassifier(LogisticRegression())
mo_clf.fit(X_ml[:800], y_ml[:800)
preds = mo_clf.predict(X_ml[800:])
# Jaccard score (intersection over union) is standard for multi-label
jacc = jaccard_score(y_ml[800:], preds, average='samples')
print(f"Jaccard score: {jacc:.3f}")
# Jaccard score: 0.742 (fraction of correct labels in union of true+pred)Classical discriminative models require human-engineered features. Neural networks replace the feature engineering step with learned representations. But the final classification layer is still logistic regression (binary) or softmax regression (multiclass) — the models in this chapter are not replaced, they are elevated.
Feature Space Geometry
Deep learning training can be understood as learning a representation space where the original classification problem becomes linearly separable. Visualising this with t-SNE or UMAP reveals whether the network has learned a useful geometry.
import torch
from transformers import AutoTokenizer, AutoModel
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import numpy as np
# Linear probe: can a logistic regression on BERT embeddings
# classify sentiment? If yes, BERT has learned a useful representation.
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased').eval()
def get_embedding(text):
inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=128)
with torch.no_grad():
out = model(**inputs)
return out.last_hidden_state[:, 0, :].squeeze().numpy()
# [CLS] token embedding: 768-dimensional summary of the sentence
# Simulate sentiment data (in practice, use SST-2 or IMDB)
sentences = ['I love this film', 'Terrible movie', 'Amazing!', 'Awful']
labels = [1, 0, 1, 0] # 1=positive, 0=negative
embeddings = np.stack([get_embedding(s) for s in sentences])
# Train logistic regression on the [CLS] embeddings
probe = LogisticRegression(); probe.fit(embeddings, labels)
print(f"Linear probe accuracy: {probe.score(embeddings, labels):.1%}")
# With enough data: ~90%+ on SST-2 -- BERT's representations are linearly separableReference: Which Classifier to Use When
| Situation | Recommended model | Why |
|---|---|---|
| Linearly separable, need probabilities | Logistic Regression | Fast, calibrated, interpretable coefficients |
| Non-linear boundary, medium dataset | RBF SVM | Kernel trick; good generalisation |
| Tabular mixed features, no DL | Gradient Boosted Trees | Best general-purpose tabular model |
| Need feature importance | Random Forest | Out-of-bag importance + low tuning |
| Very high-dimensional (text TF-IDF) | Logistic Regression (L1) | Sparse features → sparse solution |
| Imbalanced classes | GBT + PR-AUC metric | Handles imbalance; PR-AUC measures it |
| Need to generate examples | Use generative model (Ch. 7) | Discriminative models can't generate X |
| Rich unstructured input (images, text) | Neural network (Parts III–IV) | Learns features automatically |
Exercises
Exercises 1–10 are pen-and-paper; 11–18 require code.
Further reading: “The Elements of Statistical Learning” (Hastie, Tibshirani, Friedman) Chapters 4, 9, 10, 15 — the authoritative reference. “Pattern Recognition and Machine Learning” (Bishop) Chapter 7 for SVMs. scikit-learn documentation for practical API reference.
Next → Chapter 7: Generative Models (Pre-deep learning)
Chapter 7 flips the modelling question: instead of learning P(Y|X) directly, we learn how data is generated — P(X|Y) and P(Y). Naïve Bayes, Gaussian Mixture Models, and Hidden Markov Models show why generative approaches excel when labels are scarce, enabling models to share statistical strength across classes and generate new samples from the learned distribution.