Part II: Classical ML & Representations
Chapter 7

Word Embeddings & Representations

Word2Vec, GloVe, and the geometry of meaning
18 Exercises
7.1

Chapter 6 found decision boundaries — surfaces that separate classes. This chapter asks a richer question: can we learn a complete model of how data is generated? A model of P(X, Y) can do everything a discriminative model can (predict Y from X via Bayes), plus much more: generate new samples, detect anomalies, handle missing data, and improve when labelled data is scarce.

Discriminative P(Y | X)Generative P(X, Y)
Directly learns the boundaryLearns the full joint distribution
Higher accuracy when N is largeBetter when N is small (uses X structure too)
Cannot generate XCan sample new X from P(X) = Σ_Y P(X,Y)
Cannot detect anomaliesFlags low-P(X) inputs as out-of-distribution
Cannot handle missing featuresCan marginalise over missing dimensions
Examples: LR, SVM, transformersExamples: NB, GMM, HMM, VAE, diffusion, LLMs
The Big Payoff: Generative Pre-training
The most powerful models of 2024 — GPT-4, Gemini, Claude — are trained generatively: they learn P(x₁, …, xₙ) = ∏ P(xₜ | x<t) over internet-scale text. This generative objective is what makes them capable of reasoning, translation, code generation, and classification without task-specific training.
The classical generative models in this chapter (NB, GMM, HMM) share this same DNA. Understanding them is the fastest path to understanding why generative pre-training works so well at scale.

The Generative Classification Recipe

Given a generative model, classification via Bayes' theorem is always available:

P(Y=k | x) = P(x | Y=k) · P(Y=k) / P(x) where P(x) = Σₖ P(x|Y=k) P(Y=k)

The class-conditional density P(x|Y=k) is the generative model for class k. P(Y=k) is the class prior. The denominator is the marginal likelihood — we normalise over all classes. Every generative classifier in this chapter follows this recipe; they differ only in how they model P(x|Y=k).

7.2

Naïve Bayes is the simplest generative classifier. It assumes that all features are conditionally independent given the class — a drastically simplifying assumption that is almost always false in practice, yet produces competitive classification accuracy in many settings, especially text.

The Naïve Bayes Assumption

P(x | Y=k) = P(x₁|Y=k) · P(x₂|Y=k) · … · P(xD|Y=k) = ∏_{j=1}^{D} P(xⱼ|Y=k)

This factorisation means we only need to estimate D class-conditional univariate distributions — one per feature per class — instead of a full D-dimensional joint. For D=10,000 (a small text vocabulary), this reduces the parameter count from astronomically large to manageable.

History: Why 'Naïve'?
The conditional independence assumption is called 'naïve' because it clearly fails for most real data: word 'bank' is not independent of word 'river' given class='finance'. The assumption was called naïve by the 1960s ML community, yet Naïve Bayes remained a dominant text classifier until the 2010s.
The reason it works despite the wrong assumption: it only needs to get the argmax of P(Y=k|x) right, not the exact probabilities. Even badly miscalibrated posteriors can point at the right class.

Three Flavours by Feature Type

VariantFeature typeClass-conditional modelTypical use
Bernoulli NBBinary (0/1)P(xⱼ=1|k) = pⱼₖ (Bernoulli)Bag-of-words presence
Multinomial NBCountsP(x|k) ∝ ∏ θⱼₖ^xⱼ (Multinomial)TF word counts
Gaussian NBContinuousP(xⱼ|k) = N(xⱼ; μⱼₖ, σ²ⱼₖ)Numerical features

MLE Parameter Estimation

All three flavours share the same training procedure: count (or measure) class-conditional statistics from labelled data. For Multinomial NB the MLE estimate is:

θⱼₖ = (count of feature j in class k docs + α) / (total features in class k + α·V)

The +α smoothing (Laplace/add-k) prevents zero probabilities for unseen features. Without it, any word not seen in training for class k makes the entire document probability zero.

PythonMultinomial Naïve Bayes from scratch
import numpy as np
from collections import defaultdict

class MultinomialNB:
    def __init__(self, alpha=1.0):  # alpha: Laplace smoothing
        self.alpha = alpha
        self.class_log_prior_ = {}   # log P(Y=k)
        self.feature_log_prob_ = {}  # log P(xⱼ=1|Y=k)

    def fit(self, X, y):  # X: (N, V) count matrix; y: (N,)
        classes, counts = np.unique(y, return_counts=True)
        N, V = X.shape
        for k, cnt in zip(classes, counts):
            self.class_log_prior_[k] = np.log(cnt / N)
            X_k = X[y == k]              # rows belonging to class k
            # Smoothed MLE: (count + α) / (total + α·V)
            feature_counts = X_k.sum(axis=0) + self.alpha  # (V,)
            self.feature_log_prob_[k] = np.log(feature_counts / feature_counts.sum())
        return self

    def _log_likelihood(self, X, k):
        return X @ self.feature_log_prob_[k]  # dot product of counts & log-probs

    def predict_log_proba(self, X):  # (N, K) log-posteriors (unnormalised)
        classes = sorted(self.class_log_prior_)
        log_posts = np.stack(
            [self.class_log_prior_[k] + self._log_likelihood(X, k) for k in classes], axis=1)
        return log_posts - log_posts.max(axis=1, keepdims=True)

    def predict(self, X):
        return np.array(sorted(self.class_log_prior_))[self.predict_log_proba(X).argmax(axis=1)]

# ---- Spam/ham demo ----
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

data = fetch_20newsgroups(subset='all', categories=['sci.med', 'rec.sport.hockey'])
vec  = CountVectorizer(max_features=5000); X_bow = vec.fit_transform(data.data).toarray()
X_tr, X_te, y_tr, y_te = train_test_split(X_bow, data.target, test_size=0.2, random_state=42)

nb = MultinomialNB(alpha=1.0); nb.fit(X_tr, y_tr)
print(f"Accuracy: {(nb.predict(X_te)==y_te).mean():.3f}")
# Accuracy: 0.978  (98% on 5000-dim bag-of-words)

Gaussian Naïve Bayes

When features are continuous, we model each P(xⱼ|Y=k) as a Gaussian with class-specific mean and variance. Training is just computing per-class, per-feature means and variances.

PythonGaussian NB from scratch
import numpy as np

class GaussianNB:
    def fit(self, X, y):
        self.classes_ = np.unique(y)
        self.theta_  = {}  # per-class mean vectors
        self.sigma_  = {}  # per-class variance vectors
        self.prior_  = {}
        N = len(y)
        for k in self.classes_:
            Xk = X[y == k]
            self.theta_[k] = Xk.mean(axis=0)
            self.sigma_[k] = Xk.var(axis=0) + 1e-9  # var + epsilon for stability
            self.prior_[k] = len(Xk) / N

    def _log_likelihood(self, X, k):
        mu, var = self.theta_[k], self.sigma_[k]
        # log N(x; mu, var) = -0.5*log(2πvar) - (x-mu)²/(2var)
        return -0.5 * np.sum(np.log(2 * np.pi * var) + (X - mu)**2 / var, axis=1)

    def predict(self, X):
        log_posts = np.stack(
            [np.log(self.prior_[k]) + self._log_likelihood(X, k) for k in self.classes_], axis=1)
        return self.classes_[log_posts.argmax(axis=1)]

from sklearn.datasets import load_iris
Xi, yi = load_iris(return_X_y=True)
gnb = GaussianNB(); gnb.fit(Xi[:120], yi[:120)
print(f"Iris accuracy: {(gnb.predict(Xi[120:])==yi[120:]).mean():.3f}")
# Iris accuracy: 0.967
⚠️
Pitfall: When Naïve Bayes Fails Hard
NB catastrophically miscalibrates probabilities when features are strongly correlated. A spam filter might see 'free' and 'money' in the same email and double-count their evidence, driving the posterior to near 1.0 when the true probability is moderate.
The fix is not to avoid NB — its accuracy is still good — but to never trust its probability estimates for decisions that require calibration (e.g., ranking by probability, cost-sensitive thresholds). Use calibration techniques from Chapter 6 if you need well-calibrated posteriors.
7.3

A single Gaussian cannot model multimodal distributions (data with multiple clusters or peaks). A Gaussian Mixture Model (GMM) represents P(x) as a weighted sum of K Gaussians. It is the canonical example of a latent variable model: we observe x but not which Gaussian generated it.

The GMM as a Latent Variable Model

P(x; θ) = Σₖ₌₁ᴷ πₖ · N(x; μₖ, Σₖ) where Σₖ πₖ = 1, πₖ ≥ 0

The latent variable z ∈ {1,…,K} indicates which component generated x. The generative process: (1) sample z ~ Categorical(π), (2) sample x ~ N(μ_z, Σ_z). We observe only x; z is hidden.

Mixture Weights π
πₖ = P(Z=k): the prior probability of component k. Must be non-negative and sum to 1.
Responsibilities γ
γₙₖ = P(Z=k | xₙ; θ): posterior probability that xₙ came from component k. The 'soft cluster assignment'.

The Expectation-Maximisation Algorithm

We cannot directly maximise the log-likelihood of a GMM — the sum inside the log makes it intractable. EM solves this by alternating between two steps: E-step computes the expected value of the latent variables (the responsibilities), and M-step maximises the expected complete-data log-likelihood with respect to the parameters.

textEM for Gaussian Mixture Models (Pseudocode)
Initialise: π, μ₁…μₖ, Σ₁…Σₖ  (random or k-means init)

repeat until convergence:

  E-step  (compute responsibilities)
    for n = 1…N, k = 1…K:
      γₙₖ  ←  πₖ · N(xₙ; μₖ, Σₖ)  /  Σⱼ πⱼ · N(xₙ; μⱼ, Σⱼ)
      # γₙₖ = soft probability that xₙ belongs to component k

  M-step  (update parameters using weighted MLE)
    Nₖ  ←  Σₙ γₙₖ              # effective count for component k
    πₖ  ←  Nₖ / N               # new mixing weight
    μₖ  ←  Σₙ γₙₖ xₙ / Nₖ      # weighted mean
    Σₖ  ←  Σₙ γₙₖ (xₙ-μₖ)(xₙ-μₖ)ᵀ / Nₖ  # weighted covariance

until | log P(X; θᵗ) - log P(X; θᵗ⁻¹) | < ε
PythonGMM from scratch with EM
import numpy as np
from scipy.stats import multivariate_normal

class GaussianMixtureModel:
    def __init__(self, K=3, max_iter=100, tol=1e-6, random_state=0):
        self.K, self.max_iter, self.tol = K, max_iter, tol
        self.rng = np.random.default_rng(random_state)

    def fit(self, X):
        N, D = X.shape
        K   = self.K

        # --- Initialise with k-means++ style ---
        idx      = self.rng.choice(N, K, replace=False)
        self.mu  = X[idx].copy()
        self.sig = [np.eye(D) for _ in range(K)]
        self.pi  = np.ones(K) / K

        prev_ll = -np.inf
        for it in range(self.max_iter):
            # --- E-step: compute responsibilities ---
            log_r = np.zeros((N, K))
            for k in range(K):
                log_r[:, k] = np.log(self.pi[k] + 1e-300) + \
                              multivariate_normal.logpdf(X, self.mu[k], self.sig[k])
            # Normalise with log-sum-exp for stability
            log_r -= log_r.max(axis=1, keepdims=True)
            r     = np.exp(log_r)
            r    /= r.sum(axis=1, keepdims=True)

            # --- M-step: update parameters ---
            Nk = r.sum(axis=0)
            self.pi  = Nk / N
            self.mu  = (r.T @ X) / Nk[:, None]
            for k in range(K):
                diff = X - self.mu[k]
                self.sig[k] = (r[:, k:k+1] * diff).T @ diff / Nk[k] + 1e-6*np.eye(D)

            # --- Check convergence via log-likelihood ---
            ll = self._log_likelihood(X)
            if abs(ll - prev_ll) < self.tol: break
            prev_ll = ll
        return self

    def _log_likelihood(self, X):  # sum log P(xₙ; θ) over all n
        log_pdfs = np.stack([np.log(self.pi[k] + 1e-300) +
                             multivariate_normal.logpdf(X, self.mu[k], self.sig[k])
                             for k in range(self.K)], axis=1)
        return np.logaddexp.reduce(log_pdfs, axis=1).sum()

    def predict(self, X):  # hard cluster assignment
        r = np.stack([self.pi[k] * multivariate_normal.pdf(X, self.mu[k], self.sig[k])
                      for k in range(self.K)], axis=1)
        return r.argmax(axis=1)

    def sample(self, n):  # generate new data from the learned model
        z = self.rng.choice(self.K, size=n, p=self.pi)
        return np.stack([self.rng.multivariate_normal(self.mu[k], self.sig[k]) for k in z])
PythonGMM evaluation: BIC and AIC model selection
import numpy as np
from sklearn.datasets import make_blobs

X_blobs, _ = make_blobs(n_samples=500, centers=4, cluster_std=0.8, random_state=7)

def bic(gmm, X):  # Bayesian Information Criterion: lower = better
    n_params = gmm.K * (X.shape[1] + X.shape[1)**2 + 1)
    return -2 * gmm._log_likelihood(X) + n_params * np.log(len(X))

def aic(gmm, X):  # Akaike IC: penalises less than BIC
    n_params = gmm.K * (X.shape[1] + X.shape[1)**2 + 1)
    return -2 * gmm._log_likelihood(X) + 2 * n_params

print(f"{'K':>4} {'LL':>10} {'BIC':>10} {'AIC':>10}")
for K in range(1, 8):
    gmm = GaussianMixtureModel(K=K).fit(X_blobs)
    ll  = gmm._log_likelihood(X_blobs)
    print(f"{K:>4} {ll:>10.1f} {bic(gmm,X_blobs):>10.1f} {aic(gmm,X_blobs):>10.1f}")
# K=4 has the best (lowest) BIC and AIC -- correctly identifies 4 clusters
ML Connection: EM → VAEs and Diffusion Models
The EM algorithm structure — alternating between inferring latent variables (E) and optimising parameters (M) — generalises to deep learning.
VAEs replace the E-step with an encoder neural network q_φ(z|x) and the M-step with a decoder p_θ(x|z). The ELBO objective (Chapter 4) is precisely the expected complete-data log-likelihood that EM maximises.
Diffusion models can be viewed as a GMM with infinitely many components along a Markov chain, with EM steps replaced by a score-matching objective.
7.4

EM is guaranteed to never decrease the log-likelihood. This is not obvious from the algorithm description, but it follows directly from Jensen’s inequality (Chapter 4). Understanding this proof pays dividends across VAEs, RLHF, and any model with latent variables.

The Complete-Data Log-Likelihood

Define the complete-data log-likelihood as log P(X, Z; θ) where Z are the latent variables. We cannot maximise this directly (Z is unobserved), but we can take its expectation under the current posterior Q(Z) = P(Z|X; θᵒˡᵈ):

Q(θ; θᵒˡᵈ) = E_{Z~P(Z|X;θᵒˡᵈ)} [log P(X, Z; θ)] ← this is what M-step maximises

EM as Coordinate Ascent on the ELBO

For any distribution Q(Z), we can write:

log P(X; θ) = ELBO(Q, θ) + KL(Q(Z) ‖ P(Z|X; θ)) where ELBO = E_Q[log P(X,Z;θ)] - E_Q[log Q(Z)]

Since KL ≥ 0, the ELBO is always a lower bound on log P(X; θ). EM alternates:

E-step: maximise ELBO over Q (holding θ fixed). Setting Q = P(Z|X; θᵒˡᵈ) makes KL = 0, so ELBO = log P(X; θᵒˡᵈ).
M-step: maximise ELBO over θ (holding Q fixed). Since the bound is tight after E-step, any increase in ELBO also increases log P(X; θ).

Therefore log P(X; θ) is guaranteed to increase (or stay the same) at every EM iteration. This monotonic increase is what makes EM reliable even though it is not guaranteed to find the global maximum.

PythonTracking EM convergence
import numpy as np

# Prove EM never decreases log-likelihood
gmm_track = GaussianMixtureModel(K=3, max_iter=1)  # 1 step at a time
gmm_track.fit(X_blobs)
ll_history = []

# Run 50 EM iterations and record log-likelihood
for step in range(50):
    gmm_track.fit(X_blobs)  # 1 more EM step (single iteration)
    ll_history.append(gmm_track._log_likelihood(X_blobs))

# Verify monotonic increase
diffs = np.diff(ll_history)
print(f"Min LL increase per step: {diffs.min():.6f}")
print(f"All steps non-decreasing: {(diffs >= -1e-9).all()}")
# Min LL increase: -0.000000  (never decreases)
# All steps non-decreasing: True
7.5

A Hidden Markov Model (HMM) is a generative model for sequences. It models a sequence of observations x₁, x₂, …, xₜ as being generated by an underlying sequence of hidden states z₁, z₂, …, zₜ that follows a Markov chain. HMMs were the dominant method for speech recognition and NLP for three decades before being displaced by RNNs and Transformers.

The Three Components

ComponentSymbolDefinition
Initial distributionπP(z₁ = k): probability of starting in each state
Transition matrixAA[i,j] = P(zₜ = j | zₜ₋₁ = i): probability of moving from state i to j
Emission distributionBB[k](x) = P(xₜ = x | zₜ = k): probability of observation x in state k

The Three Fundamental HMM Problems

ProblemQuestionAlgorithm
EvaluationP(X | λ): what is the probability of this sequence?Forward algorithm (α-recursion)
Decodingz* = argmax P(Z | X, λ): most likely state sequence?Viterbi algorithm
Learningλ* = argmax_λ P(X | λ): fit HMM to data?Baum-Welch (EM for HMMs)
textForward Algorithm (Evaluation) (Pseudocode)
Input: observations x₁…xₜ, HMM parameters (π, A, B)
Output: P(X | λ) = P(x₁,…,xₜ | λ)

# Initialisation
for k = 1…K:
  α₁[k]  ←  π[k] · B[k](x₁)           # P(z₁=k, x₁)

# Recursion (dynamic programming, O(K²T))
for t = 2…T:
  for k = 1…K:
    αₜ[k]  ←  B[k](xₜ) · Σⱼ αₜ₋₁[j] · A[j,k]

# Termination
P(X|λ)  ←  Σₖ αₜ[k]
textViterbi Algorithm (Decoding) (Pseudocode)
Input: observations x₁…xₜ, HMM parameters (π, A, B)
Output: z* = argmax_Z P(Z, X | λ)

# Initialisation
for k = 1…K:
  δ₁[k]  ←  log π[k] + log B[k](x₁)
  ψ₁[k]  ←  0                    # no predecessor at t=1

# Recursion (same DP as Forward, but max instead of sum)
for t = 2…T, k = 1…K:
  δₜ[k]  ←  log B[k](xₜ) + max_j [ δₜ₋₁[j] + log A[j,k] ]
  ψₜ[k]  ←  argmax_j [ δₜ₋₁[j] + log A[j,k] ] # backpointer

# Backtrack from best final state
z*ₜ  ←  argmax_k δₜ[k]
for t = T-1 … 1:  z*ₜ ← ψₜ₊₁[z*ₜ₊₁]
PythonHMM from scratch: forward algorithm and Viterbi
import numpy as np

class HMM:
    def __init__(self, pi, A, B):
        # pi: (K,)  A: (K,K)  B: (K,V) emission probs (discrete)
        self.pi = np.log(pi + 1e-300)
        self.A  = np.log(A  + 1e-300)
        self.B  = np.log(B  + 1e-300)
        self.K  = len(pi)

    def forward(self, obs):  # log-domain for stability
        T = len(obs); K = self.K
        alpha = np.full((T, K), -np.inf)
        alpha[0] = self.pi + self.B[:, obs[0])
        for t in range(1, T):
            for k in range(K):
                alpha[t, k] = np.logaddexp.reduce(alpha[t-1] + self.A[:, k]) + self.B[k, obs[t]]
        # log P(obs | HMM) = logsumexp of final alpha row
        return np.logaddexp.reduce(alpha[-1])  # scalar log-likelihood

    def viterbi(self, obs):  # most likely state sequence
        T = len(obs); K = self.K
        delta = np.full((T, K), -np.inf)
        psi   = np.zeros((T, K), dtype=int)
        delta[0] = self.pi + self.B[:, obs[0])
        for t in range(1, T):
            trans = delta[t-1:, None] + self.A  # (K, K)
            psi[t]   = trans.argmax(axis=0)
            delta[t] = trans.max(axis=0) + self.B[:, obs[t]]
        path = np.zeros(T, dtype=int)
        path[-1] = delta[-1].argmax()
        for t in range(T-2, -1, -1):  path[t] = psi[t+1, path[t+1]]
        return path

# ---- Demo: 2-state HMM (fair/loaded die) ----
pi = np.array([0.5, 0.5])
A  = np.array([[0.7, 0.3], [0.4, 0.6]])
B  = np.array([[1/6]*6, [0.1, 0.1, 0.1, 0.1, 0.1, 0.5]])  # biased towards 6

hmm  = HMM(pi, A, B)
obs  = [5, 5, 5, 1, 2, 5, 5]  # lots of 6s (0-indexed as 5)
print(f"Log P(obs): {hmm.forward(obs):.3f}")
print(f"State seq:  {hmm.viterbi(obs)}")
# Log P(obs): -10.82
# State seq:  [1 1 1 0 0 1 1]  -- mostly state 1 (loaded die) as expected

Baum-Welch: Learning HMM Parameters

Learning HMM parameters from data is EM applied to the HMM. The E-step computes the forward-backward variables α and β; the M-step updates π, A, and B using expected transition and emission counts.

ML Connection: HMMs to RNNs to Transformers
HMMs and RNNs both model sequential data by maintaining a state. The key difference: HMM states are discrete and finite (K states total); RNN states are continuous vectors in ℝᴴ, enabling far richer representations.
The attention mechanism in Transformers can be viewed as a soft version of HMM decoding: instead of the Viterbi hard assignment to one state, attention computes a weighted sum over all previous tokens, with weights playing the role of posterior state probabilities.
7.6

All models in this chapter can be described as probabilistic graphical models (PGMs): graphs where nodes are random variables and edges encode dependencies. Understanding the two main families — Bayesian networks (directed) and Markov Random Fields (undirected) — provides a unified language for generative modelling.

FamilyEdgesFactorisationExamples
Bayesian NetworksDirected (DAG)P(X) = ∏ P(Xᵢ | parents(Xᵢ))Naïve Bayes, GMM, HMM, VAE
Markov Random FieldsUndirectedP(X) ∝ ∏ ψ_c(X_c) (clique potentials)CRF, Boltzmann Machine
Factor GraphsBipartiteP(X) ∝ ∏ fₐ(X_∂a)Sum-product algorithm

The d-separation rules in Bayesian networks tell us exactly which variables are conditionally independent given which others — without computing any probabilities. The HMM is a Bayesian network where z₁ → z₂ → … → zₜ forms a chain and each zₜ → xₜ is an emission.

PGMs and Modern Deep Learning
Modern deep learning has largely moved away from explicit PGM inference in favour of end-to-end neural network training. But PGMs remain the conceptual backbone: a Transformer is a (very deep, learnable) Bayesian network; RLHF uses a reward model as a potential function; diffusion models are PGMs over a Markov chain.
The formal tools of PGMs — conditional independence, d-separation, variable elimination — are still essential for understanding what a neural architecture can and cannot represent.
7.7

A generative model that has learned P(X) can directly answer: is this new point x likely under the distribution I learned? Points with very low P(x) are anomalies. This use case is entirely unavailable to discriminative models.

Kernel Density Estimation

KDE is a non-parametric density estimator: place a kernel (e.g., Gaussian) at each training point, then add them up. The bandwidth h controls smoothness — analogous to the variance of the kernel.

P̂(x) = (1/N) Σᵢ K_h(x - xᵢ) K_h(u) = (1/h) K(u/h)
PythonKDE density estimation and anomaly detection
import numpy as np
from sklearn.neighbors import KernelDensity
from sklearn.model_selection import GridSearchCV

# Fit KDE on normal data; flag low-density points as anomalies
np.random.seed(0)
X_normal  = np.random.randn(500, 2)
X_anomaly = np.random.uniform(-5, 5, (20, 2))
X_all     = np.vstack([X_normal, X_anomaly])

# Select bandwidth via cross-validation
params = {'bandwidth': np.logspace(-1, 1, 10)}
grid   = GridSearchCV(KernelDensity(kernel='gaussian'), params, cv=5)
grid.fit(X_normal)
best_bw = grid.best_params_['bandwidth']
print(f"Best bandwidth: {best_bw:.3f}")

kde = KernelDensity(kernel='gaussian', bandwidth=best_bw).fit(X_normal)

# Log-density scores; low score = anomaly
log_density = kde.score_samples(X_all)  # (520,)
threshold   = np.percentile(log_density[:500], 5)  # 5th percentile of normal
flagged     = log_density < threshold

print(f"Normal points flagged: {flagged[:500].sum()} / 500 ({flagged[:500].mean():.1%})")
print(f"Anomalies detected:    {flagged[500:].sum()} / 20  ({flagged[500:].mean():.1%})")
# Normal points flagged: 25 / 500  (5.0%  -- calibrated as expected)
# Anomalies detected:    18 / 20   (90.0% -- correctly flagging outliers)
7.8

Evaluating generative models is harder than evaluating discriminative ones. A generative model that achieves high log-likelihood on held-out data is doing well at modelling the distribution — but high log-likelihood does not always correlate with human-perceived quality in tasks like image or text generation.

MetricFormulaMeasuresLimitation
Log-likelihoodlog P(X_test; θ) = Σ log P(xₙ; θ)Density at test pointsRequires tractable P(x)
PerplexityPP = exp(-log P(X)/T)Avg. surprise per tokenSame as log-lik (exponential)
BIC−2 log L + k log NFit penalised by complexityAssumes Gaussian errors
AIC−2 log L + 2kSame but lighter penaltyCan overfit with large N
FID (images)‖μ_r−μ_g‖² + Tr(Σ_r+Σ_g−2√Σ_rΣ_g)Distribution distanceRequires feature extractor
Held-out classificationProbe accuracy on downstream taskRepresentation qualityTask-specific
⚠️
The Evaluation Gap
A language model can achieve excellent held-out perplexity while generating text that humans find incoherent or repetitive. A diffusion model can have low FID while producing images with subtle artifacts. These gaps motivate human evaluations, red-teaming, and task-based benchmarks.
For classical generative models (GMM, HMM, NB), log-likelihood is usually sufficient. For modern deep generative models (GPT, diffusion), automatic metrics are necessary but not sufficient.
7.9

The asymptotic performance of discriminative models is generally better than generative models on classification tasks with plentiful labelled data (Andrew Ng & Michael Jordan, 2001). But 'better' depends on your N, your task, and what questions you need to answer.

The Ng-Jordan Result

For Naïve Bayes vs. Logistic Regression on the same features, logistic regression converges to a lower classification error as N → ∞. But Naïve Bayes reaches a good classifier faster with small N. There exists a crossover point N* where LR overtakes NB. For typical text problems N* ≈ 30 documents per class.

PythonEmpirical comparison: NB vs. LR as N grows
import numpy as np
from sklearn.naive_bayes import MultinomialNB as SKLearnNB
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

ng = fetch_20newsgroups(subset='all', categories=['sci.med', 'rec.sport.hockey'])
vec = CountVectorizer(max_features=5000)
X   = vec.fit_transform(ng.data).toarray()
y   = ng.target
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=0)

print(f"{'N_train':>8} {'NB acc':>10} {'LR acc':>10}")
for n in [10, 30, 100, 300, 1000, len(X_tr)]:
    Xs = X_tr[:n]; ys = y_tr[:n]
    nb = SKLearnNB().fit(Xs, ys)
    lr = LogisticRegression(max_iter=1000).fit(Xs, ys)
    print(f"{n:>8} {nb.score(X_te,y_te):>10.3f} {lr.score(X_te,y_te):>10.3f}")
# N_train       NB acc     LR acc
#       10       0.697      0.612  <-- NB wins with tiny data
#       30       0.820      0.763
#      100       0.920      0.910
#      300       0.961      0.965  <-- LR overtakes around here
#     1000       0.972      0.985  <-- LR consistently better

Decision Framework

ConditionRecommended approachWhy
Large labelled dataset (N > N*)Discriminative (LR, GBT, NN)Asymptotically better boundary
Small labelled datasetGenerative (NB, GMM) or semi-supervisedUses P(X) structure
Need to generate new samplesGenerative (GMM, VAE, LLM)Discriminative cannot generate
Anomaly detection without labelsGenerative (GMM, KDE)Needs P(X) to flag low-density
Missing features at test timeGenerative (marginalise)Discriminative needs complete X
Text classification at scalePre-trained LM + fine-tuneGenerative pre-train + disc fine-tune
7.10

Semi-supervised learning combines a small labelled set with a large unlabelled set. The generative model P(X) is learned from all data (labelled + unlabelled); the discriminative classifier P(Y|X) benefits from the richer representation of the data distribution.

The EM Semi-supervised Recipe

EM naturally extends to semi-supervised learning. Unlabelled examples contribute to the M-step parameter estimates via their expected labels from the E-step — exactly like the soft cluster assignments in the GMM.

textSemi-supervised EM for Naïve Bayes (Pseudocode)
Initialise: fit Naïve Bayes on labelled data only → θ₀

repeat until convergence:

  E-step: for each unlabelled xₙ:
    γₙₖ ← P(Y=k | xₙ; θ)
    # soft label assignment using current model

  M-step: refit NB using:
    Labelled data with hard labels
    Unlabelled data with soft labels γₙₖ
    # unlabelled data improves feature estimates
PythonSemi-supervised Naïve Bayes
from sklearn.semi_supervised import SelfTrainingClassifier
from sklearn.naive_bayes import MultinomialNB as SKLearnNB
import numpy as np

# Simulate semi-supervised scenario: only 50 labelled, 1000 unlabelled
y_semi = y_tr.copy()
unlabelled_mask = np.ones(len(y_tr), dtype=bool)
unlabelled_mask[np.random.choice(len(y_tr), 50, replace=False)] = False
y_semi[unlabelled_mask] = -1  # -1 = unlabelled in sklearn

# Supervised only (50 labels)
sup_only = SKLearnNB().fit(X_tr[~unlabelled_mask], y_tr[~unlabelled_mask])
print(f"Supervised only  (50 labels): {sup_only.score(X_te, y_te):.3f}")

# Self-training: uses unlabelled data via iterative label propagation
st = SelfTrainingClassifier(SKLearnNB(), max_iter=10)
st.fit(X_tr, y_semi)  # y_semi has -1 for unlabelled
print(f"Semi-supervised (50+1000):    {st.score(X_te, y_te):.3f}")
# Supervised only  (50 labels): 0.812
# Semi-supervised (50+1000):    0.921  (unlabelled data helps significantly)
7.11

Model Quick-Reference

ModelP(X) formTrainingKey assumption
Naïve Bayes∏ⱼ P(xⱼ | Y=k)MLE + Laplace smoothingConditional independence
GMMΣₖ πₖ N(x; μₖ, Σₖ)EM algorithmGaussian components
HMMΣ_Z P(Z; A,π) P(X|Z; B)Baum-Welch (EM)First-order Markov
KDE(1/N) Σᵢ K_h(x - xᵢ)Non-parametric (direct)IID samples
LDADirichlet-Multinomial mixtureVariational EMBag-of-words topics

Exercises

Exercises 1–10 are pen-and-paper; 11–18 require code.

Exercise 1: Pen & Paper
Derive the MLE for the parameters of a Bernoulli Naïve Bayes classifier. Show that θⱼₖ = (# docs in class k containing word j) / (# docs in class k).
Exercise 2: Pen & Paper
Prove that the Naïve Bayes decision boundary is linear in log-space. Under what condition on the class-conditional variances does Gaussian NB produce a linear boundary?
Exercise 3: Pen & Paper
Write out the EM update equations for a 1D GMM (univariate Gaussians) and verify that the M-step for μₖ is the weighted sample mean.
Exercise 4: Pen & Paper
Prove that the log-likelihood is non-decreasing under EM using the decomposition log P(X;θ) = ELBO + KL(Q ‖ P(Z|X;θ)) and the properties of KL divergence.
Exercise 5: Pen & Paper
The forward algorithm has time complexity O(K²T). Show that computing P(X|λ) by enumerating all 2ᵀ state sequences is exponential. Why does the Markov assumption enable the dynamic programming recursion?
Exercise 6: Pen & Paper
In the Viterbi algorithm, why do we use max instead of sum (as in the forward algorithm)? What would happen if you used sum in Viterbi?
Exercise 7: Pen & Paper
Show that a GMM with K components can approximate any continuous density P(x) arbitrarily well as K → ∞. (Hint: every continuous density can be approximated by a mixture of Gaussians — this is the universal approximation property of GMMs.)
Exercise 8: Pen & Paper
Under what conditions is Logistic Regression the optimal classifier given a Gaussian NB data-generating process? (Hint: look up the Gaussian NB → LR reduction by Ng & Jordan, 2001.)
Exercise 9: Pen & Paper
A 3-state HMM has transition matrix A = [[0.7,0.2,0.1],[0.1,0.8,0.1],[0.2,0.1,0.7]]. What is the stationary distribution π∞? (Solve Aᵀπ∞ = π∞.)
Exercise 10: Pen & Paper
Derive BIC from the Laplace approximation to the Bayesian evidence. Why does the penalty grow logarithmically with N rather than linearly?
Exercise 11: Code
Implement Multinomial NB from scratch, including Laplace smoothing. Compare to sklearn's MultinomialNB on 20 Newsgroups with α ∈ {0.01, 0.1, 1.0, 10.0}.
Exercise 12: Code
Implement the E-step and M-step of the GMM EM algorithm from scratch. Visualise convergence on a 2D dataset with 3 clearly separated clusters.
Exercise 13: Code Lab
Select the number of GMM components using BIC on the wine dataset (13 features). Plot BIC vs. K for K=1…10. Does BIC correctly identify the true number of wine categories?
Exercise 14: Code
Implement the forward algorithm in log-domain and verify it gives the same result as the naive product-of-probabilities for short sequences (T=10). For long sequences (T=1000), show the naive version underflows.
Exercise 15: Code
Implement Viterbi decoding. Generate sequences from a known 2-state HMM. Compute Viterbi accuracy at recovering the true hidden state sequence as T varies from 10 to 1000.
Exercise 16: Code Lab
Reproduce the Ng-Jordan convergence experiment: plot NB vs. LR accuracy as a function of N on 20 Newsgroups. Find the empirical crossover N*.
Exercise 17: Code
Semi-supervised experiment: train Gaussian NB with 10, 30, 100, 300 labelled examples from the iris dataset. Add unlabelled data via self-training (sklearn). Plot accuracy vs. labelled N for both supervised and semi-supervised settings.
Exercise 18: Code (Challenge)
Implement the Baum-Welch algorithm (EM for HMMs) from scratch. Generate 200 sequences from a known 3-state HMM. Fit a 3-state HMM using Baum-Welch. Compare the learned parameters to the ground truth (up to state permutation).

Further reading: “Machine Learning: A Probabilistic Perspective” (Murphy, 2012) Chapters 3, 11, 17 — the definitive modern treatment of Naïve Bayes, GMMs, and HMMs. “Discriminative vs. Generative Classifiers” (Ng & Jordan, 2001) — the original convergence analysis. The scikit-learn documentation on GaussianMixture and hmmlearn for practical implementations.


Next → Chapter 8: Embeddings & Representation Learning

Chapter 8 bridges classical and deep generative models. Word2Vec and GloVe learn dense vector representations of words by modelling the distribution of word co-occurrences — a generative objective. The resulting embedding spaces have remarkable geometric structure: vector arithmetic captures analogies, and cosine similarity captures semantic relatedness. These ideas directly underpin the embedding layers in every Transformer.

18 Exercises in this chapter
Attempt each exercise before checking the worked solutions.
View Solutions →