Solutions Appendix
Chapter 7

Generative Models

18 Solutions

Detailed solutions for the exercises in Chapter 7. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper
Derive the MLE for Bernoulli Naive Bayes: θ_jk = (#class-k docs with word j)/(#class-k docs).

Solution

For each class k and word j, the presence of word j is Bernoulli(θ_jk). The log-likelihood over class-k documents is Σ[x_j logθ_jk + (1−x_j)log(1−θ_jk)]. Setting its derivative to zero (as in the Bernoulli MLE of Chapter 3) gives θ_jk = (number of class-k docs containing word j)/(number of class-k docs) — simply the within-class frequency of the word.

Exercise 2Pen & Paper
Prove the Naive Bayes boundary is linear in log-space. When is Gaussian NB linear?

Solution

The decision compares Σ log P(x_j|k) + log P(k) across classes. For Bernoulli/multinomial features these log-likelihoods are linear in the feature counts, so the log-posterior difference between two classes is a linear function of x — a linear boundary. Gaussian NB is linear only when the class-conditional variances are equal across classes (shared covariance): then the quadratic x² terms cancel in the log-ratio, leaving a linear boundary; unequal variances give a quadratic boundary.

Exercise 3Pen & Paper
Write the EM updates for a 1D GMM and verify the M-step for μ_k is the weighted sample mean.

Solution

E-step: responsibilities γ_{ik} = π_k N(x_i|μ_k,σ_k²) / Σ_j π_j N(x_i|μ_j,σ_j²). M-step: maximizing the expected complete-data log-likelihood gives μ_k = Σ_i γ_{ik} x_i / Σ_i γ_{ik} (the responsibility-weighted mean), σ_k² = Σ_i γ_{ik}(x_i−μ_k)² / Σ_i γ_{ik}, and π_k = (1/N)Σ_i γ_{ik}. The μ_k update is exactly the weighted sample mean, with weights equal to each point's soft assignment to component k.

Exercise 4Pen & Paper
Prove EM's log-likelihood is non-decreasing using log P(X;θ) = ELBO + KL(Q‖P(Z|X;θ)).

Solution

The E-step sets Q = P(Z|X;θ_old), making KL = 0 so ELBO = log P(X;θ_old). The M-step then increases the ELBO by maximizing it over θ. Since log P(X;θ) = ELBO + KL ≥ ELBO (because KL ≥ 0), the new log-likelihood is at least the increased ELBO, which is at least the old log-likelihood. Hence log P(X;θ) never decreases — EM monotonically climbs the likelihood (to a local optimum).

Exercise 5Pen & Paper
Forward algorithm is O(K²T); enumerating sequences is exponential. Why does the Markov assumption enable DP?

Solution

Naively, P(X|λ) = Σ over all K^T (here 2^T) state paths — exponential. The Markov assumption means the next state depends only on the current one, so the probability of being in state j at time t (the forward variable α_t(j)) can be computed from the K values α_{t−1}(·) in O(K) work, giving O(K²) per step and O(K²T) overall. The DP works because the past is summarized entirely by the current state distribution — no need to track full histories.

Exercise 6Pen & Paper
In Viterbi, why max instead of sum? What if you used sum?

Solution

Viterbi finds the single most likely state sequence, so it propagates the maximum-probability path to each state (max). The forward algorithm computes the total probability of the observations by summing over all paths (sum). If you used sum in Viterbi you would compute the observation likelihood (the forward quantity), not the best path, and the backpointer-based decoding would be meaningless — you'd lose the argmax path you set out to recover.

Exercise 7Pen & Paper
Show a GMM can approximate any continuous density as K→∞.

Solution

Any continuous density can be approximated arbitrarily well by placing many narrow Gaussians along it: as the number of components K→∞ and their widths shrink, the mixture can match the target density to any desired accuracy (Gaussians form a dense basis for smooth densities, analogous to how kernels reconstruct functions). This universal-approximation property is why GMMs are flexible density estimators, though finite K and EM's local optima limit them in practice.

Exercise 8Pen & Paper
When is logistic regression optimal given a Gaussian-NB generative process (Ng & Jordan)?

Solution

When the Gaussian-NB assumptions hold (class-conditional features Gaussian and conditionally independent, with shared covariance), the implied P(y|x) has exactly the logistic-regression form. In that case LR (the discriminative model of the same family) is asymptotically at least as accurate — it converges to a lower-bias solution. Ng & Jordan showed the trade-off: NB (generative) reaches its (higher) asymptotic error faster with less data, while LR (discriminative) needs more data but ultimately achieves lower error when the model is correct.

Exercise 9Pen & Paper
3-state HMM with given A: find the stationary distribution π∞ (solve Aᵀπ=π).

Solution

Solving π = πA with Σπ = 1 for A = [[.7,.2,.1],[.1,.8,.1],[.2,.1,.7]] gives the balance equations; substituting yields π₁ = 1.25π₃ and π₂ = 1.75π₃. Normalizing (1.25+1.75+1)π₃ = 1 → π₃ = 0.25.

π∞ ≈ [0.3125, 0.4375, 0.25]

This is the long-run fraction of time the chain spends in each state — the left eigenvector of A for eigenvalue 1.

Exercise 10Pen & Paper
Derive BIC from the Laplace approximation to the Bayesian evidence. Why log N, not linear in N?

Solution

The Laplace approximation expands the log-evidence around the MAP estimate θ̂: log P(D) ≈ log P(D|θ̂) + log P(θ̂) + (k/2)log(2π) − ½ log|H|, where H is the Hessian (≈ N times the per-sample information). Since |H| ∝ N^k, the −½ log|H| term contributes −(k/2)log N. Dropping O(1) terms gives BIC = log P(D|θ̂) − (k/2)log N. The penalty is logarithmic in N because each parameter's posterior uncertainty shrinks like 1/√N, so the 'volume' penalty per parameter grows only as ½log N.

Exercise 11Code
Implement Multinomial NB with Laplace smoothing; compare to sklearn on 20 Newsgroups across α.

Solution

Estimate P(word|class) = (count + α)/(class_total + α·V) and classify by the summed log-probabilities plus log prior. Matching sklearn's MultinomialNB, accuracy varies with α: very small α overfits rare words, large α over-smooths; a moderate α (around 0.1–1.0) is usually best on 20 Newsgroups.

Exercise 12Code
Implement GMM E-step and M-step from scratch; visualize convergence on 3 separated clusters.

Solution

Alternate computing responsibilities (E-step) and updating π,μ,Σ as responsibility-weighted statistics (M-step). On three well-separated clusters the ellipses snap to the clusters within a handful of iterations and the log-likelihood increases monotonically (Exercise 4), confirming convergence to the correct components.

Exercise 13Code Lab
Select GMM components by BIC on the wine dataset; plot BIC vs K=1..10.

Solution

Fit GMMs for K=1..10 and compute BIC = −2·logL + k·logN. The BIC curve typically dips to a minimum near the true number of categories (3 for wine) then rises as extra components add parameters without improving the fit — BIC's logN penalty penalizing complexity. The minimum often recovers the true K, illustrating principled model selection.

Exercise 14Code
Implement the forward algorithm in log-domain; show the naive product underflows for T=1000.

Solution

Working in log-space with log-sum-exp (Chapter 5) keeps the forward probabilities representable for long sequences. For T=10 the log-domain and naive product agree; for T=1000 the naive product of thousands of sub-1 probabilities underflows to 0 (losing all information), while the log-domain version returns a finite, correct log-likelihood — a concrete payoff of numerical-stability techniques.

Exercise 15Code
Implement Viterbi; measure accuracy at recovering the true hidden states as T grows 10→1000.

Solution

Viterbi with backpointers recovers the most likely state path. As T grows, more observations constrain the path, so per-state recovery accuracy generally improves and stabilizes (subject to the HMM's inherent ambiguity). This shows decoding benefits from longer context, up to the noise floor set by overlapping emission distributions.

Exercise 16Code Lab
Reproduce Ng–Jordan: NB vs LR accuracy vs N on 20 Newsgroups; find the crossover N*.

Solution

Plotting accuracy against training-set size shows NB ahead for small N (it reaches its asymptote quickly) and LR overtaking beyond a crossover N* (it keeps improving toward a lower-bias solution). The empirical N* is the dataset point where the discriminative model's lower asymptotic error wins out — reproducing the generative-vs-discriminative trade-off.

Exercise 17Code
Semi-supervised: Gaussian NB on iris with 10/30/100/300 labels; add self-training; plot accuracy.

Solution

Self-training labels confident unlabelled points and retrains. With very few labels, adding unlabelled data via self-training noticeably boosts accuracy; as labelled N grows the gap shrinks (the supervised model already has enough signal). The plot shows the largest semi-supervised gains in the low-label regime — where exploiting unlabelled structure matters most.

Exercise 18Code (Challenge)
Implement Baum–Welch (EM for HMMs) from scratch; fit a 3-state HMM; compare to ground truth.

Solution

Baum–Welch alternates the forward–backward E-step (computing state and transition posteriors γ and ξ) with M-step re-estimation of π, A, and emissions as expected-count ratios. Fit on 200 generated sequences, the recovered parameters match the ground truth up to a permutation of the (unidentified) state labels, with the log-likelihood increasing each iteration — EM for HMMs in action.