Generative Models
Detailed solutions for the exercises in Chapter 7. Try solving them yourself before checking the answers.
Solution
For each class k and word j, the presence of word j is Bernoulli(θ_jk). The log-likelihood over class-k documents is Σ[x_j logθ_jk + (1−x_j)log(1−θ_jk)]. Setting its derivative to zero (as in the Bernoulli MLE of Chapter 3) gives θ_jk = (number of class-k docs containing word j)/(number of class-k docs) — simply the within-class frequency of the word.
Solution
The decision compares Σ log P(x_j|k) + log P(k) across classes. For Bernoulli/multinomial features these log-likelihoods are linear in the feature counts, so the log-posterior difference between two classes is a linear function of x — a linear boundary. Gaussian NB is linear only when the class-conditional variances are equal across classes (shared covariance): then the quadratic x² terms cancel in the log-ratio, leaving a linear boundary; unequal variances give a quadratic boundary.
Solution
E-step: responsibilities γ_{ik} = π_k N(x_i|μ_k,σ_k²) / Σ_j π_j N(x_i|μ_j,σ_j²). M-step: maximizing the expected complete-data log-likelihood gives μ_k = Σ_i γ_{ik} x_i / Σ_i γ_{ik} (the responsibility-weighted mean), σ_k² = Σ_i γ_{ik}(x_i−μ_k)² / Σ_i γ_{ik}, and π_k = (1/N)Σ_i γ_{ik}. The μ_k update is exactly the weighted sample mean, with weights equal to each point's soft assignment to component k.
Solution
The E-step sets Q = P(Z|X;θ_old), making KL = 0 so ELBO = log P(X;θ_old). The M-step then increases the ELBO by maximizing it over θ. Since log P(X;θ) = ELBO + KL ≥ ELBO (because KL ≥ 0), the new log-likelihood is at least the increased ELBO, which is at least the old log-likelihood. Hence log P(X;θ) never decreases — EM monotonically climbs the likelihood (to a local optimum).
Solution
Naively, P(X|λ) = Σ over all K^T (here 2^T) state paths — exponential. The Markov assumption means the next state depends only on the current one, so the probability of being in state j at time t (the forward variable α_t(j)) can be computed from the K values α_{t−1}(·) in O(K) work, giving O(K²) per step and O(K²T) overall. The DP works because the past is summarized entirely by the current state distribution — no need to track full histories.
Solution
Viterbi finds the single most likely state sequence, so it propagates the maximum-probability path to each state (max). The forward algorithm computes the total probability of the observations by summing over all paths (sum). If you used sum in Viterbi you would compute the observation likelihood (the forward quantity), not the best path, and the backpointer-based decoding would be meaningless — you'd lose the argmax path you set out to recover.
Solution
Any continuous density can be approximated arbitrarily well by placing many narrow Gaussians along it: as the number of components K→∞ and their widths shrink, the mixture can match the target density to any desired accuracy (Gaussians form a dense basis for smooth densities, analogous to how kernels reconstruct functions). This universal-approximation property is why GMMs are flexible density estimators, though finite K and EM's local optima limit them in practice.
Solution
When the Gaussian-NB assumptions hold (class-conditional features Gaussian and conditionally independent, with shared covariance), the implied P(y|x) has exactly the logistic-regression form. In that case LR (the discriminative model of the same family) is asymptotically at least as accurate — it converges to a lower-bias solution. Ng & Jordan showed the trade-off: NB (generative) reaches its (higher) asymptotic error faster with less data, while LR (discriminative) needs more data but ultimately achieves lower error when the model is correct.
Solution
Solving π = πA with Σπ = 1 for A = [[.7,.2,.1],[.1,.8,.1],[.2,.1,.7]] gives the balance equations; substituting yields π₁ = 1.25π₃ and π₂ = 1.75π₃. Normalizing (1.25+1.75+1)π₃ = 1 → π₃ = 0.25.
This is the long-run fraction of time the chain spends in each state — the left eigenvector of A for eigenvalue 1.
Solution
The Laplace approximation expands the log-evidence around the MAP estimate θ̂: log P(D) ≈ log P(D|θ̂) + log P(θ̂) + (k/2)log(2π) − ½ log|H|, where H is the Hessian (≈ N times the per-sample information). Since |H| ∝ N^k, the −½ log|H| term contributes −(k/2)log N. Dropping O(1) terms gives BIC = log P(D|θ̂) − (k/2)log N. The penalty is logarithmic in N because each parameter's posterior uncertainty shrinks like 1/√N, so the 'volume' penalty per parameter grows only as ½log N.
Solution
Estimate P(word|class) = (count + α)/(class_total + α·V) and classify by the summed log-probabilities plus log prior. Matching sklearn's MultinomialNB, accuracy varies with α: very small α overfits rare words, large α over-smooths; a moderate α (around 0.1–1.0) is usually best on 20 Newsgroups.
Solution
Alternate computing responsibilities (E-step) and updating π,μ,Σ as responsibility-weighted statistics (M-step). On three well-separated clusters the ellipses snap to the clusters within a handful of iterations and the log-likelihood increases monotonically (Exercise 4), confirming convergence to the correct components.
Solution
Fit GMMs for K=1..10 and compute BIC = −2·logL + k·logN. The BIC curve typically dips to a minimum near the true number of categories (3 for wine) then rises as extra components add parameters without improving the fit — BIC's logN penalty penalizing complexity. The minimum often recovers the true K, illustrating principled model selection.
Solution
Working in log-space with log-sum-exp (Chapter 5) keeps the forward probabilities representable for long sequences. For T=10 the log-domain and naive product agree; for T=1000 the naive product of thousands of sub-1 probabilities underflows to 0 (losing all information), while the log-domain version returns a finite, correct log-likelihood — a concrete payoff of numerical-stability techniques.
Solution
Viterbi with backpointers recovers the most likely state path. As T grows, more observations constrain the path, so per-state recovery accuracy generally improves and stabilizes (subject to the HMM's inherent ambiguity). This shows decoding benefits from longer context, up to the noise floor set by overlapping emission distributions.
Solution
Plotting accuracy against training-set size shows NB ahead for small N (it reaches its asymptote quickly) and LR overtaking beyond a crossover N* (it keeps improving toward a lower-bias solution). The empirical N* is the dataset point where the discriminative model's lower asymptotic error wins out — reproducing the generative-vs-discriminative trade-off.
Solution
Self-training labels confident unlabelled points and retrains. With very few labels, adding unlabelled data via self-training noticeably boosts accuracy; as labelled N grows the gap shrinks (the supervised model already has enough signal). The plot shows the largest semi-supervised gains in the low-label regime — where exploiting unlabelled structure matters most.
Solution
Baum–Welch alternates the forward–backward E-step (computing state and transition posteriors γ and ξ) with M-step re-estimation of π, A, and emissions as expected-count ratios. Fit on 200 generated sequences, the recovered parameters match the ground truth up to a permutation of the (unidentified) state labels, with the log-likelihood increasing each iteration — EM for HMMs in action.