Solutions Appendix
Chapter 3

Probability & Statistics

20 Solutions

Detailed solutions for the exercises in Chapter 3. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper
Bag of 3 red, 7 blue; draw 2 without replacement. P(both red)?

Solution

P(both red) = (3/10)·(2/9) = 6/90 = 1/15 ≈ 0.0667. The second factor uses 2 reds left of 9 balls after one red is drawn (no replacement).

Exercise 2Pen & Paper
Spam filter: sensitivity 95%, specificity 99%, base rate 2%. Given a positive, P(spam)?

Solution

By Bayes: P(+) = 0.02·0.95 + 0.98·0.01 = 0.019 + 0.0098 = 0.0288. So P(spam|+) = 0.019/0.0288 ≈ 0.66. Even a 95%/99% test gives only ~66% confidence because spam is rare — the base-rate effect that makes false positives dominate for rare classes.

Exercise 3Pen & Paper
Prove the Bernoulli MLE is p̂ = k/n.

Solution

The log-likelihood for k successes in n trials is ℓ(p) = k·log p + (n−k)·log(1−p). Setting ℓ'(p) = k/p − (n−k)/(1−p) = 0 gives k(1−p) = (n−k)p, i.e. k = np, so p̂ = k/n. The MLE is simply the observed success frequency.

Exercise 4Pen & Paper
Show L2 regularization equals MAP with a Gaussian prior N(0, 1/(2λ)I) on θ.

Solution

MAP maximizes log p(θ|data) = log p(data|θ) + log p(θ). For a Gaussian prior N(0,σ²I), −log p(θ) = ‖θ‖²/(2σ²) + const. Setting σ² = 1/(2λ) makes this term λ‖θ‖² — exactly L2 weight decay. So L2 regularization is MAP estimation under a zero-mean Gaussian prior; stronger regularization ⇔ tighter prior.

Exercise 5Pen & Paper
Compute H([0.5,0.5]) and H([0.9,0.1]) in bits. Which is higher? Why?

Solution

H([0.5,0.5]) = −2·0.5·log₂ 0.5 = 1 bit. H([0.9,0.1]) = −0.9log₂ 0.9 − 0.1log₂ 0.1 ≈ 0.137 + 0.332 = 0.469 bits. The uniform distribution has higher entropy because it is maximally uncertain; the skewed one is more predictable, carrying less information per sample.

Exercise 6Pen & Paper
Compute KL(P‖Q), P=[0.4,0.6], Q=[0.5,0.5]; verify KL≥0.

Solution

KL = 0.4·log₂(0.4/0.5) + 0.6·log₂(0.6/0.5) = 0.4(−0.322) + 0.6(0.263) = −0.129 + 0.158 = 0.029 bits ≥ 0. The small positive value reflects the mild mismatch between P and Q; KL is zero only when P = Q.

Exercise 7Pen & Paper
Verify H(P,Q) = H(P) + KL(P‖Q) algebraically.

Solution

KL(P‖Q) = Σ p log(p/q) = Σ p log p − Σ p log q. Adding H(P) = −Σ p log p cancels the first term, leaving −Σ p log q = H(P,Q). So cross-entropy = entropy + KL: the cost of coding P with Q's code is P's own entropy plus the penalty for the wrong code.

Exercise 8Pen & Paper
A model assigns 0.01 to the true token. Contribution to perplexity?

Solution

The per-token negative log-likelihood is −log(0.01). Perplexity is the geometric-mean inverse probability, so this token contributes a factor of 1/0.01 = 100 to the perplexity (equivalently NLL = 4.605 nats). A confidently wrong token sharply inflates perplexity.

Exercise 9Pen & Paper
Why does sampling with T→0 converge to greedy? Show softmax(logits/T) as T→0.

Solution

As T→0, dividing logits by T amplifies differences without bound, so the largest logit's exponential dominates the denominator and its probability → 1 while all others → 0. The distribution collapses onto argmax — exactly greedy decoding. Conversely T→∞ flattens toward uniform.

Exercise 10Pen & Paper
Why is D_KL(P‖Q) ≠ D_KL(Q‖P)? Give an ML scenario where the asymmetry matters.

Solution

Forward KL Σ p log(p/q) weights the mismatch by P; reverse KL weights by Q. Forward KL is 'mean-seeking' (penalizes Q putting low mass where P has mass — it covers all modes), used in maximum-likelihood training. Reverse KL is 'mode-seeking' (penalizes Q putting mass where P has none), used in variational inference and some RL objectives. Choosing the wrong direction changes which behaviour you get.

Exercise 11Pen & Paper
Bernoulli(p) variance is p(1−p). Where is it maximized? Minimized?

Solution

d/dp[p(1−p)] = 1−2p = 0 → p = 0.5, giving maximum variance 0.25. Variance is minimized (= 0) at p = 0 or p = 1, where the outcome is certain. Uncertainty, like entropy, peaks at the balanced point.

Exercise 12Pen & Paper
Derive ∂CE/∂logits = p_i − y_i for softmax cross-entropy.

Solution

With p = softmax(z) and CE = −Σ_j y_j log p_j, the softmax Jacobian ∂p_j/∂z_i = p_j(δ_{ij}−p_i). Substituting and summing (using Σ y_j = 1) gives ∂CE/∂z_i = p_i − y_i. This clean form is why softmax + cross-entropy is the standard classification head: the gradient is just predicted minus target.

Exercise 13Code
Build a unigram LM on a text file; compute held-out perplexity; compare to random.

Solution

Estimate token probabilities by frequency, then perplexity = exp(mean NLL) on held-out text. A unigram model scores far better than a uniform random model (perplexity ≈ vocab size) because word frequencies are highly non-uniform, but far worse than models using context.

Exercise 14Code
Bigram LM from conditional counts; generate 50 tokens autoregressively.

Solution

Estimate P(w_t | w_{t−1}) from adjacent-pair counts (with smoothing for unseen pairs), then sample the next token from that conditional repeatedly. The output is locally plausible but quickly loses coherence — the limit of a 1-token context window.

Exercise 15Code Lab 3.1
Compare greedy, T=0.7, top-k=40, top-p=0.9 on GPT-2; measure distinct-1/2.

Solution

Greedy is deterministic and most repetitive (lowest distinct-n). Raising temperature and using top-k/top-p increases diversity (higher distinct-1, distinct-2) at some cost to coherence. top-p (nucleus) adapts the cutoff to the distribution's shape, usually giving the best diversity/quality balance — the standard default.

Exercise 16Code
Plot D_KL(P‖Q) vs Q for fixed P=[0.6,0.4]. What happens as Q→[1,0] or [0,1]?

Solution

KL is convex in Q with a minimum of 0 at Q = P = [0.6,0.4]. As Q→[1,0] the term 0.4·log(0.4/0) → +∞, and similarly as Q→[0,1]. KL blows up whenever Q assigns near-zero probability to an outcome P deems possible — the 'infinite surprise' of ruling out something that happens.

Exercise 17Code
Empirically verify the bias-variance decomposition for polynomial regression of degree 1,5,10,20.

Solution

Fit each degree on many bootstrap samples, then estimate bias² (squared gap of the mean prediction from truth) and variance (spread across fits). Low degrees show high bias, low variance (underfit); high degrees show low bias, high variance (overfit). Total error = bias² + variance + noise is U-shaped, minimized at an intermediate degree.

Exercise 18Code
Implement temperature, top-k, top-p from scratch on random logits; test T=0.001, k=1, p=0.01.

Solution

Temperature divides logits before softmax; top-k zeroes all but the k largest; top-p keeps the smallest set whose cumulative probability ≥ p, renormalizes, and samples. Edge cases: T=0.001 ≈ greedy; k=1 = greedy; p=0.01 keeps only the single most probable token (also ≈ greedy). All three degenerate to argmax in their extremes, confirming correctness.

Exercise 19Code
Perplexity of GPT-2 small on (a) Wikipedia, (b) random English words, (c) random characters. Explain.

Solution

Perplexity is lowest on fluent Wikipedia (the model predicts it well), higher on random English words (valid tokens but no coherent structure), and highest on random characters (off-distribution, near-uniform surprise). Perplexity tracks how 'expected' the text is under the model.

Exercise 20Code (Challenge)
Full MLE training for a trigram LM with add-k smoothing; tune k on validation perplexity.

Solution

Estimate P(w_t | w_{t−2}, w_{t−1}) with add-k smoothing: (count + k)/(context_count + k·V). Sweep k (e.g. 0.01–1.0) and pick the value minimizing validation perplexity — a bias/variance trade-off: small k overfits sparse counts, large k over-smooths toward uniform. The optimum is typically small but nonzero.