Solutions Appendix
Chapter 4

Information Theory

16 Solutions

Detailed solutions for the exercises in Chapter 4. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper
Surprisal (bits) of rolling a 6 on a fair die. How many fair-die rolls give 1 bit?

Solution

Surprisal = −log₂(1/6) = log₂ 6 ≈ 2.585 bits. One die roll therefore carries ~2.585 bits. To accumulate exactly 1 bit you need an event of probability 1/2 — only about 1/2.585 ≈ 0.39 of a die roll's worth of information; equivalently, one fair coin flip is 1 bit while one die roll exceeds it.

Exercise 2Pen & Paper
Show H(P)=log N for uniform P over N outcomes. Why is this maximal?

Solution

H = −Σ (1/N) log(1/N) = −N·(1/N)·(−log N) = log N. It is the maximum because, by Jensen/Lagrange, entropy is maximized by the uniform distribution: any concentration of probability reduces uncertainty. Uniform = maximal ignorance = maximal entropy.

Exercise 3Pen & Paper
Prove H(P,Q) ≥ H(P) using H(P,Q)=H(P)+KL and KL≥0.

Solution

Since H(P,Q) = H(P) + KL(P‖Q) and KL(P‖Q) ≥ 0 (Gibbs' inequality), it follows immediately that H(P,Q) ≥ H(P), with equality iff Q = P. Coding data with the wrong distribution Q can never beat using its true distribution P — the information-theoretic basis for why minimizing cross-entropy drives Q toward P.

Exercise 4Pen & Paper
Derive ∂CE/∂z_c = p_c − y_c from H(y_onehot, softmax(z)).

Solution

CE = −Σ_j y_j log p_j with p = softmax(z). Using ∂p_j/∂z_c = p_j(δ_{jc} − p_c) and Σ_j y_j = 1, the sum telescopes to ∂CE/∂z_c = p_c − y_c. The gradient is predicted-minus-true probability — zero exactly when the prediction matches the one-hot target.

Exercise 5Pen & Paper
For P=[0.5,0.3,0.2], Q=[0.6,0.3,0.1], compute H(P), H(P,Q), KL and verify the identity.

Solution

In bits: H(P) = −(0.5log₂.5 + 0.3log₂.3 + 0.2log₂.2) ≈ 1.485. H(P,Q) = −(0.5log₂.6 + 0.3log₂.3 + 0.2log₂.1) ≈ 1.554. KL = 0.5log₂(.5/.6) + 0 + 0.2log₂(.2/.1) ≈ −0.131 + 0.200 = 0.069. Indeed H(P)+KL = 1.485+0.069 = 1.554 = H(P,Q). ✓

Exercise 6Pen & Paper
Surprisal of P=0.001; perplexity if that happens 5% of the time and P=0.5 otherwise.

Solution

Surprisal of 0.001 = −log₂ 0.001 ≈ 9.97 bits (6.91 nats). Mean NLL (nats) = 0.05·(−ln 0.001) + 0.95·(−ln 0.5) = 0.345 + 0.658 = 1.004, so perplexity = e^{1.004} ≈ 2.73. The rare confident errors raise perplexity noticeably above the e^{0.693}=2 baseline of the common case.

Exercise 7Pen & Paper
State and prove Jensen's inequality for log; use it to prove KL≥0.

Solution

For a concave f, E[f(X)] ≤ f(E[X]); log is concave. Then −KL(P‖Q) = Σ p log(q/p) = E_P[log(q/p)] ≤ log E_P[q/p] = log Σ q = log 1 = 0. Hence KL(P‖Q) ≥ 0, with equality iff q/p is constant, i.e. P = Q.

Exercise 8Pen & Paper
Show I(X;Y)=0 iff X,Y independent (use the KL form).

Solution

Mutual information is I(X;Y) = KL(p(x,y) ‖ p(x)p(y)). Since KL ≥ 0 with equality iff its two arguments are equal, I(X;Y) = 0 exactly when p(x,y) = p(x)p(y) for all x,y — the definition of independence. So zero mutual information is precisely statistical independence.

Exercise 9Pen & Paper
In the ELBO, why does D_KL(q(z|x)‖p(z)) regularize the encoder? What if β→∞?

Solution

The ELBO = reconstruction − β·KL(q(z|x)‖p(z)). The KL term pulls the encoder's posterior toward the prior, preventing it from memorizing each input as an isolated code and encouraging a smooth, structured latent space. As β→∞ the KL dominates, forcing q(z|x) → p(z) for all x — 'posterior collapse', where the latent ignores the input and reconstruction fails.

Exercise 10Code
Implement entropy, cross-entropy, KL from scratch; verify against scipy for 10 random pairs.

Solution

Use H(P)=−Σ p log p, H(P,Q)=−Σ p log q, KL=Σ p log(p/q), guarding against log(0). For random normalized P,Q these match scipy.stats.entropy(P) and entropy(P,Q) (which returns KL) to numerical precision, confirming the identity H(P,Q)=H(P)+KL holds elementwise.

Exercise 11Code
Plot H(p,1−p) in bits for p∈[0.01,0.99]. Where is cross-entropy minimized?

Solution

For a fixed true label, the binary cross-entropy as a function of the predicted p is minimized when p matches the truth (e.g. p→1 for a positive label, where CE→0) and grows without bound as the prediction approaches the wrong extreme (CE→∞). The plot is the familiar convex log-loss curve.

Exercise 12Code
Plot D_KL(P‖Q) and D_KL(Q‖P) vs Q for P=[0.8,0.2]; explain the asymmetry.

Solution

Both are convex with a zero at Q=P, but they rise differently: forward KL(P‖Q) penalizes Q under-covering P's mass (blows up as Q→[0,1] where P has mass), while reverse KL(Q‖P) penalizes Q placing mass where P is small. The curves are not mirror images — visual proof that KL is not a distance.

Exercise 13Code Lab 4.1
Run analyse_text on a Wikipedia sentence, an ML-paper sentence, and random words; compare surprisal.

Solution

Fluent, common text yields low average surprisal (predictable tokens); technical text is moderate (rarer vocabulary, specialized phrasing); random words are high (little predictability). The per-token surprisal profile spikes at rare or unexpected tokens, visualizing where the model is 'surprised'.

Exercise 14Code
Implement InfoNCE from scratch; verify loss ≈ log(128) for N=128 random embeddings, ≈0 for perfect.

Solution

InfoNCE is cross-entropy over similarity scores where the positive pair is the correct class among N. With random embeddings every candidate is equally likely, so loss ≈ log N = log 128 ≈ 4.85. With perfect embeddings (z1=z2) the positive similarity dominates and loss → 0. This is the contrastive objective behind CLIP and SimCLR (Chapters 8, 30).

Exercise 15Code
Train an MLP on MNIST; log CE loss, model entropy per prediction, and perplexity over 10 epochs.

Solution

As training proceeds CE loss falls, predictive entropy falls (the model grows more confident), and perplexity = exp(CE) falls in lockstep. They are tightly related: perplexity is the exponential of cross-entropy, and confident correct predictions reduce both entropy and loss together.

Exercise 16Code (Challenge)
Implement a β-VAE; train on MNIST with β∈{0.1,1,4,10}; plot ELBO, KL, reconstruction. What does high β do?

Solution

Higher β weights the KL term more, producing a more disentangled but lower-capacity latent: reconstruction loss rises and the KL term shrinks (the posterior hugs the prior). At very high β the latent collapses (ignores the input). Low β gives sharp reconstructions but a less structured, less disentangled latent space — the core β-VAE trade-off.