Part I: Mathematical Foundations

Chapter 4

Information Theory

Entropy, cross-entropy, KL divergence, and mutual information

16 Exercises

Learning Objectives

1.	Derive Shannon entropy from first principles and compute it for any distribution.
2.	Understand cross-entropy as the expected surprise when using the wrong model, and prove it lower-bounds entropy.
3.	Explain why minimising cross-entropy loss is identical to maximising log-likelihood.
4.	Compute KL divergence, relate it to cross-entropy, and interpret its asymmetry.
5.	Define mutual information and explain how it appears in contrastive learning losses.
6.	Derive Jensen’s inequality and use it to prove the non-negativity of KL divergence.
7.	Understand the Evidence Lower Bound (ELBO) and its role in VAEs and RLHF.
8.	Implement all quantities from scratch and verify them against scipy.stats.

Every time a language model predicts the next token, it is implicitly making a bet on how surprising that token will be. The model that wins is the one that assigns the least surprise — in expectation — to the true tokens. Information theory makes this intuition precise.

Claude Shannon founded information theory in 1948 with a deceptively simple question: “How much information is in a message?” His answer — it depends on how surprising it is — turns out to be the correct foundation for understanding every loss function in deep learning.

✧

The Central Thesis of This Chapter

Training a language model with cross-entropy loss is not an arbitrary engineering choice. It follows inevitably from the information-theoretic definition of the optimal code for a source of data.

A model that has learned to assign zero surprise to every true token has achieved perfect compression — it has fully understood the data distribution. Cross-entropy measures exactly how far we are from that ideal.

Three Equivalent Framings

Framing	Question	Mathematical answer
Compression	How many bits does it take to encode a message?	H(P) bits minimum (Shannon’s source coding theorem)
Prediction	How well can a model predict the next symbol?	Cross-entropy H(P,Q) bits per symbol
Surprise	How surprised should you be by an outcome?	Surprisal: -log₂ P(x) bits

Before defining entropy (which averages over an entire distribution), we define surprisal: the information content of a single event. The rarer the event, the more surprising it is, and the more information it carries.

Definition

Surprisal

The surprisal (or self-information) of event x is: I(x) = -log P(x). Units are bits if log₂, nats if ln. A certain event (P=1) has surprisal 0. An impossible event (P=0) has infinite surprisal.

Three properties that uniquely characterise a reasonable measure of surprise:

•Monotone: rarer events are more surprising [P(x) < P(y) ⇒ I(x) > I(y)]

•Non-negative: no event can have negative surprise

•Additive for independent events: I(x,y) = I(x) + I(y) when X ⊥ Y

The only function satisfying all three is the negative logarithm.

Python•Surprisal — computing and interpreting it
import numpy as np

def surprisal(p, base=2):
    """Surprisal (self-information) of an event with probability p."""
    assert 0 < p <= 1, f"Probability must be in (0,1], got {p}"
    return -np.log(p) / np.log(base)  # change of base

# Coin flip: P(heads) = 0.5  →  1 bit of surprise
print(f"Fair coin flip:   {surprisal(0.5):.2f} bits")
print(f"1-in-4 outcome:   {surprisal(0.25):.2f} bits")
print(f"Certain event:    {surprisal(1.0):.2f} bits")
print(f"1-in-million:     {surprisal(1e-6):.2f} bits")

# Token-level surprisal in a language model
# When the model assigns P('the') = 0.40, the surprisal is:
p_the = 0.40
print(f"'the' surprisal: {surprisal(p_the):.3f} bits")
# 1.322 bits -- less than a fair coin because 'the' is common

# Rare technical term: model assigns P('eigenvector') = 0.0001
p_eig = 0.0001
print(f"'eigenvector' surprisal: {surprisal(p_eig):.3f} bits")
# 13.288 bits -- very surprising to a general model

# Independent events: surprisals ADD (probabilities MULTIPLY)
p_seq  = 0.5 * 0.25  # P('heads' then '1-in-4') -- independent
direct = surprisal(p_seq)
summed = surprisal(0.5) + surprisal(0.25)
print(f"Direct: {direct:.2f}  Summed: {summed:.2f}")
# Direct: 3.00  Summed: 3.00  ✔

Surprisal and the NLL Loss

Every cross-entropy training step computes the surprisal of the correct token under the model and tries to reduce it. The negative log-likelihood (NLL) loss is exactly the average surprisal across all training tokens:

NLL = -(1/N) Σ_{t=1}^{N} log P_θ(x_t | x_{<t}) = (1/N) Σ I_θ(x_t)

Python•NLL loss as average surprisal
import numpy as np
import torch
import torch.nn.functional as F

# Simulate model logits for 5 tokens, vocab size 10
torch.manual_seed(42)
logits  = torch.randn(5, 10)  # (T=5, V=10)
targets = torch.tensor([3, 7, 1, 4, 9])  # true next tokens

# Standard cross-entropy loss
ce_loss = F.cross_entropy(logits, targets)

# Manually: average surprisal of true tokens
probs        = F.softmax(logits, dim=-1)
true_probs   = probs[torch.arange(5), targets]
surprisals   = -torch.log(true_probs)
manual_loss  = surprisals.mean()

print(f"CE loss:       {ce_loss.item():.4f}")
print(f"Manual (avg surprisal): {manual_loss.item():.4f}")
print(f"Token surprisals: {surprisals.tolist()}")
print(torch.allclose(ce_loss, manual_loss))
# True -- they are identical!

Entropy is the expected surprisal. Where surprisal tells you how surprising a single outcome was, entropy tells you how surprising events from a distribution are, on average. It is the single most important quantity in information theory.

Definition and Derivation

H(X) = E[I(X)] = -Σ_{x} P(x) log P(x) [nats or bits]

This is not an arbitrary formula. Shannon showed it is the unique function satisfying three natural axioms: continuity in probabilities, maximum for the uniform distribution, and additivity for independent sub-experiments. The log is the only function that delivers all three.

Python•Shannon entropy — building intuition
import numpy as np

def entropy(probs, base=np.e):
    """Shannon entropy. base=np.e → nats; base=2 → bits."""
    probs = np.asarray(probs, dtype=float)
    probs = probs[probs > 0]              # 0*log(0) = 0 by convention
    return -np.sum(probs * np.log(probs)) / np.log(base)

# ---- Case 1: Deterministic (no uncertainty) ----
print(f"Deterministic:  {entropy([1.0, 0.0, 0.0]):.3f} bits")
# 0.000 bits -- knowing the outcome is FREE (no encoding needed)

# ---- Case 2: Fair coin (maximum 2-outcome entropy) ----
print(f"Fair coin:      {entropy([0.5, 0.5], base=2):.3f} bits")
# 1.000 bit -- need exactly 1 bit to transmit the outcome

# ---- Case 3: Biased coin ----
print(f"Biased (0.9):   {entropy([0.9, 0.1], base=2):.3f} bits")
# 0.469 bits -- predictable, so little info per flip

# ---- Case 4: Uniform over vocabulary ----
V = 50257  # GPT-2 tokenizer vocab size
print(f"Uniform vocab:  {entropy([1/V]*V, base=2):.2f} bits")
# 15.61 bits -- a random model needs 15.6 bits per token

# ---- Entropy of a realistic LM output distribution ----
np.random.seed(42)
logits = np.random.randn(V)
def softmax(x): e=np.exp(x-x.max()); return e/e.sum()
probs  = softmax(logits)
print(f"Random logits:  {entropy(probs, base=2):.2f} bits")
# ~15.5 bits -- still near-uniform after random init

# Push one logit high to simulate a confident model
logits[42] += 10.0
probs_conf = softmax(logits)
print(f"Confident model:{entropy(probs_conf, base=2):.2f} bits")
# ~0.06 bits -- nearly certain, very low entropy

Properties of Entropy

Property	Statement
Non-negativity	H(X) ≥ 0, with equality iff X is deterministic
Maximum	H(X) is maximised by the uniform distribution: H ≤ log\|X\|
Concavity	H is concave in P — mixing distributions increases entropy
Chain rule	H(X,Y) = H(X) + H(Y\|X) = H(Y) + H(X\|Y)
Subadditivity	H(X,Y) ≤ H(X) + H(Y), with equality iff X ⊥ Y
Conditioning reduces entropy	H(X\|Y) ≤ H(X): knowing Y can only help

Cross-entropy H(P,Q) measures the average number of bits needed to encode events from distribution P when we use a code optimised for distribution Q. When our model Q differs from the true distribution P, we spend extra bits — exactly the KL divergence worth.

Definition

H(P, Q) = -Σ_{x} P(x) log Q(x) = E_{x ∼ P}[-log Q(x)]

Key inequality: H(P,Q) ≥ H(P) always, with equality iff P = Q. The gap is the KL divergence. Training minimises this gap.

H(P, Q) = H(P) + D_KL(P ∥ Q) ≥ H(P)

Python•Cross-entropy — computed three ways
import numpy as np
import torch
import torch.nn.functional as F
from scipy.special import entr

P = np.array([0.4, 0.3, 0.2, 0.1])  # true distribution
Q = np.array([0.25, 0.25, 0.25, 0.25])  # uniform model

# ---- Method 1: from definition ----
H_PQ_def  = -np.sum(P * np.log(Q))

# ---- Method 2: H(P) + KL(P||Q) ----
H_P       = -np.sum(P * np.log(P))
KL_PQ     = np.sum(P * np.log(P / Q))
H_PQ_comp = H_P + KL_PQ

# ---- Method 3: torch.nn.functional (one-hot targets) ----
logits_q  = torch.tensor(np.log(Q), dtype=torch.float32)
sample    = torch.tensor([0, 1, 2, 3], dtype=torch.long) # simulate 4 samples
weights   = torch.tensor(P, dtype=torch.float32)
ce_torch  = (weights * F.cross_entropy(logits_q.unsqueeze(0).expand(4,-1), sample, reduction='none')).sum()

print(f"H(P,Q) definition: {H_PQ_def:.4f} nats")
print(f"H(P,Q) = H(P)+KL:  {H_PQ_comp:.4f} nats")
print(f"H(P)   =            {H_P:.4f} nats (irreducible floor)")
print(f"KL gap =            {KL_PQ:.4f} nats (what training closes)")

One-Hot Targets: Why NLL Equals Cross-Entropy

In classification, the “true” distribution P is a one-hot vector: all mass on the correct class. This simplifies cross-entropy dramatically, because the sum collapses to a single term:

H(one_hot_y, Q) = -Σ_c [y=c] log Q(c) = -log Q(y) ← negative log-likelihood

Python•Why NLL and cross-entropy are identical for classification
import numpy as np

def cross_entropy_onehot(logits, true_class):
    """Cross-entropy with one-hot P = NLL of true class."""
    # Numerically stable softmax
    logits = logits - logits.max()
    probs  = np.exp(logits) / np.exp(logits).sum()

    # Cross-entropy: -sum_c P(c) log Q(c)
    # With one-hot P: only the true class contributes
    ce = -np.log(probs[true_class] + 1e-12)  # == NLL
    return ce, probs

logits = np.array([1.0, 3.0, 2.0, 0.5])

for true_class in range(4):
    ce, probs = cross_entropy_onehot(logits, true_class)
    print(f"True={true_class}  P(correct)={probs[true_class]:.3f}  CE={ce:.3f} nats")
# True=0  P(correct)=0.090  CE=2.411 (hard: model favours class 1)
# True=1  P(correct)=0.665  CE=0.408 (easy: model correctly confident)
# True=2  P(correct)=0.245  CE=1.408 (medium)
# True=3  P(correct)=0.049  CE=3.016 (hard: model very wrong)

# Note: lower CE → model is more confident in the right answer

The Gradient of Cross-Entropy w.r.t. Logits

One reason cross-entropy is so popular: its gradient w.r.t. the pre-softmax logits has an elegant closed form that is numerically stable and computationally efficient.

∂ H / ∂ z_c = p_c - y_c

Where p_c = softmax(z)_c and y_c is the one-hot target. The gradient is simply the difference between the model’s prediction and the truth. This is the most elegant result in the chapter.

Python•Gradient of cross-entropy — derivation and verification
import numpy as np

def softmax(z): e=np.exp(z-z.max()); return e/e.sum()

def ce_and_grad(z, y_class):
    """Cross-entropy loss and its gradient w.r.t. logits z."""
    p        = softmax(z)
    loss     = -np.log(p[y_class] + 1e-12)
    # Gradient: p - one_hot(y)
    grad     = p.copy()
    grad[y_class] -= 1.0                  # subtract 1 from correct class
    return loss, grad

z = np.array([1.0, 3.0, 2.0, 0.5])  # logits
y = 2                                 # true class
loss, grad_analytic = ce_and_grad(z, y)

# Verify against numerical gradient
h = 1e-5
grad_numerical = np.zeros_like(z)
for i in range(len(z)):
    zp, zm = z.copy(), z.copy()
    zp[i] += h; zm[i] -= h
    lp     = -np.log(softmax(zp)[y]+1e-12)
    lm     = -np.log(softmax(zm)[y]+1e-12)
    grad_numerical[i] = (lp - lm) / (2*h)

print(f"Analytic  grad: {np.round(grad_analytic, 4)}")
print(f"Numerical grad: {np.round(grad_numerical, 4)}")
print(np.allclose(grad_analytic, grad_numerical, atol=1e-7))
# True -- gradient is exactly p - y_onehot

✧

The Beautiful Gradient

The gradient ∂CE/∂z = p - y is the most elegant result in practical deep learning. It says: to decrease the loss, push the model’s probabilities toward the true distribution. The gradient is zero only when p = y exactly.

This clean form is why cross-entropy is preferred over mean squared error for classification: MSE on probabilities has a much messier gradient that saturates badly when the sigmoid/softmax output is near 0 or 1.

We met KL divergence in Chapter 3. Here we examine it more deeply: its geometric interpretation, its asymmetry explained carefully, and its central role in variational inference, RLHF, and contrastive learning.

Definition and Decomposition

D_KL(P ∥ Q) = Σ_x P(x) log [P(x)/Q(x)] = H(P,Q) - H(P)

Python•KL divergence — full implementation and properties
import numpy as np

def kl_div(P, Q, eps=1e-12):
    """KL(P || Q): extra bits when using Q to encode messages from P."""
    P, Q = np.asarray(P)+eps, np.asarray(Q)+eps
    P /= P.sum(); Q /= Q.sum()
    return float(np.sum(P * np.log(P/Q)))

# ---- KL is always non-negative ----
for _ in range(10000):
    P = np.random.dirichlet(np.ones(5))
    Q = np.random.dirichlet(np.ones(5))
    assert kl_div(P, Q) >= -1e-9   # always >= 0
print("KL ≥ 0 verified over 10,000 random pairs")

# ---- KL is NOT symmetric ----
P = [0.9, 0.1]   # peaked (true posterior in Bayesian setting)
Q = [0.5, 0.5]   # flat (prior)
print(f"KL(P||Q) = {kl_div(P,Q):.4f}")
print(f"KL(Q||P) = {kl_div(Q,P):.4f}")
# KL(P||Q) = 0.5108   KL(Q||P) = 0.4185  -- different!

# ---- KL = 0 iff P == Q ----
print(f"KL(P||P) = {kl_div(P,P):.6f}")
# 0.000000

Forward vs. Reverse KL: Mode-Seeking vs. Mean-Seeking

The two directions of KL have fundamentally different behaviours when Q is used to approximate P. This is one of the most practically important distinctions in probabilistic ML.

Direction	D_KL(P ∥ Q) — Forward	D_KL(Q ∥ P) — Reverse
Also called	Inclusive KL / I-projection	Exclusive KL / M-projection
Behaviour	Q must cover all of P’s mass (mean-seeking)	Q concentrates on one mode of P (mode-seeking)
Zero-avoiding?	Yes: Q(x)>0 whenever P(x)>0	No: Q can ignore modes of P where it assigns zero
Used in	MLE, VAE decoder (ELBO), next-token CE loss	Variational inference, diffusion reverse process
Training effect	Model spreads mass over all possible outputs	Model picks one high-probability output and sticks to it

Python•Forward vs. reverse KL on a bimodal distribution
import numpy as np
import matplotlib.pyplot as plt

# True distribution P: bimodal (two peaks at 2 and 8)
x    = np.linspace(0, 10, 200)
P    = 0.5*np.exp(-0.5*(x-2)**2) + 0.5*np.exp(-0.5*(x-8)**2)
P   /= P.sum()

# Q: a single Gaussian parameterised by mu
def Q_gaussian(mu, sigma=1.0):
    q = np.exp(-0.5*((x-mu)/sigma)**2)
    return q / q.sum()

def kl(P, Q, eps=1e-12): return float(np.sum(P*np.log((P+eps)/(Q+eps))))

mus  = np.linspace(0.5, 9.5, 100)
fwd  = [kl(P, Q_gaussian(m)) for m in mus]  # KL(P||Q): forward
rev  = [kl(Q_gaussian(m), P) for m in mus]  # KL(Q||P): reverse

fwd_opt = mus[np.argmin(fwd)]
rev_opt = mus[np.argmin(rev)]
print(f"Forward KL minimised at mu={fwd_opt:.1f}")
# mu ≈5.0 -- mean-seeking: splits the difference between modes
print(f"Reverse KL minimised at mu={rev_opt:.1f}")
# mu ≈2.0 or 8.0 -- mode-seeking: locks onto one mode

▶

ML Connection: KL in RLHF and VAEs

RLHF objective: max E[reward] - β D_KL(π_θ ∥ π_ref). This is reverse KL (KL(Q∥P) with the policy as Q). The mode-seeking property means the model may ignore safe-but-suboptimal responses and focus on a narrow high-reward region.

VAE ELBO: -E[log P(x|z)] + D_KL(q(z|x) ∥ p(z)). This is forward KL on the latent space (KL(P∥Q) with the approximate posterior as P). The mean-seeking property forces the encoder to cover all plausible latent codes, preventing posterior collapse.

Jensen’s inequality is a two-line theorem with outsized consequences: it proves KL ≥ 0, justifies the ELBO lower bound in VAEs, and explains why the log of an average is always greater than the average of logs.

Statement

Jensen’s Inequality

For any convex function f and random variable X: f(E[X]) ≤ E[f(X)]. For concave f (like log), the inequality flips: f(E[X]) ≥ E[f(X)].

Geometric interpretation: the chord connecting any two points on a convex function lies above (or on) the curve. Therefore the average of function values exceeds the function of the average.

Python•Jensen’s inequality — verified and applied
import numpy as np

np.random.seed(0)

# ---- Verify for convex f(x) = x^2 ----
X      = np.random.randn(10000)
f      = lambda x: x**2
lhs    = f(X.mean())      # f(E[X])
rhs    = np.mean(f(X))    # E[f(X)]
print(f"f(E[X]) = {lhs:.4f}   E[f(X)] = {rhs:.4f}  LHS <= RHS: {lhs <= rhs+1e-9}")
# f(E[X]) ≈0.0   E[f(X)] ≈1.0  LHS <= RHS: True  (Jensen holds)

# ---- log is concave: E[log X] ≤ log(E[X]) ----
X_pos  = np.abs(X) + 0.1  # keep positive for log
lhs_log= np.log(X_pos.mean())   # log(E[X])  (concave: this is BIGGER)
rhs_log= np.mean(np.log(X_pos)) # E[log X]  (Jensen: this is SMALLER)
print(f"log(E[X]) = {lhs_log:.4f}   E[log X] = {rhs_log:.4f}  log(E) >= E[log]: {lhs_log >= rhs_log-1e-9}")

# ---- Prove KL >= 0 using Jensen ----
# -KL(P||Q) = E_{P}[log(Q/P)] = E_{P}[log r]  where r = Q/P
# By Jensen (log concave):  E[log r] ≤ log E[r] = log Σ P(x)*(Q(x)/P(x))
#                                     = log Σ Q(x) = log 1 = 0
# Therefore -KL ≤ 0  ⇒  KL ≥ 0  ✔
P = np.array([0.1,0.3,0.4,0.2])
Q = np.array([0.2,0.2,0.3,0.3])
r = Q / P  # ratio
e_log_r  = np.sum(P * np.log(r))   # E_P[log r] = -KL(P||Q)
log_e_r  = np.log(np.sum(P * r))   # log(E_P[r]) = log(1) = 0
print(f"E[log r] = {e_log_r:.4f} ≤ log(E[r]) = {log_e_r:.4f}  ⇒ -KL ≤ 0  ⇒ KL ≥ 0")

The ELBO: Jensen Applied to Variational Inference

The Evidence Lower Bound (ELBO) in Variational Autoencoders follows directly from Jensen’s inequality applied to the log-evidence. This is the derivation every VAE paper references but rarely shows fully.

log P(x) = log ∫ P(x,z) dz = log E_{q(z)}[P(x,z)/q(z)]

≥ E_{q(z)}[log P(x,z)/q(z)] = E[log P(x|z)] - D_KL(q(z|x) ∥ p(z))

The first step uses Jensen’s inequality (log is concave, so log E[·] ≥ E[log ·]). The ELBO lower-bounds the log-likelihood; maximizing the ELBO simultaneously learns the encoder q(z|x) and decoder P(x|z).

Python•ELBO for a simple Gaussian VAE
import torch, torch.nn as nn, torch.nn.functional as F

class SimpleVAE(nn.Module):
    def __init__(self, input_dim=784, latent_dim=20):
        super().__init__()
        self.encoder_mu  = nn.Linear(input_dim, latent_dim)
        self.encoder_lv  = nn.Linear(input_dim, latent_dim) # log-variance
        self.decoder     = nn.Linear(latent_dim, input_dim)

    def forward(self, x):
        # Encode: q(z|x) = N(mu, diag(exp(log_var)))
        mu, log_var = self.encoder_mu(x), self.encoder_lv(x)

        # Reparameterisation trick
        std = torch.exp(0.5 * log_var)
        eps = torch.randn_like(std)
        z   = mu + eps * std

        # Decode: P(x|z)
        x_hat = torch.sigmoid(self.decoder(z))

        # ELBO = reconstruction_term - KL_term
        recon_loss = F.binary_cross_entropy(x_hat, x, reduction='sum') # -E[log P(x|z)]
        kl_loss    = -0.5 * torch.sum(1 + log_var - mu**2 - log_var.exp()) # KL(q||p)
        elbo       = -(recon_loss + kl_loss)  # maximise ELBO = minimise sum
        return x_hat, recon_loss, kl_loss

Mutual information (MI) quantifies the reduction in uncertainty about X when we observe Y. It is symmetric, always non-negative, and zero if and only if X and Y are independent. It appears in contrastive learning, feature selection, and self-supervised representation objectives.

Definition and Decompositions

I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X,Y)

I(X;Y) = D_KL( P(X,Y) ∥ P(X)P(Y) ) ≥ 0

The KL form reveals the geometric meaning: MI measures how different the joint distribution is from the product of marginals. Zero mutual information means the joint factorises — the variables are independent.

Python•Mutual information — from scratch
import numpy as np

def mutual_information(joint, eps=1e-12):
    """
    Compute I(X;Y) from a joint probability table.
    joint: 2D numpy array where joint[i,j] = P(X=i, Y=j).
    """
    joint = np.asarray(joint, float)
    joint = joint / joint.sum()           # normalise
    px    = joint.sum(axis=1, keepdims=True)
    py    = joint.sum(axis=0, keepdims=True)
    product = px * py                    # P(X)*P(Y)
    # I(X;Y) = KL(P_XY || P_X P_Y)
    mask  = joint > eps
    return float(np.sum(joint[mask] * np.log(joint[mask]/product[mask])))

# ---- Independent variables: I = 0 ----
px_ind = np.array([0.4, 0.6])
py_ind = np.array([0.3, 0.7])
joint_ind = np.outer(px_ind, py_ind)  # factorised
print(f"Independent:  I(X;Y) = {mutual_information(joint_ind):.4f} nats")
# ≈0.0000 -- independent, zero MI

# ---- Correlated variables: I > 0 ----
joint_cor = np.array([[0.4, 0.05], [0.05, 0.50]])
print(f"Correlated:   I(X;Y) = {mutual_information(joint_cor):.4f} nats")
# 0.5004 -- knowing one strongly predicts the other

# ---- Deterministic: I = H(X) ----
joint_det = np.array([[0.5, 0.0], [0.0, 0.5]])  # Y always = X
hx_det    = -np.sum([0.5,0.5]*np.log([0.5,0.5]))
print(f"Deterministic: I = {mutual_information(joint_det):.4f} nats  H(X) = {hx_det:.4f}")
# I = H(X) when Y is a deterministic function of X

Mutual Information in Contrastive Learning

Self-supervised representation learning methods like CLIP, SimCLR, and MoCo maximise a lower bound on mutual information between two views of the same data. The InfoNCE loss is this lower bound.

Python•InfoNCE loss — MI lower bound for contrastive learning
import torch
import torch.nn.functional as F

def info_nce_loss(z1, z2, temperature=0.07):
    """
    InfoNCE (NT-Xent) loss: lower bound on I(z1; z2).
    z1, z2: L2-normalised embeddings (N, D).
    Positive pairs: (z1[i], z2[i]). Negatives: all other pairs.
    """
    N = z1.shape[0]
    # L2 normalise
    z1 = F.normalize(z1, dim=1)
    z2 = F.normalize(z2, dim=1)
    # Cosine similarity matrix: (N, N)
    logits = (z1 @ z2.T) / temperature  # (N, N)
    # Positives on diagonal; negatives are off-diagonal
    labels = torch.arange(N, device=z1.device)
    # Cross-entropy: each row is a N-way classification problem
    loss   = (F.cross_entropy(logits, labels) + F.cross_entropy(logits.T, labels)) / 2
    return loss

# Simulate: 64 samples, 128-dim embeddings
torch.manual_seed(0)
z1 = torch.randn(64, 128)  # view 1 embeddings
z2 = z1 + 0.1 * torch.randn(64, 128)  # view 2 (similar)
z2_rand = torch.randn(64, 128)           # random view (unrelated)

print(f"InfoNCE (similar views): {info_nce_loss(z1, z2):.4f}")
print(f"InfoNCE (random views):  {info_nce_loss(z1, z2_rand):.4f}")
# Similar views: low loss (high MI)
# Random views:  log(N) = log(64) ≈4.16 (chance-level, zero MI)

▶

ML Connection: CLIP and Contrastive Pre-training

CLIP (Contrastive Language-Image Pre-training) trains image and text encoders jointly by maximising InfoNCE loss over (image, caption) pairs from 400M web examples. The model learns to embed matching pairs close in cosine similarity space.

This is MI maximisation: the model learns representations I(image; caption) is maximised — every bit of information shared between image content and caption text is encoded in the representation.

Conditional entropy H(X|Y) measures the remaining uncertainty in X after observing Y. The data processing inequality formalises a crucial fact: any deterministic transformation of data can only lose information, never create it.

Conditional Entropy

H(X|Y) = -Σ_{x,y} P(x,y) log P(x|y) = H(X,Y) - H(Y)

H(X|Y) ≤ H(X) always — conditioning reduces entropy (or leaves it unchanged). Equality holds iff X and Y are independent.

Python•Conditional entropy — the chain rule of information
import numpy as np

def conditional_entropy(joint, eps=1e-12):
    """H(X|Y) from a joint probability table P(X,Y)."""
    joint = np.asarray(joint, float); joint /= joint.sum()
    py    = joint.sum(axis=0, keepdims=True)  # marginal P(Y)
    cond  = joint / (py + eps)              # P(X|Y)
    # H(X|Y) = -sum_{x,y} P(x,y) log P(x|y)
    mask  = joint > eps
    return float(-np.sum(joint[mask] * np.log(cond[mask])))

# Joint table: Weather (rows) x Mood (cols)
joint = np.array([[0.40, 0.10], [0.15, 0.35]])

H_X   = -np.sum(joint.sum(1)*np.log(joint.sum(1)+1e-12)) # H(Weather)
H_Y   = -np.sum(joint.sum(0)*np.log(joint.sum(0)+1e-12)) # H(Mood)
H_XY  = -np.sum(joint*np.log(joint+1e-12))           # H(Weather, Mood)
H_X_Y = conditional_entropy(joint)               # H(Weather|Mood)
H_Y_X = conditional_entropy(joint.T)              # H(Mood|Weather)

print(f"H(X)    = {H_X:.4f}")
print(f"H(X|Y)  = {H_X_Y:.4f}   (reduced by observing Y)")
print(f"Chain rule: H(X|Y) = H(X,Y) - H(Y) = {H_XY-H_Y:.4f}  ✔")
print(f"MI: I(X;Y) = H(X) - H(X|Y) = {H_X - H_X_Y:.4f}")

Data Processing Inequality

If X → Y → Z forms a Markov chain (Z depends on X only through Y), then I(X;Z) ≤ I(X;Y). Processing data can only reduce the mutual information.

Python•Data processing inequality — verified for neural representations
import torch
import torch.nn as nn

# Toy example: information about class label Y in representations
# X → layer1 → layer2 → layer3
# By DPI: I(Y; layer3) ≤ I(Y; layer2) ≤ I(Y; layer1) ≤ I(Y; X)

torch.manual_seed(42)
N = 1000
# 2-class input: class 0 ~ N(-1,1), class 1 ~ N(+1,1)
labels = torch.randint(0, 2, (N,))
X      = (labels.float() * 2 - 1) + 0.8 * torch.randn(N, 1)

# Pass through layers of random (untrained) linear projections
layer1 = nn.Linear(1,  4,  bias=False)
layer2 = nn.Linear(4,  16, bias=False)
layer3 = nn.Linear(16, 64, bias=False)

with torch.no_grad():
    h1 = layer1(X)
    h2 = layer2(h1)
    h3 = layer3(h2)

# Estimate MI via a linear probe accuracy proxy
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
for name, rep in [('X',X), ('h1',h1),('h2',h2),('h3',h3)]:
    acc = cross_val_score(LogisticRegression(), rep.numpy(), labels.numpy(), cv=3).mean()
    print(f"{name:>4}: linear probe accuracy = {acc:.3f}")
# All ≈ same -- random projections don't lose class info (linear maps are invertible)
# But nonlinear activations with ReLU would reduce MI in early layers

We now have all the pieces to see language model training as an information-theoretic problem — not just an engineering task. This section ties together entropy, cross-entropy, KL divergence, and mutual information into a single coherent picture.

The Optimal Language Model

The data distribution P_data assigns probabilities to all possible sequences. The language model Q_θ is our parameterised approximation. The training objective is:

min_θ H(P_data, Q_θ) = min_θ [H(P_data) + D_KL(P_data ∥ Q_θ)]

Since H(P_data) is fixed, minimising cross-entropy is exactly minimising KL divergence — bringing the model distribution as close as possible to the data distribution.

Python•Code Lab 4.1 — Computing all information quantities for a real LM
import torch
import numpy as np
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model = GPT2LMHeadModel.from_pretrained('gpt2').eval()
tok   = GPT2Tokenizer.from_pretrained('gpt2')

def analyse_text(text):
    ids   = tok.encode(text, return_tensors='pt')
    with torch.no_grad():
        logits = model(ids).logits[0]       # (T, V)
    probs     = torch.softmax(logits, dim=-1).numpy()
    tokens    = [tok.decode([t]) for t in ids[0]]

    # For each position (except last): 
    # surprisal, entropy, KL from uniform
    results = []
    for i in range(1, len(tokens)):
        p       = probs[i-1]            # model dist at position i-1
        true_id = ids[0, i].item()
        surpris = -np.log(p[true_id])   # -log P(true token)
        h       = -np.sum(p * np.log(p + 1e-12)) # H[model at this step]
        results.append((tokens[i], surpris, h))

    avg_surpris = np.mean([r[1] for r in results])
    perplexity  = np.exp(avg_surpris)
    avg_entropy = np.mean([r[2] for r in results])

    print(f"Text: {text[:60]}...")
    print(f"Avg surprisal: {avg_surpris:.3f} nats  Perplexity: {perplexity:.1f}")
    print(f"Avg entropy:   {avg_entropy:.3f} nats (model confidence per step)")
    print(f"Top-5 most surprising tokens:")
    for tok_text, surpris, h in sorted(results, key=lambda r: -r[1])[:5]:
        print(f"  '{tok_text}': {surpris:.2f} nats  (H={h:.2f})")

analyse_text("The transformer architecture was introduced by Vaswani et al. in 2017.")

Reference Card

Quantity	Formula	ML meaning
Surprisal I(x)	-log P(x)	Loss contribution of one token
Entropy H(P)	-Σ P log P	Uncertainty of a distribution
Cross-entropy H(P,Q)	-Σ P log Q	Average loss when model Q encodes P
KL divergence D_KL(P∥Q)	Σ P log(P/Q)	Gap between model and true dist
Mutual info I(X;Y)	H(X)-H(X\|Y)	Info shared between two variables
Perplexity PP	exp(H(P,Q))	Effective vocab size the model sees
ELBO	E[log P(x\|z)] - KL(q∥p)	Tractable VAE training objective
InfoNCE	-E[log e^{sim}/Σ e^{sim}]	MI lower bound; contrastive loss

Exercises

Exercises 1–9 are pen-and-paper; 10–16 require code.

✎

Exercise 1: Pen & Paper

Compute the surprisal (in bits) of rolling a 6 on a fair die. How many fair-die rolls give 1 bit of surprisal?

✎

Exercise 2: Pen & Paper

Show that H(P) = log N when P is uniform over N outcomes. Why is this the maximum entropy?

✎

Exercise 3: Pen & Paper

Prove H(P,Q) ≥ H(P) using the KL decomposition H(P,Q) = H(P) + KL(P∥Q) and the fact that KL ≥ 0.

✎

Exercise 4: Pen & Paper

Derive the gradient ∂CE/∂z_c = p_c - y_c from scratch. Start from H(y_onehot, softmax(z)).

✎

Exercise 5: Pen & Paper

Let P = [0.5, 0.3, 0.2] and Q = [0.6, 0.3, 0.1]. Compute H(P), H(P,Q), D_KL(P∥Q) and verify H(P,Q) = H(P) + KL.

✎

Exercise 6: Pen & Paper

A language model assigns P(correct) = 0.001 to a rare token. What is the surprisal? If this happens 5% of the time and P(correct) = 0.5 otherwise, what is the perplexity?

✎

Exercise 7: Pen & Paper

State and prove Jensen’s inequality for the concave function f(x) = log x. Use it to prove KL(P∥Q) ≥ 0.

✎

Exercise 8: Pen & Paper

Show that mutual information I(X;Y) = 0 iff X and Y are independent. (Hint: use the KL form of MI.)

✎

Exercise 9: Pen & Paper

In the ELBO, explain why the KL term D_KL(q(z|x) ∥ p(z)) acts as a regularizer on the encoder. What happens if β → ∞?

✎

Exercise 10: Code

Implement entropy, cross-entropy, and KL divergence from scratch. Verify each against scipy.stats.entropy for 10 random distribution pairs.

✎

Exercise 11: Code

Plot H(p, 1-p) in bits for p ∈ [0.01, 0.99]. Where is cross-entropy minimised? What is its value when Q is wrong by a lot?

✎

Exercise 12: Code

Plot D_KL(P∥Q) and D_KL(Q∥P) as a function of Q for a fixed P = [0.8, 0.2]. Explain the asymmetry visually.

✎

Exercise 13: Code Lab 4.1

Run the analyse_text function on three texts: (1) a Wikipedia sentence, (2) a sentence from a technical ML paper, (3) random words. Compare surprisal and entropy profiles.

✎

Exercise 14: Code

Implement InfoNCE loss from scratch. Verify that with N=128 and random embeddings, the loss ≈ log(128) ≈4.85. With perfect embeddings (z1 = z2), loss ≈ 0.

✎

Exercise 15: Code

Train a 2-layer MLP on MNIST with cross-entropy loss. Log (a) the CE loss, (b) the model entropy per prediction, (c) perplexity over 10 epochs. How do they relate?

✎

Exercise 16: Code (Challenge)

Implement a β-VAE. Train on MNIST with β ∈ {0.1, 1.0, 4.0, 10.0}. Plot the ELBO, KL term, and reconstruction loss separately. What does high β do to the latent space?

Further reading: “A Mathematical Theory of Communication” (Shannon, 1948) — the original paper, surprisingly readable. “Elements of Information Theory” (Cover & Thomas, 2006) — the standard textbook, Chapters 1–5. 3Blue1Brown’s video “But what is information?” for geometric intuition.

Next: Chapter 5 — Numerical Methods & Stability. We bridge theory and practice: how floating-point arithmetic, log-sum-exp tricks, and mixed precision interact with everything we’ve learned.

✎ 16 Exercises in this chapter

Attempt each exercise before checking the worked solutions.

View Solutions →

←

PreviousCh 3. Probability & Statistics

NextCh 5. Numerical Methods & Stability

→

Information Theory

Learning Objectives

Why Information Theory for ML?

Three Equivalent Framings

Surprisal (Self-Information)

Definition

Surprisal and the NLL Loss

Shannon Entropy

Definition and Derivation

Properties of Entropy

Cross-Entropy

Definition

One-Hot Targets: Why NLL Equals Cross-Entropy

The Gradient of Cross-Entropy w.r.t. Logits

KL Divergence — A Deep Dive

Definition and Decomposition

Forward vs. Reverse KL: Mode-Seeking vs. Mean-Seeking

Jensen’s Inequality

Statement

The ELBO: Jensen Applied to Variational Inference

Mutual Information

Definition and Decompositions

Mutual Information in Contrastive Learning

Conditional Entropy & the Data Processing Inequality

Conditional Entropy

Data Processing Inequality

The Information-Theoretic View of Language Models

The Optimal Language Model

Chapter Summary & Exercises

Reference Card

Exercises