Probability & Statistics
A language model assigns probabilities to sequences of tokens. To understand why that’s meaningful and how to train it, we need the vocabulary of probability theory: random variables, distributions, and the rules that govern them.
Random Variables
Random variables come in two flavours that require different mathematical treatment:
Probability Mass & Density Functions
For discrete variables, we use a Probability Mass Function (PMF): P(X = x) directly gives the probability. For continuous variables, we use a Probability Density Function (PDF): probability is the area under the curve over an interval.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# ---- Discrete: Categorical distribution (token probabilities) ----
vocab = ['the', 'cat', 'sat', 'on', 'mat']
probs = np.array([0.40, 0.25, 0.15, 0.12, 0.08])
print(probs.sum()) # 1.0 — must sum to 1
# PMF: P(X = 'cat') is just probs[1]
print(f"P(cat) = {probs[1]:.2f}")
# Sample from categorical (simulates LM token sampling)
samples = np.random.choice(vocab, size=10, p=probs)
print(samples) # e.g. ['the' 'cat' 'the' 'on' 'the' ...]
# ---- Continuous: Gaussian distribution (weight init) ----
mu, sigma = 0.0, 0.02 # Xavier-like init for small nets
weights = np.random.normal(mu, sigma, size=(768, 768))
print(f"Weight mean: {weights.mean():.4f}") # ≈0.0
print(f"Weight std: {weights.std():.4f}") # ≈0.02
# PDF value at a point (NOT a probability — it’s a density)
pdf_at_zero = stats.norm.pdf(0, mu, sigma) # ≈19.9 (>1 is fine for PDF!)The Essential Distributions for ML
Language modeling is fundamentally about the joint distribution over sequences. The chain rule of probability allows us to factorize this frightening high-dimensional object into a product of tractable conditional distributions — one per token.
Joint Distribution P(X, Y)
The joint distribution captures the probability of two (or more) variables taking specific values simultaneously. A language model’s true goal is to learn the joint distribution over all tokens in a sequence.
import numpy as np
# Joint distribution P(Weather, Mood)
# Rows: Weather = {Sunny, Rainy}
# Cols: Mood = {Happy, Sad}
joint = np.array([
[0.40, 0.10], # Sunny: 40% happy, 10% sad
[0.15, 0.35], # Rainy: 15% happy, 35% sad
])
print(joint.sum()) # 1.0
# Marginal P(Weather): sum over Mood (axis=1)
p_weather = joint.sum(axis=1) # [0.5, 0.5] Sunny/Rainy equally likely
# Marginal P(Mood): sum over Weather (axis=0)
p_mood = joint.sum(axis=0) # [0.55, 0.45] Happy/Sad
# Conditional P(Mood | Weather=Sunny): joint / marginal
p_mood_given_sunny = joint[0] / p_weather[0] # [0.8, 0.2]
print(p_mood_given_sunny) # [0.8 0.2] — sunny → usually happyThe Chain Rule of Probability
Any joint distribution can be factorized into a product of conditionals. The order of factorization is arbitrary — left-to-right is just a convention:
import numpy as np
# Simulate a tiny 4-token language model
# P(sequence) = P(x1) * P(x2|x1) * P(x3|x1,x2) * ...
np.random.seed(42)
vocab = ['<s>', 'the', 'cat', 'sat', '</s>']
def mock_lm_probs(context):
"""Mock language model: returns next-token probs given context."""
if context == ['<s>']: return [0.0, 0.7, 0.2, 0.0, 0.1] # after <s>
if context[-1] == 'the': return [0.0, 0.0, 0.6, 0.1, 0.3] # after 'the'
if context[-1] == 'cat': return [0.0, 0.0, 0.0, 0.8, 0.2] # after 'cat'
return [0.0, 0.0, 0.0, 0.0, 1.0] # default: end
# Generate a sequence and compute its probability
context = ['<s>']
log_prob = 0.0 # work in log space to avoid underflow
for step in range(3):
probs = mock_lm_probs(context)
tok_idx = np.random.choice(len(vocab), p=probs)
tok = vocab[tok_idx]
log_prob += np.log(probs[tok_idx] + 1e-10) # log P(x_t | x_{<t})
context.append(tok)
print(f"Step {step+1}: generated '{tok}', log_prob so far = {log_prob:.3f}")
print(f"Sequence: {context} P(seq) ≈ {np.exp(log_prob):.4f}")Independence & Conditional Independence
Two variables X and Y are independent if knowing one tells you nothing about the other: P(X,Y) = P(X)P(Y). Conditional independence is the weaker (and more practically useful) form: X ⊥ Y | Z means X and Y are independent once we know Z.
Bayes’ theorem is arguably the most important equation in machine learning. It describes how to rationally update a prior belief in the light of new evidence. Every Bayesian model, MAP estimator, and probabilistic neural network is an application of this rule.
The Formula
# Classic example: disease testing
# Shows why base rates (priors) matter enormously
p_disease = 0.001 # P(disease): 1 in 1000 people have it
p_pos_given_d = 0.99 # P(positive | disease): 99% sensitivity
p_pos_given_nd= 0.05 # P(positive | no disease): 5% false positive
# P(positive) by total probability theorem
p_positive = (p_pos_given_d * p_disease +
p_pos_given_nd * (1 - p_disease))
# Bayes: P(disease | positive)
p_disease_given_pos = (p_pos_given_d * p_disease) / p_positive
print(f"P(disease | positive test) = {p_disease_given_pos:.1%}")
# ≈ 1.9% — despite a 99% accurate test!
# Intuition: 1000 people → 1 has disease (1 likely positive)
# 999 don't → ~50 false positives
# So among ~51 positives, only 1 is truly sick: 1/51 ≈ 2%Maximum A Posteriori (MAP) Estimation
MLE finds the θ that maximises P(data|θ). MAP goes one step further and finds the θ that maximises the posterior P(θ|data) = P(data|θ) × P(θ). The prior acts as a regularizer.
import numpy as np
# Training loss with L2 regularization = MAP with Gaussian prior
# L2 prior: P(θ) ∝ exp(-λ||θ||^2) → log P(θ) = -λ||θ||^2
# MAP objective: maximize log P(data|θ) - λ||θ||^2
# = minimize NLL + λ||θ||^2 ← standard L2-regularized training!
def map_loss(predictions, targets, weights, lam=0.01):
"""MAP loss = NLL + L2 regularization (Gaussian prior on weights)."""
nll = -np.mean(targets * np.log(predictions + 1e-9)) # cross-entropy
reg = lam * np.sum(weights**2) # L2 prior
return nll + reg
# L1 regularization corresponds to a Laplace prior on weights:
# P(θ) ∝ exp(-λ||θ||_1) → promotes sparsity
def map_loss_l1(predictions, targets, weights, lam=0.01):
nll = -np.mean(targets * np.log(predictions + 1e-9))
reg = lam * np.sum(np.abs(weights))
return nll + regRather than working with full distributions, we often summarize them with moments: the expected value (centre of mass), variance (spread), and covariance (linear dependence between two variables). These quantities appear throughout neural network design and analysis.
Expected Value
import numpy as np
# ---- Discrete: expected number of heads in 3 coin flips ----
# X ~ Binomial(n=3, p=0.5)
values = np.array([0, 1, 2, 3])
probs = np.array([0.125, 0.375, 0.375, 0.125])
E_X = np.sum(values * probs)
print(f"E[X] = {E_X}") # 1.5 = n*p = 3*0.5
# ---- Monte Carlo estimation: sample and average ----
# (This is how E[loss] is estimated during training!)
samples = np.random.binomial(n=3, p=0.5, size=100000)
print(f"MC estimate: {samples.mean():.4f}") # ≈1.5
# ---- Continuous: E[X] for Gaussian μ=3, σ=2 ----
samples_g = np.random.normal(loc=3, scale=2, size=1000000)
print(f"E[N(3,4)]: {samples_g.mean():.4f}") # ≈3.0Variance & Standard Deviation
import numpy as np
# Variance measures spread around the mean
x = np.array([2.0, 4.0, 4.0, 4.0, 5.0, 5.0, 7.0, 9.0])
mean = x.mean() # 5.0
var = np.mean((x - mean)**2) # 4.0 (population variance)
std = np.sqrt(var) # 2.0
# CAUTION: E[X^2] - E[X]^2 is numerically unstable for large values!
# np.var() and np.std() use the safer two-pass formula
print(np.var(x)) # 4.0 (matches)
print(np.var(x, ddof=1)) # 4.57 (sample variance, unbiased)
# Layer norm computes mean and variance per token
def layer_norm(x, eps=1e-5):
mu = x.mean(axis=-1, keepdims=True)
var = x.var(axis=-1, keepdims=True, ddof=0)
return (x - mu) / np.sqrt(var + eps)
acts = np.random.randn(4, 768) # (batch, dim)
normed = layer_norm(acts)
print(normed.mean(axis=-1)) # [≈0, ≈0, ≈0, ≈0]
print(normed.std(axis=-1)) # [≈1, ≈1, ≈1, ≈1]Covariance & Correlation
import numpy as np
np.random.seed(0)
# Simulate 3 features with known covariance structure
n = 500
x1 = np.random.randn(n)
x2 = 0.8 * x1 + 0.6 * np.random.randn(n) # strongly correlated with x1
x3 = np.random.randn(n) # independent
X = np.stack([x1, x2, x3], axis=1) # (500, 3)
cov = np.cov(X.T) # (3,3) covariance matrix
print(np.round(cov, 2))
# [[ 1.00 0.81 0.01] <- x1-x2 strongly correlated
# [ 0.81 1.01 -0.02]
# [ 0.01 -0.02 0.98]] <- x3 near-zero correlation with x1, x2
# Correlation matrix: normalise to [-1, 1]
std = np.sqrt(np.diag(cov))
corr = cov / np.outer(std, std)
print(np.round(corr, 2))The Gaussian (Normal) distribution is the most important continuous distribution in machine learning. It appears in weight initialization, data augmentation, latent spaces of generative models, and noise models. Understanding it deeply — not just its formula — is essential.
PDF and Parameters
μ is the mean (peak of the bell), σ² is the variance (width of the bell), and σ is the standard deviation. The 68-95-99.7 rule tells us that 68% of probability mass lies within 1σ of the mean, 95% within 2σ, and 99.7% within 3σ.
import numpy as np
from scipy import stats
mu, sigma = 0.0, 1.0 # standard normal
# 68-95-99.7 rule: verified numerically
within_1s = stats.norm.cdf(mu+sigma) - stats.norm.cdf(mu-sigma)
within_2s = stats.norm.cdf(mu+2*sigma) - stats.norm.cdf(mu-2*sigma)
within_3s = stats.norm.cdf(mu+3*sigma) - stats.norm.cdf(mu-3*sigma)
print(f"±1σ: {within_1s:.1%} ±2σ: {within_2s:.1%} ±3σ: {within_3s:.1%}")
# ±1σ: 68.3% ±2σ: 95.4% ±3σ: 99.7%
# --- Central Limit Theorem: sums of ANY distribution converge to Gaussian ---
# Sum of 30 uniform[0,1] samples ≈ Gaussian
n_experiments = 10000
n_samples = 30
sums = np.random.uniform(0, 1, size=(n_experiments, n_samples)).sum(axis=1)
_, p_value = stats.normaltest(sums)
print(f"Normality test p-value: {p_value:.3f}") # >>0.05: looks Gaussian
# --- Xavier / He weight initialization ---
def xavier_init(fan_in, fan_out):
"""Keeps variance stable through layers of tanh activations."""
std = np.sqrt(2.0 / (fan_in + fan_out))
return np.random.normal(0, std, size=(fan_in, fan_out))
def he_init(fan_in, fan_out):
"""Keeps variance stable through layers of ReLU activations."""
std = np.sqrt(2.0 / fan_in)
return np.random.normal(0, std, size=(fan_in, fan_out))
W_xavier = xavier_init(512, 512) # std ≈ 0.0625
W_he = he_init(512, 512) # std ≈ 0.0625 * sqrt(2) ≈ 0.0884Multivariate Gaussian
For vectors in ℝⁿ, the Gaussian generalizes to the multivariate normal: N(x; μ, Σ) where Σ is the covariance matrix encoding the shape of the distribution. The covariance matrix determines whether the distribution is spherical, elongated, or tilted.
import numpy as np
# Spherical Gaussian: Σ = I (independent, equal variance)
mu_sphere = [0.0, 0.0]
cov_sphere = [[1.0, 0.0], [0.0, 1.0]]
# Elongated Gaussian: high variance in x1, low in x2
mu_elong = [0.0, 0.0]
cov_elong = [[4.0, 0.0], [0.0, 0.25]]
# Correlated Gaussian: positive off-diagonal
mu_corr = [1.0, 2.0]
cov_corr = [[1.0, 0.8], [0.8, 1.0]]
# Sample 500 points from each
samples_sphere = np.random.multivariate_normal(mu_sphere, cov_sphere, 500)
samples_elong = np.random.multivariate_normal(mu_elong, cov_elong, 500)
samples_corr = np.random.multivariate_normal(mu_corr, cov_corr, 500)
# In a VAE, we sample z ~ N(mu(x), diag(sigma^2(x)))
# where mu and sigma are output by the encoder network
def vae_reparameterize(mu, log_var):
"""Reparameterization trick: z = mu + eps*sigma, eps~N(0,I)."""
std = np.exp(0.5 * log_var)
eps = np.random.randn(*mu.shape)
return mu + eps * std # differentiable w.r.t. mu, log_var!MLE is the principle behind training almost every neural network. The fundamental insight: choose model parameters that make the observed training data as probable as possible. It sounds obvious — but tracing this principle through to its mathematical conclusion reveals exactly why we use cross-entropy loss.
The Setup
We have a dataset D = {x₁, x₂, ..., xₙ} of n independent observations. Our model has parameters θ. The likelihood of observing this data under our model is:
MLE finds the θ that maximises L(θ). We work in log-space because: (1) products become sums (numerically stable), and (2) log is monotone so argmax is preserved.
MLE Derivation: Categorical Distribution
For a vocabulary of V tokens, with counts cᵥ for each token v, the MLE solution is simply the empirical frequency. This is the simplest possible language model — a unigram.
import numpy as np
from collections import Counter
# Corpus of tokens
corpus = ['the', 'cat', 'sat', 'on', 'the', 'mat', 'the', 'cat']
counts = Counter(corpus) # {'the':3, 'cat':2, 'sat':1, 'on':1, 'mat':1}
n = len(corpus)
# MLE solution: p_v = count_v / n
mle_probs = {word: cnt/n for word, cnt in counts.items()}
print(mle_probs) # {'the': 0.375, 'cat': 0.25, ...}
# Log-likelihood of the corpus under MLE model (maximum possible)
log_L = sum(np.log(mle_probs[w]) for w in corpus)
print(f"Log-likelihood: {log_L:.3f}")
# Compare to a uniform model (all words equally likely)
p_uniform = 1.0 / len(counts)
log_L_uni = n * np.log(p_uniform)
print(f"Uniform log-lik: {log_L_uni:.3f}") # always lowerMLE = Minimizing Cross-Entropy Loss
Here is the key derivation. Training a neural language model with cross-entropy loss is exactly MLE over token predictions. Let’s prove it:
# For a classification problem with one-hot targets:
# - Target distribution: q (one-hot, e.g. [0,0,1,0,0])
# - Model distribution: p (softmax output, e.g. [.1,.2,.5,.1,.1])
import numpy as np
def cross_entropy_loss(logits, target_class):
"""Cross-entropy loss for a single example."""
# Numerically stable softmax
logits = logits - logits.max() # subtract max for stability
probs = np.exp(logits) / np.sum(np.exp(logits))
# Loss = -log P(correct class) [one-hot collapses sum]
return -np.log(probs[target_class] + 1e-9)
# Example: 5-class logits, true class = 2
logits = np.array([1.0, 2.0, 3.0, 0.5, 0.2])
loss = cross_entropy_loss(logits, target_class=2)
print(f"Loss = {loss:.4f}") # lower = more confident in class 2
# Connection: minimizing CE == maximizing log P(data|theta)
# CE = H(q, p) = -sum_v q(v) log p(v)
# With one-hot q: CE = -log p(correct_class) = -log-likelihood!
# Training objective: minimize (1/N) sum_i CE_i == maximize log-likelihoodEntropy measures how uncertain a distribution is. KL divergence measures how different two distributions are. Both are fundamental to loss functions, evaluation metrics, and the mathematics of RLHF.
Entropy
Entropy is maximized by the uniform distribution (maximum uncertainty) and equals zero for a deterministic distribution (no uncertainty).
import numpy as np
from scipy.stats import entropy
def shannon_entropy(probs, base=np.e):
"""Shannon entropy. base=np.e gives nats, base=2 gives bits."""
probs = np.array(probs)
probs = probs[probs > 0] # avoid log(0)
return -np.sum(probs * np.log(probs) / np.log(base))
# Deterministic: one event with probability 1
print(shannon_entropy([1.0, 0.0, 0.0])) # 0.0 nats (no uncertainty)
# Uniform over 2 outcomes (fair coin)
print(shannon_entropy([0.5, 0.5], base=2)) # 1.0 bit
# Uniform over 4 outcomes (2 coins)
print(shannon_entropy([.25]*4, base=2)) # 2.0 bits
# Realistic language model output (50257-token vocabulary)
logits_peaked = np.zeros(50257); logits_peaked[42] = 10.0 # very confident
logits_flat = np.zeros(50257) # maximally uncertain
def softmax(x): e = np.exp(x - x.max()); return e / e.sum()
print(f"Peaked H = {shannon_entropy(softmax(logits_peaked)):.2f} nats") # ≈0.0
print(f"Flat H = {shannon_entropy(softmax(logits_flat)):.2f} nats") # ≈10.82KL Divergence
KL divergence measures how many extra nats we spend on average when we use Q to encode messages from P. It is always ≥ 0 (Gibbs’ inequality) and equals zero iff P = Q. Crucially, it is NOT symmetric: D_KL(P‖Q) ≠ D_KL(Q‖P).
import numpy as np
def kl_divergence(P, Q, eps=1e-10):
"""KL(P || Q): how different Q is from P, measured from P's perspective."""
P, Q = np.array(P) + eps, np.array(Q) + eps
P, Q = P / P.sum(), Q / Q.sum() # normalize
return np.sum(P * np.log(P / Q))
# KL is NOT symmetric
P = [0.9, 0.1] # true distribution (confident)
Q = [0.5, 0.5] # model distribution (uncertain)
print(f"KL(P||Q) = {kl_divergence(P, Q):.4f}") # 0.5108
print(f"KL(Q||P) = {kl_divergence(Q, P):.4f}") # 0.4185 (different!)
# KL >= 0: proved empirically
for _ in range(1000):
p = np.random.dirichlet(np.ones(10))
q = np.random.dirichlet(np.ones(10))
assert kl_divergence(p, q) >= -1e-9 # always true
# --- RLHF KL penalty ---
# RLHF objective: maximize E[r(x)] - beta * KL(pi_theta || pi_ref)
# The KL term penalises the fine-tuned policy pi_theta for
# drifting too far from the reference policy pi_ref.
# Without it, the model collapses to degenerate reward-hacking outputs.Once a language model produces logits over the vocabulary, those logits must be converted to a probability distribution and a token sampled from it. This sampling step is the primary knob for controlling the “creativity” of generated text.
Greedy vs. Sampling
Greedy decoding always picks the most probable token. It is fast and deterministic but tends to produce repetitive, conservative text. Sampling introduces randomness that makes outputs more diverse and often more natural.
import numpy as np
def softmax(logits):
e = np.exp(logits - logits.max()) # numerically stable
return e / e.sum()
vocab = ['the', 'a', 'cat', 'dog', 'ran', 'sat', 'leapt']
logits = np.array([3.0, 2.0, 1.5, 1.2, 0.8, 0.4, 0.1])
probs = softmax(logits)
# Greedy: always pick argmax
greedy_token = vocab[np.argmax(probs)]
print(f"Greedy: '{greedy_token}' (always)")
# Sampling: draw from distribution
np.random.seed(42)
samples = [vocab[np.random.choice(len(vocab), p=probs)] for _ in range(10)]
print(f"Samples: {samples}")
# e.g. ['the', 'the', 'a', 'the', 'cat', 'the', 'a', 'the', 'the', 'a']Temperature Scaling
Temperature T scales the logits before softmax. T < 1 makes the distribution sharper (more deterministic). T > 1 makes it flatter (more random). T = 1 is unmodified sampling.
import numpy as np
def temperature_sample(logits, temperature=1.0):
"""Sample from logits with temperature scaling."""
assert temperature > 0, "Temperature must be positive"
scaled_logits = logits / temperature
probs = softmax(scaled_logits)
return np.random.choice(len(probs), p=probs), probs
vocab = ['the', 'a', 'cat', 'dog', 'ran', 'sat', 'leapt']
logits = np.array([3.0, 2.0, 1.5, 1.2, 0.8, 0.4, 0.1])
print(f"{'Temp':>6} {'Distribution (top 3)':>40}")
for T in [0.1, 0.5, 1.0, 1.5, 2.0]:
_, probs = temperature_sample(logits, T)
top3 = sorted(zip(vocab, probs), key=lambda x: -x[1])[:3]
top3_str = ' '.join(f"{w}:{p:.2f}" for w, p in top3)
print(f"{T:>6.1f} {top3_str}")
# T=0.1: the:0.99 a:0.00 cat:0.00 (near-greedy)
# T=1.0: the:0.44 a:0.21 cat:0.14 (default sampling)
# T=2.0: the:0.26 a:0.21 cat:0.18 (almost uniform)Top-k Sampling
Top-k sampling restricts sampling to the k most probable tokens, setting all others to zero probability. This prevents very unlikely tokens from ever being sampled, reducing “tail risk” in generation.
import numpy as np
def top_k_sample(logits, k, temperature=1.0):
"""Zero out all but top-k logits, then sample with temperature."""
assert 1 <= k <= len(logits)
# Find k-th largest logit value
threshold = np.partition(logits, -k)[-k]
# Mask out all logits below threshold
masked_logits = np.where(logits >= threshold, logits, -np.inf)
scaled_logits = masked_logits / temperature
probs = softmax(scaled_logits)
return np.random.choice(len(probs), p=probs)
logits = np.array([3.0, 2.0, 1.5, 1.2, 0.8, 0.4, 0.1])
# With k=3, only tokens 0,1,2 are candidates
np.random.seed(42)
samples_k3 = [vocab[top_k_sample(logits, k=3)] for _ in range(20)]
from collections import Counter
print(Counter(samples_k3)) # only 'the', 'a', 'cat' appearNucleus (Top-p) Sampling
Top-p sampling (also called nucleus sampling) selects the smallest set of tokens whose cumulative probability exceeds p, then samples from that set. This dynamically adjusts the candidate set size — narrower when the model is confident, wider when uncertain.
import numpy as np
def nucleus_sample(logits, p, temperature=1.0):
"""
Nucleus (top-p) sampling.
Selects the smallest set of tokens whose cumulative prob >= p.
"""
scaled_logits = logits / temperature
probs = softmax(scaled_logits)
# Sort by probability descending
sorted_idx = np.argsort(probs)[::-1]
sorted_probs = probs[sorted_idx]
cumulative = np.cumsum(sorted_probs)
# Find cutoff: first index where cumulative >= p
cutoff_idx = np.searchsorted(cumulative, p) + 1
nucleus_idx = sorted_idx[:cutoff_idx]
# Renormalise over nucleus and sample
nucleus_probs = probs[nucleus_idx]
nucleus_probs = nucleus_probs / nucleus_probs.sum()
chosen = np.random.choice(nucleus_idx, p=nucleus_probs)
return chosen
# Comparison: peaked distribution → small nucleus
peaked_logits = np.array([10.0, 1.0, 0.5, 0.1, 0.0, 0.0, 0.0])
flat_logits = np.array([2.0, 1.8, 1.6, 1.4, 1.2, 1.0, 0.8])
# Count nucleus size over many samples
def nucleus_size(logits, p):
probs = softmax(logits)
s_probs = np.sort(probs)[::-1]
return int(np.searchsorted(np.cumsum(s_probs), p)) + 1
print(f"Peaked: nucleus size = {nucleus_size(peaked_logits, 0.9)} (p=0.9)") # 1
print(f"Flat: nucleus size = {nucleus_size(flat_logits, 0.9)} (p=0.9)") # 6Perplexity is the exponential of the average cross-entropy per token. It measures how surprised, on average, the model is by each token in the evaluation set. A perplexity of k means the model is as confused as if it were choosing uniformly from k equally likely options at every step.
Definition
import numpy as np
def perplexity_from_log_probs(log_probs):
"""
Compute perplexity from a list of log P(x_t | x_{<t}).
log_probs: list of log-probabilities for each token.
"""
avg_nll = -np.mean(log_probs) # average negative log-likelihood
return np.exp(avg_nll)
# Perfect model: always predicts correct token with prob 1.0
perfect_log_probs = [np.log(1.0)] * 100
print(perplexity_from_log_probs(perfect_log_probs)) # 1.0
# Uniform model: 50257-token vocabulary
uniform_log_probs = [np.log(1/50257)] * 100
print(perplexity_from_log_probs(uniform_log_probs)) # 50257.0
# GPT-2 large on WikiText-103: PP ≈18
# GPT-3 175B on PTB: PP ≈20.5
# Random: PP = vocab_size ≈50257 for GPT tokenizer
# --- Using HuggingFace to compute real perplexity ---
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
def compute_perplexity(text, model, tokenizer, stride=512):
"""Sliding-window perplexity (handles texts longer than max_length)."""
encodings = tokenizer(text, return_tensors='pt')
input_ids = encodings.input_ids
nlls, prev_end = [], 0
for begin in range(0, input_ids.size(1), stride):
end = min(begin + 1024, input_ids.size(1))
target_len = end - prev_end
chunk = input_ids[:, begin:end]
target_ids = chunk.clone()
target_ids[:, :-target_len] = -100 # ignore context tokens
with torch.no_grad():
loss = model(chunk, labels=target_ids).loss
nlls.append(loss * target_len)
prev_end = end
return torch.exp(torch.stack(nlls).sum() / prev_end).item()Perplexity Across Models — A Reference Table
When does a model memorize training data versus generalize to new examples? The bias–variance framework provides the answer, giving a decomposition of prediction error into three components with very different remedies.
The Decomposition
import numpy as np
np.random.seed(0)
true_fn = lambda x: np.sin(2 * np.pi * x)
def fit_poly(X_train, y_train, degree, x_test):
"""Fit a polynomial of given degree, evaluate at x_test."""
coeffs = np.polyfit(X_train, y_train, degree)
return np.polyval(coeffs, x_test)
n_datasets, n_train = 200, 20
x_test = 0.5
sigma = 0.3 # noise level
print(f"{'Degree':>8} {'Bias^2':>10} {'Variance':>10} {'Total MSE':>12}")
for degree in [1, 3, 7, 15]:
preds = []
for _ in range(n_datasets):
X_tr = np.random.uniform(0, 1, n_train)
y_tr = true_fn(X_tr) + sigma * np.random.randn(n_train)
preds.append(fit_poly(X_tr, y_tr, degree, x_test))
preds = np.array(preds)
y_true = true_fn(x_test)
bias2 = (preds.mean() - y_true) ** 2
var = preds.var()
mse = bias2 + var + sigma**2
print(f"{degree:>8} {bias2:>10.4f} {var:>10.4f} {mse:>12.4f}")
# degree=1: high bias, low variance (underfitting)
# degree=7: low bias, low variance (sweet spot)
# degree=15: low bias, HIGH variance (overfitting)Concept Map
Exercises
Exercises 1–12 are pen-and-paper; 13–20 require code. The coding exercises build a miniature language modelling pipeline from scratch.
Further reading: “Pattern Recognition and Machine Learning” (Bishop, 2006) — Chapters 1–3 for a comprehensive Bayesian treatment. “The Elements of Statistical Learning” (Hastie et al.) for bias-variance and classical estimators. 3Blue1Brown’s Bayes’ theorem video for geometric intuition.
Next: Chapter 4 — Information Theory. We complete the probabilistic picture by formalizing entropy, mutual information, and the connection between compression and prediction.