Language Model Fundamentals
Everything in Part II so far has assumed fixed-length inputs: a feature vector, a bag of words, a single embedding. But language is sequential and variable-length. “Dog bites man” and “Man bites dog” contain identical words yet mean opposite things. A model that ignores order cannot tell them apart.
The Three Requirements of a Sequence Model
A naïve fix — concatenate all word vectors into one giant input — fails all three: it requires a fixed length, learns position-specific weights, and cannot share statistical strength across positions. The recurrent neural network solves all three at once with a single elegant idea: process one token at a time, carrying a hidden state forward.
The Recurrence Relation
The hidden state hₜ is a function of the previous hidden state hₜ₋₁ and the current input xₜ. The weight matrices W_hh, W_xh, W_hy are shared across all time steps — this is the parameter sharing that lets an RNN handle any sequence length with a fixed parameter count.
The vanilla (Elman) RNN implements the recurrence directly with a single tanh nonlinearity. It is beautifully simple and was the workhorse of early neural NLP. But it has a crippling limitation that we will diagnose precisely in the next section.
Input: sequence x₁ … x_T, initial state h₀ = 0
Params: W_xh (D→H), W_hh (H→H), W_hy (H→V), biases b_h, b_y
for t = 1 … T:
aₜ ← W_xh · xₜ + W_hh · hₜ₋₁ + b_h
hₜ ← tanh(aₜ) # new hidden state
ŷₜ ← softmax(W_hy · hₜ + b_y) # output distribution
return h₁ … h_T, ŷ₁ … ŷ_Timport numpy as np
class VanillaRNN:
def __init__(self, vocab, hidden=64, seed=0):
rng = np.random.default_rng(seed)
s = 0.01 # small init to avoid early saturation
self.Wxh = rng.normal(0, s, (hidden, vocab))
self.Whh = rng.normal(0, s, (hidden, hidden))
self.Why = rng.normal(0, s, (vocab, hidden))
self.bh = np.zeros((hidden, 1))
self.by = np.zeros((vocab, 1))
self.H = hidden
def forward(self, inputs, h_prev):
"""inputs: list of one-hot column vectors. Returns states + caches."""
h = h_prev
xs, hs, ys, ps = {}, {-1: h_prev}, {}, {}
for t, x in enumerate(inputs):
xs[t] = x
h = np.tanh(self.Wxh @ x + self.Whh @ hs[t-1] + self.bh)
y = self.Why @ h + self.by
p = np.exp(y - y.max()); p /= p.sum() # stable softmax
hs[t], ys[t], ps[t] = h, y, p
return xs, hs, ps
# A character-level RNN trained on text can generate plausible characters,
# but it forgets context beyond ~10-20 steps. The next section shows why.Backpropagation Through Time (BPTT)
Training an RNN means unrolling it across all time steps into a (very deep) feed-forward graph, then applying backpropagation. Because W_hh is shared across every step, its gradient is the sum of contributions from all time steps — this summation is the source of both the RNN's power and its instability.
The crucial term is ∂hₜ/∂h_k — how the hidden state at time t depends on the hidden state at an earlier time k. Expanding it reveals the vanishing-gradient problem.
def backward(self, xs, hs, ps, targets):
dWxh = np.zeros_like(self.Wxh); dWhh = np.zeros_like(self.Whh)
dWhy = np.zeros_like(self.Why); dbh = np.zeros_like(self.bh); dby = np.zeros_like(self.by)
dh_next = np.zeros((self.H, 1))
for t in reversed(range(len(targets))): # walk backwards in time
dy = ps[t].copy(); dy[targets[t]] -= 1 # softmax-CE gradient: p - y
dWhy += dy @ hs[t].T; dby += dy
dh = self.Why.T @ dy + dh_next # grad flowing into h_t
draw = (1 - hs[t]**2) * dh # tanh'(a) = 1 - tanh²(a)
dbh += draw
dWxh += draw @ xs[t].T
dWhh += draw @ hs[t-1].T
dh_next = self.Whh.T @ draw # propagate to previous step
# Gradient clipping is ESSENTIAL for RNNs (prevents explosion)
for d in [dWxh, dWhh, dWhy, dbh, dby]:
np.clip(d, -5, 5, out=d)
return dWxh, dWhh, dWhy, dbh, dbyWe met vanishing gradients briefly in Chapter 2. In RNNs the problem is acute and structural, because the same recurrent weight matrix is applied at every time step. Understanding this precisely is the key to understanding why LSTMs were invented.
The Gradient Chain
The dependence of hidden state hₜ on an earlier state h_k is a product of Jacobians — one per intervening step:
This product of (t − k) matrices governs how a gradient at time t flows back to time k. Two failure modes:
| Regime | Condition | Consequence |
|---|---|---|
| Vanishing | Largest singular value of W_hh < 1 | Gradient → 0 exponentially; no long-range learning |
| Exploding | Largest singular value of W_hh > 1 | Gradient → ∞; NaN loss, unstable training |
| Stable (rare) | Singular values ≈ 1 | Knife-edge; hard to maintain during training |
The tanh derivative compounds the problem: tanh'(a) = 1 − tanh²(a) ≤ 1, and it is near zero whenever the unit is saturated (|a| large). So even with well-scaled W_hh, repeated multiplication by sub-unit derivatives drives gradients toward zero.
import numpy as np
def gradient_norm_over_time(W_hh, T=100, H=64):
"""Track how a gradient shrinks as it flows back through T steps."""
grad = np.random.randn(H, 1) # gradient at the final step
norms = [np.linalg.norm(grad)]
for _ in range(T):
# Each backward step: multiply by W_hhᵀ and a tanh' factor (~0.5 avg)
grad = (W_hh.T @ grad) * 0.5
norms.append(np.linalg.norm(grad))
return np.array(norms)
rng = np.random.default_rng(0)
H = 64
# Case 1: small weights → vanishing
W_small = rng.normal(0, 0.5/np.sqrt(H), (H, H))
norms_v = gradient_norm_over_time(W_small)
print(f"After 100 steps (small W): {norms_v[-1]/norms_v[0]:.2e} of original")
# After 100 steps (small W): 3.7e-31 of original -- completely vanished
# Case 2: large weights → exploding
W_large = rng.normal(0, 3.0/np.sqrt(H), (H, H))
norms_e = gradient_norm_over_time(W_large)
print(f"After 100 steps (large W): {norms_e[-1]/norms_e[0]:.2e} of original")
# After 100 steps (large W): 1.4e+18 of original -- exploded to NaN territory
# Conclusion: a gradient cannot survive 100 steps of multiplication by W_hh.
# Information from t=1 cannot influence the loss at t=100. The RNN is 'amnesiac'.The Long Short-Term Memory network (Hochretier & Schmidhuber, 1997) solves the vanishing gradient problem with a brilliant architectural change: a separate cell state that flows through time with only minor, additive modifications. Gradients can travel along this cell-state highway nearly unchanged, enabling the network to learn dependencies hundreds of steps long.
The Cell State and Three Gates
The LSTM maintains two states: the hidden state hₜ (the output) and the cell state cₜ (the long-term memory). Three gates — each a sigmoid layer producing values in [0,1] — control the flow of information:
| Gate | Symbol | Role |
|---|---|---|
| Forget | fₜ | What fraction of the old cell state to keep (0 = erase, 1 = retain) |
| Input | iₜ | What fraction of the new candidate to write into the cell |
| Output | oₜ | What fraction of the cell state to expose as the hidden state |
# Concatenate previous hidden state and current input: [hₜ₋₁ ; xₜ]
fₜ ← σ(W_f · [hₜ₋₁, xₜ] + b_f) # forget gate
iₜ ← σ(W_i · [hₜ₋₁, xₜ] + b_i) # input gate
c̃ₜ ← tanh(W_c · [hₜ₋₁, xₜ] + b_c) # candidate cell content
oₜ ← σ(W_o · [hₜ₋₁, xₜ] + b_o) # output gate
cₜ ← fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ # ← THE GRADIENT HIGHWAY
hₜ ← oₜ ⊙ tanh(cₜ) # gated outputThe critical line is the cell-state update: cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ. Its gradient with respect to cₜ₋₁ is simply fₜ — an element-wise multiplication, not a matrix multiplication through a saturating nonlinearity. When the forget gate fₜ ≈ 1, the gradient flows backward essentially unchanged. This is the additive, near-identity path that defeats vanishing gradients.
import numpy as np
def sigmoid(x): return 1 / (1 + np.exp(-np.clip(x, -30, 30)))
class LSTMCell:
def __init__(self, input_dim, hidden, seed=0):
rng = np.random.default_rng(seed)
Z = input_dim + hidden # concatenated [h, x] size
s = 1.0 / np.sqrt(Z)
# One weight matrix per gate; rows map [h_prev, x] → gate value
self.Wf = rng.normal(0, s, (hidden, Z)); self.bf = np.ones((hidden,1)) # bf=1: start by remembering
self.Wi = rng.normal(0, s, (hidden, Z)); self.bi = np.zeros((hidden,1))
self.Wc = rng.normal(0, s, (hidden, Z)); self.bc = np.zeros((hidden,1))
self.Wo = rng.normal(0, s, (hidden, Z)); self.bo = np.zeros((hidden,1))
self.H = hidden
def step(self, x, h_prev, c_prev):
z = np.vstack([h_prev, x]) # concatenate
f = sigmoid(self.Wf @ z + self.bf) # forget gate
i = sigmoid(self.Wi @ z + self.bi) # input gate
c_tilde = np.tanh(self.Wc @ z + self.bc) # candidate
o = sigmoid(self.Wo @ z + self.bo) # output gate
c = f * c_prev + i * c_tilde # cell update (the highway)
h = o * np.tanh(c) # gated hidden output
return h, c, (f, i, c_tilde, o, z) # cache for backprop
# Note bf initialised to 1.0: the standard 'forget gate bias' trick.
# It makes the LSTM remember by default at the start of training,
# dramatically improving learning of long-range dependencies (Jozefowicz 2015).The GRU (Cho et al., 2014) simplifies the LSTM: it merges the cell state and hidden state into one, and uses only two gates instead of three. It has fewer parameters, trains faster, and performs comparably to the LSTM on most tasks — making it a popular default.
Two Gates: Reset and Update
rₜ ← σ(W_r · [hₜ₋₁, xₜ] + b_r) # reset gate
zₜ ← σ(W_z · [hₜ₋₁, xₜ] + b_z) # update gate
h̃ₜ ← tanh(W_h · [rₜ ⊙ hₜ₋₁, xₜ] + b_h) # candidate state
hₜ ← (1 − zₜ) ⊙ hₜ₋₁ + zₜ ⊙ h̃ₜ # interpolate old & newThe reset gate rₜ controls how much of the previous state contributes to the candidate; when rₜ ≈ 0, the unit ignores its past and behaves like reading a fresh sequence. The update gate zₜ interpolates between keeping the old state and adopting the new candidate — it plays the combined role of the LSTM's forget and input gates.
import numpy as np
class GRUCell:
def __init__(self, input_dim, hidden, seed=0):
rng = np.random.default_rng(seed); Z = input_dim + hidden; s = 1/np.sqrt(Z)
self.Wr = rng.normal(0, s, (hidden, Z)); self.br = np.zeros((hidden,1))
self.Wz = rng.normal(0, s, (hidden, Z)); self.bz = np.zeros((hidden,1))
self.Wh = rng.normal(0, s, (hidden, Z)); self.bh = np.zeros((hidden,1))
def step(self, x, h_prev):
z_cat = np.vstack([h_prev, x])
r = sigmoid(self.Wr @ z_cat + self.br) # reset gate
z = sigmoid(self.Wz @ z_cat + self.bz) # update gate
# Candidate uses reset-gated previous state
h_tilde = np.tanh(self.Wh @ np.vstack([r * h_prev, x]) + self.bh)
h = (1 - z) * h_prev + z * h_tilde # interpolate
return h
# 3 matrices vs LSTM's 4. On a 256-hidden, 300-input cell:
# GRU: 3 × 256 × 556 = 427k params
# LSTM: 4 × 256 × 556 = 569k params (33% more)Two architectural enhancements turned RNNs into the powerful sequence models of 2014–2017: stacking multiple recurrent layers for hierarchical abstraction, and processing sequences in both directions for full bidirectional context.
Stacked (Deep) RNNs
A single recurrent layer can be stacked: the hidden states of layer ℓ become the inputs to layer ℓ+1. Lower layers capture local patterns (morphology, short phrases); higher layers capture longer-range, more abstract structure — mirroring the hierarchy of features in a deep CNN or Transformer.
Bidirectional RNNs
A unidirectional RNN at position t has seen only x₁…xₜ. But for many tasks — named-entity recognition, part-of-speech tagging — the right context matters too. A bidirectional RNN runs two independent RNNs, one forward and one backward, and concatenates their hidden states:
import torch; import torch.nn as nn
# A 2-layer bidirectional LSTM for sequence tagging
bilstm = nn.LSTM(
input_size=300, # word embedding dimension
hidden_size=256, # hidden units per direction
num_layers=2, # stacked depth
bidirectional=True, # forward + backward
batch_first=True,
)
x = torch.randn(32, 50, 300) # (batch, seq_len, emb_dim)
out, (h, c) = bilstm(x)
print(out.shape) # (32, 50, 512) -- 512 = 256 × 2 directions
# A tagging head maps each 512-d state to tag logits
tagger = nn.Linear(512, 9) # e.g. 9 NER tags (BIO scheme)
logits = tagger(out) # (32, 50, 9)Classification produces one label per input. But translation, summarisation, and dialogue must produce a variable-length output sequence from a variable-length input sequence. The sequence-to-sequence (seq2seq) architecture (Sutskever et al., 2014; Cho et al., 2014) solved this with two RNNs: an encoder that compresses the input into a context vector, and a decoder that generates the output from it.
The Encoder-Decoder Architecture
ENCODER (reads the source sequence)
h^enc ← 0
for t = 1 … T_src:
h^enc ← LSTM_enc(x_t, h^enc)
context ← h^enc_{T_src} # final state = fixed-size summary
DECODER (generates the target sequence)
h^dec ← context # initialise from encoder
y_0 ← <START>
for t = 1 … T_tgt:
h^dec ← LSTM_dec(embed(y_{t-1}), h^dec)
y_t ← argmax softmax(W · h^dec) # emit next token
if y_t == <END>: breakimport torch; import torch.nn as nn
class Encoder(nn.Module):
def __init__(self, vocab, emb=256, hidden=512):
super().__init__()
self.embed = nn.Embedding(vocab, emb)
self.lstm = nn.LSTM(emb, hidden, batch_first=True)
def forward(self, src):
out, (h, c) = self.lstm(self.embed(src))
return out, (h, c) # out used later for attention
class Decoder(nn.Module):
def __init__(self, vocab, emb=256, hidden=512):
super().__init__()
self.embed = nn.Embedding(vocab, emb)
self.lstm = nn.LSTM(emb, hidden, batch_first=True)
self.proj = nn.Linear(hidden, vocab)
def forward(self, tok, state):
out, state = self.lstm(self.embed(tok), state)
return self.proj(out), state
# Training uses 'teacher forcing': feed the TRUE previous target token,
# not the model's own (possibly wrong) prediction. This stabilises early training
# but creates exposure bias -- the model never practises recovering from its
# own mistakes (a problem scheduled sampling and RL fine-tuning later address).Bahdanau et al. (2014) asked a simple question: why force the decoder to rely on a single context vector? Instead, let it look back at all encoder hidden states and compute a fresh, weighted context vector at every decoding step. This is attention — and it is the conceptual core of the Transformer.
The Attention Mechanism
At decoder step t, attention computes a context vector cₜ as a weighted sum of all encoder states, where the weights depend on how relevant each source position is to the current decoder state:
The alignment score eₜⱼ measures the compatibility between decoder state s_{t-1} and encoder state h_j. The softmax turns scores into a probability distribution (the attention weights αₜⱼ), and the context cₜ is their weighted average of encoder states.
Two Scoring Functions
| Variant | Score function | Notes |
|---|---|---|
| Bahdanau (additive) | vᵀ tanh(W₁ s + W₂ h) | Small MLP; original 2014 formulation |
| Luong (dot) | sᵀ h | Simplest; requires matching dimensions |
| Luong (general) | sᵀ W h | Learned bilinear; flexible dimensions |
| Scaled dot-product | sᵀ h / √d | The Transformer's choice (Ch. 12) |
import torch; import torch.nn as nn; import torch.nn.functional as F
class BahdanauAttention(nn.Module):
def __init__(self, hidden):
super().__init__()
self.W1 = nn.Linear(hidden, hidden) # projects decoder state
self.W2 = nn.Linear(hidden, hidden) # projects encoder states
self.v = nn.Linear(hidden, 1, bias=False)
def forward(self, dec_state, enc_outputs):
# dec_state: (B, H) enc_outputs: (B, T_src, H)
dec = self.W1(dec_state).unsqueeze(1) # (B, 1, H)
enc = self.W2(enc_outputs) # (B, T_src, H)
scores = self.v(torch.tanh(dec + enc)).squeeze(-1) # (B, T_src)
weights = F.softmax(scores, dim=-1) # (B, T_src) attention
context = torch.bmm(weights.unsqueeze(1), enc_outputs) # (B,1,H)
return context.squeeze(1), weights # context + weights for viz
# The attention weights are interpretable: plotting the (T_tgt × T_src)
# weight matrix for a translation reveals soft word alignments -- e.g.
# 'la maison bleue' → 'the blue house' shows the diagonal-ish alignment
# with the adjective-noun swap clearly visible.Attention solved the bottleneck, but the underlying RNN retained two fatal limitations that no amount of gating could fix. Recognising these limits explains precisely why the Transformer abandoned recurrence entirely — keeping only the attention.
| Limitation | Cause | Transformer's fix |
|---|---|---|
| No parallelism | hₜ depends on hₜ₋₁: must compute sequentially | Self-attention: all positions at once |
| Long paths | Info from pos 1 to pos T travels T steps | Attention: any pair in O(1) hops |
| Limited memory | Fixed-size state compresses all history | Attention: direct access to all positions |
| Slow training | Sequential = poor GPU utilisation | Parallel = saturates GPU tensor cores |
The parallelism limitation was the decisive one. An RNN cannot compute hₜ until it has computed hₜ₋₁; a sequence of length 1,000 requires 1,000 sequential steps regardless of how many GPUs you have. The Transformer computes representations for all positions simultaneously, turning a sequential bottleneck into a single large matrix multiplication — perfectly suited to GPU hardware.
The Path Length Argument
To connect information between two positions i and j, an RNN must propagate it through |i − j| recurrent steps, with the gradient attenuating at each step. Attention connects any two positions in a single step — a path length of O(1) rather than O(distance). This is why Transformers learn long-range dependencies that defeated even LSTMs.
Even with LSTMs and attention, training recurrent models well required a collection of practical techniques. Many of these — gradient clipping, learning-rate scheduling, teacher forcing — carry directly into Transformer training.
Truncated Backpropagation Through Time
Full BPTT through a 10,000-token document is infeasible: it requires storing all 10,000 hidden states and backpropagating through all of them. Truncated BPTT processes the sequence in chunks of k steps (e.g., k = 35), backpropagating only within each chunk while carrying the hidden state forward across chunks.
import torch; import torch.nn as nn
model = nn.LSTM(256, 512, batch_first=True)
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
bptt = 35 # truncation length
hidden = None
for chunk in data_chunks: # each chunk is (B, 35, 256)
# Detach hidden state: carry VALUES forward but cut the gradient graph
if hidden is not None:
hidden = tuple(h.detach() for h in hidden)
out, hidden = model(chunk, hidden)
loss = compute_loss(out, targets)
opt.zero_grad()
loss.backward() # gradient only flows within this chunk
nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0) # ESSENTIAL
opt.step()
# Why detach? Without it, the graph grows unboundedly across chunks and
# you'd backprop through the entire history -- defeating the truncation.Essential RNN Training Checklist
Architecture Quick-Reference
| Model | Key feature | Solves | Limitation |
|---|---|---|---|
| Vanilla RNN | Shared recurrent weights | Variable length | Vanishing gradients |
| LSTM | Cell state + 3 gates | Long-range memory | Sequential; slow |
| GRU | 2 gates, merged state | Memory, fewer params | Sequential; slow |
| BiRNN | Forward + backward pass | Full context | 2× compute; no streaming |
| seq2seq | Encoder-decoder | Variable output | Fixed-context bottleneck |
| seq2seq + attn | Weighted source lookup | Bottleneck removed | Still sequential |
Exercises
Exercises 1–10 are pen-and-paper; 11–18 require code.
Further reading: “Long Short-Term Memory” (Hochreiter & Schmidhuber, 1997) — the original LSTM. “Learning Phrase Representations using RNN Encoder-Decoder” (Cho et al., 2014) — GRU and seq2seq. “Neural Machine Translation by Jointly Learning to Align and Translate” (Bahdanau et al., 2014) — the attention paper. “Sequence to Sequence Learning with Neural Networks” (Sutskever et al., 2014). Christopher Olah's blog post “Understanding LSTM Networks” for unmatched visual intuition. Andrej Karpathy's “The Unreasonable Effectiveness of Recurrent Neural Networks.”
Part II Complete: Classical Machine Learning
| Ch. 6 | Discriminative Models | logistic regression, SVMs, trees, and ensembles — learning decision boundaries directly. |
| Ch. 7 | Generative Models | Naïve Bayes, GMMs, HMMs, and the EM algorithm — modelling how data is generated. |
| Ch. 8 | Embeddings | Word2Vec, GloVe, fastText — dense vector spaces where geometry encodes meaning. |
| Ch. 9 | Sequence Models | RNNs, LSTMs, GRUs, and seq2seq with attention — the direct ancestors of the Transformer. |
Part II ends exactly where Part III begins. You have built attention by hand as a soft, differentiable memory lookup, and you have seen the three limitations — no parallelism, long gradient paths, fixed-size memory — that doomed recurrence. The Transformer, the subject of Part III, makes one audacious move: it keeps the attention and discards everything else. Self-attention, multi-head attention, positional encodings, and residual streams — every component you are about to meet is a direct response to a limitation you now understand from first principles. The Transformer will not feel like a black box; it will feel inevitable.