Part II: Classical ML & Representations
Chapter 9

Language Model Fundamentals

N-gram models, perplexity, and the shift to neural LMs
18 Exercises
9.1

Everything in Part II so far has assumed fixed-length inputs: a feature vector, a bag of words, a single embedding. But language is sequential and variable-length. “Dog bites man” and “Man bites dog” contain identical words yet mean opposite things. A model that ignores order cannot tell them apart.

The Three Requirements of a Sequence Model

Variable length: handle a 3-word sentence and a 3,000-word document with the same parameters.
Order sensitivity: the position of each token must matter.
Parameter sharing: the same transformation applied at every position, so learning generalises across positions.

A naïve fix — concatenate all word vectors into one giant input — fails all three: it requires a fixed length, learns position-specific weights, and cannot share statistical strength across positions. The recurrent neural network solves all three at once with a single elegant idea: process one token at a time, carrying a hidden state forward.

Intuition: The Core Idea of Recurrence
Read the sequence left to right. At each step, combine the current token with a running summary of everything seen so far (the hidden state), producing an updated summary. The same update function — the same weights — is applied at every step.
This is exactly how you read this sentence: each word updates your understanding, conditioned on all the words before it. The hidden state is your working memory of the sentence so far.

The Recurrence Relation

hₜ = f(W_hh · hₜ₋₁ + W_xh · xₜ + b) ŷₜ = g(W_hy · hₜ + b_y)

The hidden state hₜ is a function of the previous hidden state hₜ₋₁ and the current input xₜ. The weight matrices W_hh, W_xh, W_hy are shared across all time steps — this is the parameter sharing that lets an RNN handle any sequence length with a fixed parameter count.

9.2

The vanilla (Elman) RNN implements the recurrence directly with a single tanh nonlinearity. It is beautifully simple and was the workhorse of early neural NLP. But it has a crippling limitation that we will diagnose precisely in the next section.

textVanilla RNN — Forward Pass (Pseudocode)
Input:  sequence x₁ … x_T,  initial state h₀ = 0
Params: W_xh (D→H),  W_hh (H→H),  W_hy (H→V),  biases b_h, b_y

for t = 1 … T:
    aₜ  ←  W_xh · xₜ  +  W_hh · hₜ₋₁  +  b_h
    hₜ  ←  tanh(aₜ)                 # new hidden state
    ŷₜ  ←  softmax(W_hy · hₜ + b_y)   # output distribution

return h₁ … h_T, ŷ₁ … ŷ_T
PythonVanilla RNN cell from scratch
import numpy as np

class VanillaRNN:
    def __init__(self, vocab, hidden=64, seed=0):
        rng = np.random.default_rng(seed)
        s   = 0.01                      # small init to avoid early saturation
        self.Wxh = rng.normal(0, s, (hidden, vocab))
        self.Whh = rng.normal(0, s, (hidden, hidden))
        self.Why = rng.normal(0, s, (vocab, hidden))
        self.bh  = np.zeros((hidden, 1))
        self.by  = np.zeros((vocab,  1))
        self.H   = hidden

    def forward(self, inputs, h_prev):
        """inputs: list of one-hot column vectors. Returns states + caches."""
        h = h_prev
        xs, hs, ys, ps = {}, {-1: h_prev}, {}, {}
        for t, x in enumerate(inputs):
            xs[t] = x
            h     = np.tanh(self.Wxh @ x + self.Whh @ hs[t-1] + self.bh)
            y     = self.Why @ h + self.by
            p     = np.exp(y - y.max()); p /= p.sum()   # stable softmax
            hs[t], ys[t], ps[t] = h, y, p
        return xs, hs, ps

# A character-level RNN trained on text can generate plausible characters,
# but it forgets context beyond ~10-20 steps. The next section shows why.

Backpropagation Through Time (BPTT)

Training an RNN means unrolling it across all time steps into a (very deep) feed-forward graph, then applying backpropagation. Because W_hh is shared across every step, its gradient is the sum of contributions from all time steps — this summation is the source of both the RNN's power and its instability.

∂L/∂W_hh = Σₜ (∂Lₜ/∂hₜ) · Σ_{k≤t} (∂hₜ/∂h_k) · (∂h_k/∂W_hh)

The crucial term is ∂hₜ/∂h_k — how the hidden state at time t depends on the hidden state at an earlier time k. Expanding it reveals the vanishing-gradient problem.

PythonBPTT for the vanilla RNN
def backward(self, xs, hs, ps, targets):
    dWxh = np.zeros_like(self.Wxh); dWhh = np.zeros_like(self.Whh)
    dWhy = np.zeros_like(self.Why); dbh = np.zeros_like(self.bh); dby = np.zeros_like(self.by)
    dh_next = np.zeros((self.H, 1))

    for t in reversed(range(len(targets))):  # walk backwards in time
        dy = ps[t].copy(); dy[targets[t]] -= 1   # softmax-CE gradient: p - y
        dWhy += dy @ hs[t].T;  dby += dy

        dh  = self.Why.T @ dy + dh_next         # grad flowing into h_t
        draw = (1 - hs[t]**2) * dh            # tanh'(a) = 1 - tanh²(a)
        dbh += draw
        dWxh += draw @ xs[t].T
        dWhh += draw @ hs[t-1].T
        dh_next = self.Whh.T @ draw            # propagate to previous step

    # Gradient clipping is ESSENTIAL for RNNs (prevents explosion)
    for d in [dWxh, dWhh, dWhy, dbh, dby]:
        np.clip(d, -5, 5, out=d)
    return dWxh, dWhh, dWhy, dbh, dby
9.3

We met vanishing gradients briefly in Chapter 2. In RNNs the problem is acute and structural, because the same recurrent weight matrix is applied at every time step. Understanding this precisely is the key to understanding why LSTMs were invented.

The Gradient Chain

The dependence of hidden state hₜ on an earlier state h_k is a product of Jacobians — one per intervening step:

∂hₜ/∂h_k = ∏_{i=k+1}^{t} ∂hᵢ/∂hᵢ₋₁ = ∏_{i=k+1}^{t} diag(tanh'(aᵢ)) · W_hhᵀ

This product of (t − k) matrices governs how a gradient at time t flows back to time k. Two failure modes:

RegimeConditionConsequence
VanishingLargest singular value of W_hh < 1Gradient → 0 exponentially; no long-range learning
ExplodingLargest singular value of W_hh > 1Gradient → ∞; NaN loss, unstable training
Stable (rare)Singular values ≈ 1Knife-edge; hard to maintain during training

The tanh derivative compounds the problem: tanh'(a) = 1 − tanh²(a) ≤ 1, and it is near zero whenever the unit is saturated (|a| large). So even with well-scaled W_hh, repeated multiplication by sub-unit derivatives drives gradients toward zero.

PythonDemonstrating vanishing gradients empirically
import numpy as np

def gradient_norm_over_time(W_hh, T=100, H=64):
    """Track how a gradient shrinks as it flows back through T steps."""
    grad  = np.random.randn(H, 1)       # gradient at the final step
    norms = [np.linalg.norm(grad)]
    for _ in range(T):
        # Each backward step: multiply by W_hhᵀ and a tanh' factor (~0.5 avg)
        grad = (W_hh.T @ grad) * 0.5
        norms.append(np.linalg.norm(grad))
    return np.array(norms)

rng = np.random.default_rng(0)
H   = 64

# Case 1: small weights → vanishing
W_small = rng.normal(0, 0.5/np.sqrt(H), (H, H))
norms_v = gradient_norm_over_time(W_small)
print(f"After 100 steps (small W): {norms_v[-1]/norms_v[0]:.2e} of original")
# After 100 steps (small W): 3.7e-31 of original  -- completely vanished

# Case 2: large weights → exploding
W_large = rng.normal(0, 3.0/np.sqrt(H), (H, H))
norms_e = gradient_norm_over_time(W_large)
print(f"After 100 steps (large W): {norms_e[-1]/norms_e[0]:.2e} of original")
# After 100 steps (large W): 1.4e+18 of original  -- exploded to NaN territory

# Conclusion: a gradient cannot survive 100 steps of multiplication by W_hh.
# Information from t=1 cannot influence the loss at t=100. The RNN is 'amnesiac'.
⚠️
Pitfall: The Practical Symptom
A vanilla RNN trained on language modelling will learn local patterns (the next character, short word completions) but fail at long-range dependencies. Ask it to close a parenthesis opened 50 tokens ago, or match subject-verb agreement across a long clause, and it fails — the gradient signal connecting those positions has vanished.
Gradient clipping fixes EXPLODING gradients (just cap the norm), but it cannot fix VANISHING gradients — you cannot un-vanish a zero. The solution required a new architecture: the LSTM.
9.4

The Long Short-Term Memory network (Hochretier & Schmidhuber, 1997) solves the vanishing gradient problem with a brilliant architectural change: a separate cell state that flows through time with only minor, additive modifications. Gradients can travel along this cell-state highway nearly unchanged, enabling the network to learn dependencies hundreds of steps long.

History: Ahead of Its Time
The LSTM was published in 1997 but languished for 15 years — the compute and data to train it on real problems did not exist. It became the dominant sequence architecture only around 2014–2015, powering Google Translate, Siri, and Alexa.
Its reign was brief: the Transformer (2017) displaced it within three years. But the LSTM's core insight — additive state updates protect gradients — lives on in the residual connections of every Transformer.

The Cell State and Three Gates

The LSTM maintains two states: the hidden state hₜ (the output) and the cell state cₜ (the long-term memory). Three gates — each a sigmoid layer producing values in [0,1] — control the flow of information:

GateSymbolRole
ForgetfₜWhat fraction of the old cell state to keep (0 = erase, 1 = retain)
InputiₜWhat fraction of the new candidate to write into the cell
OutputoₜWhat fraction of the cell state to expose as the hidden state
textLSTM Cell — Forward Pass (Pseudocode)
# Concatenate previous hidden state and current input: [hₜ₋₁ ; xₜ]

fₜ  ←  σ(W_f · [hₜ₋₁, xₜ] + b_f)        # forget gate
iₜ  ←  σ(W_i · [hₜ₋₁, xₜ] + b_i)        # input gate
c̃ₜ  ←  tanh(W_c · [hₜ₋₁, xₜ] + b_c)      # candidate cell content
oₜ  ←  σ(W_o · [hₜ₋₁, xₜ] + b_o)        # output gate

cₜ  ←  fₜ ⊙ cₜ₋₁  +  iₜ ⊙ c̃ₜ            # ← THE GRADIENT HIGHWAY
hₜ  ←  oₜ ⊙ tanh(cₜ)                    # gated output

The critical line is the cell-state update: cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ. Its gradient with respect to cₜ₋₁ is simply fₜ — an element-wise multiplication, not a matrix multiplication through a saturating nonlinearity. When the forget gate fₜ ≈ 1, the gradient flows backward essentially unchanged. This is the additive, near-identity path that defeats vanishing gradients.

PythonLSTM cell from scratch
import numpy as np

def sigmoid(x): return 1 / (1 + np.exp(-np.clip(x, -30, 30)))

class LSTMCell:
    def __init__(self, input_dim, hidden, seed=0):
        rng = np.random.default_rng(seed)
        Z   = input_dim + hidden          # concatenated [h, x] size
        s   = 1.0 / np.sqrt(Z)
        # One weight matrix per gate; rows map [h_prev, x] → gate value
        self.Wf = rng.normal(0, s, (hidden, Z)); self.bf = np.ones((hidden,1))  # bf=1: start by remembering
        self.Wi = rng.normal(0, s, (hidden, Z)); self.bi = np.zeros((hidden,1))
        self.Wc = rng.normal(0, s, (hidden, Z)); self.bc = np.zeros((hidden,1))
        self.Wo = rng.normal(0, s, (hidden, Z)); self.bo = np.zeros((hidden,1))
        self.H  = hidden

    def step(self, x, h_prev, c_prev):
        z  = np.vstack([h_prev, x])       # concatenate
        f  = sigmoid(self.Wf @ z + self.bf)   # forget gate
        i  = sigmoid(self.Wi @ z + self.bi)   # input gate
        c_tilde = np.tanh(self.Wc @ z + self.bc)  # candidate
        o  = sigmoid(self.Wo @ z + self.bo)   # output gate

        c  = f * c_prev + i * c_tilde       # cell update (the highway)
        h  = o * np.tanh(c)                 # gated hidden output
        return h, c, (f, i, c_tilde, o, z)  # cache for backprop

# Note bf initialised to 1.0: the standard 'forget gate bias' trick.
# It makes the LSTM remember by default at the start of training,
# dramatically improving learning of long-range dependencies (Jozefowicz 2015).
ML Connection: From Cell-State Highway to Residual Connection
The LSTM's cell-state update cₜ = fₜ ⊙ cₜ₋₁ + (new info) is structurally identical to the residual connection x + F(x) used in ResNets and every Transformer block. Both create an additive path along which gradients flow nearly unchanged.
When you read in Chapter 13 that 'residual connections enable training of very deep Transformers', recognise it as the same insight the LSTM used to enable training across many time steps. The problem (vanishing gradients through depth) and the solution (additive identity paths) are universal.
9.5

The GRU (Cho et al., 2014) simplifies the LSTM: it merges the cell state and hidden state into one, and uses only two gates instead of three. It has fewer parameters, trains faster, and performs comparably to the LSTM on most tasks — making it a popular default.

Two Gates: Reset and Update

textGRU Cell — Forward Pass (Pseudocode)
rₜ  ←  σ(W_r · [hₜ₋₁, xₜ] + b_r)          # reset gate
zₜ  ←  σ(W_z · [hₜ₋₁, xₜ] + b_z)          # update gate

h̃ₜ  ←  tanh(W_h · [rₜ ⊙ hₜ₋₁, xₜ] + b_h)   # candidate state
hₜ  ←  (1 − zₜ) ⊙ hₜ₋₁  +  zₜ ⊙ h̃ₜ        # interpolate old & new

The reset gate rₜ controls how much of the previous state contributes to the candidate; when rₜ ≈ 0, the unit ignores its past and behaves like reading a fresh sequence. The update gate zₜ interpolates between keeping the old state and adopting the new candidate — it plays the combined role of the LSTM's forget and input gates.

Compare: LSTM vs. GRU
LSTM: 3 gates (forget, input, output), separate cell and hidden state, 4 weight matrices per cell. More expressive; better on tasks needing precise long-term memory control (e.g., counting, very long dependencies). ~33% more parameters.
GRU: 2 gates (reset, update), single state, 3 weight matrices per cell. Faster to train, fewer parameters, less prone to overfitting on small datasets. Comparable accuracy on most NLP and speech tasks. Often the better default when compute or data is limited.
PythonGRU cell from scratch
import numpy as np

class GRUCell:
    def __init__(self, input_dim, hidden, seed=0):
        rng = np.random.default_rng(seed); Z = input_dim + hidden; s = 1/np.sqrt(Z)
        self.Wr = rng.normal(0, s, (hidden, Z)); self.br = np.zeros((hidden,1))
        self.Wz = rng.normal(0, s, (hidden, Z)); self.bz = np.zeros((hidden,1))
        self.Wh = rng.normal(0, s, (hidden, Z)); self.bh = np.zeros((hidden,1))

    def step(self, x, h_prev):
        z_cat = np.vstack([h_prev, x])
        r = sigmoid(self.Wr @ z_cat + self.br)    # reset gate
        z = sigmoid(self.Wz @ z_cat + self.bz)    # update gate
        # Candidate uses reset-gated previous state
        h_tilde = np.tanh(self.Wh @ np.vstack([r * h_prev, x]) + self.bh)
        h = (1 - z) * h_prev + z * h_tilde       # interpolate
        return h

# 3 matrices vs LSTM's 4. On a 256-hidden, 300-input cell:
#   GRU:  3 × 256 × 556 = 427k params
#   LSTM: 4 × 256 × 556 = 569k params   (33% more)
9.6

Two architectural enhancements turned RNNs into the powerful sequence models of 2014–2017: stacking multiple recurrent layers for hierarchical abstraction, and processing sequences in both directions for full bidirectional context.

Stacked (Deep) RNNs

A single recurrent layer can be stacked: the hidden states of layer ℓ become the inputs to layer ℓ+1. Lower layers capture local patterns (morphology, short phrases); higher layers capture longer-range, more abstract structure — mirroring the hierarchy of features in a deep CNN or Transformer.

Bidirectional RNNs

A unidirectional RNN at position t has seen only x₁…xₜ. But for many tasks — named-entity recognition, part-of-speech tagging — the right context matters too. A bidirectional RNN runs two independent RNNs, one forward and one backward, and concatenates their hidden states:

hₜ = [ h⃗ₜ ; h⃖ₜ ] where h⃗ₜ reads x₁…xₜ and h⃖ₜ reads x_T…xₜ
ML Connection: BiLSTM → BERT
The bidirectional LSTM was the dominant architecture for sequence labelling (NER, POS) until 2018. ELMo (2018) used a deep BiLSTM to produce the first widely-used contextual embeddings.
BERT (2018) replaced the BiLSTM with a bidirectional Transformer, trained via masked language modelling. The 'B' in BERT stands for 'Bidirectional' — a direct descendant of the BiLSTM idea, but with attention replacing recurrence. The conceptual lineage is unmistakable.
PythonBidirectional LSTM in PyTorch
import torch; import torch.nn as nn

# A 2-layer bidirectional LSTM for sequence tagging
bilstm = nn.LSTM(
    input_size=300,       # word embedding dimension
    hidden_size=256,      # hidden units per direction
    num_layers=2,         # stacked depth
    bidirectional=True,    # forward + backward
    batch_first=True,
)

x       = torch.randn(32, 50, 300)  # (batch, seq_len, emb_dim)
out, (h, c) = bilstm(x)
print(out.shape)   # (32, 50, 512) -- 512 = 256 × 2 directions

# A tagging head maps each 512-d state to tag logits
tagger = nn.Linear(512, 9)  # e.g. 9 NER tags (BIO scheme)
logits = tagger(out)           # (32, 50, 9)
9.7

Classification produces one label per input. But translation, summarisation, and dialogue must produce a variable-length output sequence from a variable-length input sequence. The sequence-to-sequence (seq2seq) architecture (Sutskever et al., 2014; Cho et al., 2014) solved this with two RNNs: an encoder that compresses the input into a context vector, and a decoder that generates the output from it.

The Encoder-Decoder Architecture

textVanilla seq2seq (Pseudocode)
ENCODER  (reads the source sequence)
  h^enc ← 0
  for t = 1 … T_src:
    h^enc ← LSTM_enc(x_t, h^enc)
  context ← h^enc_{T_src}       # final state = fixed-size summary

DECODER  (generates the target sequence)
  h^dec ← context                # initialise from encoder
  y_0 ← <START>
  for t = 1 … T_tgt:
    h^dec ← LSTM_dec(embed(y_{t-1}), h^dec)
    y_t   ← argmax softmax(W · h^dec)  # emit next token
    if y_t == <END>: break
Pythonseq2seq for translation in PyTorch
import torch; import torch.nn as nn

class Encoder(nn.Module):
    def __init__(self, vocab, emb=256, hidden=512):
        super().__init__()
        self.embed = nn.Embedding(vocab, emb)
        self.lstm  = nn.LSTM(emb, hidden, batch_first=True)
    def forward(self, src):
        out, (h, c) = self.lstm(self.embed(src))
        return out, (h, c)     # out used later for attention

class Decoder(nn.Module):
    def __init__(self, vocab, emb=256, hidden=512):
        super().__init__()
        self.embed = nn.Embedding(vocab, emb)
        self.lstm  = nn.LSTM(emb, hidden, batch_first=True)
        self.proj  = nn.Linear(hidden, vocab)
    def forward(self, tok, state):
        out, state = self.lstm(self.embed(tok), state)
        return self.proj(out), state

# Training uses 'teacher forcing': feed the TRUE previous target token,
# not the model's own (possibly wrong) prediction. This stabilises early training
# but creates exposure bias -- the model never practises recovering from its
# own mistakes (a problem scheduled sampling and RL fine-tuning later address).
⚠️
Pitfall: The Fixed-Context Bottleneck
Vanilla seq2seq compresses the ENTIRE source sentence into a single fixed-size context vector. For a 5-word sentence this is fine; for a 50-word sentence, the encoder must cram 50 words of meaning into, say, 512 numbers. Translation quality collapses as sentences grow longer (Cho et al., 2014 measured the degradation precisely).
This bottleneck is the single problem that attention was invented to solve — and attention, in turn, became the seed of the entire Transformer revolution.
9.8

Bahdanau et al. (2014) asked a simple question: why force the decoder to rely on a single context vector? Instead, let it look back at all encoder hidden states and compute a fresh, weighted context vector at every decoding step. This is attention — and it is the conceptual core of the Transformer.

The Attention Mechanism

At decoder step t, attention computes a context vector cₜ as a weighted sum of all encoder states, where the weights depend on how relevant each source position is to the current decoder state:

eₜⱼ = score(s_{t-1}, h_j) αₜⱼ = softmax_j(eₜⱼ) cₜ = Σ_j αₜⱼ · h_j

The alignment score eₜⱼ measures the compatibility between decoder state s_{t-1} and encoder state h_j. The softmax turns scores into a probability distribution (the attention weights αₜⱼ), and the context cₜ is their weighted average of encoder states.

Two Scoring Functions

VariantScore functionNotes
Bahdanau (additive)vᵀ tanh(W₁ s + W₂ h)Small MLP; original 2014 formulation
Luong (dot)sᵀ hSimplest; requires matching dimensions
Luong (general)sᵀ W hLearned bilinear; flexible dimensions
Scaled dot-productsᵀ h / √dThe Transformer's choice (Ch. 12)
PythonBahdanau attention from scratch
import torch; import torch.nn as nn; import torch.nn.functional as F

class BahdanauAttention(nn.Module):
    def __init__(self, hidden):
        super().__init__()
        self.W1 = nn.Linear(hidden, hidden)    # projects decoder state
        self.W2 = nn.Linear(hidden, hidden)    # projects encoder states
        self.v  = nn.Linear(hidden, 1, bias=False)

    def forward(self, dec_state, enc_outputs):
        # dec_state: (B, H)   enc_outputs: (B, T_src, H)
        dec = self.W1(dec_state).unsqueeze(1)    # (B, 1, H)
        enc = self.W2(enc_outputs)              # (B, T_src, H)
        scores = self.v(torch.tanh(dec + enc)).squeeze(-1)  # (B, T_src)
        weights = F.softmax(scores, dim=-1)       # (B, T_src) attention
        context = torch.bmm(weights.unsqueeze(1), enc_outputs)  # (B,1,H)
        return context.squeeze(1), weights     # context + weights for viz

# The attention weights are interpretable: plotting the (T_tgt × T_src)
# weight matrix for a translation reveals soft word alignments -- e.g.
# 'la maison bleue' → 'the blue house' shows the diagonal-ish alignment
# with the adjective-noun swap clearly visible.
Attention as Soft, Differentiable Memory Lookup
Attention is a content-addressable memory: the query (decoder state) is compared against keys (encoder states) to retrieve a weighted blend of values (also encoder states). Unlike a hard lookup, it is fully differentiable — the model learns what to attend to via gradient descent.
This query-key-value framing, made explicit here, is EXACTLY the self-attention of the Transformer. The only differences in Chapter 12 will be: (1) queries, keys, and values are all derived from the same sequence (self-attention), (2) the scaled dot-product score replaces the additive MLP, and (3) multiple attention 'heads' run in parallel. The mechanism you just built IS the Transformer's beating heart.
9.9

Attention solved the bottleneck, but the underlying RNN retained two fatal limitations that no amount of gating could fix. Recognising these limits explains precisely why the Transformer abandoned recurrence entirely — keeping only the attention.

LimitationCauseTransformer's fix
No parallelismhₜ depends on hₜ₋₁: must compute sequentiallySelf-attention: all positions at once
Long pathsInfo from pos 1 to pos T travels T stepsAttention: any pair in O(1) hops
Limited memoryFixed-size state compresses all historyAttention: direct access to all positions
Slow trainingSequential = poor GPU utilisationParallel = saturates GPU tensor cores

The parallelism limitation was the decisive one. An RNN cannot compute hₜ until it has computed hₜ₋₁; a sequence of length 1,000 requires 1,000 sequential steps regardless of how many GPUs you have. The Transformer computes representations for all positions simultaneously, turning a sequential bottleneck into a single large matrix multiplication — perfectly suited to GPU hardware.

The Path Length Argument

To connect information between two positions i and j, an RNN must propagate it through |i − j| recurrent steps, with the gradient attenuating at each step. Attention connects any two positions in a single step — a path length of O(1) rather than O(distance). This is why Transformers learn long-range dependencies that defeated even LSTMs.

History: “Attention Is All You Need”
The 2017 Transformer paper made a radical claim in its title: you can throw away the entire recurrent machinery and keep only the attention. The community was sceptical — surely sequences need some notion of recurrence or convolution?
They did not. The Transformer matched and then crushed RNN performance on translation, while training in a fraction of the time thanks to full parallelism. Within three years, recurrence had all but vanished from NLP research. Part III tells this story in full.
9.10

Even with LSTMs and attention, training recurrent models well required a collection of practical techniques. Many of these — gradient clipping, learning-rate scheduling, teacher forcing — carry directly into Transformer training.

Truncated Backpropagation Through Time

Full BPTT through a 10,000-token document is infeasible: it requires storing all 10,000 hidden states and backpropagating through all of them. Truncated BPTT processes the sequence in chunks of k steps (e.g., k = 35), backpropagating only within each chunk while carrying the hidden state forward across chunks.

PythonTruncated BPTT training loop
import torch; import torch.nn as nn

model   = nn.LSTM(256, 512, batch_first=True)
opt     = torch.optim.Adam(model.parameters(), lr=1e-3)
bptt    = 35                            # truncation length
hidden  = None

for chunk in data_chunks:            # each chunk is (B, 35, 256)
    # Detach hidden state: carry VALUES forward but cut the gradient graph
    if hidden is not None:
        hidden = tuple(h.detach() for h in hidden)

    out, hidden = model(chunk, hidden)
    loss = compute_loss(out, targets)

    opt.zero_grad()
    loss.backward()                    # gradient only flows within this chunk
    nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)  # ESSENTIAL
    opt.step()

# Why detach? Without it, the graph grows unboundedly across chunks and
# you'd backprop through the entire history -- defeating the truncation.

Essential RNN Training Checklist

Gradient clipping: clip the global gradient norm to ~5.0. Non-negotiable for RNNs — prevents the exploding-gradient NaNs.
Forget-gate bias = 1: initialise the LSTM forget gate bias to 1 so the network remembers by default early in training.
Orthogonal initialisation of W_hh: keeps recurrent Jacobian singular values near 1 at initialisation, delaying vanishing/exploding.
Variational dropout: apply the SAME dropout mask at every time step, not a fresh one per step (Gal & Ghahramani, 2016).
Teacher forcing with scheduled sampling: gradually replace true previous tokens with model predictions to reduce exposure bias.
Layer normalisation: stabilises hidden-state statistics across time steps, easing optimisation.
9.11

Architecture Quick-Reference

ModelKey featureSolvesLimitation
Vanilla RNNShared recurrent weightsVariable lengthVanishing gradients
LSTMCell state + 3 gatesLong-range memorySequential; slow
GRU2 gates, merged stateMemory, fewer paramsSequential; slow
BiRNNForward + backward passFull context2× compute; no streaming
seq2seqEncoder-decoderVariable outputFixed-context bottleneck
seq2seq + attnWeighted source lookupBottleneck removedStill sequential

Exercises

Exercises 1–10 are pen-and-paper; 11–18 require code.

Exercise 1: Pen & Paper
Write out the full BPTT gradient ∂L/∂W_hh for a 3-step vanilla RNN. Identify which term contains the product of Jacobians responsible for vanishing gradients.
Exercise 2: Pen & Paper
The tanh derivative is 1 − tanh²(a). Show that it is at most 1 and approaches 0 as |a| → ∞. Explain how this compounds the vanishing gradient problem over many time steps.
Exercise 3: Pen & Paper
For the LSTM cell update cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ, compute ∂cₜ/∂cₜ₋₁. Explain why this protects gradients when fₜ ≈ 1, contrasting with the vanilla RNN's ∂hₜ/∂hₜ₋₁.
Exercise 4: Pen & Paper
Count the parameters in an LSTM cell with input dim D and hidden dim H. Repeat for a GRU. Verify the GRU has 25% fewer recurrent parameters.
Exercise 5: Pen & Paper
Show that if the LSTM forget gate is always 1 and the input gate always 0, the cell state is perfectly preserved: cₜ = c₀ for all t. What does this 'constant error carousel' achieve?
Exercise 6: Pen & Paper
In Bahdanau attention, the alignment scores eₜⱼ are passed through a softmax. Prove that the resulting context vector cₜ is a convex combination of the encoder states (lies in their convex hull).
Exercise 7: Pen & Paper
Compare the path length between positions i and j (|i−j| apart) in: (a) a vanilla RNN, (b) a bidirectional RNN, (c) a Transformer with self-attention. Why does path length matter for learning long-range dependencies?
Exercise 8: Pen & Paper
A seq2seq encoder compresses a T-token sentence into a fixed H-dimensional vector. Argue information-theoretically why translation quality must degrade as T grows for fixed H.
Exercise 9: Pen & Paper
Derive the gradient of the Luong dot-product attention score sᵀh with respect to the decoder state s and encoder state h. Why is this cheaper to compute than Bahdanau's additive score?
Exercise 10: Pen & Paper
Explain why truncated BPTT with truncation length k cannot learn dependencies longer than k steps within a single backward pass, even though the hidden state carries information forward indefinitely.
Exercise 11: Code
Implement a vanilla RNN character-level language model from scratch (NumPy). Train it on a small text. Demonstrate empirically that it fails to balance parentheses opened more than ~15 characters earlier.
Exercise 12: Code
Implement the LSTM cell forward AND backward pass from scratch. Verify your gradients with the finite-difference checker from Chapter 5.
Exercise 13: Code
Implement a GRU cell from scratch. On a synthetic 'copy task' (reproduce a sequence after a delay of N steps), compare GRU vs. vanilla RNN as N grows from 5 to 100. Plot accuracy vs. N.
Exercise 14: Code Lab
Build a seq2seq model (PyTorch) for a toy translation task (e.g., reversing or sorting number sequences). Train WITHOUT attention. Plot BLEU/accuracy vs. source length.
Exercise 15: Code Lab
Add Bahdanau attention to your exercise-14 model. Re-plot accuracy vs. source length and show the bottleneck is removed. Visualise the attention weight matrix for 3 examples.
Exercise 16: Code
Demonstrate the gradient highway: train an LSTM and a vanilla RNN on a sequence with a dependency 80 steps long (the output depends on input at t=1). Plot the gradient norm reaching t=1 for both during training.
Exercise 17: Code
Implement a bidirectional LSTM tagger for part-of-speech tagging on a small annotated corpus. Compare unidirectional vs. bidirectional F1 score. Which tags benefit most from right-context?
Exercise 18: Code (Challenge)
Build a complete attention-based seq2seq translation system on a small real dataset (e.g., a Tatoeba language pair, ≤10k sentence pairs). Implement teacher forcing, beam search decoding, and attention visualisation. Report BLEU. Then reflect in writing: which limitations from Section 9.9 did you encounter, and how would a Transformer address each?

Further reading: “Long Short-Term Memory” (Hochreiter & Schmidhuber, 1997) — the original LSTM. “Learning Phrase Representations using RNN Encoder-Decoder” (Cho et al., 2014) — GRU and seq2seq. “Neural Machine Translation by Jointly Learning to Align and Translate” (Bahdanau et al., 2014) — the attention paper. “Sequence to Sequence Learning with Neural Networks” (Sutskever et al., 2014). Christopher Olah's blog post “Understanding LSTM Networks” for unmatched visual intuition. Andrej Karpathy's “The Unreasonable Effectiveness of Recurrent Neural Networks.”

Part II Complete: Classical Machine Learning

Ch. 6Discriminative Modelslogistic regression, SVMs, trees, and ensembles — learning decision boundaries directly.
Ch. 7Generative ModelsNaïve Bayes, GMMs, HMMs, and the EM algorithm — modelling how data is generated.
Ch. 8EmbeddingsWord2Vec, GloVe, fastText — dense vector spaces where geometry encodes meaning.
Ch. 9Sequence ModelsRNNs, LSTMs, GRUs, and seq2seq with attention — the direct ancestors of the Transformer.

Part II ends exactly where Part III begins. You have built attention by hand as a soft, differentiable memory lookup, and you have seen the three limitations — no parallelism, long gradient paths, fixed-size memory — that doomed recurrence. The Transformer, the subject of Part III, makes one audacious move: it keeps the attention and discards everything else. Self-attention, multi-head attention, positional encodings, and residual streams — every component you are about to meet is a direct response to a limitation you now understand from first principles. The Transformer will not feel like a black box; it will feel inevitable.

18 Exercises in this chapter
Attempt each exercise before checking the worked solutions.
View Solutions →