Part II: Classical ML & Representations

Chapter 9

Sequence Models: RNNs & LSTMs

Vanishing gradients, gating mechanisms, and seq2seq with attention — the direct ancestors of the Transformer.

18 Exercises

Learning Objectives

1.	Understand why feed-forward networks cannot process variable-length sequences and how recurrence solves it.
2.	Derive backpropagation through time (BPTT) and diagnose the vanishing/exploding gradient problem.
3.	Build a vanilla RNN from scratch and observe its memory limitations empirically.
4.	Understand LSTM gating mechanisms and how the cell state creates a gradient highway.
5.	Implement an LSTM and GRU cell from scratch and compare them.
6.	Build the encoder-decoder (seq2seq) architecture for machine translation.
7.	Add Bahdanau and Luong attention to seq2seq and explain why it shattered the fixed-context bottleneck.
8.	Trace the conceptual line from attention-augmented RNNs directly to the Transformer of Part III.

Everything in Part II so far has assumed fixed-length inputs: a feature vector, a bag of words, a single embedding. But language is sequential and variable-length. “Dog bites man” and “Man bites dog” contain identical words yet mean opposite things. A model that ignores order cannot tell them apart.

The Three Requirements of a Sequence Model

•Variable length: handle a 3-word sentence and a 3,000-word document with the same parameters.

•Order sensitivity: the position of each token must matter.

•Parameter sharing: the same transformation applied at every position, so learning generalises across positions.

A naïve fix — concatenate all word vectors into one giant input — fails all three: it requires a fixed length, learns position-specific weights, and cannot share statistical strength across positions. The recurrent neural network solves all three at once with a single elegant idea: process one token at a time, carrying a hidden state forward.

✧

Intuition: The Core Idea of Recurrence

Read the sequence left to right. At each step, combine the current token with a running summary of everything seen so far (the hidden state), producing an updated summary. The same update function — the same weights — is applied at every step.

This is exactly how you read this sentence: each word updates your understanding, conditioned on all the words before it. The hidden state is your working memory of the sentence so far.

The Recurrence Relation

hₜ = f(W_hh · hₜ₋₁ + W_xh · xₜ + b) ŷₜ = g(W_hy · hₜ + b_y)

The hidden state hₜ is a function of the previous hidden state hₜ₋₁ and the current input xₜ. The weight matrices W_hh, W_xh, W_hy are shared across all time steps — this is the parameter sharing that lets an RNN handle any sequence length with a fixed parameter count.

The vanilla (Elman) RNN implements the recurrence directly with a single tanh nonlinearity. It is beautifully simple and was the workhorse of early neural NLP. But it has a crippling limitation that we will diagnose precisely in the next section.

text•Vanilla RNN — Forward Pass (Pseudocode)
Input:  sequence x₁ … x_T,  initial state h₀ = 0
Params: W_xh (D→H),  W_hh (H→H),  W_hy (H→V),  biases b_h, b_y

for t = 1 … T:
    aₜ  ←  W_xh · xₜ  +  W_hh · hₜ₋₁  +  b_h
    hₜ  ←  tanh(aₜ)                 # new hidden state
    ŷₜ  ←  softmax(W_hy · hₜ + b_y)   # output distribution

return h₁ … h_T, ŷ₁ … ŷ_T

Python•Vanilla RNN cell from scratch
import numpy as np

class VanillaRNN:
    def __init__(self, vocab, hidden=64, seed=0):
        rng = np.random.default_rng(seed)
        s   = 0.01                      # small init to avoid early saturation
        self.Wxh = rng.normal(0, s, (hidden, vocab))
        self.Whh = rng.normal(0, s, (hidden, hidden))
        self.Why = rng.normal(0, s, (vocab, hidden))
        self.bh  = np.zeros((hidden, 1))
        self.by  = np.zeros((vocab,  1))
        self.H   = hidden

    def forward(self, inputs, h_prev):
        """inputs: list of one-hot column vectors. Returns states + caches."""
        h = h_prev
        xs, hs, ys, ps = {}, {-1: h_prev}, {}, {}
        for t, x in enumerate(inputs):
            xs[t] = x
            h     = np.tanh(self.Wxh @ x + self.Whh @ hs[t-1] + self.bh)
            y     = self.Why @ h + self.by
            p     = np.exp(y - y.max()); p /= p.sum()   # stable softmax
            hs[t], ys[t], ps[t] = h, y, p
        return xs, hs, ps

# A character-level RNN trained on text can generate plausible characters,
# but it forgets context beyond ~10-20 steps. The next section shows why.

Backpropagation Through Time (BPTT)

Training an RNN means unrolling it across all time steps into a (very deep) feed-forward graph, then applying backpropagation. Because W_hh is shared across every step, its gradient is the sum of contributions from all time steps — this summation is the source of both the RNN's power and its instability.

∂L/∂W_hh = Σₜ (∂Lₜ/∂hₜ) · Σ_{k≤t} (∂hₜ/∂h_k) · (∂h_k/∂W_hh)

The crucial term is ∂hₜ/∂h_k — how the hidden state at time t depends on the hidden state at an earlier time k. Expanding it reveals the vanishing-gradient problem.

Python•BPTT for the vanilla RNN
def backward(self, xs, hs, ps, targets):
    dWxh = np.zeros_like(self.Wxh); dWhh = np.zeros_like(self.Whh)
    dWhy = np.zeros_like(self.Why); dbh = np.zeros_like(self.bh); dby = np.zeros_like(self.by)
    dh_next = np.zeros((self.H, 1))

    for t in reversed(range(len(targets))):  # walk backwards in time
        dy = ps[t].copy(); dy[targets[t]] -= 1   # softmax-CE gradient: p - y
        dWhy += dy @ hs[t].T;  dby += dy

        dh  = self.Why.T @ dy + dh_next         # grad flowing into h_t
        draw = (1 - hs[t]**2) * dh            # tanh'(a) = 1 - tanh²(a)
        dbh += draw
        dWxh += draw @ xs[t].T
        dWhh += draw @ hs[t-1].T
        dh_next = self.Whh.T @ draw            # propagate to previous step

    # Gradient clipping is ESSENTIAL for RNNs (prevents explosion)
    for d in [dWxh, dWhh, dWhy, dbh, dby]:
        np.clip(d, -5, 5, out=d)
    return dWxh, dWhh, dWhy, dbh, dby

We met vanishing gradients briefly in Chapter 2. In RNNs the problem is acute and structural, because the same recurrent weight matrix is applied at every time step. Understanding this precisely is the key to understanding why LSTMs were invented.

The Gradient Chain

The dependence of hidden state hₜ on an earlier state h_k is a product of Jacobians — one per intervening step:

∂hₜ/∂h_k = ∏_{i=k+1}^{t} ∂hᵢ/∂hᵢ₋₁ = ∏_{i=k+1}^{t} diag(tanh'(aᵢ)) · W_hhᵀ

This product of (t − k) matrices governs how a gradient at time t flows back to time k. Two failure modes:

Regime	Condition	Consequence
Vanishing	Largest singular value of W_hh < 1	Gradient → 0 exponentially; no long-range learning
Exploding	Largest singular value of W_hh > 1	Gradient → ∞; NaN loss, unstable training
Stable (rare)	Singular values ≈ 1	Knife-edge; hard to maintain during training

The tanh derivative compounds the problem: tanh'(a) = 1 − tanh²(a) ≤ 1, and it is near zero whenever the unit is saturated (|a| large). So even with well-scaled W_hh, repeated multiplication by sub-unit derivatives drives gradients toward zero.

Python•Demonstrating vanishing gradients empirically
import numpy as np

def gradient_norm_over_time(W_hh, T=100, H=64):
    """Track how a gradient shrinks as it flows back through T steps."""
    grad  = np.random.randn(H, 1)       # gradient at the final step
    norms = [np.linalg.norm(grad)]
    for _ in range(T):
        # Each backward step: multiply by W_hhᵀ and a tanh' factor (~0.5 avg)
        grad = (W_hh.T @ grad) * 0.5
        norms.append(np.linalg.norm(grad))
    return np.array(norms)

rng = np.random.default_rng(0)
H   = 64

# Case 1: small weights → vanishing
W_small = rng.normal(0, 0.5/np.sqrt(H), (H, H))
norms_v = gradient_norm_over_time(W_small)
print(f"After 100 steps (small W): {norms_v[-1]/norms_v[0]:.2e} of original")
# After 100 steps (small W): 3.7e-31 of original  -- completely vanished

# Case 2: large weights → exploding
W_large = rng.normal(0, 3.0/np.sqrt(H), (H, H))
norms_e = gradient_norm_over_time(W_large)
print(f"After 100 steps (large W): {norms_e[-1]/norms_e[0]:.2e} of original")
# After 100 steps (large W): 1.4e+18 of original  -- exploded to NaN territory

# Conclusion: a gradient cannot survive 100 steps of multiplication by W_hh.
# Information from t=1 cannot influence the loss at t=100. The RNN is 'amnesiac'.

⚠️

Pitfall: The Practical Symptom

A vanilla RNN trained on language modelling will learn local patterns (the next character, short word completions) but fail at long-range dependencies. Ask it to close a parenthesis opened 50 tokens ago, or match subject-verb agreement across a long clause, and it fails — the gradient signal connecting those positions has vanished.

Gradient clipping fixes EXPLODING gradients (just cap the norm), but it cannot fix VANISHING gradients — you cannot un-vanish a zero. The solution required a new architecture: the LSTM.

The Long Short-Term Memory network (Hochretier & Schmidhuber, 1997) solves the vanishing gradient problem with a brilliant architectural change: a separate cell state that flows through time with only minor, additive modifications. Gradients can travel along this cell-state highway nearly unchanged, enabling the network to learn dependencies hundreds of steps long.

✧

History: Ahead of Its Time

The LSTM was published in 1997 but languished for 15 years — the compute and data to train it on real problems did not exist. It became the dominant sequence architecture only around 2014–2015, powering Google Translate, Siri, and Alexa.

Its reign was brief: the Transformer (2017) displaced it within three years. But the LSTM's core insight — additive state updates protect gradients — lives on in the residual connections of every Transformer.

The Cell State and Three Gates

The LSTM maintains two states: the hidden state hₜ (the output) and the cell state cₜ (the long-term memory). Three gates — each a sigmoid layer producing values in [0,1] — control the flow of information:

Gate	Symbol	Role
Forget	fₜ	What fraction of the old cell state to keep (0 = erase, 1 = retain)
Input	iₜ	What fraction of the new candidate to write into the cell
Output	oₜ	What fraction of the cell state to expose as the hidden state

text•LSTM Cell — Forward Pass (Pseudocode)
# Concatenate previous hidden state and current input: [hₜ₋₁ ; xₜ]

fₜ  ←  σ(W_f · [hₜ₋₁, xₜ] + b_f)        # forget gate
iₜ  ←  σ(W_i · [hₜ₋₁, xₜ] + b_i)        # input gate
c̃ₜ  ←  tanh(W_c · [hₜ₋₁, xₜ] + b_c)      # candidate cell content
oₜ  ←  σ(W_o · [hₜ₋₁, xₜ] + b_o)        # output gate

cₜ  ←  fₜ ⊙ cₜ₋₁  +  iₜ ⊙ c̃ₜ            # ← THE GRADIENT HIGHWAY
hₜ  ←  oₜ ⊙ tanh(cₜ)                    # gated output

The critical line is the cell-state update: cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ. Its gradient with respect to cₜ₋₁ is simply fₜ — an element-wise multiplication, not a matrix multiplication through a saturating nonlinearity. When the forget gate fₜ ≈ 1, the gradient flows backward essentially unchanged. This is the additive, near-identity path that defeats vanishing gradients.

Python•LSTM cell from scratch
import numpy as np

def sigmoid(x): return 1 / (1 + np.exp(-np.clip(x, -30, 30)))

class LSTMCell:
    def __init__(self, input_dim, hidden, seed=0):
        rng = np.random.default_rng(seed)
        Z   = input_dim + hidden          # concatenated [h, x] size
        s   = 1.0 / np.sqrt(Z)
        # One weight matrix per gate; rows map [h_prev, x] → gate value
        self.Wf = rng.normal(0, s, (hidden, Z)); self.bf = np.ones((hidden,1))  # bf=1: start by remembering
        self.Wi = rng.normal(0, s, (hidden, Z)); self.bi = np.zeros((hidden,1))
        self.Wc = rng.normal(0, s, (hidden, Z)); self.bc = np.zeros((hidden,1))
        self.Wo = rng.normal(0, s, (hidden, Z)); self.bo = np.zeros((hidden,1))
        self.H  = hidden

    def step(self, x, h_prev, c_prev):
        z  = np.vstack([h_prev, x])       # concatenate
        f  = sigmoid(self.Wf @ z + self.bf)   # forget gate
        i  = sigmoid(self.Wi @ z + self.bi)   # input gate
        c_tilde = np.tanh(self.Wc @ z + self.bc)  # candidate
        o  = sigmoid(self.Wo @ z + self.bo)   # output gate

        c  = f * c_prev + i * c_tilde       # cell update (the highway)
        h  = o * np.tanh(c)                 # gated hidden output
        return h, c, (f, i, c_tilde, o, z)  # cache for backprop

# Note bf initialised to 1.0: the standard 'forget gate bias' trick.
# It makes the LSTM remember by default at the start of training,
# dramatically improving learning of long-range dependencies (Jozefowicz 2015).

▶

ML Connection: From Cell-State Highway to Residual Connection

The LSTM's cell-state update cₜ = fₜ ⊙ cₜ₋₁ + (new info) is structurally identical to the residual connection x + F(x) used in ResNets and every Transformer block. Both create an additive path along which gradients flow nearly unchanged.

When you read in Chapter 13 that 'residual connections enable training of very deep Transformers', recognise it as the same insight the LSTM used to enable training across many time steps. The problem (vanishing gradients through depth) and the solution (additive identity paths) are universal.

The GRU (Cho et al., 2014) simplifies the LSTM: it merges the cell state and hidden state into one, and uses only two gates instead of three. It has fewer parameters, trains faster, and performs comparably to the LSTM on most tasks — making it a popular default.

Two Gates: Reset and Update

text•GRU Cell — Forward Pass (Pseudocode)
rₜ  ←  σ(W_r · [hₜ₋₁, xₜ] + b_r)          # reset gate
zₜ  ←  σ(W_z · [hₜ₋₁, xₜ] + b_z)          # update gate

h̃ₜ  ←  tanh(W_h · [rₜ ⊙ hₜ₋₁, xₜ] + b_h)   # candidate state
hₜ  ←  (1 − zₜ) ⊙ hₜ₋₁  +  zₜ ⊙ h̃ₜ        # interpolate old & new

The reset gate rₜ controls how much of the previous state contributes to the candidate; when rₜ ≈ 0, the unit ignores its past and behaves like reading a fresh sequence. The update gate zₜ interpolates between keeping the old state and adopting the new candidate — it plays the combined role of the LSTM's forget and input gates.

✧

Compare: LSTM vs. GRU

LSTM: 3 gates (forget, input, output), separate cell and hidden state, 4 weight matrices per cell. More expressive; better on tasks needing precise long-term memory control (e.g., counting, very long dependencies). ~33% more parameters.

GRU: 2 gates (reset, update), single state, 3 weight matrices per cell. Faster to train, fewer parameters, less prone to overfitting on small datasets. Comparable accuracy on most NLP and speech tasks. Often the better default when compute or data is limited.

Python•GRU cell from scratch
import numpy as np

class GRUCell:
    def __init__(self, input_dim, hidden, seed=0):
        rng = np.random.default_rng(seed); Z = input_dim + hidden; s = 1/np.sqrt(Z)
        self.Wr = rng.normal(0, s, (hidden, Z)); self.br = np.zeros((hidden,1))
        self.Wz = rng.normal(0, s, (hidden, Z)); self.bz = np.zeros((hidden,1))
        self.Wh = rng.normal(0, s, (hidden, Z)); self.bh = np.zeros((hidden,1))

    def step(self, x, h_prev):
        z_cat = np.vstack([h_prev, x])
        r = sigmoid(self.Wr @ z_cat + self.br)    # reset gate
        z = sigmoid(self.Wz @ z_cat + self.bz)    # update gate
        # Candidate uses reset-gated previous state
        h_tilde = np.tanh(self.Wh @ np.vstack([r * h_prev, x]) + self.bh)
        h = (1 - z) * h_prev + z * h_tilde       # interpolate
        return h

# 3 matrices vs LSTM's 4. On a 256-hidden, 300-input cell:
#   GRU:  3 × 256 × 556 = 427k params
#   LSTM: 4 × 256 × 556 = 569k params   (33% more)

Two architectural enhancements turned RNNs into the powerful sequence models of 2014–2017: stacking multiple recurrent layers for hierarchical abstraction, and processing sequences in both directions for full bidirectional context.

Stacked (Deep) RNNs

A single recurrent layer can be stacked: the hidden states of layer ℓ become the inputs to layer ℓ+1. Lower layers capture local patterns (morphology, short phrases); higher layers capture longer-range, more abstract structure — mirroring the hierarchy of features in a deep CNN or Transformer.

Bidirectional RNNs

A unidirectional RNN at position t has seen only x₁…xₜ. But for many tasks — named-entity recognition, part-of-speech tagging — the right context matters too. A bidirectional RNN runs two independent RNNs, one forward and one backward, and concatenates their hidden states:

hₜ = [ h⃗ₜ ; h⃖ₜ ] where h⃗ₜ reads x₁…xₜ and h⃖ₜ reads x_T…xₜ

▶

ML Connection: BiLSTM → BERT

The bidirectional LSTM was the dominant architecture for sequence labelling (NER, POS) until 2018. ELMo (2018) used a deep BiLSTM to produce the first widely-used contextual embeddings.

BERT (2018) replaced the BiLSTM with a bidirectional Transformer, trained via masked language modelling. The 'B' in BERT stands for 'Bidirectional' — a direct descendant of the BiLSTM idea, but with attention replacing recurrence. The conceptual lineage is unmistakable.

Python•Bidirectional LSTM in PyTorch
import torch; import torch.nn as nn

# A 2-layer bidirectional LSTM for sequence tagging
bilstm = nn.LSTM(
    input_size=300,       # word embedding dimension
    hidden_size=256,      # hidden units per direction
    num_layers=2,         # stacked depth
    bidirectional=True,    # forward + backward
    batch_first=True,
)

x       = torch.randn(32, 50, 300)  # (batch, seq_len, emb_dim)
out, (h, c) = bilstm(x)
print(out.shape)   # (32, 50, 512) -- 512 = 256 × 2 directions

# A tagging head maps each 512-d state to tag logits
tagger = nn.Linear(512, 9)  # e.g. 9 NER tags (BIO scheme)
logits = tagger(out)           # (32, 50, 9)

Classification produces one label per input. But translation, summarisation, and dialogue must produce a variable-length output sequence from a variable-length input sequence. The sequence-to-sequence (seq2seq) architecture (Sutskever et al., 2014; Cho et al., 2014) solved this with two RNNs: an encoder that compresses the input into a context vector, and a decoder that generates the output from it.

The Encoder-Decoder Architecture

text•Vanilla seq2seq (Pseudocode)
ENCODER  (reads the source sequence)
  h^enc ← 0
  for t = 1 … T_src:
    h^enc ← LSTM_enc(x_t, h^enc)
  context ← h^enc_{T_src}       # final state = fixed-size summary

DECODER  (generates the target sequence)
  h^dec ← context                # initialise from encoder
  y_0 ← <START>
  for t = 1 … T_tgt:
    h^dec ← LSTM_dec(embed(y_{t-1}), h^dec)
    y_t   ← argmax softmax(W · h^dec)  # emit next token
    if y_t == <END>: break

Python•seq2seq for translation in PyTorch
import torch; import torch.nn as nn

class Encoder(nn.Module):
    def __init__(self, vocab, emb=256, hidden=512):
        super().__init__()
        self.embed = nn.Embedding(vocab, emb)
        self.lstm  = nn.LSTM(emb, hidden, batch_first=True)
    def forward(self, src):
        out, (h, c) = self.lstm(self.embed(src))
        return out, (h, c)     # out used later for attention

class Decoder(nn.Module):
    def __init__(self, vocab, emb=256, hidden=512):
        super().__init__()
        self.embed = nn.Embedding(vocab, emb)
        self.lstm  = nn.LSTM(emb, hidden, batch_first=True)
        self.proj  = nn.Linear(hidden, vocab)
    def forward(self, tok, state):
        out, state = self.lstm(self.embed(tok), state)
        return self.proj(out), state

# Training uses 'teacher forcing': feed the TRUE previous target token,
# not the model's own (possibly wrong) prediction. This stabilises early training
# but creates exposure bias -- the model never practises recovering from its
# own mistakes (a problem scheduled sampling and RL fine-tuning later address).

⚠️

Pitfall: The Fixed-Context Bottleneck

Vanilla seq2seq compresses the ENTIRE source sentence into a single fixed-size context vector. For a 5-word sentence this is fine; for a 50-word sentence, the encoder must cram 50 words of meaning into, say, 512 numbers. Translation quality collapses as sentences grow longer (Cho et al., 2014 measured the degradation precisely).

This bottleneck is the single problem that attention was invented to solve — and attention, in turn, became the seed of the entire Transformer revolution.

Bahdanau et al. (2014) asked a simple question: why force the decoder to rely on a single context vector? Instead, let it look back at all encoder hidden states and compute a fresh, weighted context vector at every decoding step. This is attention — and it is the conceptual core of the Transformer.

The Attention Mechanism

At decoder step t, attention computes a context vector cₜ as a weighted sum of all encoder states, where the weights depend on how relevant each source position is to the current decoder state:

eₜⱼ = score(s_{t-1}, h_j) αₜⱼ = softmax_j(eₜⱼ) cₜ = Σ_j αₜⱼ · h_j

The alignment score eₜⱼ measures the compatibility between decoder state s_{t-1} and encoder state h_j. The softmax turns scores into a probability distribution (the attention weights αₜⱼ), and the context cₜ is their weighted average of encoder states.

Two Scoring Functions

Variant	Score function	Notes
Bahdanau (additive)	vᵀ tanh(W₁ s + W₂ h)	Small MLP; original 2014 formulation
Luong (dot)	sᵀ h	Simplest; requires matching dimensions
Luong (general)	sᵀ W h	Learned bilinear; flexible dimensions
Scaled dot-product	sᵀ h / √d	The Transformer's choice (Ch. 12)

Python•Bahdanau attention from scratch
import torch; import torch.nn as nn; import torch.nn.functional as F

class BahdanauAttention(nn.Module):
    def __init__(self, hidden):
        super().__init__()
        self.W1 = nn.Linear(hidden, hidden)    # projects decoder state
        self.W2 = nn.Linear(hidden, hidden)    # projects encoder states
        self.v  = nn.Linear(hidden, 1, bias=False)

    def forward(self, dec_state, enc_outputs):
        # dec_state: (B, H)   enc_outputs: (B, T_src, H)
        dec = self.W1(dec_state).unsqueeze(1)    # (B, 1, H)
        enc = self.W2(enc_outputs)              # (B, T_src, H)
        scores = self.v(torch.tanh(dec + enc)).squeeze(-1)  # (B, T_src)
        weights = F.softmax(scores, dim=-1)       # (B, T_src) attention
        context = torch.bmm(weights.unsqueeze(1), enc_outputs)  # (B,1,H)
        return context.squeeze(1), weights     # context + weights for viz

# The attention weights are interpretable: plotting the (T_tgt × T_src)
# weight matrix for a translation reveals soft word alignments -- e.g.
# 'la maison bleue' → 'the blue house' shows the diagonal-ish alignment
# with the adjective-noun swap clearly visible.

✧

Attention as Soft, Differentiable Memory Lookup

Attention is a content-addressable memory: the query (decoder state) is compared against keys (encoder states) to retrieve a weighted blend of values (also encoder states). Unlike a hard lookup, it is fully differentiable — the model learns what to attend to via gradient descent.

This query-key-value framing, made explicit here, is EXACTLY the self-attention of the Transformer. The only differences in Chapter 12 will be: (1) queries, keys, and values are all derived from the same sequence (self-attention), (2) the scaled dot-product score replaces the additive MLP, and (3) multiple attention 'heads' run in parallel. The mechanism you just built IS the Transformer's beating heart.

Attention solved the bottleneck, but the underlying RNN retained two fatal limitations that no amount of gating could fix. Recognising these limits explains precisely why the Transformer abandoned recurrence entirely — keeping only the attention.

Limitation	Cause	Transformer's fix
No parallelism	hₜ depends on hₜ₋₁: must compute sequentially	Self-attention: all positions at once
Long paths	Info from pos 1 to pos T travels T steps	Attention: any pair in O(1) hops
Limited memory	Fixed-size state compresses all history	Attention: direct access to all positions
Slow training	Sequential = poor GPU utilisation	Parallel = saturates GPU tensor cores

The parallelism limitation was the decisive one. An RNN cannot compute hₜ until it has computed hₜ₋₁; a sequence of length 1,000 requires 1,000 sequential steps regardless of how many GPUs you have. The Transformer computes representations for all positions simultaneously, turning a sequential bottleneck into a single large matrix multiplication — perfectly suited to GPU hardware.

The Path Length Argument

To connect information between two positions i and j, an RNN must propagate it through |i − j| recurrent steps, with the gradient attenuating at each step. Attention connects any two positions in a single step — a path length of O(1) rather than O(distance). This is why Transformers learn long-range dependencies that defeated even LSTMs.

✧

History: “Attention Is All You Need”

The 2017 Transformer paper made a radical claim in its title: you can throw away the entire recurrent machinery and keep only the attention. The community was sceptical — surely sequences need some notion of recurrence or convolution?

They did not. The Transformer matched and then crushed RNN performance on translation, while training in a fraction of the time thanks to full parallelism. Within three years, recurrence had all but vanished from NLP research. Part III tells this story in full.

Even with LSTMs and attention, training recurrent models well required a collection of practical techniques. Many of these — gradient clipping, learning-rate scheduling, teacher forcing — carry directly into Transformer training.

Truncated Backpropagation Through Time

Full BPTT through a 10,000-token document is infeasible: it requires storing all 10,000 hidden states and backpropagating through all of them. Truncated BPTT processes the sequence in chunks of k steps (e.g., k = 35), backpropagating only within each chunk while carrying the hidden state forward across chunks.

Python•Truncated BPTT training loop
import torch; import torch.nn as nn

model   = nn.LSTM(256, 512, batch_first=True)
opt     = torch.optim.Adam(model.parameters(), lr=1e-3)
bptt    = 35                            # truncation length
hidden  = None

for chunk in data_chunks:            # each chunk is (B, 35, 256)
    # Detach hidden state: carry VALUES forward but cut the gradient graph
    if hidden is not None:
        hidden = tuple(h.detach() for h in hidden)

    out, hidden = model(chunk, hidden)
    loss = compute_loss(out, targets)

    opt.zero_grad()
    loss.backward()                    # gradient only flows within this chunk
    nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)  # ESSENTIAL
    opt.step()

# Why detach? Without it, the graph grows unboundedly across chunks and
# you'd backprop through the entire history -- defeating the truncation.

Essential RNN Training Checklist

•Gradient clipping: clip the global gradient norm to ~5.0. Non-negotiable for RNNs — prevents the exploding-gradient NaNs.

•Forget-gate bias = 1: initialise the LSTM forget gate bias to 1 so the network remembers by default early in training.

•Orthogonal initialisation of W_hh: keeps recurrent Jacobian singular values near 1 at initialisation, delaying vanishing/exploding.

•Variational dropout: apply the SAME dropout mask at every time step, not a fresh one per step (Gal & Ghahramani, 2016).

•Teacher forcing with scheduled sampling: gradually replace true previous tokens with model predictions to reduce exposure bias.

•Layer normalisation: stabilises hidden-state statistics across time steps, easing optimisation.

Architecture Quick-Reference

Model	Key feature	Solves	Limitation
Vanilla RNN	Shared recurrent weights	Variable length	Vanishing gradients
LSTM	Cell state + 3 gates	Long-range memory	Sequential; slow
GRU	2 gates, merged state	Memory, fewer params	Sequential; slow
BiRNN	Forward + backward pass	Full context	2× compute; no streaming
seq2seq	Encoder-decoder	Variable output	Fixed-context bottleneck
seq2seq + attn	Weighted source lookup	Bottleneck removed	Still sequential

Exercises

Exercises 1–10 are pen-and-paper; 11–18 require code.

✎

Exercise 1: Pen & Paper

Write out the full BPTT gradient ∂L/∂W_hh for a 3-step vanilla RNN. Identify which term contains the product of Jacobians responsible for vanishing gradients.

✎

Exercise 2: Pen & Paper

The tanh derivative is 1 − tanh²(a). Show that it is at most 1 and approaches 0 as |a| → ∞. Explain how this compounds the vanishing gradient problem over many time steps.

✎

Exercise 3: Pen & Paper

For the LSTM cell update cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ, compute ∂cₜ/∂cₜ₋₁. Explain why this protects gradients when fₜ ≈ 1, contrasting with the vanilla RNN's ∂hₜ/∂hₜ₋₁.

✎

Exercise 4: Pen & Paper

Count the parameters in an LSTM cell with input dim D and hidden dim H. Repeat for a GRU. Verify the GRU has 25% fewer recurrent parameters.

✎

Exercise 5: Pen & Paper

Show that if the LSTM forget gate is always 1 and the input gate always 0, the cell state is perfectly preserved: cₜ = c₀ for all t. What does this 'constant error carousel' achieve?

✎

Exercise 6: Pen & Paper

In Bahdanau attention, the alignment scores eₜⱼ are passed through a softmax. Prove that the resulting context vector cₜ is a convex combination of the encoder states (lies in their convex hull).

✎

Exercise 7: Pen & Paper

Compare the path length between positions i and j (|i−j| apart) in: (a) a vanilla RNN, (b) a bidirectional RNN, (c) a Transformer with self-attention. Why does path length matter for learning long-range dependencies?

✎

Exercise 8: Pen & Paper

A seq2seq encoder compresses a T-token sentence into a fixed H-dimensional vector. Argue information-theoretically why translation quality must degrade as T grows for fixed H.

✎

Exercise 9: Pen & Paper

Derive the gradient of the Luong dot-product attention score sᵀh with respect to the decoder state s and encoder state h. Why is this cheaper to compute than Bahdanau's additive score?

✎

Exercise 10: Pen & Paper

Explain why truncated BPTT with truncation length k cannot learn dependencies longer than k steps within a single backward pass, even though the hidden state carries information forward indefinitely.

✎

Exercise 11: Code

Implement a vanilla RNN character-level language model from scratch (NumPy). Train it on a small text. Demonstrate empirically that it fails to balance parentheses opened more than ~15 characters earlier.

✎

Exercise 12: Code

Implement the LSTM cell forward AND backward pass from scratch. Verify your gradients with the finite-difference checker from Chapter 5.

✎

Exercise 13: Code

Implement a GRU cell from scratch. On a synthetic 'copy task' (reproduce a sequence after a delay of N steps), compare GRU vs. vanilla RNN as N grows from 5 to 100. Plot accuracy vs. N.

✎

Exercise 14: Code Lab

Build a seq2seq model (PyTorch) for a toy translation task (e.g., reversing or sorting number sequences). Train WITHOUT attention. Plot BLEU/accuracy vs. source length.

✎

Exercise 15: Code Lab

Add Bahdanau attention to your exercise-14 model. Re-plot accuracy vs. source length and show the bottleneck is removed. Visualise the attention weight matrix for 3 examples.

✎

Exercise 16: Code

Demonstrate the gradient highway: train an LSTM and a vanilla RNN on a sequence with a dependency 80 steps long (the output depends on input at t=1). Plot the gradient norm reaching t=1 for both during training.

✎

Exercise 17: Code

Implement a bidirectional LSTM tagger for part-of-speech tagging on a small annotated corpus. Compare unidirectional vs. bidirectional F1 score. Which tags benefit most from right-context?

✎

Exercise 18: Code (Challenge)

Build a complete attention-based seq2seq translation system on a small real dataset (e.g., a Tatoeba language pair, ≤10k sentence pairs). Implement teacher forcing, beam search decoding, and attention visualisation. Report BLEU. Then reflect in writing: which limitations from Section 9.9 did you encounter, and how would a Transformer address each?

Further reading: “Long Short-Term Memory” (Hochreiter & Schmidhuber, 1997) — the original LSTM. “Learning Phrase Representations using RNN Encoder-Decoder” (Cho et al., 2014) — GRU and seq2seq. “Neural Machine Translation by Jointly Learning to Align and Translate” (Bahdanau et al., 2014) — the attention paper. “Sequence to Sequence Learning with Neural Networks” (Sutskever et al., 2014). Christopher Olah's blog post “Understanding LSTM Networks” for unmatched visual intuition. Andrej Karpathy's “The Unreasonable Effectiveness of Recurrent Neural Networks.”

Part II Complete: Classical Machine Learning

Ch. 6	Discriminative Models	logistic regression, SVMs, trees, and ensembles — learning decision boundaries directly.
Ch. 7	Generative Models	Naïve Bayes, GMMs, HMMs, and the EM algorithm — modelling how data is generated.
Ch. 8	Embeddings	Word2Vec, GloVe, fastText — dense vector spaces where geometry encodes meaning.
Ch. 9	Sequence Models	RNNs, LSTMs, GRUs, and seq2seq with attention — the direct ancestors of the Transformer.

Part II ends exactly where Part III begins. You have built attention by hand as a soft, differentiable memory lookup, and you have seen the three limitations — no parallelism, long gradient paths, fixed-size memory — that doomed recurrence. The Transformer, the subject of Part III, makes one audacious move: it keeps the attention and discards everything else. Self-attention, multi-head attention, positional encodings, and residual streams — every component you are about to meet is a direct response to a limitation you now understand from first principles. The Transformer will not feel like a black box; it will feel inevitable.

✎ 18 Exercises in this chapter

Attempt each exercise before checking the worked solutions.

View Solutions →

←

PreviousCh 8. Embeddings & Representation Learning

NextCh 10. Neural Network Fundamentals

→

Sequence Models: RNNs & LSTMs

Learning Objectives

Why Sequences Need Memory

The Three Requirements of a Sequence Model

The Recurrence Relation

The Vanilla RNN

Backpropagation Through Time (BPTT)

The Vanishing & Exploding Gradient Problem

The Gradient Chain

Long Short-Term Memory (LSTM)

The Cell State and Three Gates

The Gated Recurrent Unit (GRU)

Two Gates: Reset and Update

Deep & Bidirectional RNNs

Stacked (Deep) RNNs

Bidirectional RNNs

Sequence-to-Sequence Models

The Encoder-Decoder Architecture

Attention: Breaking the Bottleneck

The Attention Mechanism

Two Scoring Functions

The Limits of Recurrence

The Path Length Argument

Practical RNN Training

Truncated Backpropagation Through Time

Essential RNN Training Checklist

Chapter Summary & Exercises

Architecture Quick-Reference

Exercises

Part II Complete: Classical Machine Learning