Part III: The Transformer
Chapter 11

Attention Mechanisms

Self-attention, multi-head attention, and the attention equation
18 Exercises
11.1

Training a neural network means computing the gradient of a scalar loss with respect to millions or billions of parameters. The only mathematical tool required is the chain rule — but applied at this scale, it demands a systematic algorithm. That algorithm is backpropagation, and the machinery that automates it is automatic differentiation. By the end of this chapter you will have built both from scratch.

The Single-Variable Chain Rule

Recall the elementary chain rule: if y = f(u) and u = g(x), then dy/dx = (dy/du)(du/dx). Gradients multiply along a chain of dependencies. Backpropagation is this rule applied to a graph rather than a line.

textChain rule, scaling up
Single chain:    dy/dx = (dy/du)(du/dx)

Multivariate:    ∂L/∂x = Σᵢ (∂L/∂uᵢ)(∂uᵢ/∂x)    # sum over all paths

Vector form:     ∇ₓ L = Jᵀ · ∇ᵤ L              # J = Jacobian ∂u/∂x

The crucial generalization is the multivariate chain rule: when a variable x influences the loss through multiple intermediate paths, the gradients from all paths sum. This summation is the single most error-prone aspect of hand-written backprop — and the thing autograd handles automatically.

Intuition: Gradients as Blame Assignment
Think of the backward pass as assigning blame. The loss is large; which parameters are responsible, and in which direction should each move to reduce it? The chain rule distributes this blame backward through the network, with each operation passing its share to its inputs.
An operation receives blame from above (the gradient of the loss w.r.t. its output), multiplies by its local Jacobian (how its output depends on its input), and passes the result downstream. That is the entirety of backpropagation.
11.2

A computational graph represents a computation as nodes (variables and operations) connected by edges (data dependencies). It makes the structure of a computation explicit and is the data structure on which automatic differentiation operates.

A Worked Example

Consider the expression L = (a·b + c)². We can decompose it into elementary operations, each a node in the graph:

textDecomposing L = (a·b + c)²
d = a * b        # multiply node
e = d + c        # add node
L = e * e        # square node

The forward pass evaluates these in topological order (inputs first). The backward pass traverses them in reverse, applying the chain rule at each node. Here is the graph drawn as a stack from inputs (bottom) to loss (top):

Arch Stack: Computational graph for L = (a·b + c)²

L = e · esquare
e = d + cadd
d = a · bmultiply
inputs: a, b, cleaves

Local Gradients at Each Node

Each operation knows only its own local gradient — how its output changes with its inputs. Backprop chains these together:

NodeForwardLocal gradients
multiply d=a·bd = a*b∂d/∂a = b, ∂d/∂b = a
add e=d+ce = d+c∂e/∂d = 1, ∂e/∂c = 1
square L=e²L = e*e∂L/∂e = 2e

Composing by the chain rule: ∂L/∂a = (∂L/∂e)(∂e/∂d)(∂d/∂a) = 2e · 1 · b = 2(a·b+c)·b. The backward pass computes exactly this product, but mechanically — each node multiplies the incoming gradient by its local gradient and passes it on.

Static vs Dynamic Graphs
TensorFlow 1.x used static graphs: you defined the entire graph first, then ran data through it. PyTorch popularized dynamic graphs (define-by-run): the graph is built on the fly as operations execute, so it can change every iteration and supports ordinary Python control flow.
The autograd engine you build in this chapter is dynamic — like PyTorch. Each operation, when called, records itself in the graph. This is why PyTorch feels like ordinary NumPy: the graph construction is invisible, happening automatically as a side effect of computation.
11.3

Automatic differentiation comes in two flavors, distinguished by the direction in which they propagate derivatives through the graph. Understanding why deep learning uses reverse mode (backpropagation) reveals a deep efficiency argument.

Forward-mode ADReverse-mode AD (backprop)
Propagates derivatives input → outputPropagates derivatives output → input
Computes ∂(all outputs)/∂(one input)Computes ∂(one output)/∂(all inputs)
One pass per input variableOne pass per output variable
Efficient when few inputs, many outputsEfficient when many inputs, one output
No need to store intermediate valuesMust store forward activations
Used in: sensitivity analysis, JacobiansUsed in: ALL deep learning training

Why Deep Learning Uses Reverse Mode

A neural network has one scalar output (the loss) and millions of inputs (the parameters). Forward mode would require one full pass per parameter — millions of passes. Reverse mode computes the gradient with respect to all parameters in a single backward pass. This asymmetry is decisive.

textThe cost asymmetry (n inputs, m outputs)
Forward mode:  cost ∝ n passes   (one per input)
Reverse mode:  cost ∝ m passes   (one per output)

Neural net training:  n = 10⁹ params,  m = 1 scalar loss
⇒ Reverse mode is ~10⁹× cheaper. This is why backprop won.
ML Connection: The Cost of a Backward Pass
A practical rule of thumb: the backward pass costs roughly 2× the forward pass in compute. So one training step (forward + backward) is about 3× a single forward pass.
This 3× factor underlies the famous '6N' rule for training FLOPs (Chapter 16): training a model with N parameters on D tokens costs approximately 6ND floating-point operations — 2 for the forward pass and 4 for the backward pass, per parameter per token.
11.4

Before automating backprop, we derive it by hand for a two-layer MLP. This grounds the autograd engine you will build next, and the vector-Jacobian products you derive here are exactly what each autograd node implements.

The Network

textTwo-layer MLP with softmax-cross-entropy
z₁ = X W₁ + b₁       # (N, H)   pre-activation
a₁ = ReLU(z₁)       # (N, H)   hidden activation
z₂ = a₁ W₂ + b₂      # (N, K)   logits
p  = softmax(z₂)    # (N, K)   probabilities
L  = CE(p, y)       # scalar   loss

The Backward Pass

We compute gradients in reverse, starting from the loss. The softmax-cross-entropy gradient (derived in Chapter 4) gives the elegant starting point ∂L/∂z₂ = (p − y)/N:

textBackward pass (reverse order)
∂L/∂z₂ = (p - y) / N              # (N, K)  softmax-CE shortcut
∂L/∂W₂ = a₁ᵀ (∂L/∂z₂)             # (H, K)
∂L/∂b₂ = Σ_rows (∂L/∂z₂)           # (K,)
∂L/∂a₁ = (∂L/∂z₂) W₂ᵀ             # (N, H)
∂L/∂z₁ = (∂L/∂a₁) ⊙ [z₁ > 0]      # (N, H)  ReLU mask
∂L/∂W₁ = Xᵀ (∂L/∂z₁)              # (D, H)
∂L/∂b₁ = Σ_rows (∂L/∂z₁)           # (H,)

[Missing Component: shapeNote]

PythonMLP backward pass from the derivation
import numpy as np

def mlp_backward(X, y, cache):
    """y: integer labels. cache holds z1, a1, p from forward pass."""
    z1, a1, p, W2 = cache
    N = len(y)

    # Output layer: softmax-CE gradient
    dz2 = p.copy(); dz2[np.arange(N), y] -= 1; dz2 /= N  # (N, K)
    dW2 = a1.T @ dz2                              # (H, K)
    db2 = dz2.sum(0)                            # (K,)

    # Hidden layer: backprop through ReLU
    da1 = dz2 @ W2.T                              # (N, H)
    dz1 = da1 * (z1 > 0)                        # ReLU mask
    dW1 = X.T @ dz1                               # (D, H)
    db1 = dz1.sum(0)                            # (H,)

    return dW1, db1, dW2, db2

# Verify against numerical gradient (Chapter 5's checker) before trusting it.
# This hand-derivation is what autograd will compute automatically.
11.5

We now build a working autograd engine for scalars, in the spirit of Karpathy's micrograd. Each Value object holds a number, its gradient, and a reference to the function that computes its local backward pass. Operations build the graph automatically; a single backward() call propagates gradients through the whole graph.

PythonCode Lab: a complete scalar autograd engine
class Value:
    """A scalar that tracks its gradient through a computational graph."""
    def __init__(self, data, _children=(), _op=''):
        self.data  = data
        self.grad  = 0.0                  # ∂L/∂self, accumulated
        self._backward = lambda: None  # local backward closure
        self._prev = set(_children)      # graph edges
        self._op   = _op

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad  += out.grad          # ∂(a+b)/∂a = 1
            other.grad += out.grad          # ∂(a+b)/∂b = 1
        out._backward = _backward
        return out

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad  += other.data * out.grad  # ∂(ab)/∂a = b
            other.grad += self.data  * out.grad  # ∂(ab)/∂b = a
        out._backward = _backward
        return out

    def relu(self):
        out = Value(max(0, self.data), (self,), 'ReLU')
        def _backward():
            self.grad += (out.data > 0) * out.grad  # 1 if active
        out._backward = _backward
        return out

    def backward(self):
        # 1. Build topological order of all nodes behind self
        topo, visited = [], set()
        def build(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev: build(child)
                topo.append(v)
        build(self)
        # 2. Seed the output gradient and backprop in reverse topo order
        self.grad = 1.0
        for v in reversed(topo): v._backward()

Using the Engine

PythonVerifying autograd against the hand derivative
# Compute L = (a*b + c)^2 and its gradients automatically
a = Value(2.0); b = Value(-3.0); c = Value(10.0)
d = a * b          # -6
e = d + c          #  4
L = e * e          # 16
L.backward()

print(f"L = {L.data}")            # 16.0
print(f"dL/da = {a.grad}")        # 2*e*b = 2*4*(-3) = -24
print(f"dL/db = {b.grad}")        # 2*e*a = 2*4*2    =  16
print(f"dL/dc = {c.grad}")        # 2*e*1 = 2*4      =   8
# All match the hand-derived gradients from Section 11.2. The engine works.
⚠️
Pitfall: Why += and not =
Notice every _backward uses += to accumulate gradients, never =. This is the multivariate chain rule in action: if a node is used in multiple places (e.g., e*e uses e twice), gradients from all uses must sum.
Using = instead of += is the most common autograd bug — it silently discards all but the last gradient contribution. PyTorch has the same behaviour, which is why you must call optimizer.zero_grad() between steps: gradients accumulate by default and must be explicitly reset.
11.6

The backward pass cannot visit nodes in arbitrary order. A node's gradient is only complete once all nodes that depend on it have contributed. Topological sorting guarantees this: process each node only after everything downstream of it has been processed.

Why Order Matters

Consider a node x used by two operations whose results both feed the loss. The gradient ∂L/∂x is the sum of contributions from both paths. If we processed x before both contributors had run, we would propagate an incomplete gradient downstream — corrupting every node below x.

textTopological backward pass (Pseudocode)
# Phase 1: build topological order via DFS post-order
topo ← [];  visited ← {}
function build(v):
    if v not in visited:
        visited.add(v)
        for child in v.prev: build(child)
        topo.append(v)        # post-order: children before parent
build(loss)

# Phase 2: backprop in REVERSE topological order
loss.grad ← 1.0              # seed: ∂L/∂L = 1
for v in reversed(topo):
    v._backward()            # all downstream grads are ready

The Tape (Wengert List)

Reverse-mode AD frameworks record operations on a 'tape' (also called a Wengert list) during the forward pass. The tape is exactly the topological order: replaying it in reverse gives the backward pass. PyTorch builds this tape implicitly through the grad_fn attribute on each tensor.

ML Connection: PyTorch's grad_fn
Every PyTorch tensor created by an operation has a .grad_fn attribute pointing to the function that created it — this is the tape. When you call loss.backward(), PyTorch walks this graph backward in topological order, exactly like our build() function.
Try it: x = torch.randn(3, requires_grad=True); y = (x * 2).sum(); print(y.grad_fn). You will see , and following .next_functions reveals the whole graph. Our Value engine and PyTorch's autograd are the same algorithm at different scales.
11.7

Real networks operate on tensors, not scalars. Extending autograd to tensors requires computing vector-Jacobian products (VJPs) rather than scalar derivatives, and handling the subtleties of broadcasting. The key efficiency insight: we never materialize the full Jacobian — we compute its product with the incoming gradient directly.

The Vector-Jacobian Product

For a tensor operation y = f(x), the Jacobian ∂y/∂x can be enormous (for a 1000×1000 matmul it has a trillion entries). Reverse-mode AD never builds it: it computes the VJP, the product of the upstream gradient with the Jacobian, which usually has a simple closed form.

textVJPs for common operations
matmul  Y = X W:
    ∂L/∂X = (∂L/∂Y) Wᵀ        ∂L/∂W = Xᵀ (∂L/∂Y)

elementwise  y = f(x):
    ∂L/∂x = (∂L/∂y) ⊙ f'(x)

sum  s = Σ x:
    ∂L/∂x = (∂L/∂s) broadcast to shape of x

Broadcasting and Gradient Reduction

When an operation broadcasts a smaller tensor against a larger one (e.g., adding a bias vector to a batch), the backward pass must sum the gradient over the broadcasted dimensions to restore the original shape. Forgetting this is a frequent source of shape-mismatch bugs.

PythonTensor autograd: matmul and broadcast-aware add
import numpy as np

class Tensor:
    def __init__(self, data, _children=(), _op=''):
        self.data = np.asarray(data, dtype=float)
        self.grad = np.zeros_like(self.data)
        self._backward = lambda: None
        self._prev = set(_children)

    def __matmul__(self, other):
        out = Tensor(self.data @ other.data, (self, other), '@')
        def _backward():
            self.grad  += out.grad @ other.data.T   # (∂L/∂Y) Wᵀ
            other.grad += self.data.T @ out.grad   # Xᵀ (∂L/∂Y)
        out._backward = _backward
        return out

    def __add__(self, other):
        out = Tensor(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad  += _unbroadcast(out.grad, self.data.shape)
            other.grad += _unbroadcast(out.grad, other.data.shape)
        out._backward = _backward
        return out

def _unbroadcast(grad, shape):
    """Sum grad over broadcasted dims to match `shape`."""
    while grad.ndim > len(shape): grad = grad.sum(0)  # extra leading dims
    for i, s in enumerate(shape):
        if s == 1: grad = grad.sum(i, keepdims=True)  # broadcast dim
    return grad

[Missing Component: shapeNote]

11.8

Two operations have backward passes subtle enough to derive carefully: softmax (where every output depends on every input) and LayerNorm (where the normalization couples all features). Both appear in every Transformer, and both are where hand-written backprop most often goes wrong.

Softmax Backward

The softmax Jacobian is dense: ∂pᵢ/∂zⱼ = pᵢ(δᵢⱼ − pⱼ). But the VJP — the product with the upstream gradient — simplifies beautifully:

textSoftmax VJP
p = softmax(z)
∂L/∂z = p ⊙ (∂L/∂p - Σⱼ (∂L/∂p)ⱼ pⱼ)

# In words: subtract the p-weighted mean of the upstream gradient,
# then scale by p. One pass, no Jacobian materialized.
PythonSoftmax and LayerNorm backward from scratch
import numpy as np

def softmax_backward(dout, p):
    """dout: ∂L/∂p (N,K).  p: softmax output (N,K). Returns ∂L/∂z."""
    # Per-row: dz = p * (dout - sum(dout * p))
    weighted = (dout * p).sum(axis=1, keepdims=True)
    return p * (dout - weighted)

def layernorm_backward(dout, x, gamma, eps=1e-5):
    """Backward through y = gamma * (x - mu)/sqrt(var+eps)."""
    N, D = x.shape
    mu  = x.mean(axis=1, keepdims=True)
    var = x.var(axis=1, keepdims=True)
    std = np.sqrt(var + eps)
    xhat = (x - mu) / std

    dgamma = (dout * xhat).sum(0)
    dbeta  = dout.sum(0)

    # Gradient w.r.t. x (the coupled part -- derive carefully!)
    dxhat = dout * gamma
    dx = (1.0/D) / std * (
        D * dxhat
        - dxhat.sum(1, keepdims=True)
        - xhat * (dxhat * xhat).sum(1, keepdims=True)
    )
    return dx, dgamma, dbeta

# ALWAYS verify these against the numerical gradient checker (Chapter 5).
# The LayerNorm backward is the single most bug-prone derivation in
# Transformer implementations -- the coupling through mu and var is subtle.
⚠️
Verify Every Hand-Derived Backward
The LayerNorm and softmax backward passes are notoriously easy to get subtly wrong — a missing term might still let the network train, just slightly worse, making the bug nearly invisible.
The only defense is the finite-difference gradient checker from Chapter 5. Run it on every custom backward before trusting it. This is not optional rigor; it is the standard practice that prevents silent, accuracy-eroding bugs.
11.9

Reverse-mode AD has a hidden cost: it must store every intermediate activation from the forward pass to compute gradients in the backward pass. For deep Transformers with long sequences, these stored activations dominate memory — often exceeding the memory for the parameters themselves.

The Activation Memory Problem

A Transformer with L layers stores activations for all L layers during the forward pass, holding them until the backward pass consumes them. Activation memory scales as O(L × batch × seq_len × d) — and for long contexts this becomes the binding constraint on model size.

StrategyMemoryCompute
Store all activationsO(L) — full1× forward (baseline)
Gradient checkpointingO(√L) — sublinear~1.33× forward (recompute)
Recompute everythingO(1) — minimal2× forward (recompute all)

Gradient checkpointing (also called activation checkpointing) stores activations only at a subset of layers — the checkpoints. During the backward pass, the activations between checkpoints are recomputed on the fly from the nearest checkpoint. Storing checkpoints every √L layers gives O(√L) memory at the cost of one extra forward pass per segment.

PythonGradient checkpointing in PyTorch
import torch
from torch.utils.checkpoint import checkpoint

# Without checkpointing: stores all activations (high memory)
def forward_normal(self, x):
    for block in self.blocks:
        x = block(x)            # every activation kept for backward
    return x

# With checkpointing: recompute activations during backward (low memory)
def forward_checkpointed(self, x):
    for block in self.blocks:
        x = checkpoint(block, x, use_reentrant=False)
        # block's internal activations are NOT stored;
        # they are recomputed when backward reaches this block
    return x

# Trade-off: ~33% more compute for training, but enables much larger
# models / longer sequences on the same GPU. Standard for large LLM training.
Train Note: Checkpointing Is Standard at Scale
Nearly every large-scale LLM training run uses gradient checkpointing. The 33% compute overhead is a worthwhile price for fitting a much larger model or much longer sequence into limited GPU memory.
Combined with mixed precision (Chapter 5) and the distributed strategies of Chapter 18 (ZeRO, tensor/pipeline parallelism), checkpointing is part of the standard toolkit that makes training billion-parameter models feasible.
11.10

Backprop Quick-Reference

OperationForwardVJP (backward)
add y=a+ba + bda = dy, db = dy (unbroadcast)
mul y=a⊙ba * bda = dy⊙b, db = dy⊙a
matmul Y=XWX @ WdX = dY@Wᵀ, dW = Xᵀ@dY
relu y=relu(x)max(0,x)dx = dy ⊙ [x>0]
softmax psoftmax(z)dz = p⊙(dp - Σ(dp⊙p))
sum s=Σxx.sum()dx = ds broadcast to x.shape
layernorm(x-μ)/σ·γ+βcoupled — see Section 11.8

Exercises

Exercises 1–10 are pen-and-paper or derivations; 11–18 require code.

Exercise 1: Pen & Paper
Draw the computational graph for f(x,y) = (x+y)·(x·y), label each node, and write the local gradient at each.
Exercise 2: Derive
For f(x,y) = (x+y)·(x·y) at x=2, y=3, compute ∂f/∂x and ∂f/∂y by hand using the chain rule. Verify by direct differentiation.
Exercise 3: Pen & Paper
Explain why backprop uses gradient ACCUMULATION (+=) when a variable feeds multiple operations. Give an example where = instead of += gives the wrong answer.
Exercise 4: Derive
Derive the vector-Jacobian product for matmul Y = XW: show ∂L/∂X = (∂L/∂Y)Wᵀ and ∂L/∂W = Xᵀ(∂L/∂Y), checking all shapes.
Exercise 5: Derive
Derive the softmax VJP ∂L/∂z = p⊙(∂L/∂p − Σ(∂L/∂p ⊙ p)) starting from the Jacobian ∂pᵢ/∂zⱼ = pᵢ(δᵢⱼ − pⱼ).
Exercise 6: Pen & Paper
Compare the cost of computing a full Jacobian for a 1000×1000 matmul versus computing its VJP. Why does reverse-mode AD never materialize the Jacobian?
Exercise 7: Pen & Paper
For a network with n=10⁹ parameters and one scalar loss, estimate the relative cost of forward-mode vs reverse-mode AD. Justify why deep learning uses reverse mode.
Exercise 8: Derive
Derive the full LayerNorm backward pass ∂L/∂x, accounting for the dependence of both μ and σ² on every element of x.
Exercise 9: Pen & Paper
Explain why the backward pass must visit nodes in reverse topological order. Construct a small graph where a wrong order produces an incorrect gradient.
Exercise 10: Pen & Paper
Gradient checkpointing with checkpoints every √L layers gives O(√L) memory. Sketch why, and state the compute overhead.
Exercise 11: Code
Implement the scalar Value autograd engine from Section 11.5. Add __pow__, __neg__, and tanh. Verify all gradients against finite differences.
Exercise 12: Code
Use your Value engine to build and train a tiny MLP (2→4→1) on XOR. Confirm it reaches near-zero loss using only your autograd.
Exercise 13: Code
Implement the Tensor autograd class with matmul, broadcast-aware add, ReLU, and sum. Verify the gradient-shape-equals-parameter-shape invariant holds.
Exercise 14: Code Lab
Implement softmax_backward and layernorm_backward. Verify BOTH against the Chapter 5 finite-difference checker to within 1e-6 relative error.
Exercise 15: Code
Build a 3-layer MLP entirely with your Tensor autograd and train it on a small classification dataset. Compare its gradients to PyTorch's autograd on the same data.
Exercise 16: Code
Demonstrate the accumulation bug: implement a version of add that uses = instead of += and show it produces wrong gradients for f = x + x.
Exercise 17: Code Lab
Measure activation memory: train a deep MLP (20 layers) with and without gradient checkpointing (use torch.utils.checkpoint). Report peak GPU memory and wall-clock time for both.
Exercise 18: Code (Challenge)
Extend your Tensor autograd engine to support a complete Transformer feed-forward block: Linear → GELU → Linear with LayerNorm and a residual connection. Verify end-to-end gradients against PyTorch. This is the autograd you will rely on conceptually for the rest of the book.

Further reading: Andrej Karpathy's “micrograd” and his lecture “The spelled-out intro to neural networks and backpropagation” — the inspiration for this chapter's scalar engine. “Automatic Differentiation in Machine Learning: a Survey” (Baydin et al., 2018) for the full taxonomy. The PyTorch autograd documentation and the “Autograd mechanics” note. “Training Deep Nets with Sublinear Memory Cost” (Chen et al., 2016) for gradient checkpointing.


Next → Chapter 12: Attention Mechanisms

You can now compute gradients through any computational graph — the engine that trains every neural network. Chapter 12 builds the one component missing from your Transformer block: attention. We will start from the query-key-value memory lookup you met in Chapter 9, derive scaled dot-product attention from first principles, and build multi-head self-attention from scratch. This is the mechanism that defines the Transformer, and with your autograd understanding you will see exactly how gradients flow through it.

18 Exercises in this chapter
Attempt each exercise before checking the worked solutions.
View Solutions →