Part III: The Transformer

Chapter 13

The Transformer Architecture

Positional encodings, residual streams, the full encoder and decoder stacks, and a complete working Transformer built from scratch — every line understood.

20 Exercises

Learning Objectives

1.	Assemble attention, FFN, normalization, and residuals into a complete Transformer block.
2.	Explain why position must be injected and compare sinusoidal, learned, and rotary encodings.
3.	Understand the residual stream as the Transformer's central communication channel.
4.	Distinguish encoder-only, decoder-only, and encoder-decoder architectures and their uses.
5.	Trace data flow from token IDs through embeddings, blocks, and the unembedding to logits.
6.	Implement a complete decoder-only Transformer (GPT-style) from scratch.
7.	Understand weight tying, the LM head, and how generation works at inference.
8.	Map the architecture onto real models: GPT, BERT, T5, and LLaMA.

You have built every component: linear layers, activations, normalization, dropout, residuals (Chapter 10), the autograd to train them (Chapter 11), and multi-head attention (Chapter 12). This chapter assembles them into the complete Transformer. We start with the bird's-eye view, then build each piece, then write the whole thing from scratch.

The Three-Stage Structure

Every Transformer, regardless of variant, has the same three-stage structure: an input stage that turns tokens into vectors, a stack of identical processing blocks, and an output stage that turns vectors back into token predictions.

Arch Stack: Transformer: the three stages (decoder-only)

Logits over vocabulary	(T, V)
Unembedding / LM head	(d → V)
Final LayerNorm	normalize
Transformer Block × N	the stack
+ Positional encoding	inject order
Token embedding	(V → d)
input token IDs	(T,)

The middle stage — the stack of N identical blocks — does the heavy lifting. Each block refines the representation, mixing context via attention and processing it via the feed-forward network. The input and output stages are comparatively simple: a lookup table in, a linear projection out.

✧

Same Block, Stacked Deep

A Transformer is mostly the same block repeated. GPT-2 small stacks 12 identical blocks; GPT-3 stacks 96; the largest models stack over 100. The block's design is what this chapter is about — once you understand one block, you understand the whole network, because the rest is just repetition.

This uniformity is also why Transformers scale so predictably (Chapter 16): adding capacity means adding more of the same block, and the scaling laws describe exactly how performance improves as you do.

The input stage converts a sequence of integer token IDs into a sequence of vectors the network can process. This requires two things: a token embedding that maps each token to a learned vector, and a positional encoding that tells the network where each token sits in the sequence.

Token Embeddings

The token embedding is a lookup table: a matrix E of shape (V, d) where row i is the embedding of token i. Looking up a sequence of token IDs gives a sequence of d-dimensional vectors. This is exactly the embedding idea from Chapter 8, now learned end-to-end with the rest of the model.

Why Position Must Be Injected

Here is a subtle but critical fact: self-attention is permutation-equivariant. If you shuffle the input tokens, the outputs shuffle the same way — attention has no inherent notion of order. 'dog bites man' and 'man bites dog' would be processed identically. Position must be added explicitly.

[Missing Component: attnNote]

Three Ways to Encode Position

Method	How	Used in
Sinusoidal	Fixed sin/cos of varying frequency	Original Transformer (2017)
Learned absolute	A learned vector per position	BERT, GPT-2
Rotary (RoPE)	Rotate Q,K by position-dependent angle	LLaMA, GPT-NeoX, most modern LLMs
ALiBi	Linear bias on attention scores by distance	BLOOM, some long-context models
Relative	Encode pairwise position differences	T5, Transformer-XL

Sinusoidal Positional Encoding

The original Transformer used fixed sinusoids of geometrically increasing wavelength. Each dimension of the encoding oscillates at a different frequency, giving every position a unique fingerprint and — crucially — letting the model attend to relative positions via linear combinations.

text•Sinusoidal positional encoding
PE(pos, 2i)   = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

# pos = position, i = dimension index, d = model dimension
# Low dimensions oscillate fast, high dimensions oscillate slowly

Python•Token + sinusoidal positional embeddings from scratch
import numpy as np

def sinusoidal_encoding(T, d):
    """Returns (T, d) fixed positional encodings."""
    pos = np.arange(T)[:, None]              # (T, 1)
    i   = np.arange(d)[None, :]              # (1, d)
    angle = pos / np.power(10000, (2*(i//2)) / d)
    pe = np.zeros((T, d))
    pe[:, 0::2] = np.sin(angle[:, 0::2])  # even dims
    pe[:, 1::2] = np.cos(angle[:, 1::2])  # odd dims
    return pe

class EmbeddingLayer:
    def __init__(self, vocab, d, seed=0):
        rng = np.random.default_rng(seed)
        self.E = rng.normal(0, 0.02, (vocab, d))  # token table
        self.d = d

    def forward(self, token_ids):  # (T,) integer IDs
        T = len(token_ids)
        tok = self.E[token_ids]             # (T, d) token embeddings
        pos = sinusoidal_encoding(T, self.d)  # (T, d) position
        return tok + pos                    # add them: (T, d)

# Token embedding answers 'what token is this?'
# Positional encoding answers 'where in the sequence is it?'
# Their SUM carries both, and the network learns to disentangle them.

▶

ML Connection: RoPE Dominates Modern LLMs

Most LLMs since 2022 (LLaMA, GPT-NeoX, Mistral, Gemma) use Rotary Position Embedding (RoPE). Instead of adding a position vector, RoPE rotates the query and key vectors by an angle proportional to their position, so that the dot product q·k naturally depends on the relative distance between positions.

RoPE's advantages: it encodes relative position directly in the attention score, extrapolates better to longer sequences than training length, and adds no parameters. You will implement RoPE in the exercises and meet it again in Chapter 33 on long context.

The residual connections introduced in Chapter 10 are more than an optimization trick in the Transformer — they form the residual stream, a powerful conceptual lens (Elhage et al., 2021) for understanding how the whole network communicates. Every block reads from and writes to this shared stream.

Reframing the Block as Read-Write

Each sublayer (attention or FFN) does not replace the representation — it adds to it. The running sum x flows unchanged through every residual connection; each sublayer reads the current x, computes an update, and adds it back. The stream is a shared workspace that all blocks contribute to.

text•The residual stream view
x₀ = embedding + position           # initial stream
x₁ = x₀ + Attn(LN(x₀))              # attention writes to stream
x₂ = x₁ + FFN(LN(x₁))               # FFN writes to stream
...  (repeat for N blocks)
logits = Unembed(LN(x_final))       # read final stream

✧

Intuition: Why Residuals Make Deep Transformers Trainable

Because each sublayer ADDS to the stream rather than replacing it, the gradient has a direct path from the loss back to every layer — the identity path x → x carries gradient unchanged. This is exactly the LSTM cell-state highway from Chapter 9, generalized to depth instead of time.

Without residuals, a 96-layer Transformer would suffer the vanishing-gradient problem just as a 96-step RNN does. The residual stream is what lets gradients reach the earliest layers, making very deep Transformers trainable.

Pre-LN vs Post-LN

Where you place the LayerNorm relative to the residual add is one of the most consequential design choices. The original Transformer used Post-LN; nearly all modern LLMs use Pre-LN, which keeps the residual stream clean and stabilizes training at depth.

Post-LN (original, 2017)	Pre-LN (modern)
x = LN(x + Sublayer(x))	x = x + Sublayer(LN(x))
Norm AFTER the residual add	Norm BEFORE the sublayer
Residual stream gets normalized	Residual stream stays clean
Needs careful LR warmup	Trains stably without warmup tricks
Can diverge at great depth	Scales to 100+ layers reliably
Used in: original Transformer, BERT	Used in: GPT-2+, LLaMA, most LLMs

After attention mixes information across positions, the feed-forward network (FFN) processes each position independently. It is deceptively simple — two linear layers with a nonlinearity — yet it holds roughly two-thirds of a Transformer's parameters and is increasingly understood to store much of the model's factual knowledge.

text•The standard FFN (per position)
FFN(x) = W₂ · act(W₁ x + b₁) + b₂

W₁: (d → 4d)    expand to a wider hidden dimension
act: GELU or SwiGLU
W₂: (4d → d)    project back to model dimension

The FFN expands the representation to a wider hidden dimension (typically 4× the model dimension), applies a nonlinearity, then projects back. The expansion gives the network room to compute rich nonlinear features per position; the projection returns to the residual-stream dimension so the output can be added back.

▶

ML Connection: FFNs as Key-Value Memories

Geva et al. (2021) showed that FFN layers act like key-value memories: the first weight matrix W₁ detects patterns (keys), and the second matrix W₂ writes associated information (values) into the residual stream. Specific neurons activate for specific concepts — a 'Canada' neuron, a 'past tense' neuron.

This is why FFNs are thought to store factual knowledge. Editing model facts (the ROME and MEMIT methods) works by surgically modifying FFN weights. The FFN is not just a generic nonlinearity — it is the Transformer's long-term memory.

Python•Feed-forward network (standard and SwiGLU)
import numpy as np

def gelu(x):
    return 0.5*x*(1+np.tanh(np.sqrt(2/np.pi)*(x+0.044715*x**3)))

class FeedForward:  # standard GELU FFN
    def __init__(self, d, hidden=None, seed=0):
        hidden = hidden or 4*d            # 4x expansion
        rng = np.random.default_rng(seed); s = 1/np.sqrt(d)
        self.W1 = rng.normal(0, s, (d, hidden)); self.b1 = np.zeros(hidden)
        self.W2 = rng.normal(0, s, (hidden, d)); self.b2 = np.zeros(d)

    def forward(self, x):  # (T, d) -> (T, d)
        return gelu(x @ self.W1 + self.b1) @ self.W2 + self.b2

class SwiGLU_FFN:  # gated FFN used in LLaMA
    def __init__(self, d, seed=0):
        hidden = int(8/3*d)            # param-matched to 4d standard FFN
        rng = np.random.default_rng(seed); s = 1/np.sqrt(d)
        self.W = rng.normal(0, s, (d, hidden))   # gate
        self.V = rng.normal(0, s, (d, hidden))   # value
        self.W2 = rng.normal(0, s, (hidden, d))

    def forward(self, x):
        swish = (x @ self.W); swish = swish * (1/(1+np.exp(-swish)))
        return (swish * (x @ self.V)) @ self.W2  # gated, then project

We now have every piece. The Transformer block combines a multi-head attention sublayer and a feed-forward sublayer, each wrapped in a Pre-LN residual connection. This is the unit that gets stacked N times.

text•Pre-LN Transformer block (Pseudocode)
function TransformerBlock(x, mask):
    # Sublayer 1: multi-head self-attention
    a = MultiHeadAttention(LayerNorm(x), mask)
    x = x + Dropout(a)              # residual add

    # Sublayer 2: position-wise feed-forward
    f = FeedForward(LayerNorm(x))
    x = x + Dropout(f)              # residual add

    return x

Python•The complete Transformer block from scratch
import numpy as np

class TransformerBlock:
    def __init__(self, d, n_heads, seed=0):
        self.attn = MultiHeadAttention(d, n_heads, seed=seed)
        self.ffn  = FeedForward(d, seed=seed+1)
        self.ln1  = LayerNorm(d)   # before attention
        self.ln2  = LayerNorm(d)   # before FFN

    def forward(self, x, mask=None):  # x: (T, d)
        # Pre-LN: normalize, sublayer, residual add
        attn_out, _ = self.attn.forward(self.ln1.forward(x), mask)
        x = x + attn_out                  # residual 1
        x = x + self.ffn.forward(self.ln2.forward(x))  # residual 2
        return x

class LayerNorm:
    def __init__(self, d, eps=1e-5):
        self.g = np.ones(d); self.b = np.zeros(d); self.eps = eps
    def forward(self, x):
        mu = x.mean(-1, keepdims=True); var = x.var(-1, keepdims=True)
        return self.g * (x - mu) / np.sqrt(var + self.eps) + self.b

# This is the entire block. Stack N of these and you have a Transformer body.

Shape Trace: Data through one block (T=16, d=512, H=8)

Operation	Shape	Note
input x	(16, 512)	from previous block
LayerNorm(x)	(16, 512)	normalized, shape unchanged
MultiHeadAttention	(16, 512)	context mixed across positions
x + attn_out	(16, 512)	residual add
LayerNorm(x)	(16, 512)	normalized again
FeedForward	(16, 512)	per-position processing
x + ffn_out	(16, 512)	residual add → next block

The same block can be arranged in three ways, giving three model families. The difference comes down to two choices: is the self-attention masked (causal) or not, and is there a separate encoder feeding the decoder via cross-attention?

Architecture	Attention	Best for	Examples
Encoder-only	Bidirectional (no mask)	Understanding, classification	BERT, RoBERTa
Decoder-only	Causal (masked)	Generation, language modeling	GPT, LLaMA, Claude
Encoder-decoder	Encoder bidir. + decoder causal + cross-attn	Seq-to-seq (translation)	T5, BART, original

Why Decoder-Only Won for LLMs

The largest and most capable language models — GPT, LLaMA, Claude, Gemini — are decoder-only. This was not obvious in 2018, when encoder-only (BERT) and encoder-decoder (T5) were equally prominent. The decoder-only design won for LLMs because of its simplicity and the power of the next-token objective.

•One objective, one architecture: next-token prediction needs only a causal decoder — no separate encoder, no cross-attention, no masking scheme beyond causal.

•Unified interface: every task (translation, Q&A, summarization, code) becomes 'continue this text', so one model serves all tasks via prompting.

•Efficient training: every token position provides a training signal (predict the next one), making maximal use of every sequence.

•Clean scaling: the uniform stack scales predictably, and the scaling laws of Chapter 16 were established on decoder-only models.

✧

History: The Architecture Convergence

In 2018–2020, the field was split: Google bet on encoder-only (BERT) and encoder-decoder (T5), while OpenAI bet on decoder-only (GPT). By 2023, nearly the entire frontier had converged on decoder-only — GPT-4, LLaMA, Claude, Gemini, Mistral are all decoder-only.

Encoder-only models remain dominant for embeddings and classification where you need to encode a fixed input. But for the generative, general-purpose models that define the LLM era, decoder-only is the unchallenged design.

After the last block, the residual stream holds a rich representation of each position. The output stage converts this back into a probability distribution over the vocabulary: a final LayerNorm, then a linear projection (the unembedding or 'LM head') from model dimension d to vocabulary size V, then softmax.

text•The output stage
x_final = LayerNorm(x_N)            # (T, d)
logits  = x_final @ W_U             # (T, V)   unembedding
probs   = softmax(logits)           # (T, V)   per-position distributions

Weight Tying

A common trick: tie the unembedding matrix W_U to the token embedding matrix E (using Eᵀ as the unembedding). This saves V×d parameters — 38 million for GPT-2 — and often improves quality, since the same vocabulary geometry serves both reading and writing tokens.

✧

Train Note: Weight Tying Saves Millions of Parameters

For a 50,000-token vocabulary and d=768, the embedding and unembedding matrices each hold 38M parameters. Tying them halves this to 38M total and ties the 'meaning' of a token on input to its prediction on output.

GPT-2 and many models use weight tying. Some very large models untie them, finding that at scale the extra parameters help. Like many architecture choices, the right answer depends on scale.

Python•The output stage and a complete forward pass
import numpy as np

def softmax(x, axis=-1):
    x = x - x.max(axis=axis, keepdims=True); e = np.exp(x)
    return e / e.sum(axis=axis, keepdims=True)

class LMHead:
    def __init__(self, embedding_matrix):
        self.W_U = embedding_matrix.T   # weight tying: (d, V)
        self.ln_f = LayerNorm(embedding_matrix.shape[1])

    def forward(self, x):  # x: (T, d)
        x = self.ln_f.forward(x)        # final norm
        logits = x @ self.W_U           # (T, V)
        return logits

# The logits at position t are the model's prediction for token t+1.
# Apply softmax for probabilities, or argmax for greedy decoding.

We now assemble everything into a complete, working decoder-only Transformer — a miniature GPT. This is the culmination of Part III: a model you understand line by line, from token IDs to output logits.

Python•Code Lab: a complete GPT from scratch (forward pass)
import numpy as np

class GPT:
    """A minimal decoder-only Transformer (GPT-style)."""
    def __init__(self, vocab, d=256, n_layers=6, n_heads=8, seed=0):
        rng = np.random.default_rng(seed)
        self.E = rng.normal(0, 0.02, (vocab, d))   # token embeddings
        self.d = d
        self.blocks = [TransformerBlock(d, n_heads, seed=seed+i)
                       for i in range(n_layers)]
        self.head = LMHead(self.E)           # weight-tied output

    def forward(self, token_ids):  # (T,) integer IDs
        T = len(token_ids)
        # 1. Input stage: embed + position
        x = self.E[token_ids] + sinusoidal_encoding(T, self.d)  # (T, d)
        # 2. Causal mask (decoder-only)
        mask = np.tril(np.ones((T, T), dtype=bool))
        # 3. The block stack
        for block in self.blocks:
            x = block.forward(x, mask)        # (T, d)
        # 4. Output stage
        return self.head.forward(x)           # (T, V) logits

# Instantiate and run a forward pass
model = GPT(vocab=1000, d=256, n_layers=6, n_heads=8)
tokens = np.array([5, 42, 17, 3, 99])
logits = model.forward(tokens)
print(f"logits shape: {logits.shape}")  # (5, 1000)
print(f"next-token prediction at pos 0: {logits[0].argmax()}")
# Each row predicts the next token. Untrained, predictions are random;
# after training on text (Chapter 15), they become coherent language.

Shape Trace: Full GPT forward pass (T=5, d=256, V=1000)

Operation	Shape	Note
token_ids	(5,)	integer token IDs
E[token_ids]	(5, 256)	token embeddings
+ positional	(5, 256)	order injected
block × 6	(5, 256)	context mixed and processed
final LayerNorm	(5, 256)	normalize
@ W_U	(5, 1000)	logits over vocab

✧

You Have Built a Transformer

The GPT class above, together with the components from this chapter and Chapter 12, is a complete Transformer. It is the same architecture as GPT-2, GPT-3, and LLaMA — those models differ only in scale (more layers, wider d, larger vocab), a few refinements (RoPE, RMSNorm, SwiGLU), and the enormous training data and compute of Chapters 15–16.

Everything from here — pretraining, scaling, alignment, inference — builds on this architecture. You now understand, from first principles, the machine at the center of the modern AI revolution.

A trained decoder-only Transformer generates text autoregressively: it predicts the next token, appends it to the sequence, and repeats. The causal mask ensures each prediction depends only on previous tokens, so this loop is consistent with how the model was trained.

text•Autoregressive generation (Pseudocode)
function generate(model, prompt, max_tokens):
    tokens = prompt
    for _ in range(max_tokens):
        logits = model.forward(tokens)    # (T, V)
        next_logits = logits[-1]          # last position predicts next
        next_token = sample(next_logits)   # greedy / top-k / nucleus
        tokens = tokens + [next_token]      # append and repeat
        if next_token == EOS: break
    return tokens

Sampling Strategies

Strategy	How	Effect
Greedy	Always pick argmax	Deterministic; can be repetitive
Temperature	Scale logits by 1/T before softmax	T>1 more random, T<1 more peaked
Top-k	Sample from k highest-probability tokens	Cuts off the long tail
Nucleus (top-p)	Sample from smallest set with cumulative p	Adapts cutoff to confidence
Beam search	Track b best sequences	Better for translation, not open text

✧

Train Note: The KV-Cache Makes Generation Fast

Naive generation recomputes attention over the whole sequence at every step — O(T²) per token, O(T³) total. The KV-cache stores the keys and values of past tokens so each new token only computes its own Q, K, V and attends to the cache: O(T) per token.

This is why the causal mask matters for efficiency, not just correctness: because position t never attends to the future, the keys and values of past tokens never change and can be cached. Chapter 27 covers inference optimization in depth.

Every real Transformer is a specific configuration of the architecture you just built. Here is how the major models fill in the blanks — the same skeleton, different sizes and refinements.

Model	Type	Layers	d_model	Refinements
GPT-2 small	Decoder	12	768	Learned pos, GELU, LayerNorm, tied
GPT-3	Decoder	96	12288	Same as GPT-2, vastly scaled
BERT-base	Encoder	12	768	Learned pos, bidirectional, MLM
T5-base	Enc-Dec	12+12	768	Relative pos, cross-attention
LLaMA-2 7B	Decoder	32	4096	RoPE, RMSNorm, SwiGLU, untied
LLaMA-2 70B	Decoder	80	8192	RoPE, RMSNorm, SwiGLU, GQA

Notice the pattern: the architecture is stable, but two things change over time. Scale grows relentlessly (12 → 96 layers; 768 → 12288 dimensions), and a handful of refinements accumulate (RoPE replaces learned positions; RMSNorm replaces LayerNorm; SwiGLU replaces GELU; GQA reduces KV memory). The core — stacked Pre-LN blocks of attention and FFN — is unchanged from 2017.

▶

ML Connection: The Modern LLM Recipe

The 2024 'default' decoder-only LLM combines: RoPE positional encoding, RMSNorm (Pre-LN), SwiGLU feed-forward, grouped-query attention (GQA) for efficient inference, and untied embeddings at large scale. This recipe — LLaMA-style — is the de facto standard that most open models follow.

Each refinement is a small, empirically-validated improvement over the original 2017 design. None changes the fundamental architecture; together they represent seven years of incremental engineering on a remarkably durable foundation.

A Transformer can have a subtle bug and still train — just worse — making errors hard to spot. Here are the most common implementation pitfalls and how to catch them.

Pitfall	Symptom	Fix
Forgot √d_k scaling	Slow/stuck training	Divide scores by √d_k
Wrong mask broadcasting	Model peeks at future	Verify causal mask shape (T,T)
Forgot positional encoding	Bag-of-words behaviour	Add position before block 1
Post-LN at depth	Divergence past ~12 layers	Use Pre-LN
No gradient clipping	NaN loss spikes	Clip grad norm to ~1.0
Mask applied after softmax	Future leaks in	Mask BEFORE softmax (set -∞)
Tied weights, wrong transpose	Shape error or garbage	W_U = Eᵀ, check shapes

⚠️

Pitfall: The Most Dangerous Bug: A Mask That Almost Works

A causal mask applied with the wrong broadcasting can leak a little future information — enough to inflate training metrics (the model 'cheats') but invisible until evaluation, when generation quality is mysteriously poor. The model trained fine; it just learned to rely on information it won't have at inference.

Always test generation, not just training loss. A model that cheats during training will have suspiciously low training loss and disappointing generation. The overfit-one-batch test from Chapter 10 plus a generation sanity-check catches most architecture bugs.

✧

Train Note: Validate Against a Reference

When implementing a Transformer from scratch, validate against a trusted reference (Hugging Face transformers, nanoGPT). Load the same weights into both, run the same input, and confirm the logits match to within floating-point tolerance.

This single test — numerical equivalence to a reference on a fixed input — catches nearly every architecture bug. It is the gold standard for verifying a from-scratch implementation is correct before investing in training.

Architecture Quick-Reference

Component	Role	Modern choice
Token embedding	Token → vector	Learned table (V, d)
Positional encoding	Inject order	RoPE
Self-attention	Mix across positions	Multi-head + causal mask
Feed-forward	Process per position	SwiGLU, ~8/3 d hidden
Normalization	Stabilize	RMSNorm, Pre-LN
Residual stream	Communication channel	Identity skip connections
LM head	Vector → logits	Linear (often weight-tied)

Exercises

Exercises 1–10 are pen-and-paper or derivations; 11–20 require code.

✎

Exercise 1: Pen & Paper

Prove that self-attention without positional encoding is permutation-equivariant: permuting the input rows permutes the output rows identically. Why does this make positional encoding necessary?

✎

Exercise 2: Derive

Show that the sinusoidal encoding lets the model attend to relative positions: express PE(pos+k) as a linear function of PE(pos).

✎

Exercise 3: Pen & Paper

Count the parameters in a single Transformer block (d=512, H=8, FFN hidden=2048). Break down attention vs FFN. What fraction is in the FFN?

✎

Exercise 4: Pen & Paper

Explain the residual stream view. Why does writing x = x + Sublayer(LN(x)) (Pre-LN) keep the stream cleaner than x = LN(x + Sublayer(x)) (Post-LN)?

✎

Exercise 5: Pen & Paper

For a decoder-only model with vocab V=50257, d=768, 12 layers, estimate the total parameter count. Where do most parameters live?

✎

Exercise 6: Pen & Paper

Explain weight tying. How many parameters does it save for V=50257, d=768? What is the conceptual justification?

✎

Exercise 7: Pen & Paper

Compare encoder-only, decoder-only, and encoder-decoder in terms of (a) attention masking, (b) training objective, (c) typical tasks. Why did decoder-only win for LLMs?

✎

Exercise 8: Derive

Implement RoPE on paper: show how rotating q and k by angle proportional to position makes q·k depend only on the relative position difference.

✎

Exercise 9: Pen & Paper

Why does the KV-cache work only because of the causal mask? What property of causal attention makes past keys and values reusable?

✎

Exercise 10: Pen & Paper

A model trains with low loss but generates poorly. List three architecture bugs (from Section 13.11) that could cause this and how to distinguish them.

✎

Exercise 11: Code

Implement sinusoidal positional encoding and visualize it as a heatmap (position × dimension). Confirm low dimensions oscillate fast, high dimensions slowly.

✎

Exercise 12: Code

Implement and compare sinusoidal, learned, and rotary positional encodings on a small sequence task. Which extrapolates best to sequences longer than training length?

✎

Exercise 13: Code

Implement the complete Transformer block (Pre-LN, MHA, FFN, residuals) from scratch. Verify the output shape equals the input shape for any sequence length.

✎

Exercise 14: Code Lab

Build the complete GPT class from Section 13.8. Verify shapes flow correctly end-to-end. Confirm it produces (T, V) logits for any token sequence.

✎

Exercise 15: Code

Implement weight tying: share the embedding matrix between input and output. Verify the parameter count drops by V×d and the forward pass still works.

✎

Exercise 16: Code

Implement autoregressive generation with greedy, temperature, top-k, and nucleus sampling. On an untrained model, confirm the sampling distributions differ as expected.

✎

Exercise 17: Code Lab

Implement the KV-cache for generation. Measure the speedup vs naive recomputation for generating 100 tokens. Confirm it produces identical outputs.

✎

Exercise 18: Code

Convert your decoder-only model to an encoder by removing the causal mask. Verify that with no mask, every position attends to every other (full attention matrix).

✎

Exercise 19: Code

Load GPT-2 weights from Hugging Face into a from-scratch implementation. Verify your logits match the reference to within 1e-4 on a fixed input — the gold-standard correctness test.

✎

Exercise 20: Code (Challenge)

Build a complete trainable nanoGPT: implement the full forward AND backward pass (using your Chapter 11 autograd or PyTorch), train it on a small text corpus (e.g., Shakespeare), and generate samples. This is the capstone of Part III — a Transformer you built and trained from scratch.

Further reading: “Attention Is All You Need” (Vaswani et al., 2017) — the original architecture. “Language Models are Unsupervised Multitask Learners” (Radford et al., 2019) for GPT-2's decoder-only design. “RoFormer” (Su et al., 2021) for RoPE. “LLaMA” (Touvron et al., 2023) for the modern recipe. Andrej Karpathy's nanoGPT and his 'Let's build GPT' video — the best hands-on companion to this chapter. “A Mathematical Framework for Transformer Circuits” (Elhage et al., 2021) for the residual-stream view.

Next → Chapter 14: Tokenization

You have built a complete Transformer that maps token IDs to predictions — but where do token IDs come from? Chapter 14 fills the one remaining gap: how raw text becomes the integer sequences the model consumes. We will build Byte-Pair Encoding from scratch, compare it to WordPiece and Unigram, explore the surprising ways tokenization shapes model behaviour (arithmetic, multilingual fairness, the 'SolidGoldMagikarp' glitch tokens), and understand why tokenization is both essential and a persistent source of model quirks.

✎ 20 Exercises in this chapter

Attempt each exercise before checking the worked solutions.

View Solutions →

←

PreviousCh 12. Attention Mechanisms

NextCh 14. Tokenization

→

The Transformer Architecture

Learning Objectives

The Transformer at a Glance

The Three-Stage Structure

Arch Stack: Transformer: the three stages (decoder-only)

Token and Positional Embeddings

Token Embeddings

Why Position Must Be Injected

Three Ways to Encode Position

Sinusoidal Positional Encoding

The Residual Stream

Reframing the Block as Read-Write

Pre-LN vs Post-LN

The Feed-Forward Network

The Complete Transformer Block

Shape Trace: Data through one block (T=16, d=512, H=8)

Encoder, Decoder, and Encoder-Decoder

Why Decoder-Only Won for LLMs

The Output Stage: Unembedding and the LM Head

Weight Tying

Building a Complete GPT from Scratch

Shape Trace: Full GPT forward pass (T=5, d=256, V=1000)

Autoregressive Generation

Sampling Strategies

Mapping to Real Models

Implementation Pitfalls

Chapter Summary & Exercises

Architecture Quick-Reference

Exercises