Positional Encoding
You have built every component: linear layers, activations, normalization, dropout, residuals (Chapter 10), the autograd to train them (Chapter 11), and multi-head attention (Chapter 12). This chapter assembles them into the complete Transformer. We start with the bird's-eye view, then build each piece, then write the whole thing from scratch.
The Three-Stage Structure
Every Transformer, regardless of variant, has the same three-stage structure: an input stage that turns tokens into vectors, a stack of identical processing blocks, and an output stage that turns vectors back into token predictions.
Arch Stack: Transformer: the three stages (decoder-only)
| Logits over vocabulary | (T, V) |
| Unembedding / LM head | (d → V) |
| Final LayerNorm | normalize |
| Transformer Block × N | the stack |
| + Positional encoding | inject order |
| Token embedding | (V → d) |
| input token IDs | (T,) |
The middle stage — the stack of N identical blocks — does the heavy lifting. Each block refines the representation, mixing context via attention and processing it via the feed-forward network. The input and output stages are comparatively simple: a lookup table in, a linear projection out.
The input stage converts a sequence of integer token IDs into a sequence of vectors the network can process. This requires two things: a token embedding that maps each token to a learned vector, and a positional encoding that tells the network where each token sits in the sequence.
Token Embeddings
The token embedding is a lookup table: a matrix E of shape (V, d) where row i is the embedding of token i. Looking up a sequence of token IDs gives a sequence of d-dimensional vectors. This is exactly the embedding idea from Chapter 8, now learned end-to-end with the rest of the model.
Why Position Must Be Injected
Here is a subtle but critical fact: self-attention is permutation-equivariant. If you shuffle the input tokens, the outputs shuffle the same way — attention has no inherent notion of order. 'dog bites man' and 'man bites dog' would be processed identically. Position must be added explicitly.
[Missing Component: attnNote]
Three Ways to Encode Position
| Method | How | Used in |
|---|---|---|
| Sinusoidal | Fixed sin/cos of varying frequency | Original Transformer (2017) |
| Learned absolute | A learned vector per position | BERT, GPT-2 |
| Rotary (RoPE) | Rotate Q,K by position-dependent angle | LLaMA, GPT-NeoX, most modern LLMs |
| ALiBi | Linear bias on attention scores by distance | BLOOM, some long-context models |
| Relative | Encode pairwise position differences | T5, Transformer-XL |
Sinusoidal Positional Encoding
The original Transformer used fixed sinusoids of geometrically increasing wavelength. Each dimension of the encoding oscillates at a different frequency, giving every position a unique fingerprint and — crucially — letting the model attend to relative positions via linear combinations.
PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
# pos = position, i = dimension index, d = model dimension
# Low dimensions oscillate fast, high dimensions oscillate slowlyimport numpy as np
def sinusoidal_encoding(T, d):
"""Returns (T, d) fixed positional encodings."""
pos = np.arange(T)[:, None] # (T, 1)
i = np.arange(d)[None, :] # (1, d)
angle = pos / np.power(10000, (2*(i//2)) / d)
pe = np.zeros((T, d))
pe[:, 0::2] = np.sin(angle[:, 0::2]) # even dims
pe[:, 1::2] = np.cos(angle[:, 1::2]) # odd dims
return pe
class EmbeddingLayer:
def __init__(self, vocab, d, seed=0):
rng = np.random.default_rng(seed)
self.E = rng.normal(0, 0.02, (vocab, d)) # token table
self.d = d
def forward(self, token_ids): # (T,) integer IDs
T = len(token_ids)
tok = self.E[token_ids] # (T, d) token embeddings
pos = sinusoidal_encoding(T, self.d) # (T, d) position
return tok + pos # add them: (T, d)
# Token embedding answers 'what token is this?'
# Positional encoding answers 'where in the sequence is it?'
# Their SUM carries both, and the network learns to disentangle them.The residual connections introduced in Chapter 10 are more than an optimization trick in the Transformer — they form the residual stream, a powerful conceptual lens (Elhage et al., 2021) for understanding how the whole network communicates. Every block reads from and writes to this shared stream.
Reframing the Block as Read-Write
Each sublayer (attention or FFN) does not replace the representation — it adds to it. The running sum x flows unchanged through every residual connection; each sublayer reads the current x, computes an update, and adds it back. The stream is a shared workspace that all blocks contribute to.
x₀ = embedding + position # initial stream
x₁ = x₀ + Attn(LN(x₀)) # attention writes to stream
x₂ = x₁ + FFN(LN(x₁)) # FFN writes to stream
... (repeat for N blocks)
logits = Unembed(LN(x_final)) # read final streamPre-LN vs Post-LN
Where you place the LayerNorm relative to the residual add is one of the most consequential design choices. The original Transformer used Post-LN; nearly all modern LLMs use Pre-LN, which keeps the residual stream clean and stabilizes training at depth.
| Post-LN (original, 2017) | Pre-LN (modern) |
|---|---|
| x = LN(x + Sublayer(x)) | x = x + Sublayer(LN(x)) |
| Norm AFTER the residual add | Norm BEFORE the sublayer |
| Residual stream gets normalized | Residual stream stays clean |
| Needs careful LR warmup | Trains stably without warmup tricks |
| Can diverge at great depth | Scales to 100+ layers reliably |
| Used in: original Transformer, BERT | Used in: GPT-2+, LLaMA, most LLMs |
After attention mixes information across positions, the feed-forward network (FFN) processes each position independently. It is deceptively simple — two linear layers with a nonlinearity — yet it holds roughly two-thirds of a Transformer's parameters and is increasingly understood to store much of the model's factual knowledge.
FFN(x) = W₂ · act(W₁ x + b₁) + b₂
W₁: (d → 4d) expand to a wider hidden dimension
act: GELU or SwiGLU
W₂: (4d → d) project back to model dimensionThe FFN expands the representation to a wider hidden dimension (typically 4× the model dimension), applies a nonlinearity, then projects back. The expansion gives the network room to compute rich nonlinear features per position; the projection returns to the residual-stream dimension so the output can be added back.
import numpy as np
def gelu(x):
return 0.5*x*(1+np.tanh(np.sqrt(2/np.pi)*(x+0.044715*x**3)))
class FeedForward: # standard GELU FFN
def __init__(self, d, hidden=None, seed=0):
hidden = hidden or 4*d # 4x expansion
rng = np.random.default_rng(seed); s = 1/np.sqrt(d)
self.W1 = rng.normal(0, s, (d, hidden)); self.b1 = np.zeros(hidden)
self.W2 = rng.normal(0, s, (hidden, d)); self.b2 = np.zeros(d)
def forward(self, x): # (T, d) -> (T, d)
return gelu(x @ self.W1 + self.b1) @ self.W2 + self.b2
class SwiGLU_FFN: # gated FFN used in LLaMA
def __init__(self, d, seed=0):
hidden = int(8/3*d) # param-matched to 4d standard FFN
rng = np.random.default_rng(seed); s = 1/np.sqrt(d)
self.W = rng.normal(0, s, (d, hidden)) # gate
self.V = rng.normal(0, s, (d, hidden)) # value
self.W2 = rng.normal(0, s, (hidden, d))
def forward(self, x):
swish = (x @ self.W); swish = swish * (1/(1+np.exp(-swish)))
return (swish * (x @ self.V)) @ self.W2 # gated, then projectWe now have every piece. The Transformer block combines a multi-head attention sublayer and a feed-forward sublayer, each wrapped in a Pre-LN residual connection. This is the unit that gets stacked N times.
function TransformerBlock(x, mask):
# Sublayer 1: multi-head self-attention
a = MultiHeadAttention(LayerNorm(x), mask)
x = x + Dropout(a) # residual add
# Sublayer 2: position-wise feed-forward
f = FeedForward(LayerNorm(x))
x = x + Dropout(f) # residual add
return ximport numpy as np
class TransformerBlock:
def __init__(self, d, n_heads, seed=0):
self.attn = MultiHeadAttention(d, n_heads, seed=seed)
self.ffn = FeedForward(d, seed=seed+1)
self.ln1 = LayerNorm(d) # before attention
self.ln2 = LayerNorm(d) # before FFN
def forward(self, x, mask=None): # x: (T, d)
# Pre-LN: normalize, sublayer, residual add
attn_out, _ = self.attn.forward(self.ln1.forward(x), mask)
x = x + attn_out # residual 1
x = x + self.ffn.forward(self.ln2.forward(x)) # residual 2
return x
class LayerNorm:
def __init__(self, d, eps=1e-5):
self.g = np.ones(d); self.b = np.zeros(d); self.eps = eps
def forward(self, x):
mu = x.mean(-1, keepdims=True); var = x.var(-1, keepdims=True)
return self.g * (x - mu) / np.sqrt(var + self.eps) + self.b
# This is the entire block. Stack N of these and you have a Transformer body.Shape Trace: Data through one block (T=16, d=512, H=8)
| Operation | Shape | Note |
|---|---|---|
| input x | (16, 512) | from previous block |
| LayerNorm(x) | (16, 512) | normalized, shape unchanged |
| MultiHeadAttention | (16, 512) | context mixed across positions |
| x + attn_out | (16, 512) | residual add |
| LayerNorm(x) | (16, 512) | normalized again |
| FeedForward | (16, 512) | per-position processing |
| x + ffn_out | (16, 512) | residual add → next block |
The same block can be arranged in three ways, giving three model families. The difference comes down to two choices: is the self-attention masked (causal) or not, and is there a separate encoder feeding the decoder via cross-attention?
| Architecture | Attention | Best for | Examples |
|---|---|---|---|
| Encoder-only | Bidirectional (no mask) | Understanding, classification | BERT, RoBERTa |
| Decoder-only | Causal (masked) | Generation, language modeling | GPT, LLaMA, Claude |
| Encoder-decoder | Encoder bidir. + decoder causal + cross-attn | Seq-to-seq (translation) | T5, BART, original |
Why Decoder-Only Won for LLMs
The largest and most capable language models — GPT, LLaMA, Claude, Gemini — are decoder-only. This was not obvious in 2018, when encoder-only (BERT) and encoder-decoder (T5) were equally prominent. The decoder-only design won for LLMs because of its simplicity and the power of the next-token objective.
After the last block, the residual stream holds a rich representation of each position. The output stage converts this back into a probability distribution over the vocabulary: a final LayerNorm, then a linear projection (the unembedding or 'LM head') from model dimension d to vocabulary size V, then softmax.
x_final = LayerNorm(x_N) # (T, d)
logits = x_final @ W_U # (T, V) unembedding
probs = softmax(logits) # (T, V) per-position distributionsWeight Tying
A common trick: tie the unembedding matrix W_U to the token embedding matrix E (using Eᵀ as the unembedding). This saves V×d parameters — 38 million for GPT-2 — and often improves quality, since the same vocabulary geometry serves both reading and writing tokens.
import numpy as np
def softmax(x, axis=-1):
x = x - x.max(axis=axis, keepdims=True); e = np.exp(x)
return e / e.sum(axis=axis, keepdims=True)
class LMHead:
def __init__(self, embedding_matrix):
self.W_U = embedding_matrix.T # weight tying: (d, V)
self.ln_f = LayerNorm(embedding_matrix.shape[1])
def forward(self, x): # x: (T, d)
x = self.ln_f.forward(x) # final norm
logits = x @ self.W_U # (T, V)
return logits
# The logits at position t are the model's prediction for token t+1.
# Apply softmax for probabilities, or argmax for greedy decoding.We now assemble everything into a complete, working decoder-only Transformer — a miniature GPT. This is the culmination of Part III: a model you understand line by line, from token IDs to output logits.
import numpy as np
class GPT:
"""A minimal decoder-only Transformer (GPT-style)."""
def __init__(self, vocab, d=256, n_layers=6, n_heads=8, seed=0):
rng = np.random.default_rng(seed)
self.E = rng.normal(0, 0.02, (vocab, d)) # token embeddings
self.d = d
self.blocks = [TransformerBlock(d, n_heads, seed=seed+i)
for i in range(n_layers)]
self.head = LMHead(self.E) # weight-tied output
def forward(self, token_ids): # (T,) integer IDs
T = len(token_ids)
# 1. Input stage: embed + position
x = self.E[token_ids] + sinusoidal_encoding(T, self.d) # (T, d)
# 2. Causal mask (decoder-only)
mask = np.tril(np.ones((T, T), dtype=bool))
# 3. The block stack
for block in self.blocks:
x = block.forward(x, mask) # (T, d)
# 4. Output stage
return self.head.forward(x) # (T, V) logits
# Instantiate and run a forward pass
model = GPT(vocab=1000, d=256, n_layers=6, n_heads=8)
tokens = np.array([5, 42, 17, 3, 99])
logits = model.forward(tokens)
print(f"logits shape: {logits.shape}") # (5, 1000)
print(f"next-token prediction at pos 0: {logits[0].argmax()}")
# Each row predicts the next token. Untrained, predictions are random;
# after training on text (Chapter 15), they become coherent language.Shape Trace: Full GPT forward pass (T=5, d=256, V=1000)
| Operation | Shape | Note |
|---|---|---|
| token_ids | (5,) | integer token IDs |
| E[token_ids] | (5, 256) | token embeddings |
| + positional | (5, 256) | order injected |
| block × 6 | (5, 256) | context mixed and processed |
| final LayerNorm | (5, 256) | normalize |
| @ W_U | (5, 1000) | logits over vocab |
A trained decoder-only Transformer generates text autoregressively: it predicts the next token, appends it to the sequence, and repeats. The causal mask ensures each prediction depends only on previous tokens, so this loop is consistent with how the model was trained.
function generate(model, prompt, max_tokens):
tokens = prompt
for _ in range(max_tokens):
logits = model.forward(tokens) # (T, V)
next_logits = logits[-1] # last position predicts next
next_token = sample(next_logits) # greedy / top-k / nucleus
tokens = tokens + [next_token] # append and repeat
if next_token == EOS: break
return tokensSampling Strategies
| Strategy | How | Effect |
|---|---|---|
| Greedy | Always pick argmax | Deterministic; can be repetitive |
| Temperature | Scale logits by 1/T before softmax | T>1 more random, T<1 more peaked |
| Top-k | Sample from k highest-probability tokens | Cuts off the long tail |
| Nucleus (top-p) | Sample from smallest set with cumulative p | Adapts cutoff to confidence |
| Beam search | Track b best sequences | Better for translation, not open text |
Every real Transformer is a specific configuration of the architecture you just built. Here is how the major models fill in the blanks — the same skeleton, different sizes and refinements.
| Model | Type | Layers | d_model | Refinements |
|---|---|---|---|---|
| GPT-2 small | Decoder | 12 | 768 | Learned pos, GELU, LayerNorm, tied |
| GPT-3 | Decoder | 96 | 12288 | Same as GPT-2, vastly scaled |
| BERT-base | Encoder | 12 | 768 | Learned pos, bidirectional, MLM |
| T5-base | Enc-Dec | 12+12 | 768 | Relative pos, cross-attention |
| LLaMA-2 7B | Decoder | 32 | 4096 | RoPE, RMSNorm, SwiGLU, untied |
| LLaMA-2 70B | Decoder | 80 | 8192 | RoPE, RMSNorm, SwiGLU, GQA |
Notice the pattern: the architecture is stable, but two things change over time. Scale grows relentlessly (12 → 96 layers; 768 → 12288 dimensions), and a handful of refinements accumulate (RoPE replaces learned positions; RMSNorm replaces LayerNorm; SwiGLU replaces GELU; GQA reduces KV memory). The core — stacked Pre-LN blocks of attention and FFN — is unchanged from 2017.
A Transformer can have a subtle bug and still train — just worse — making errors hard to spot. Here are the most common implementation pitfalls and how to catch them.
| Pitfall | Symptom | Fix |
|---|---|---|
| Forgot √d_k scaling | Slow/stuck training | Divide scores by √d_k |
| Wrong mask broadcasting | Model peeks at future | Verify causal mask shape (T,T) |
| Forgot positional encoding | Bag-of-words behaviour | Add position before block 1 |
| Post-LN at depth | Divergence past ~12 layers | Use Pre-LN |
| No gradient clipping | NaN loss spikes | Clip grad norm to ~1.0 |
| Mask applied after softmax | Future leaks in | Mask BEFORE softmax (set -∞) |
| Tied weights, wrong transpose | Shape error or garbage | W_U = Eᵀ, check shapes |
Architecture Quick-Reference
| Component | Role | Modern choice |
|---|---|---|
| Token embedding | Token → vector | Learned table (V, d) |
| Positional encoding | Inject order | RoPE |
| Self-attention | Mix across positions | Multi-head + causal mask |
| Feed-forward | Process per position | SwiGLU, ~8/3 d hidden |
| Normalization | Stabilize | RMSNorm, Pre-LN |
| Residual stream | Communication channel | Identity skip connections |
| LM head | Vector → logits | Linear (often weight-tied) |
Exercises
Exercises 1–10 are pen-and-paper or derivations; 11–20 require code.
Further reading: “Attention Is All You Need” (Vaswani et al., 2017) — the original architecture. “Language Models are Unsupervised Multitask Learners” (Radford et al., 2019) for GPT-2's decoder-only design. “RoFormer” (Su et al., 2021) for RoPE. “LLaMA” (Touvron et al., 2023) for the modern recipe. Andrej Karpathy's nanoGPT and his 'Let's build GPT' video — the best hands-on companion to this chapter. “A Mathematical Framework for Transformer Circuits” (Elhage et al., 2021) for the residual-stream view.
Next → Chapter 14: Tokenization
You have built a complete Transformer that maps token IDs to predictions — but where do token IDs come from? Chapter 14 fills the one remaining gap: how raw text becomes the integer sequences the model consumes. We will build Byte-Pair Encoding from scratch, compare it to WordPiece and Unigram, explore the surprising ways tokenization shapes model behaviour (arithmetic, multilingual fairness, the 'SolidGoldMagikarp' glitch tokens), and understand why tokenization is both essential and a persistent source of model quirks.