Part IV: Pretraining at Scale
Chapter 19

Training Stability & Monitoring

Loss spikes, gradient norms, and what to watch
20 Exercises
19.1

Chapter 13 built the Transformer, and its core has been remarkably durable: stacked blocks of attention and feed-forward, residual connections, normalization. Yet GPT-2, LLaMA-3, and the latest frontier models differ in dozens of details. This chapter surveys those variations — the architectural choices that, layered on the stable core, distinguish each generation and each family of models.

What Stays, What Changes

Unchanged since 2017Refined over time
Self-attention as the mixing operationPositional encoding (learned → RoPE)
Feed-forward per positionActivation (ReLU → GELU → SwiGLU)
Residual connectionsNormalization (LayerNorm → RMSNorm)
Stacked identical blocksNorm placement (Post-LN → Pre-LN)
Softmax attention weightsAttention efficiency (MHA → GQA, Flash)
Token embeddingsSparsity (dense → Mixture of Experts)
Refinements, Not Revolutions
Each change in the right-hand column is a small, empirically-validated improvement — a few percent better perplexity, or a large memory saving at equal quality. None overturns the fundamental architecture. The modern LLM is the 2017 Transformer plus seven years of accumulated refinements, each justified by careful ablation.
This chapter is a tour of those refinements. Understanding why each was adopted — the specific problem it solves — is more valuable than memorizing the list, because it teaches the engineering judgment behind architecture design.
19.2

The single most consequential architectural choice is the overall structure, determined by the attention masking and whether there is a separate encoder. Chapter 13 introduced these; here we treat them as a design space and examine why the field converged on decoder-only for generative LLMs.

FamilyAttentionPretrainingExamples
Encoder-onlyBidirectionalMasked LM (MLM)BERT, RoBERTa, DeBERTa
Decoder-onlyCausalNext-tokenGPT, LLaMA, Claude, Mistral
Encoder-decoderBidir. enc + causal decSpan corruptionT5, BART, Flan-T5

Encoder-Only: Understanding

Encoder-only models (BERT; Devlin et al., 2019) use bidirectional attention — every token sees every other — and train by masking random tokens and predicting them (masked language modeling). They excel at understanding tasks: classification, named-entity recognition, retrieval embeddings. They cannot generate text autoregressively, because every position already sees the future.

Decoder-Only: Generation

Decoder-only models use causal attention and next-token prediction. As Chapter 13 argued, they won for general-purpose LLMs because one model and one objective serve every task via prompting, every token provides a training signal, and the architecture scales cleanly. The entire generative frontier — GPT-4, LLaMA, Claude, Gemini — is decoder-only.

Encoder-Decoder: Seq-to-Seq

Encoder-decoder models (T5; Raffel et al., 2020) keep a bidirectional encoder and a causal decoder joined by cross-attention. They are natural for tasks with a clear input-to-output mapping: translation, summarization. T5's 'span corruption' objective masks contiguous spans rather than single tokens. They remain competitive for focused seq-to-seq tasks but have been eclipsed by decoder-only models for general use.

ML Connection: Encoders Still Rule Embeddings
Although decoder-only dominates generation, encoder-only models remain the best choice for producing text embeddings — the dense vectors used in retrieval, clustering, and RAG (Chapter 29). Their bidirectional attention lets every token inform the final representation, which is ideal for encoding a fixed input into a single vector.
Modern embedding models (E5, BGE, and others) are typically encoder-only or encoder-derived. So the 'decoder-only won' story applies to generation; for understanding and retrieval, the bidirectional encoder is alive and well.
19.3

Chapter 13 introduced positional encoding and previewed RoPE. Here we examine the variants in depth, because the choice profoundly affects a model's ability to handle long context — a central concern for modern LLMs. The trend has moved decisively from absolute positions toward relative schemes that extrapolate better.

MethodMechanismLong-context extrapolation
SinusoidalAdd fixed sin/cos vectorsPoor beyond training length
Learned absoluteAdd a learned vector per positionNone — fails past max length
RoPERotate Q,K by position angleGood; extendable via scaling
ALiBiAdd distance penalty to scoresStrong; trains short, tests long
Relative (T5)Learned bias per relative distanceGood within bucketed range

Rotary Position Embedding (RoPE)

RoPE (Su et al., 2021) is the dominant choice in modern LLMs. Instead of adding a position vector, it ROTATES the query and key vectors by an angle proportional to their absolute position. The magic: when you take the dot product of a rotated query and a rotated key, the result depends only on their RELATIVE position — absolute position cancels out. This gives relative-position awareness for free, with no added parameters.

textRoPE: rotation encodes relative position
Rotate q at position m and k at position n by their position angles:
    q̃_m = R(mθ) q       k̃_n = R(nθ) k

Dot product depends only on (m - n):
    q̃_m · k̃_n = qᵀ R((m-n)θ) k      # relative position!
PythonRotary Position Embedding from scratch
import torch

def build_rope(seq_len, dim, base=10000):
    """Precompute cos/sin tables for RoPE. dim must be even."""
    # Frequencies: lower dims rotate fast, higher dims slow
    inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
    pos = torch.arange(seq_len).float()
    angles = torch.outer(pos, inv_freq)       # (seq_len, dim/2)
    return angles.cos(), angles.sin()

def apply_rope(x, cos, sin):
    """Rotate x (..., seq, dim) by the precomputed angles."""
    x1, x2 = x[..., ::2], x[..., 1::2]  # even/odd dims
    # Rotate each (x1, x2) pair by its angle
    rot1 = x1 * cos - x2 * sin
    rot2 = x1 * sin + x2 * cos
    return torch.stack([rot1, rot2], dim=-1).flatten(-2)

# Applied to Q and K (NOT V) just before the attention dot product.
# No learned parameters; position enters purely through rotation.

ALiBi: Attention with Linear Biases

ALiBi (Press et al., 2021) takes an even simpler approach: it adds NO positional encoding to the embeddings at all. Instead, it adds a distance-proportional penalty directly to the attention scores — the farther back a key is, the more its score is reduced. This biases each query toward nearby tokens and, remarkably, lets a model trained on short sequences generalize to much longer ones at inference.

textALiBi: linear distance penalty
scores[i][j] = q_i · k_j  -  m · (i - j)     # penalize distance

# m is a fixed per-head slope (smaller m = longer effective range)
# No position embeddings; the bias IS the positional signal.
ML Connection: Why Long-Context Models Use RoPE Scaling
RoPE trained at one context length can be extended to longer contexts by SCALING the rotation frequencies — 'position interpolation' (Chen et al., 2023) and methods like YaRN and NTK-aware scaling let a model trained at 4k tokens work at 32k or 128k with brief fine-tuning. This is how most long-context models extend their windows.
RoPE's clean mathematical structure — position as rotation — is precisely what makes this scaling possible. You will meet these long-context techniques again in Chapter 33.
19.4

Chapters 10 and 13 introduced normalization and activations; modern LLMs use specific refined choices that improve stability and quality at scale. We consolidate the modern recipe here.

RMSNorm Over LayerNorm

RMSNorm (Chapter 10) drops LayerNorm's mean-centering, normalizing only by the root-mean-square. It is cheaper (no mean computation, no bias) and empirically matches or beats LayerNorm. LLaMA, Gemma, and most recent models use it. The savings are small per call but add up across hundreds of layers and trillions of tokens.

SwiGLU Over GELU

The SwiGLU feed-forward (Chapter 10) uses a gated activation with three weight matrices instead of two, with the hidden dimension reduced to (8/3)d to keep the parameter count matched. It consistently improves perplexity at equal parameters and is the standard FFN in LLaMA and PaLM.

Pre-LN and QK-Norm

Pre-LN (normalize before each sublayer; Chapter 13) keeps the residual stream clean and is essential for training deep models stably. A newer refinement, QK-normalization, applies normalization to the queries and keys before the attention dot product, which stabilizes training of very large models by preventing attention-logit explosion.

ComponentOld defaultModern default
NormalizationLayerNormRMSNorm
Norm placementPost-LNPre-LN (+ sometimes QK-norm)
FFN activationReLU / GELUSwiGLU
FFN hidden size4d(8/3)d for SwiGLU (param-matched)
PositionalLearned absoluteRoPE
Bias termsPresentOften removed (LLaMA drops most biases)
Effic Note: Removing Bias Terms
A subtle modern refinement: LLaMA and several other models remove most bias terms from linear layers and normalizations. With large models and good normalization, biases add parameters and a slight instability risk without measurable benefit. Dropping them simplifies the model and very slightly reduces parameters and compute.
This exemplifies the refinement philosophy: each change is small and empirically motivated. None is dramatic alone, but together — RMSNorm, SwiGLU, RoPE, no biases, Pre-LN — they define the 'LLaMA recipe' that most open models now follow.
19.5

The attention-efficiency variants in the next sections are all responses to one problem: the KV-cache. Recall from Chapter 13 that autoregressive generation caches the keys and values of past tokens to avoid recomputing them. This cache is fast, but it is large — and at long context and large batch sizes, it becomes the dominant memory cost and the bottleneck on inference throughput.

The Size of the KV-Cache

textKV-cache memory
KV-cache = 2 · n_layers · n_heads · head_dim · seq_len · batch · bytes

# the 2 is for Keys AND Values
Example: 70B model (80 layers, 64 heads, 128 dim), 8k context, batch 32, bf16:
    2 × 80 × 64 × 128 × 8192 × 32 × 2  ≈  860 GB

860GB of KV-cache for a single batch — far more than the model weights themselves. This is why long-context, high-throughput inference is hard: the cache, not the parameters, fills the GPU memory. The number of attention heads is a direct multiplier, which points to the fix: reduce the number of distinct keys and values by sharing them across query heads.

Dist Note: Generation Is Memory-Bandwidth-Bound
During generation, each new token requires reading the entire KV-cache from memory to compute attention. The bottleneck is not compute (the matmuls are small for one token) but MEMORY BANDWIDTH — how fast you can stream the cache through the GPU. This is the opposite of training, which is compute-bound.
Because generation is bandwidth-bound, shrinking the KV-cache directly speeds up inference: less data to read per token means faster generation. This is the central motivation for multi-query and grouped-query attention, the subject of the next section.
19.6

Multi-Query Attention (MQA; Shazeer, 2019) and Grouped-Query Attention (GQA; Ainslie et al., 2023) shrink the KV-cache by sharing keys and values across query heads. Standard multi-head attention gives every head its own K and V; MQA gives ALL heads a single shared K and V; GQA is the middle ground, sharing K and V within small groups of query heads.

The Spectrum

VariantK/V headsKV-cacheQuality
Multi-head (MHA)= query heads (e.g. 64)LargestBest
Grouped-query (GQA)Groups (e.g. 8)Reduced ~8×≈ MHA
Multi-query (MQA)1 sharedSmallestSlight drop

GQA hits the sweet spot and is now standard in LLaMA-2 70B, LLaMA-3, Mistral, and most recent models. With 64 query heads and 8 KV groups, the KV-cache shrinks 8× with negligible quality loss — a large inference win for almost no cost. Here is the head-sharing layout:

Device Grid: Grouped-Query Attention: 8 query heads, 2 KV groups

Q0Q1Q2Q3Q4Q5Q6Q7
K/VKV-AKV-AKV-AKV-AKV-BKV-BKV-BKV-B
PythonGrouped-query attention from scratch
import torch; import torch.nn.functional as F

def grouped_query_attention(Q, K, V, n_groups):
    """Q: (B, n_q_heads, T, d)   K,V: (B, n_groups, T, d)."""
    B, n_q, T, d = Q.shape
    heads_per_group = n_q // n_groups

    # Repeat each K/V group to match its query heads
    K = K.repeat_interleave(heads_per_group, dim=1)  # (B, n_q, T, d)
    V = V.repeat_interleave(heads_per_group, dim=1)

    # Standard attention, but K/V were shared within groups
    scores = (Q @ K.transpose(-2, -1)) / d**0.5
    return (F.softmax(scores, dim=-1) @ V)

# n_groups = n_q  -> standard MHA (no sharing)
# n_groups = 1    -> MQA (all heads share one K/V)
# n_groups = 8    -> GQA (LLaMA-3 uses this)

# The KV-cache stores only n_groups K/V heads instead of n_q,
# shrinking it by n_q/n_groups -- e.g. 64/8 = 8x smaller.
Effic Note: GQA Is Now the Default
Grouped-query attention has become the standard attention variant for inference-efficient LLMs. The 8× KV-cache reduction directly translates to longer context, larger batch sizes, and faster generation — the memory-bandwidth bottleneck of Section 19.5 is the thing being relieved.
The quality cost is minimal: with enough groups (typically 8), GQA matches full multi-head attention on benchmarks. It is a near-free lunch for inference, which is why essentially every recent open model adopts it.
19.7

FlashAttention (Dao et al., 2022) is one of the most impactful systems contributions to LLMs. It computes EXACT attention — no approximation — but does so without ever materializing the full (T×T) attention score matrix in slow GPU memory. By restructuring the computation to keep data in fast on-chip SRAM, it dramatically reduces memory traffic and speeds up both training and inference.

The Key Idea: Tiling and Online Softmax

Standard attention writes the entire T×T score matrix to GPU high-bandwidth memory (HBM), then reads it back for the softmax and the value multiplication — a huge amount of slow memory traffic. FlashAttention instead processes attention in tiles that fit in fast SRAM, computing a running 'online softmax' that never needs the full matrix in HBM. The result is identical, but the memory traffic drops from O(T²) to O(T).

textFlashAttention: the IO win
Standard: write & read T×T score matrix to/from HBM  →  O(T²) HBM traffic

FlashAttention: tile Q,K,V into SRAM-sized blocks,
    compute attention block-by-block with online softmax,
    never store the full T×T matrix      →  O(T) HBM traffic

Same exact result; far less slow-memory traffic → 2-4× faster.
Effic Note: It's About the Memory Hierarchy, Not the FLOPs
FlashAttention does NOT reduce the number of floating-point operations — attention is still O(T²) compute. What it reduces is MEMORY TRAFFIC: the movement of data between slow HBM and fast SRAM. On modern GPUs, attention is memory-bound, so reducing traffic directly speeds it up by 2–4×.
This echoes the lesson of gradient checkpointing (Chapter 11): sometimes the win is not fewer operations but better use of the memory hierarchy. FlashAttention is now the default attention kernel in essentially every serious framework (it is what torch.nn.functional.scaled_dot_product_attention calls under the hood).

Why It Also Saves Memory

Because FlashAttention never materializes the T×T matrix, its memory footprint is linear in sequence length rather than quadratic. This is what made long-context training practical: a 32k-token sequence would need a 32k×32k = 1-billion-entry score matrix per head with standard attention, but FlashAttention needs only the linear-sized tiles. Long context owes much of its feasibility to this single kernel.

19.8

FlashAttention makes exact attention faster but does not change its O(T²) compute. For very long contexts, even that is too expensive, so a family of variants APPROXIMATES attention by having each token attend to only a subset of positions, trading some expressiveness for sub-quadratic cost.

PatternEach token attends toCost
Sliding windowA fixed window of w nearby tokensO(T·w) linear
Dilated / stridedEvery k-th token (gaps)Sub-quadratic
Global + localA few global tokens + local windowO(T·w + T·g)
Block-sparseSelected blocks of the matrixConfigurable
Random (BigBird)Window + global + random tokensO(T) linear

Sliding-Window Attention

The simplest and most widely used sparse pattern is the sliding window (used in Longformer and Mistral): each token attends only to the w previous tokens, not the whole history. This makes attention linear in sequence length. Crucially, stacking layers gives an effective receptive field that grows with depth — just as in a CNN — so information can still propagate across the full sequence, indirectly, through the layer stack.

textSliding-window receptive field
Each layer attends to w nearby tokens (window size w).
After L layers, the effective receptive field ≈ L × w tokens.

Mistral 7B: w = 4096, 32 layers  →  effective ~131k token reach,
    while each attention op stays O(T · 4096), not O(T²).
PythonSliding-window attention mask
import torch

def sliding_window_mask(seq_len, window):
    """Causal mask where each token sees only `window` previous tokens."""
    i = torch.arange(seq_len)[:, None]    # query positions
    j = torch.arange(seq_len)[None, :]    # key positions
    # Attend to j where: j <= i (causal) AND i - j < window
    mask = (j <= i) & (i - j < window)
    return mask

m = sliding_window_mask(6, window=3)
print(m.int())
# [[1 0 0 0 0 0]
#  [1 1 0 0 0 0]
#  [1 1 1 0 0 0]
#  [0 1 1 1 0 0]   <- token 3 no longer sees token 0 (outside window)
#  [0 0 1 1 1 0]
#  [0 0 0 1 1 1]]
⚠️
Sparse Attention Trades Expressiveness for Cost
Sparse and windowed attention are approximations: a token genuinely cannot directly attend to everything. For tasks needing precise long-range lookups (retrieving an exact fact from far back), this can hurt. The deep-stack receptive field helps, but indirect multi-hop attention is weaker than direct attention.
This is why many frontier models still use full (FlashAttention-accelerated) attention up to long context lengths, reserving sparsity for extreme lengths. The right choice depends on the context length and task — there is no universal winner, which Chapter 33 explores further.
19.9

The most significant architectural departure from the dense Transformer is the Mixture of Experts (MoE). It addresses a fundamental tension: more parameters mean a more capable model, but also more compute per token. MoE breaks this link by activating only a small subset of parameters for each token. This chapter previews the idea; Chapter 32 covers it in full.

The Core Idea

In a dense model, every token passes through every parameter. In an MoE, the feed-forward layer is replaced by many parallel 'expert' FFNs plus a small 'router' network. For each token, the router selects only the top-k experts (typically 1 or 2 of many), so each token uses only a fraction of the total parameters. The model has a huge parameter count but a small ACTIVE parameter count per token.

textMoE: total vs active parameters
Dense FFN:  every token uses all N_ffn parameters.

MoE:  E experts, router picks top-k per token.
    total params  = E × N_expert    (huge)
    active params = k × N_expert    (small, fixed per token)

Mixtral 8×7B: 47B total params, but only ~13B active per token.
ML Connection: Why MoE Matters for Scaling
MoE lets a model have far more parameters — more knowledge capacity — without proportionally more compute per token. This is a direct response to the scaling and inference economics of Chapter 16: you get the capability of a huge model at the inference cost of a much smaller one. Mixtral, and reportedly several frontier models, use MoE for exactly this reason.
The 6ND FLOP rule of Chapter 16 must be adjusted for MoE: compute scales with ACTIVE parameters, not total. This decoupling is why MoE is one of the most important frontier techniques — and why it earns a full chapter (32) in Part VII.

The Challenges (Previewed)

Routing: the router must learn to send tokens to the right experts, and must balance load so no expert is overwhelmed or starved.
Training instability: the discrete routing decision is hard to train; auxiliary load-balancing losses are needed.
Memory: although compute is low, ALL experts' parameters must be stored, so MoE models have large memory footprints.
Communication: in distributed training, routing tokens to experts on other GPUs adds all-to-all communication (Chapter 32).
19.10

We can now assemble the refinements into the anatomy of a contemporary decoder-only LLM — the 'LLaMA-3 recipe' that most open models follow. Compare this to the Chapter 13 baseline to see how the refinements layer on the stable core.

Arch Stack: A modern decoder-only block (LLaMA-3 style)

+ residualx + FFN_out
SwiGLU FFNd → (8/3)d → d, no bias
RMSNormpre-norm
+ residualx + Attn_out
Grouped-Query AttentionRoPE + GQA + FlashAttn
RMSNormpre-norm
input x(B, T, d)
ChoiceChapter 13 baselineModern (LLaMA-3)
PositionalSinusoidal/learnedRoPE (with scaling for long context)
NormalizationLayerNormRMSNorm, pre-norm
FFNGELU, 4dSwiGLU, (8/3)d, no bias
AttentionMulti-headGrouped-query + FlashAttention
BiasesPresentRemoved
Vocab~50k128k (tiktoken)
Context~2k8k–128k+ (RoPE scaling)
Same Skeleton, Better Everywhere
Every modern refinement sits on the same skeleton you built in Chapter 13: embed, stack of (norm → attention → residual → norm → FFN → residual) blocks, unembed. RoPE refines position, RMSNorm refines normalization, SwiGLU refines the FFN, GQA and FlashAttention refine attention's efficiency. None changes the fundamental data flow.
This is the deep lesson of the chapter: progress in LLM architecture has been evolutionary, not revolutionary. Understanding the Chapter 13 core plus these targeted refinements is understanding the architecture of every model at the frontier today.
19.11

Variants Quick-Reference

VariantSolvesUsed in
Decoder-onlyGeneral-purpose generationGPT, LLaMA, Claude
RoPERelative position, long contextLLaMA, Mistral, most LLMs
ALiBiLength extrapolationBLOOM, some long-context models
RMSNorm / SwiGLUCheaper, better stability/qualityLLaMA, Gemma, PaLM
GQASmaller KV-cache, fast inferenceLLaMA-2/3 70B, Mistral
FlashAttentionFaster exact attentionEssentially universal
Sliding windowLinear-cost long contextLongformer, Mistral
Mixture of ExpertsCapacity without computeMixtral (full detail Ch. 32)

Exercises

Exercises 1–10 are pen-and-paper or derivations; 11–20 require code.

Exercise 1: Pen & Paper
Compare encoder-only, decoder-only, and encoder-decoder in terms of attention masking, pretraining objective, and best-fit tasks. Why did decoder-only win for generative LLMs?
Exercise 2: Derive
Show that for RoPE, the dot product of a query at position m and a key at position n depends only on (m−n). Use the rotation-matrix identity R(mθ)ᵀR(nθ)=R((n−m)θ).
Exercise 3: Pen & Paper
Explain why learned absolute positional embeddings cannot handle sequences longer than training length, while RoPE and ALiBi can extrapolate.
Exercise 4: Pen & Paper
Describe ALiBi's linear bias. How does penalizing distant tokens enable a model trained on short sequences to work on longer ones?
Exercise 5: Pen & Paper
List the components of the modern 'LLaMA recipe' and, for each, state the old default it replaced and the problem it addresses.
Exercise 6: Derive
Derive the KV-cache size formula. For a 13B model (40 layers, 40 heads, 128 dim), 16k context, batch 16, bf16, compute the cache size in GB.
Exercise 7: Pen & Paper
Explain why generation is memory-bandwidth-bound while training is compute-bound. How does this make KV-cache size a throughput bottleneck?
Exercise 8: Pen & Paper
Compare MHA, GQA, and MQA in KV-cache size and quality. For 64 query heads and 8 groups, what is the cache reduction factor?
Exercise 9: Pen & Paper
Explain why FlashAttention is faster despite having the same O(T²) FLOPs. What does it actually reduce, and why does that help on a GPU?
Exercise 10: Pen & Paper
For sliding-window attention with window w over L layers, derive the effective receptive field. Why can a windowed model still propagate long-range information?
Exercise 11: Code
Implement RoPE from scratch (cos/sin tables + rotation). Apply it to Q and K, and verify that the attention score between positions depends only on their difference.
Exercise 12: Code
Implement ALiBi: add the per-head linear distance bias to attention scores. Compare the attention distributions of ALiBi vs no positional encoding.
Exercise 13: Code
Implement RMSNorm and SwiGLU from scratch. Build a modern FFN block and verify the parameter count matches a 4d GELU FFN when hidden = (8/3)d.
Exercise 14: Code Lab
Implement grouped-query attention with configurable groups. Measure KV-cache size for MHA, GQA-8, and MQA on a fixed model, and confirm the reduction factors.
Exercise 15: Code
Implement the KV-cache for generation and measure memory growth with sequence length. Then switch to GQA and show the cache shrinks by the expected factor.
Exercise 16: Code
Compare your naive attention against torch.nn.functional.scaled_dot_product_attention (which uses FlashAttention). Verify identical outputs and measure the speedup and memory difference at T = 4096.
Exercise 17: Code
Implement a sliding-window attention mask and apply it. Verify the receptive field grows with the number of stacked layers on a synthetic propagation task.
Exercise 18: Code Lab
Build a toy Mixture-of-Experts FFN with a top-1 router over 4 experts. Verify that each token activates only one expert and measure active vs total parameters.
Exercise 19: Code
Assemble a modern decoder block (RMSNorm + RoPE + GQA + SwiGLU, no biases) and verify shapes flow correctly. Compare its parameter count to the Chapter 13 baseline block.
Exercise 20: Code (Challenge)
Take a small GPT-2-style model and modernize it: replace learned positions with RoPE, LayerNorm with RMSNorm, the GELU FFN with SwiGLU, and multi-head with grouped-query attention. Train both the baseline and modernized versions on the same data and compare perplexity, memory, and generation speed.

Further reading: “RoFormer” (Su et al., 2021) for RoPE and “Train Short, Test Long” (Press et al., 2021) for ALiBi. “Root Mean Square Layer Normalization” (Zhang & Sennrich, 2019) and “GLU Variants Improve Transformer” (Shazeer, 2020) for RMSNorm and SwiGLU. “Fast Transformer Decoding” (Shazeer, 2019, MQA) and “GQA” (Ainslie et al., 2023). “FlashAttention” and “FlashAttention-2” (Dao et al., 2022, 2023). “Longformer” (Beltagy et al., 2020) and “BigBird” (Zaheer et al., 2020) for sparse attention. The LLaMA, Mistral, and Mixtral technical reports for the modern recipe in practice.


Next → Chapter 20: Efficient Training

You now know the architectural variants that make models capable and inference-efficient. Chapter 20 turns to making the TRAINING itself efficient: low-precision number formats (fp8 and beyond), parameter-efficient fine-tuning (LoRA and adapters), the optimized kernels and compilers that raise hardware utilization, and the techniques — quantization-aware training, activation recomputation, fused operations — that squeeze more useful work out of every GPU-hour. These are the methods that determine how much model you can train for a given budget.

20 Exercises in this chapter
Attempt each exercise before checking the worked solutions.
View Solutions →