Part IV: Pretraining at Scale
Chapter 16

Scaling Laws

Chinchilla, compute-optimal training, and emergent abilities
18 Exercises
16.1

One of the most consequential discoveries in modern machine learning is also one of the simplest to state: as you scale up model size, data, and compute, the loss of a language model decreases in a smooth, predictable power law. This regularity is so reliable that you can forecast the performance of a model costing millions of dollars from a handful of small, cheap experiments. Scaling laws turned the question 'how big should the model be?' from guesswork into engineering.

What a Power Law Looks Like

A power law relates loss L to a resource X (parameters, data, or compute) as L ≈ (X₀/X)^α plus an irreducible floor. On a log-log plot, this is a straight line — and the empirical loss of language models falls almost exactly on such lines across many orders of magnitude. This is the central empirical fact of the chapter.

textThe three scaling power laws (Kaplan et al., 2020)
L(N) ≈ (N_c / N)^α_N      # loss vs model size N (params)
L(D) ≈ (D_c / D)^α_D      # loss vs dataset size D (tokens)
L(C) ≈ (C_c / C)^α_C      # loss vs compute C (FLOPs)

# Each α ≈ 0.05-0.10: a small but relentless improvement with scale
Loss Is Predictable; Capabilities Are the Surprise
Scaling laws predict the LOSS — the average next-token cross-entropy — with remarkable precision. What they do not directly predict is which capabilities will emerge at which scale. A model's loss might improve smoothly while its ability to do arithmetic, or follow instructions, appears suddenly.
This gap — smooth, predictable loss versus seemingly discontinuous capability — is one of the most important and debated phenomena in the field, which we examine in Section 16.8.
History: From Kaplan to Chinchilla
Kaplan et al. (OpenAI, 2020) first established power-law scaling and concluded that, given more compute, you should mostly make the model bigger. Two years later, Hoffmann et al. (DeepMind, 2022) — the 'Chinchilla' paper — corrected this: at fixed compute, model size and data should grow together, roughly equally. Chinchilla showed that the large models of 2020–2021 (like GPT-3) were significantly under-trained on data.
This correction reshaped the field. It is why models got 'smaller but better-trained' after 2022, and why the data-to-parameter ratio became a central design variable.
16.2

Before relating compute to loss, we need to measure compute. The fundamental unit is the floating-point operation (FLOP), and there is a beautifully simple rule for how many FLOPs it takes to train a Transformer: approximately 6 times the number of parameters, per token of training data.

textThe 6ND rule for training FLOPs
C ≈ 6 · N · D

C = total training compute (FLOPs)
N = number of model parameters
D = number of training tokens

Where the 6 Comes From

The factor of 6 decomposes cleanly. Each parameter participates in roughly 2 FLOPs per token in the forward pass (one multiply, one add in the matrix multiplications). The backward pass costs about twice the forward pass — it computes gradients with respect to both inputs and weights — adding 4 FLOPs. Total: 2 (forward) + 4 (backward) = 6 FLOPs per parameter per token.

PassFLOPs per param per tokenWhy
Forward2One multiply + one add per weight
Backward (input grad)2Gradient w.r.t. activations
Backward (weight grad)2Gradient w.r.t. weights
Total62 forward + 4 backward
PythonUsing the 6N rule to budget a training run
def training_flops(n_params, n_tokens):
    """Approximate total training FLOPs via the 6ND rule."""
    return 6 * n_params * n_tokens

# GPT-3: 175B params, 300B tokens
C = training_flops(175e9, 300e9)
print(f"GPT-3: {C:.1e} FLOPs")   # 3.1e23 FLOPs

# How long on 1024 A100s at 50% utilization?
# A100 peak: ~312 TFLOP/s (bf16); effective ~156 TFLOP/s at 50%
gpu_flops_per_sec = 156e12
n_gpus = 1024
seconds = C / (gpu_flops_per_sec * n_gpus)
print(f"Time: {seconds/86400:.1f} GPU-days of wall-clock")
# Time: ~22 days  (matches GPT-3's reported training time)

# Cost estimate at ~$1/GPU-hour
cost = n_gpus * (seconds / 3600) * 1.0
print(f"Approx cost: ${cost/1e6:.1f}M")   # ~$0.5M of compute
Train Note: The 6N Rule Is the Field's Mental Arithmetic
Every practitioner uses the 6ND rule constantly to sanity-check budgets. Given a compute budget C, you can immediately read off the constraint N × D = C/6 — the trade-off curve between how big and how long. Given a target model size and dataset, you can estimate cost and wall-clock time in seconds.
It ignores attention's O(T²) term, which is negligible when the model dimension dwarfs the sequence length (the usual regime), and assumes dense matmuls. For MoE models (Chapter 32) the active-parameter count replaces N. But as a first-order estimate, 6ND is astonishingly accurate.
16.3

Kaplan et al. (2020) trained hundreds of models across orders of magnitude in size and data, and fit power laws to the results. Their findings established the quantitative foundation of large-model development and held up remarkably well — with one important correction we cover next.

The Key Findings

Loss follows clean power laws in N, D, and C over many orders of magnitude — the relationship is smooth and predictable, not jagged.
Architecture details (depth vs width, number of heads) matter far less than the total parameter count, within reasonable ranges. Scale dominates shape.
Larger models are more sample-efficient: they reach a given loss with fewer tokens than smaller models.
Performance is bottlenecked by whichever resource is scarcest — a large model trained on too little data plateaus, and vice versa.

The Combined Law

Kaplan proposed a combined formula capturing how loss depends jointly on model size and data, with the loss limited by whichever is the binding constraint:

textKaplan combined loss surface
L(N, D) ≈ [ (N_c/N)^(α_N/α_D) + D_c/D ]^α_D

# Loss is high if EITHER N is small OR D is small.
# To improve, you must grow the binding constraint.

Kaplan's compute-allocation conclusion: given a 10× increase in compute, spend most of it on a bigger model and relatively little on more data. This recommendation — 'make the model much bigger, train on modestly more data' — guided the GPT-3 generation. It was reasonable given the data available, but it turned out to be suboptimal, as Chinchilla revealed.

⚠️
Pitfall: Why Kaplan's Allocation Was Off
Kaplan's experiments used a fixed learning-rate schedule that was not re-tuned for each training duration. This subtly disadvantaged models trained on more data, biasing the conclusion toward 'bigger model, less data.' The Chinchilla team re-ran the experiments with properly tuned schedules and reached a very different allocation.
The lesson: scaling-law conclusions are only as good as the experimental methodology behind them. A subtle confound in the protocol propagated into a field-wide misallocation of compute for two years.
16.4

Hoffmann et al. (2022) revisited the compute-allocation question with more careful methodology and reached a landmark conclusion: at a fixed compute budget, model size N and dataset size D should scale roughly in equal proportion. The implication was that the flagship models of the era were drastically under-trained — too big for the amount of data they saw.

The Chinchilla Loss Law

textChinchilla parametric loss (Hoffmann et al., 2022)
L(N, D) = E + A/N^α + B/D^β

E ≈ 1.69    # irreducible loss (entropy of natural language)
A, α        # model-size term  (α ≈ 0.34)
B, β        # data-size term   (β ≈ 0.28)

This additive form is more interpretable than Kaplan's. There is an irreducible loss E — the entropy of language itself, which no model can beat. Above that, two terms decrease with model size and data size respectively. Minimizing the loss for a fixed compute budget C = 6ND becomes a clean constrained optimization.

The Headline Result: ~20 Tokens per Parameter

Solving the optimization yields the compute-optimal recipe: for every parameter, train on roughly 20 tokens. A compute-optimal 10B-parameter model should see about 200B tokens; a 70B model about 1.4T tokens. This 20:1 token-to-parameter ratio is the single most cited number in modern LLM training.

ModelParamsTokens (actual)Tokens/param
GPT-3 (2020)175B300B~1.7 (under-trained)
Gopher (2021)280B300B~1.1 (under-trained)
Chinchilla (2022)70B1.4T20 (compute-optimal)
LLaMA-1 (2023)65B1.4T~22
LLaMA-2 (2023)70B2.0T~29 (past optimal)
LLaMA-3 (2024)70B15T~214 (far past optimal)
Chinchilla Beat a Model 4× Its Size
The Chinchilla model (70B params, 1.4T tokens) outperformed Gopher (280B params, 300B tokens) despite being one-quarter the size — because it was trained compute-optimally. Same compute budget, far better allocation, dramatically better model. This was the proof that allocation, not just total compute, determines quality.
The practical upshot: a smaller, well-trained model is cheaper to serve at inference AND better. This is why the post-Chinchilla era favored 7B–70B models trained on trillions of tokens over the giant-but-undertrained models that preceded them.
16.5

The core practical use of scaling laws is allocation: given a fixed compute budget C, how do you split it between a bigger model (more N) and more training (more D)? The Chinchilla law turns this into a solvable optimization, and the answer is a specific N and D for every budget.

The Optimization

textCompute-optimal allocation
minimize   L(N, D) = E + A/N^α + B/D^β
subject to  C = 6 N D

Solution:  N_opt ∝ C^a,  D_opt ∝ C^b   with  a ≈ b ≈ 0.5
⇒ split each 10× of compute as ~3.2× bigger model, ~3.2× more data

The key result: both the optimal model size and the optimal dataset size grow as roughly the square root of compute. Each time you get 10× more compute, you should make the model about 3.2× bigger AND train on about 3.2× more data — not pour it all into size as Kaplan suggested.

PythonCode Lab: compute-optimal model sizing
import numpy as np
from scipy.optimize import minimize_scalar

# Chinchilla parametric loss
E, A, B, alpha, beta = 1.69, 406.4, 410.7, 0.34, 0.28

def loss(N, D): return E + A/N**alpha + B/D**beta

def optimal_allocation(compute):
    """Given compute budget C, find N, D minimizing loss s.t. C=6ND."""
    # Parametrize by N; D is then determined by the constraint
    def loss_at(log_N):
        N = 10**log_N
        D = compute / (6 * N)        # from C = 6ND
        return loss(N, D)
    res = minimize_scalar(loss_at, bounds=(7, 12), method='bounded')
    N_opt = 10**res.x
    D_opt = compute / (6 * N_opt)
    return N_opt, D_opt

# Allocate three compute budgets
for C in [1e21, 1e23, 1e25]:
    N, D = optimal_allocation(C)
    print(f"C={C:.0e}: N={N/1e9:.1f}B params, D={D/1e9:.0f}B tokens, ratio={D/N:.0f}")
# C=1e+21: N=0.4B params,  D=42B tokens,  ratio=~20
# C=1e+23: N=5.6B params,  D=320B tokens, ratio=~20
# C=1e+25: N=67B params,   D=2.5T tokens, ratio=~20
# The ~20 tokens/param ratio holds across budgets, as Chinchilla predicts.
Train Note: Inference Changes the Calculus
Chinchilla optimizes TRAINING compute. But a model is trained once and served billions of times — so total lifetime compute is dominated by inference. This justifies training a SMALLER model on MORE data than Chinchilla-optimal: you pay extra training cost once to get a cheaper-to-serve model forever.
This is exactly why LLaMA models are trained far past the 20:1 ratio (LLaMA-3 at ~214:1). They are deliberately 'over-trained' relative to Chinchilla because the cheaper inference more than pays for the extra training. Compute-optimal for TRAINING is not compute-optimal for DEPLOYMENT.
16.6

The most powerful practical application of scaling laws is forecasting. Before committing to a huge, expensive training run, you train a series of small models, fit a scaling law to their losses, and extrapolate to predict the loss of the large model. This de-risks enormous investments and is now standard practice at every frontier lab.

The Forecasting Procedure

textScaling-law forecasting (Pseudocode)
# 1. Train a ladder of small models at increasing scale
for scale in [tiny, small, medium, ...]:
    train compute-optimally; record (compute, loss)

# 2. Fit a power law to the (compute, loss) points
L(C) ≈ E + (C_c / C)^α_C        # least-squares fit on log-log

# 3. Extrapolate to the target large-scale compute
predicted_loss = L(C_target)

# 4. Decide: is the predicted gain worth the cost?
PythonFitting and extrapolating a scaling law
import numpy as np
from scipy.optimize import curve_fit

# Suppose we trained 5 small models and measured their loss
compute = np.array([1e18, 1e19, 1e20, 1e21, 1e22])
loss    = np.array([3.21, 2.88, 2.61, 2.39, 2.21])

# Fit L(C) = E + (Cc/C)^alpha
def scaling_law(C, E, Cc, alpha):
    return E + (Cc / C)**alpha

params, _ = curve_fit(scaling_law, compute, loss,
                      p0=[1.7, 1e15, 0.05], maxfev=100000)
E, Cc, alpha = params
print(f"Fit: E={E:.2f}, alpha={alpha:.3f}")

# Extrapolate to a run 1000x larger than our biggest experiment
C_target = 1e25
predicted = scaling_law(C_target, *params)
print(f"Predicted loss at C={C_target:.0e}: {predicted:.3f}")
# Predicted loss at C=1e+25: ~1.92
# Frontier labs report these forecasts match actual large-run loss to ~1%.
ML Connection: GPT-4's Predicted Performance
OpenAI reported that they predicted GPT-4's final training loss from models trained with as little as 1/1000th the compute, and the prediction was accurate. This forecasting ability is why frontier labs can confidently commit to nine-figure training runs — they know approximately what they will get before they start.
Scaling-law forecasting transformed large-model development from a gamble into an investment with a predictable return. It is arguably as important as any architectural innovation in enabling the modern era of large models.
16.7

Scaling laws tell us that three things reliably move the loss: parameters, data, and compute. Equally important is what does NOT much affect the loss — because that tells you where not to spend your effort.

FactorEffect on loss
Parameters (N)Strong power-law improvement
Training tokens (D)Strong power-law improvement
Compute (C)Strong power-law (the master variable)
Depth vs widthWeak — many shapes give similar loss at fixed N
Number of attention headsWeak within a sensible range
Exact activation/norm choiceWeak — a few % at most
Data QUALITYStrong, but hard to put on the same axis

The Quality Dimension

The classic scaling laws treat all tokens as equal, but data quality matters enormously. Better-filtered, deduplicated, higher-quality data shifts the entire scaling curve downward — you reach a lower loss at the same compute. The Phi model series demonstrated that carefully curated 'textbook-quality' data can let a small model punch far above its parameter count. Data curation (Chapter 17) is the lever that scaling laws assume away but practitioners obsess over.

Architecture Shape Barely Matters; Scale and Data Dominate
A striking and humbling implication of scaling laws: the architectural choices researchers agonize over — exactly how deep, how wide, how many heads — mostly do not matter for loss, as long as the total parameter count is fixed and the choices are reasonable. What matters is scale (N, D, C) and data quality.
This does not mean architecture is irrelevant — efficiency, inference cost, and trainability all depend on it. But for the headline metric of loss, the message of scaling laws is blunt: make it bigger, train it longer, on better data. The rest is second-order.
16.8

Scaling laws predict loss smoothly. Yet some capabilities — multi-step arithmetic, instruction following, chain-of-thought reasoning — appear to switch on suddenly at a certain scale, absent below it and present above. These 'emergent abilities' (Wei et al., 2022) are among the most discussed and contested phenomena in the field.

The Two Sides of the Debate

Emergence is realEmergence is a measurement artifact
Some tasks show sharp capability jumpsSharp jumps come from harsh metrics
Below threshold: ~0%; above: highExact-match accuracy is all-or-nothing
Qualitatively new behaviour at scaleSmooth metrics reveal smooth improvement
Hard to predict from small modelsPer-token loss improves continuously
Wei et al. (2022)Schaeffer et al. (2023): 'a mirage'

The skeptical view (Schaeffer et al., 2023) is compelling: many 'emergent' jumps are artifacts of the metric. If you score a multi-step task with exact-match (all steps correct or zero credit), then a model whose per-step accuracy improves smoothly will show a sudden jump in exact-match once per-step accuracy crosses a threshold. Switch to a smooth, partial-credit metric and the emergence often dissolves into a smooth curve.

⚠️
Emergence Depends on How You Measure
The emergent-abilities debate is a cautionary tale about metrics. A discontinuous-looking capability curve may reflect a discontinuous METRIC applied to a smoothly-improving model. Before claiming a capability 'emerged' at some scale, check whether a smoother metric tells a smoother story.
That said, even if loss improves smoothly, the DOWNSTREAM USEFULNESS of a model can still cross practical thresholds suddenly — a model that is 'almost' able to follow instructions is useless, while one just over the line is transformative. Smooth loss does not imply smooth utility.
ML Connection: Why This Matters for Safety and Forecasting
If capabilities can appear suddenly and unpredictably with scale, then forecasting what a larger model will be ABLE to do is much harder than forecasting its loss — with implications for AI safety, since a dangerous capability might appear without warning. If, instead, capabilities track smooth metrics, they are more forecastable.
This is why the emergence debate is not merely academic. Whether frontier capabilities are predictable bears directly on how much we can anticipate — and prepare for — the behaviour of the next generation of models. Chapters 25 and 35 return to these safety implications.
16.9

Scaling laws are power laws, and power laws have a sobering property: improvements shrink as you climb. Each halving of the loss-above-irreducible costs exponentially more compute. And there are hard limits — finite data, finite compute, the irreducible entropy floor — that bound how far pure scaling can go.

The Data Wall

The most discussed limit is data. Chinchilla-optimal training of ever-larger models demands ever more high-quality tokens — but the supply of high-quality human text is finite. Estimates (Villalobos et al., 2022) suggest the stock of high-quality public text could be exhausted by training runs in the mid-to-late 2020s. This 'data wall' is driving intense interest in synthetic data, multi-epoch training, and data efficiency.

LimitNatureResponse
Irreducible loss EEntropy of language itselfCannot beat; aim to approach it
Data wallFinite high-quality textSynthetic data, multi-epoch, quality
Compute costExponential for linear loss gainEfficiency, better algorithms
Diminishing returnsPower-law flatteningNew capabilities beyond loss
Inference economicsServing cost at scaleSmaller models, distillation, MoE

These limits are reshaping the field's direction. The frontier is shifting from pure scale toward better data (Chapter 17), test-time compute and reasoning (Chapter 24), sparse architectures like Mixture-of-Experts (Chapter 32), and post-training techniques (Part V) that extract more capability from a fixed base model. Scaling laws are not dead, but the era of 'just make it bigger' is giving way to a more multidimensional optimization.

History: The Bitter Lesson — and Its Limits
Rich Sutton's 'Bitter Lesson' (2019) argued that general methods leveraging computation ultimately beat hand-crafted approaches — a thesis scaling laws seemed to vindicate spectacularly. For years, scale won every bet.
Yet the data wall and inference economics suggest the next chapter is more nuanced: scale remains essential, but extracting maximum capability per FLOP — through data quality, post-training, and reasoning — is the new frontier. The Bitter Lesson endures, but 'compute' increasingly means compute spent cleverly, not just abundantly.
16.10

Scaling laws are not just descriptive science — they are a decision-making tool. Here is how a practitioner actually uses them, from a fixed budget to a concrete training plan.

The Decision Workflow

QuestionHow scaling laws answer it
How big a model can I afford?From budget C and target tokens D: N = C/(6D)
How much data do I need?Chinchilla: D ≈ 20N for compute-optimal training
What loss will I get?Fit a small-scale ladder, extrapolate L(C)
Should I train longer or bigger?Compute-optimal split: grow both as √C
Is over-training worth it?Yes if inference volume is high (cheaper serving)
Will it be good enough?Predict loss; map loss to downstream metrics
PythonA complete scaling-law planning tool
import numpy as np

def plan_training_run(budget_usd, gpu_flops_per_sec=156e12,
                      gpu_cost_per_hour=1.0, tokens_per_param=20):
    """From a dollar budget, derive a compute-optimal training plan."""
    # 1. Budget -> compute
    gpu_hours = budget_usd / gpu_cost_per_hour
    C = gpu_hours * 3600 * gpu_flops_per_sec   # total FLOPs

    # 2. Compute-optimal N, D with C = 6ND and D = 20N
    #    => C = 6 * N * 20N = 120 N^2  =>  N = sqrt(C/120)
    N = np.sqrt(C / (6 * tokens_per_param))
    D = tokens_per_param * N

    # 3. Predicted loss (Chinchilla parametric)
    E, A, B, alpha, beta = 1.69, 406.4, 410.7, 0.34, 0.28
    L = E + A/N**alpha + B/D**beta

    print(f"Budget: ${budget_usd:,.0f}")
    print(f"Compute: {C:.1e} FLOPs")
    print(f"Optimal model: {N/1e9:.1f}B params")
    print(f"Optimal data:  {D/1e9:.0f}B tokens")
    print(f"Predicted loss: {L:.3f}  (ppl {np.exp(L):.1f})")

plan_training_run(100_000)   # a $100k budget
# Budget: $100,000
# Compute: 8.4e+22 FLOPs
# Optimal model: 0.8B params
# Optimal data:  17B tokens
# Predicted loss: 2.5x  (ppl ~12)
16.11

Scaling Laws Quick-Reference

ConceptFormula / valueUse
Training FLOPsC ≈ 6 N DBudget any training run
Loss vs scaleL = E + A/N^α + B/D^βChinchilla parametric law
Irreducible lossE ≈ 1.69 natsFloor; entropy of language
Compute-optimal ratioD ≈ 20 N~20 tokens per parameter
Optimal scalingN, D ∝ √CGrow both with compute
ForecastingFit ladder, extrapolate L(C)Predict before training
Over-trainingD ≫ 20NCheaper inference, pay once

Exercises

Exercises 1–10 are pen-and-paper or derivations; 11–18 require code.

Exercise 1: Pen & Paper
Derive the 6ND rule. Account for 2 FLOPs/param in the forward pass and 4 in the backward pass, and explain each.
Exercise 2: Pen & Paper
A model has 13B parameters and is trained on 2T tokens. Compute the training FLOPs. How many A100-days at 50% utilization (156 TFLOP/s effective)?
Exercise 3: Derive
Starting from the Chinchilla loss L = E + A/N^α + B/D^β and the constraint C = 6ND, derive that the compute-optimal N and D both scale as √C.
Exercise 4: Pen & Paper
Show that the compute-optimal token-to-parameter ratio D/N is approximately constant (independent of compute budget). Why does this give the ~20:1 rule?
Exercise 5: Pen & Paper
GPT-3 used 175B params and 300B tokens. Compute its tokens/param ratio. By Chinchilla, was it over- or under-trained, and by how much?
Exercise 6: Pen & Paper
Explain why Kaplan and Chinchilla reached different allocation conclusions. What methodological difference caused the discrepancy?
Exercise 7: Pen & Paper
Why do LLaMA models train far past the Chinchilla-optimal ratio? Frame your answer in terms of total lifetime compute (training + inference).
Exercise 8: Pen & Paper
A power law gives L(C) = 1.7 + (1e15/C)^0.05. Compute the loss at C = 1e21, 1e23, 1e25. How much compute to halve the loss-above-irreducible?
Exercise 9: Pen & Paper
Explain the emergent-abilities debate. Give a concrete example of how a harsh metric can make a smoothly-improving model look like it has a sudden capability jump.
Exercise 10: Pen & Paper
Describe the data wall. Why does Chinchilla-optimal scaling make it more pressing, and what are three responses to it?
Exercise 11: Code
Implement the 6ND FLOPs calculator. Verify it reproduces the reported training compute of GPT-3, Chinchilla, and LLaMA-2 from their public N and D.
Exercise 12: Code
Fit a power law L(C) = E + (Cc/C)^alpha to a synthetic ladder of (compute, loss) points. Plot the fit on log-log axes and report the fitted exponent.
Exercise 13: Code Lab
Implement compute-optimal allocation: given a compute budget, solve for the N and D that minimize the Chinchilla loss. Verify the ~20:1 ratio holds across budgets.
Exercise 14: Code
Reproduce the Chinchilla table: for compute budgets from 1e19 to 1e25, compute optimal N, D, and predicted loss. Plot N and D vs compute on log-log axes.
Exercise 15: Code
Build the planning tool from Section 16.10. For budgets of $10k, $100k, $1M, and $10M, report the optimal model size, data, and predicted perplexity.
Exercise 16: Code Lab
Train your own mini scaling ladder: train 4-5 small Transformers at increasing sizes on a fixed corpus, fit a scaling law to their losses, and extrapolate. Compare your prediction for the next size up to the actual trained result.
Exercise 17: Code
Demonstrate the emergence-as-artifact effect: simulate a model whose per-token accuracy improves smoothly, then plot both a smooth metric and an exact-match metric over a multi-step task. Show the exact-match metric looks discontinuous.
Exercise 18: Code (Challenge)
Compare training-optimal vs inference-aware optimal sizing. For a model that will serve a given number of inference tokens, compute the model size that minimizes TOTAL lifetime compute (training + inference), and show it is smaller than Chinchilla-optimal. Quantify how the optimal size shrinks as inference volume grows.

Further reading: “Scaling Laws for Neural Language Models” (Kaplan et al., 2020) — the founding paper. “Training Compute-Optimal Large Language Models” (Hoffmann et al., 2022) — the Chinchilla paper. “Emergent Abilities of Large Language Models” (Wei et al., 2022) and “Are Emergent Abilities a Mirage?” (Schaeffer et al., 2023) for the emergence debate. “Will we run out of data?” (Villalobos et al., 2022) on the data wall. Sutton's “The Bitter Lesson” (2019) for the philosophical backdrop.

Part III Complete: Deep Learning & the Transformer

Ch. 10Neural Network Fundamentalsperceptrons to MLPs, activations, init, normalization, dropout — the components.
Ch. 11Backpropagation in Depthcomputational graphs and a from-scratch autograd engine — how gradients flow.
Ch. 12Attention Mechanismsscaled dot-product, self-attention, multi-head — the Transformer's core operation.
Ch. 13The Transformer Architecturepositional encodings, residual stream, the full model built from scratch.
Ch. 14TokenizationBPE and friends — how text becomes the tokens the model consumes.
Ch. 15Training Transformerswarmup, AdamW, clipping, mixed precision — the recipe for a stable run.
Ch. 16Scaling LawsChinchilla, the 6N rule, compute-optimal allocation — sizing as a science.

You have built the Transformer from first principles, learned to train it, and learned the scaling laws that tell you how big to build. But everything so far has lived on a single machine, on modest data. Part IV — Pretraining LLMs — confronts the realities of frontier-scale training: where the trillions of tokens come from and how they are curated (Chapter 17), how training is distributed across thousands of GPUs (Chapter 18), the architecture variants that make large models efficient (Chapter 19), the techniques that squeeze more out of every FLOP (Chapter 20), and how progress is measured during a months-long run (Chapter 21). The clean training loop of Chapter 15 becomes a distributed, fault-tolerant, data-hungry industrial process — and you now have the foundation to understand every part of it.

18 Exercises in this chapter
Attempt each exercise before checking the worked solutions.
View Solutions →