Part IV: Pretraining at Scale

Chapter 17

Data Collection & Curation

Common Crawl, deduplication, quality filtering, toxic-content removal, and data-mixing strategies — the unglamorous pipeline that determines what a model knows and how well it works.

18 Exercises

Learning Objectives

1.	Understand where pretraining data comes from and the scale required by scaling laws.
2.	Process Common Crawl: WARC/WET formats, text extraction, and language identification.
3.	Implement exact and fuzzy (MinHash/LSH) deduplication and explain why dedup matters so much.
4.	Build quality filters: heuristic rules and model-based classifiers.
5.	Detect and remove toxic, private, and benchmark-contaminating content.
6.	Design a data mixture across domains and reason about mixing weights and epochs.
7.	Understand the data wall and the role of synthetic data.
8.	Appreciate why data curation is the highest-leverage, least-glamorous part of LLM training.

Part III built and trained the Transformer. But a model is only as good as the data it learns from. The architecture has been remarkably stable since 2017; the dramatic capability gains since then have come overwhelmingly from more and better data, plus scale. Yet data curation receives a fraction of the attention that architecture does. This chapter corrects that imbalance: the pipeline that turns the raw web into clean training tokens is the highest-leverage, least-glamorous work in building an LLM.

The Scale Required

Chapter 16's Chinchilla recipe demands roughly 20 tokens per parameter, and over-training for inference efficiency pushes this far higher. A modern 70B model trains on 15 trillion tokens — the equivalent of tens of millions of books. There is no curated dataset of that size; it must be assembled and cleaned from the open web, and the cleaning is where most of the value is created.

Model	Training tokens	Primary sources
GPT-3 (2020)	300B	Common Crawl, WebText2, books, Wikipedia
The Pile (2020)	~400B	22 curated sources (academic, code, web)
LLaMA-1 (2023)	1.4T	CommonCrawl, C4, GitHub, Wikipedia, books, arXiv
RefinedWeb (2023)	5T	Common Crawl only, heavily filtered
LLaMA-3 (2024)	15T	Web + code + multilingual, heavily curated
FineWeb (2024)	15T	Common Crawl, open recipe, deduplicated

✧

Data Note: The RefinedWeb Result

RefinedWeb (Penedo et al., 2023) made a striking claim: a model trained purely on heavily-filtered Common Crawl could match or beat models trained on curated mixtures including books and academic text. The key was aggressive filtering and deduplication, not exotic data sources.

This reframed the field's understanding: the web contains enough high-quality text; the challenge is FINDING it. Curation — separating the signal from the vast noise — became recognized as the central data problem, more important than acquiring rare curated sources.

The Curation Pipeline at a Glance

Turning the raw web into training tokens is a multi-stage pipeline. Each stage discards a large fraction of the input; the survivors of all stages become training data. Here is the canonical flow:

Pipeline Flow: The data curation pipeline

1	Acquire	Download Common Crawl WARC/WET archives (petabytes of raw web)
2	Extract	Pull readable text from HTML; strip boilerplate, nav, ads
3	Language ID	Detect and route by language; filter to target languages
4	Quality filter	Heuristic rules + model classifiers remove low-quality text
5	Deduplicate	Remove exact and near-duplicate documents (often 50%+ of data)
6	Safety filter	Remove toxic content, PII, and benchmark contamination
7	Mix & shuffle	Weight domains, set epoch counts, shuffle into training shards

✧

Each Stage Discards Most of Its Input

The curation pipeline is brutally subtractive. Starting from petabytes of raw Common Crawl, text extraction keeps maybe 10–20%, quality filtering keeps a fraction of that, and deduplication removes much of the rest. FineWeb's pipeline distilled ~15T clean tokens from far more raw input — a single Common Crawl snapshot is ~400TB of WARC files.

The art is in WHAT you keep. Every filter trades quantity for quality, and the scaling laws make this trade worthwhile: better data shifts the entire loss curve down (Chapter 16), so cleaner tokens are worth more than raw token count alone.

Common Crawl is a non-profit that has crawled the web monthly since 2008, releasing the archives freely. It is the single largest source of pretraining data — nearly every major LLM is built on it. Understanding its formats and quirks is the practical starting point for data curation.

WARC, WAT, and WET

Format	Contains	Use
WARC	Raw HTTP responses (full HTML + headers)	Custom extraction, maximum fidelity
WAT	Metadata (links, headers, structure)	Link graphs, metadata analysis
WET	Pre-extracted plain text	Quick start, but lower-quality extraction

Serious pipelines prefer WARC files and run their own text extraction, because the pre-extracted WET text is noisy — it includes navigation, boilerplate, and poorly-formatted content. High-quality extractors like trafilatura or resiliparse pull the main article text and discard the chrome, which substantially improves the resulting data quality.

Python•Extracting text from Common Crawl WARC
# pip install warcio trafilatura
from warcio.archiveiterator import ArchiveIterator
import trafilatura

def extract_from_warc(warc_path):
    """Yield clean main-text from each HTML record in a WARC file."""
    with open(warc_path, 'rb') as stream:
        for record in ArchiveIterator(stream):
            if record.rec_type != 'response': continue
            html = record.content_stream().read()
            # trafilatura strips nav/ads/boilerplate, keeps main text
            text = trafilatura.extract(html, include_comments=False,
                                       include_tables=False)
            if text and len(text) > 200:  # skip tiny fragments
                yield text

# A single WARC file holds ~30-50k web pages.
# A monthly Common Crawl snapshot has ~90k WARC files (~400TB).
# This is why the pipeline runs on large distributed clusters (Spark/Ray).

⚠️

Pitfall: WET Text Is a Trap for Beginners

It is tempting to use Common Crawl's pre-extracted WET files — they skip the HTML-parsing step. But WET extraction is crude: it retains menus, footers, cookie banners, and SEO spam. Models trained on WET data are measurably worse than those trained on text extracted with a quality extractor from WARC.

The RefinedWeb and FineWeb teams both found that extraction quality was one of the single biggest levers on final model quality. Invest in good extraction; do not start from WET.

The web is multilingual, and most pipelines need to route documents by language — to filter to target languages, to apply language-specific quality rules, and to control the multilingual mixture. Language identification is a fast, early-stage filter that also doubles as a quality signal: text that no language detector can confidently classify is often garbage.

Python•Language identification with fastText
# Meta's fastText lid.176 model: 176 languages, very fast
import fasttext

model = fasttext.load_model('lid.176.bin')

def detect_language(text):
    """Return (lang_code, confidence) for a document."""
    text = text.replace('\n', ' ')[:1000]  # sample, strip newlines
    labels, probs = model.predict(text, k=1)
    lang = labels[0].replace('__label__', '')
    return lang, float(probs[0])

# Filter: keep English documents with high confidence
def keep_document(text, target='en', min_conf=0.65):
    lang, conf = detect_language(text)
    return lang == target and conf >= min_conf

# Low confidence often signals: code-switching, gibberish, or
# machine-generated spam -- so the confidence threshold is also a
# quality filter, not just a language router.

✧

Data Note: The Multilingual Trade-off

How much non-English data to include is a real design decision. More multilingual data improves the model's coverage of other languages but, at a fixed token budget, means less English — potentially weakening English performance. Most English-centric models keep a small multilingual fraction; truly multilingual models like BLOOM deliberately balance dozens of languages.

Language ID also interacts with tokenization fairness (Chapter 14): under-represented languages both fragment into more tokens AND appear in less training data, a double disadvantage that careful data mixing can partially address.

The web is extraordinarily repetitive: the same articles are syndicated across thousands of sites, boilerplate recurs everywhere, and popular text is quoted endlessly. Deduplication — removing exact and near-duplicate documents — is one of the most impactful curation steps. It typically removes 30–70% of the data, and the resulting model is BETTER, not just cheaper to train.

Why Deduplication Helps So Much

•Wasted compute: training repeatedly on the same text spends FLOPs without new learning.

•Memorization: duplicated data is memorized verbatim, increasing privacy risk and regurgitation.

•Train/test contamination: duplicates of benchmark data inflate evaluation scores misleadingly.

•Distribution skew: over-represented text (e.g. licenses, spam) distorts the learned distribution.

Lee et al. (2022) showed that deduplicating training data improves model quality and reduces memorization, with no downside. It is one of the rare 'free lunch' interventions: less data, less compute, AND a better model.

Exact vs Fuzzy Deduplication

Exact deduplication	Fuzzy (near-duplicate) deduplication
Removes byte-identical documents	Removes documents that are mostly similar
Hash each doc, drop hash collisions	MinHash + LSH on shingles
Fast, simple, catches obvious copies	Catches near-copies, edits, reformatting
Misses minor variations	Catches templated/syndicated content
O(N) with a hash set	O(N) with LSH bucketing
First pass	Second, more powerful pass

MinHash + LSH for Near-Duplicates

Fuzzy deduplication needs to find documents that are similar but not identical. The standard approach: represent each document as a set of shingles (overlapping n-grams), estimate the Jaccard similarity between documents using MinHash signatures, and use Locality-Sensitive Hashing (LSH) to find candidate pairs efficiently without comparing all N² pairs.

text•MinHash + LSH deduplication (Pseudocode)
# 1. Shingle each document
shingles(doc) = { all overlapping k-word sequences }

# 2. MinHash signature estimates Jaccard similarity
for each of P hash functions h_i:
    sig[i] = min over shingles of h_i(shingle)
# P(sig_A[i] == sig_B[i]) = Jaccard(A, B)

# 3. LSH: band the signature, hash bands into buckets
split signature into b bands of r rows each
documents sharing a band-bucket are CANDIDATE duplicates

# 4. Verify candidates, drop near-duplicates above threshold
for each candidate pair: if Jaccard > 0.8, remove one

Python•MinHash deduplication from scratch
import numpy as np

def shingles(text, k=5):
    """Set of k-word shingles."""
    words = text.split()
    return {' '.join(words[i:i+k]) for i in range(len(words)-k+1))}

def minhash(shingle_set, n_hashes=128, seed=0):
    """Compute an n_hashes-dim MinHash signature."""
    rng = np.random.default_rng(seed)
    # Random hash coefficients (a*x + b mod prime)
    a = rng.integers(1, 2**31, n_hashes)
    b = rng.integers(0, 2**31, n_hashes)
    P = 2**31 - 1
    sig = np.full(n_hashes, np.inf)
    for sh in shingle_set:
        x = hash(sh) % P
        sig = np.minimum(sig, (a * x + b) % P)  # vectorized min
    return sig

def estimate_jaccard(sig_a, sig_b):
    """Fraction of matching signature entries ≈ Jaccard similarity."""
    return (sig_a == sig_b).mean()

# Demo: two near-duplicate documents
doc1 = 'the quick brown fox jumps over the lazy dog every morning'
doc2 = 'the quick brown fox jumps over the lazy dog each morning'
doc3 = 'machine learning models require enormous amounts of training data'

s1, s2, s3 = minhash(shingles(doc1)), minhash(shingles(doc2)), minhash(shingles(doc3))
print(f"sim(near-dup):  {estimate_jaccard(s1, s2):.2f}")  # ~0.6 high
print(f"sim(unrelated): {estimate_jaccard(s1, s3):.2f}")  # ~0.0 low

✧

Data Note: Deduplication Is Often the Single Biggest Win

Across the public data recipes — C4, The Pile, RefinedWeb, FineWeb — deduplication consistently emerges as one of the most impactful steps. FineWeb's ablations showed that aggressive deduplication (especially across Common Crawl snapshots) was essential to matching state-of-the-art data quality.

There is nuance: TOO aggressive global deduplication can remove genuinely useful repeated content (common phrases, important facts). The current best practice favors deduplication within snapshots and careful, threshold-tuned fuzzy dedup rather than blanket removal of everything that repeats.

Most of the web is not useful training data: spam, auto-generated text, keyword-stuffed SEO pages, broken markup, and incoherent fragments. Quality filtering removes this noise. Two complementary approaches dominate: fast heuristic rules and slower model-based classifiers.

Heuristic Filters

Heuristic filters are cheap rules that catch obvious garbage. Gopher (Rae et al., 2021) published an influential set of such rules, now widely adopted. They are fast enough to run on every document and catch a large fraction of low-quality content.

Filter	Removes
Document length bounds	Too-short fragments and absurdly long dumps
Mean word length 3–10 chars	Gibberish, encoding errors, code masquerading as text
Symbol-to-word ratio < 0.1	Math dumps, markup soup, ASCII art
Stop-word presence	Keyword lists and SEO spam lack natural function words
Fraction of lines ending in ...	Boilerplate, truncated listings
Bullet/ellipsis line fraction	Navigation menus, link farms
Duplicate line/paragraph fraction	Templated or repetitive spam

Python•Heuristic quality filters (Gopher-style)
import re
from collections import Counter

STOP_WORDS = {'the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have', 'it'}

def passes_quality(text):
    """Apply Gopher-style heuristic quality filters."""
    words = text.split()
    n = len(words)
    if not (50 <= n <= 100_000): return False  # length

    # Mean word length in [3, 10]
    mean_len = sum(len(w) for w in words) / n
    if not (3 <= mean_len <= 10): return False

    # Must contain common stop words (natural language signal)
    lower = {w.lower() for w in words}
    if len(STOP_WORDS & lower) < 2: return False

    # Symbol-to-word ratio (hash/ellipsis)
    n_symbols = text.count('#') + text.count('...')
    if n_symbols / n > 0.1: return False

    # Duplicate-line fraction
    lines = text.split('\n')
    if lines:
        dup_frac = 1 - len(set(lines)) / len(lines)
        if dup_frac > 0.3: return False

    return True

Model-Based Quality Classifiers

Heuristics catch obvious garbage but cannot judge subtle quality. Model-based filters train a classifier to distinguish 'high-quality' from 'low-quality' text. The classic approach (used by GPT-3 and LLaMA): train a classifier to recognize text resembling a high-quality reference corpus (e.g. Wikipedia, books, or curated web), and keep documents the classifier rates highly.

Python•Model-based quality classifier
import fasttext

# Train: positive = curated text (Wikipedia, books), negative = random web
# Label format for fastText: '__label__high <text>' / '__label__low <text>'
model = fasttext.train_supervised('quality_train.txt', epoch=5, wordNgrams=2)

def quality_score(text):
    """Probability the document is high-quality (0-1)."""
    text = text.replace('\n', ' ')
    labels, probs = model.predict(text, k=2)
    scores = dict(zip(labels, probs))
    return scores.get('__label__high', 0.0)

# Pareto-style filtering: keep documents above a quality threshold,
# OR sample stochastically so some borderline documents survive --
# hard thresholds can remove useful diversity (GPT-3 used a Pareto sampler).

⚠️

Quality Filters Encode Bias

A classifier trained to recognize 'Wikipedia-like' text will favour the dialects, topics, and registers of Wikipedia's editors — systematically down-weighting informal text, minority dialects, and under-represented communities. Dodge et al. (2021) documented how C4's filters disproportionately removed text associated with minority voices.

Quality filtering is never neutral: it imposes a definition of 'quality' that reflects the reference corpus and its biases. The choice of what counts as high-quality is an editorial decision with real consequences for whose language the model learns to model well.

Beyond quality, some content should be removed for safety and legal reasons: toxic and abusive text, personally identifiable information (PII), and illegal material. This filtering reduces harmful model behaviour and memorization of private data, though it involves genuine trade-offs.

Categories of Safety Filtering

Category	Examples	Method
Toxicity	Hate speech, harassment, abuse	Classifier (e.g. trained on toxicity labels)
PII	Emails, phone numbers, SSNs, addresses	Regex + NER detection and redaction
CSAM / illegal	Illegal material	Hash matching, strict removal
Benchmark data	Test sets for evaluations	N-gram overlap detection (decontamination)
Copyright-flagged	Some explicitly protected works	Source exclusion lists

PII Detection and Redaction

Python•Basic PII redaction
import re

PII_PATTERNS = {
    'email':  r'[\w.+-]+@[\w-]+\.[\w.-]+',
    'phone':  r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
    'ssn':    r'\b\d{3}-\d{2}-\d{4}\b',
    'ip':     r'\b(?:\d{1,3}\.){3}\d{1,3}\b',
}

def redact_pii(text):
    """Replace detected PII with type placeholders."""
    for kind, pattern in PII_PATTERNS.items():
        text = re.sub(pattern, f'<{kind.upper()}>', text)
    return text

print(redact_pii('Contact me at jane@example.com or 555-123-4567'))
# Contact me at <EMAIL> or <PHONE>

# Production systems use NER models (e.g. Presidio) for names and
# addresses that regex cannot reliably catch. Redaction reduces
# memorization of private data (Carlini et al., 2021 showed LLMs
# memorize and can regurgitate training-set PII verbatim).

Benchmark Decontamination

A subtle but critical safety filter is decontamination: removing copies of evaluation benchmarks from the training data. If test questions leak into training, evaluation scores are inflated and meaningless. Pipelines detect benchmark text via n-gram overlap and remove matching documents — but contamination is pervasive and hard to fully eliminate, which is why Chapter 21 treats evaluation contamination as a first-class concern.

⚠️

Filtering Trade-offs Are Real

Aggressive toxicity filtering can backfire. Removing all text mentioning slurs also removes counter-speech, academic discussion of hate, and content from communities that reclaim such terms — and a model that never saw toxic language may be worse at DETECTING and refusing it. Welbl et al. (2021) found that naive toxicity filtering can amplify bias against marginalized groups.

The current understanding: heavy filtering of the worst content (CSAM, clear abuse) is essential, but moderate exposure to difficult content, handled at the alignment stage (Part V) rather than scrubbed from pretraining, often produces safer models. Filtering and alignment are complementary, not interchangeable.

Pretraining data is not monolithic — it spans web text, code, books, academic papers, Wikipedia, and more. The data mixture (how much of each domain, and how many times each is repeated) is a powerful lever on the model's capabilities. More code improves reasoning and coding; more books improve long-form coherence; more academic text improves knowledge.

Typical Domain Weights

Domain	Typical weight	What it provides
Filtered web (CC)	60–80%	Breadth, general knowledge, fluency
Code	5–20%	Reasoning, structure, tool use
Books	5–15%	Long-form coherence, narrative, depth
Academic (arXiv, etc.)	2–10%	Technical knowledge, formal reasoning
Wikipedia	1–5%	Factual grounding, structured knowledge
Q&A / forums	1–5%	Conversational, instructional patterns

Weights are usually expressed as sampling probabilities: when assembling a training batch, sample each domain with its weight. High-quality but small domains (Wikipedia, books) are often UP-weighted relative to their raw size — meaning the model sees them multiple times (multiple epochs) while seeing abundant web data once. This is where the epoch question becomes central.

How Many Epochs?

Should the model see each token once, or repeat the data? Muennighoff et al. (2023) studied repeating data under the data wall and found that up to ~4 epochs of repetition is nearly as good as fresh data, but returns diminish sharply beyond that. So small high-quality domains can be safely repeated a few times; the abundant web is typically seen once or twice.

Python•Implementing a weighted data mixture
import numpy as np

DOMAINS = {
    'web':   {'weight': 0.67, 'epochs': 1},
    'code':  {'weight': 0.15, 'epochs': 1},
    'books': {'weight': 0.10, 'epochs': 2},
    'arxiv': {'weight': 0.05, 'epochs': 2},
    'wiki':  {'weight': 0.03, 'epochs': 3},
}

def sample_domain(rng):
    """Sample a domain according to its mixing weight."""
    names   = list(DOMAINS)
    weights = np.array([DOMAINS[n]['weight'] for n in names])
    weights /= weights.sum()                  # normalize
    return rng.choice(names, p=weights)

# Each training batch draws domains by weight; small high-quality
# domains repeat across epochs while web data is seen ~once.
# The mixture is a key hyperparameter -- DoReMi (Xie et al., 2023)
# even learns optimal weights with a small proxy model.

✧

Data Note: Learning the Mixture: DoReMi

Choosing mixture weights by hand is laborious and suboptimal. DoReMi (Xie et al., 2023) automates it: train a small proxy model, use group-distributionally-robust optimization to find weights that minimize worst-case loss across domains, then apply those weights to train the large model. It improved downstream accuracy while reaching target loss faster.

This reflects a broader trend: data decisions that were once hand-tuned heuristics are increasingly optimized with small-scale experiments — the same forecasting philosophy as scaling laws, applied to the data mixture.

Chapter 16 introduced the data wall: the finite supply of high-quality human text, which Chinchilla-optimal scaling threatens to exhaust. One major response is synthetic data — text generated by existing models — which can be produced in unlimited quantities and tailored for quality and diversity.

Forms of Synthetic Data

Approach	How
Textbook generation	Prompt a strong model to write clean, pedagogical text (Phi series)
Rephrasing	Rewrite web text into cleaner, more consistent form
Distillation	Train a small model on a larger model's outputs
Self-instruct	Generate instruction-response pairs for fine-tuning
Reasoning traces	Generate step-by-step solutions to augment reasoning data

The Phi model series (Gunasekar et al., 2023, 'Textbooks Are All You Need') demonstrated that a small model trained largely on synthetic 'textbook-quality' data could rival much larger models on reasoning and coding. This showed that data QUALITY and structure can substitute for raw scale — a direct lever against the data wall.

⚠️

Pitfall: Model Collapse: The Risk of Synthetic Data

Training on model-generated data has a documented failure mode: model collapse (Shumailov et al., 2024). When models are trained recursively on their own outputs, they progressively lose the tails of the distribution — rare events, diverse styles — converging toward bland, repetitive text. The variance of the data shrinks generation after generation.

The mitigation is to keep a substantial fraction of real human data in the mixture and to use synthetic data to AUGMENT rather than REPLACE it. Synthetic data is a powerful supplement, but a model trained purely on synthetic data risks degrading toward its own mean.

▶

ML Connection: The Frontier Is Shifting to Data

As architecture stabilizes and the data wall looms, the competitive frontier of LLM development is shifting from scale toward data: better filtering, better mixtures, and high-quality synthetic data. Much of the gap between frontier models and open models is now a data gap, not an architecture gap.

This is why leading labs treat their data pipelines as core intellectual property and rarely disclose them in detail. The open FineWeb and Dolma datasets are valuable precisely because they make state-of-the-art curation recipes public.

We now assemble the stages into a complete pipeline. In practice this runs as a distributed job (Spark, Ray, or custom MapReduce) over thousands of machines, because the data volume is far too large for a single node. The logical flow, however, is the sequence of filters we have built.

Python•Code Lab: a complete (single-node) curation pipeline
import hashlib

def curate(raw_documents, target_lang='en'):
    """Run the full curation pipeline over a stream of raw documents."""
    seen_hashes = set()
    kept = []
    stats = {'raw': 0, 'lang': 0, 'quality': 0, 'dedup': 0, 'safe': 0}

    for doc in raw_documents:
        stats['raw'] += 1

        # 1. Language filter
        lang, conf = detect_language(doc)
        if lang != target_lang or conf < 0.65: continue
        stats['lang'] += 1

        # 2. Quality filter
        if not passes_quality(doc): continue
        stats['quality'] += 1

        # 3. Exact dedup (hash)
        h = hashlib.md5(doc.encode()).hexdigest()
        if h in seen_hashes: continue
        seen_hashes.add(h)
        stats['dedup'] += 1
        # (fuzzy dedup via MinHash/LSH would run as a separate pass)

        # 4. Safety: redact PII, drop toxic
        doc = redact_pii(doc)
        if is_toxic(doc): continue       # classifier from 17.6
        stats['safe'] += 1

        kept.append(doc)

    print("Funnel:", stats)
    return kept

# Typical funnel from raw Common Crawl (illustrative):
#   raw: 1,000,000  ->  lang: 400,000  ->  quality: 120,000
#   ->  dedup: 65,000  ->  safe: 60,000   (6% survive)

✧

Effic Note: This Runs Distributed at Scale

The single-node pipeline above is for understanding. Real curation processes hundreds of terabytes per Common Crawl snapshot and runs on distributed frameworks: text extraction and filtering parallelize trivially (each document is independent), while global deduplication requires a distributed shuffle to bring potential duplicates together.

Tools like Meta's CCNet, the FineWeb pipeline (built on datatrove), and Dolma's toolkit implement these stages at scale. The logical filters are exactly what you built here; the engineering challenge is running them over petabytes reliably and reproducibly.

A model inherits everything about its training data: its knowledge, its blind spots, its biases, and its legal exposure. Responsible data curation includes documenting what went in and grappling with the ethical and legal questions that web-scale data raises.

Datasheets and Documentation

Gebru et al. (2018) proposed datasheets for datasets: structured documentation of a dataset's composition, collection process, intended uses, and known limitations. For pretraining data this includes the sources, the filters applied, the languages covered, and the deduplication and decontamination steps. Documentation enables reproducibility, accountability, and informed downstream use.

The Hard Questions

Issue	The tension
Copyright	Web text is copyrighted; fair-use status of training is legally unsettled
Consent	Authors did not consent to their text training commercial models
Attribution	Models reproduce content without crediting sources
Bias	The web over-represents some voices and under-represents others
Privacy	Personal data on the web ends up in model weights
Labor	Data filtering and annotation often rely on low-paid workers

⚠️

These Are Unsettled Questions

The legal and ethical status of web-scale training data is genuinely unresolved. Lawsuits over copyright and training data are ongoing; regulations are emerging; norms are contested. This book does not adjudicate these questions, but a competent practitioner must be AWARE of them — they shape what data you can use, what you must document, and what risks you assume.

At minimum: respect robots.txt and opt-out signals where applicable, exclude clearly protected sources when required, document your pipeline, and stay current with the evolving legal landscape. Data curation is not just a technical task — it is a domain with real legal and ethical stakes.

Curation Quick-Reference

Stage	Method	Typical effect
Extraction	trafilatura on WARC	Web HTML → clean text
Language ID	fastText lid.176	Route/filter by language
Quality	Heuristics + classifier	Remove spam/garbage
Exact dedup	Hashing	Drop identical copies
Fuzzy dedup	MinHash + LSH	Drop near-duplicates (30–70%)
Safety	Toxicity clf + PII redaction	Remove harmful/private content
Decontamination	N-gram overlap	Remove benchmark leakage
Mixing	Weighted sampling + epochs	Balance domains

Exercises

Exercises 1–10 are pen-and-paper; 11–18 require code.

✎

Exercise 1: Pen & Paper

A 70B model trains on 15T tokens. Using ~0.75 words/token, estimate the equivalent number of 100,000-word books. Comment on why no curated corpus is this large.

✎

Exercise 2: Pen & Paper

Explain why deduplication improves model quality rather than merely saving compute. Give three distinct mechanisms.

✎

Exercise 3: Pen & Paper

Derive why MinHash signatures estimate Jaccard similarity: show P(min hash of A = min hash of B) = |A∩B|/|A∪B|.

✎

Exercise 4: Pen & Paper

In LSH with b bands of r rows, the probability two documents with Jaccard s share a band is 1-(1-s^r)^b. Sketch this S-curve for b=20, r=5 and identify the similarity threshold.

✎

Exercise 5: Pen & Paper

List five Gopher-style heuristic quality filters and explain what kind of low-quality content each targets.

✎

Exercise 6: Pen & Paper

Explain how a Wikipedia-based quality classifier can encode bias against minority dialects. Propose one mitigation.

✎

Exercise 7: Pen & Paper

Why is benchmark decontamination necessary? Describe how n-gram overlap detection works and one reason it can miss contamination.

✎

Exercise 8: Pen & Paper

A data mixture up-weights Wikipedia to 3 epochs but web to 1. Given 15T total tokens and the weights in Section 17.7, estimate the unique-token count from each domain.

✎

Exercise 9: Pen & Paper

Explain model collapse. Why does recursive training on synthetic data shrink the distribution's tails, and what mitigation preserves diversity?

✎

Exercise 10: Pen & Paper

Describe the trade-off in toxicity filtering: why can removing all toxic text make a model WORSE at handling toxicity? How does the alignment stage (Part V) change the calculus?

✎

Exercise 11: Code

Implement WARC text extraction with trafilatura (or parse a sample HTML file). Compare the extracted text to the raw HTML and to naive tag-stripping.

✎

Exercise 12: Code

Implement the Gopher-style heuristic quality filter from Section 17.5. Run it on a mix of clean text and synthetic spam; report precision and recall.

✎

Exercise 13: Code

Implement MinHash signatures and Jaccard estimation from scratch. Verify your estimate against exact Jaccard on 100 document pairs of varying similarity.

✎

Exercise 14: Code Lab

Implement LSH bucketing on top of your MinHash. On a corpus with planted near-duplicates, measure how many duplicate pairs LSH recovers vs an exhaustive O(N²) comparison.

✎

Exercise 15: Code

Implement PII redaction with regex for emails, phones, and IPs. Then add a simple NER-based name detector (e.g. spaCy) and compare coverage.

✎

Exercise 16: Code

Implement a weighted domain sampler. Verify empirically that over many draws the sampled domain frequencies match the configured mixing weights.

✎

Exercise 17: Code Lab

Build the complete single-node curation pipeline from Section 17.9. Run it on a realistic mixed corpus and report the funnel statistics at each stage.

✎

Exercise 18: Code (Challenge)

Reproduce a mini data-quality ablation: train two tiny language models on (a) unfiltered web text and (b) the same text after your full curation pipeline. Compare validation perplexity and a few generated samples. Demonstrate that curation improves the model at equal token count.

Further reading: “The RefinedWeb Dataset for Falcon LLM” (Penedo et al., 2023) and the FineWeb technical report (2024) for state-of-the-art open curation recipes. “The Pile” (Gao et al., 2020) and “Dolma” (Soldaini et al., 2024) for documented open datasets. “Deduplicating Training Data Makes Language Models Better” (Lee et al., 2022). “Documenting Large Webtext Corpora” (Dodge et al., 2021) on C4's filters and their biases. “Scaling Data-Constrained Language Models” (Muennighoff et al., 2023) on repeating data. “Datasheets for Datasets” (Gebru et al., 2018) on documentation.

Next → Chapter 18: Distributed Training

You now have a clean, curated dataset of trillions of tokens — far more than any single GPU could process. Chapter 18 confronts the engineering reality of frontier-scale training: how to spread one model and one dataset across thousands of GPUs. We will build up data parallelism, tensor parallelism, pipeline parallelism, and the ZeRO family of optimizer-sharding techniques, and see how they combine into the 3D-parallel strategies that train the largest models. The clean training loop of Chapter 15 becomes a distributed, fault-tolerant industrial system.

✎ 18 Exercises in this chapter

Attempt each exercise before checking the worked solutions.

View Solutions →

←

PreviousCh 16. Scaling Laws

NextCh 18. Distributed Training

→

Data Collection & Curation

Learning Objectives

Data Is the Model

The Scale Required

The Curation Pipeline at a Glance

Pipeline Flow: The data curation pipeline

Common Crawl and Web Data

WARC, WAT, and WET

Language Identification and Routing

Deduplication

Why Deduplication Helps So Much

Exact vs Fuzzy Deduplication

MinHash + LSH for Near-Duplicates

Quality Filtering

Heuristic Filters

Model-Based Quality Classifiers

Toxic Content, PII, and Safety Filtering

Categories of Safety Filtering

PII Detection and Redaction

Benchmark Decontamination

Data Mixing and Domain Weighting

Typical Domain Weights

How Many Epochs?

Synthetic Data and the Data Wall

Forms of Synthetic Data

Assembling the Full Pipeline

Data Documentation and Ethics

Datasheets and Documentation

The Hard Questions

Chapter Summary & Exercises

Curation Quick-Reference

Exercises