Data Curation & Preprocessing
Part III built and trained the Transformer. But a model is only as good as the data it learns from. The architecture has been remarkably stable since 2017; the dramatic capability gains since then have come overwhelmingly from more and better data, plus scale. Yet data curation receives a fraction of the attention that architecture does. This chapter corrects that imbalance: the pipeline that turns the raw web into clean training tokens is the highest-leverage, least-glamorous work in building an LLM.
The Scale Required
Chapter 16's Chinchilla recipe demands roughly 20 tokens per parameter, and over-training for inference efficiency pushes this far higher. A modern 70B model trains on 15 trillion tokens — the equivalent of tens of millions of books. There is no curated dataset of that size; it must be assembled and cleaned from the open web, and the cleaning is where most of the value is created.
| Model | Training tokens | Primary sources |
|---|---|---|
| GPT-3 (2020) | 300B | Common Crawl, WebText2, books, Wikipedia |
| The Pile (2020) | ~400B | 22 curated sources (academic, code, web) |
| LLaMA-1 (2023) | 1.4T | CommonCrawl, C4, GitHub, Wikipedia, books, arXiv |
| RefinedWeb (2023) | 5T | Common Crawl only, heavily filtered |
| LLaMA-3 (2024) | 15T | Web + code + multilingual, heavily curated |
| FineWeb (2024) | 15T | Common Crawl, open recipe, deduplicated |
The Curation Pipeline at a Glance
Turning the raw web into training tokens is a multi-stage pipeline. Each stage discards a large fraction of the input; the survivors of all stages become training data. Here is the canonical flow:
Pipeline Flow: The data curation pipeline
| 1 | Acquire | Download Common Crawl WARC/WET archives (petabytes of raw web) |
| 2 | Extract | Pull readable text from HTML; strip boilerplate, nav, ads |
| 3 | Language ID | Detect and route by language; filter to target languages |
| 4 | Quality filter | Heuristic rules + model classifiers remove low-quality text |
| 5 | Deduplicate | Remove exact and near-duplicate documents (often 50%+ of data) |
| 6 | Safety filter | Remove toxic content, PII, and benchmark contamination |
| 7 | Mix & shuffle | Weight domains, set epoch counts, shuffle into training shards |
Common Crawl is a non-profit that has crawled the web monthly since 2008, releasing the archives freely. It is the single largest source of pretraining data — nearly every major LLM is built on it. Understanding its formats and quirks is the practical starting point for data curation.
WARC, WAT, and WET
| Format | Contains | Use |
|---|---|---|
| WARC | Raw HTTP responses (full HTML + headers) | Custom extraction, maximum fidelity |
| WAT | Metadata (links, headers, structure) | Link graphs, metadata analysis |
| WET | Pre-extracted plain text | Quick start, but lower-quality extraction |
Serious pipelines prefer WARC files and run their own text extraction, because the pre-extracted WET text is noisy — it includes navigation, boilerplate, and poorly-formatted content. High-quality extractors like trafilatura or resiliparse pull the main article text and discard the chrome, which substantially improves the resulting data quality.
# pip install warcio trafilatura
from warcio.archiveiterator import ArchiveIterator
import trafilatura
def extract_from_warc(warc_path):
"""Yield clean main-text from each HTML record in a WARC file."""
with open(warc_path, 'rb') as stream:
for record in ArchiveIterator(stream):
if record.rec_type != 'response': continue
html = record.content_stream().read()
# trafilatura strips nav/ads/boilerplate, keeps main text
text = trafilatura.extract(html, include_comments=False,
include_tables=False)
if text and len(text) > 200: # skip tiny fragments
yield text
# A single WARC file holds ~30-50k web pages.
# A monthly Common Crawl snapshot has ~90k WARC files (~400TB).
# This is why the pipeline runs on large distributed clusters (Spark/Ray).The web is multilingual, and most pipelines need to route documents by language — to filter to target languages, to apply language-specific quality rules, and to control the multilingual mixture. Language identification is a fast, early-stage filter that also doubles as a quality signal: text that no language detector can confidently classify is often garbage.
# Meta's fastText lid.176 model: 176 languages, very fast
import fasttext
model = fasttext.load_model('lid.176.bin')
def detect_language(text):
"""Return (lang_code, confidence) for a document."""
text = text.replace('\n', ' ')[:1000] # sample, strip newlines
labels, probs = model.predict(text, k=1)
lang = labels[0].replace('__label__', '')
return lang, float(probs[0])
# Filter: keep English documents with high confidence
def keep_document(text, target='en', min_conf=0.65):
lang, conf = detect_language(text)
return lang == target and conf >= min_conf
# Low confidence often signals: code-switching, gibberish, or
# machine-generated spam -- so the confidence threshold is also a
# quality filter, not just a language router.The web is extraordinarily repetitive: the same articles are syndicated across thousands of sites, boilerplate recurs everywhere, and popular text is quoted endlessly. Deduplication — removing exact and near-duplicate documents — is one of the most impactful curation steps. It typically removes 30–70% of the data, and the resulting model is BETTER, not just cheaper to train.
Why Deduplication Helps So Much
Lee et al. (2022) showed that deduplicating training data improves model quality and reduces memorization, with no downside. It is one of the rare 'free lunch' interventions: less data, less compute, AND a better model.
Exact vs Fuzzy Deduplication
| Exact deduplication | Fuzzy (near-duplicate) deduplication |
|---|---|
| Removes byte-identical documents | Removes documents that are mostly similar |
| Hash each doc, drop hash collisions | MinHash + LSH on shingles |
| Fast, simple, catches obvious copies | Catches near-copies, edits, reformatting |
| Misses minor variations | Catches templated/syndicated content |
| O(N) with a hash set | O(N) with LSH bucketing |
| First pass | Second, more powerful pass |
MinHash + LSH for Near-Duplicates
Fuzzy deduplication needs to find documents that are similar but not identical. The standard approach: represent each document as a set of shingles (overlapping n-grams), estimate the Jaccard similarity between documents using MinHash signatures, and use Locality-Sensitive Hashing (LSH) to find candidate pairs efficiently without comparing all N² pairs.
# 1. Shingle each document
shingles(doc) = { all overlapping k-word sequences }
# 2. MinHash signature estimates Jaccard similarity
for each of P hash functions h_i:
sig[i] = min over shingles of h_i(shingle)
# P(sig_A[i] == sig_B[i]) = Jaccard(A, B)
# 3. LSH: band the signature, hash bands into buckets
split signature into b bands of r rows each
documents sharing a band-bucket are CANDIDATE duplicates
# 4. Verify candidates, drop near-duplicates above threshold
for each candidate pair: if Jaccard > 0.8, remove oneimport numpy as np
def shingles(text, k=5):
"""Set of k-word shingles."""
words = text.split()
return {' '.join(words[i:i+k]) for i in range(len(words)-k+1))}
def minhash(shingle_set, n_hashes=128, seed=0):
"""Compute an n_hashes-dim MinHash signature."""
rng = np.random.default_rng(seed)
# Random hash coefficients (a*x + b mod prime)
a = rng.integers(1, 2**31, n_hashes)
b = rng.integers(0, 2**31, n_hashes)
P = 2**31 - 1
sig = np.full(n_hashes, np.inf)
for sh in shingle_set:
x = hash(sh) % P
sig = np.minimum(sig, (a * x + b) % P) # vectorized min
return sig
def estimate_jaccard(sig_a, sig_b):
"""Fraction of matching signature entries ≈ Jaccard similarity."""
return (sig_a == sig_b).mean()
# Demo: two near-duplicate documents
doc1 = 'the quick brown fox jumps over the lazy dog every morning'
doc2 = 'the quick brown fox jumps over the lazy dog each morning'
doc3 = 'machine learning models require enormous amounts of training data'
s1, s2, s3 = minhash(shingles(doc1)), minhash(shingles(doc2)), minhash(shingles(doc3))
print(f"sim(near-dup): {estimate_jaccard(s1, s2):.2f}") # ~0.6 high
print(f"sim(unrelated): {estimate_jaccard(s1, s3):.2f}") # ~0.0 lowMost of the web is not useful training data: spam, auto-generated text, keyword-stuffed SEO pages, broken markup, and incoherent fragments. Quality filtering removes this noise. Two complementary approaches dominate: fast heuristic rules and slower model-based classifiers.
Heuristic Filters
Heuristic filters are cheap rules that catch obvious garbage. Gopher (Rae et al., 2021) published an influential set of such rules, now widely adopted. They are fast enough to run on every document and catch a large fraction of low-quality content.
| Filter | Removes |
|---|---|
| Document length bounds | Too-short fragments and absurdly long dumps |
| Mean word length 3–10 chars | Gibberish, encoding errors, code masquerading as text |
| Symbol-to-word ratio < 0.1 | Math dumps, markup soup, ASCII art |
| Stop-word presence | Keyword lists and SEO spam lack natural function words |
| Fraction of lines ending in ... | Boilerplate, truncated listings |
| Bullet/ellipsis line fraction | Navigation menus, link farms |
| Duplicate line/paragraph fraction | Templated or repetitive spam |
import re
from collections import Counter
STOP_WORDS = {'the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have', 'it'}
def passes_quality(text):
"""Apply Gopher-style heuristic quality filters."""
words = text.split()
n = len(words)
if not (50 <= n <= 100_000): return False # length
# Mean word length in [3, 10]
mean_len = sum(len(w) for w in words) / n
if not (3 <= mean_len <= 10): return False
# Must contain common stop words (natural language signal)
lower = {w.lower() for w in words}
if len(STOP_WORDS & lower) < 2: return False
# Symbol-to-word ratio (hash/ellipsis)
n_symbols = text.count('#') + text.count('...')
if n_symbols / n > 0.1: return False
# Duplicate-line fraction
lines = text.split('\n')
if lines:
dup_frac = 1 - len(set(lines)) / len(lines)
if dup_frac > 0.3: return False
return TrueModel-Based Quality Classifiers
Heuristics catch obvious garbage but cannot judge subtle quality. Model-based filters train a classifier to distinguish 'high-quality' from 'low-quality' text. The classic approach (used by GPT-3 and LLaMA): train a classifier to recognize text resembling a high-quality reference corpus (e.g. Wikipedia, books, or curated web), and keep documents the classifier rates highly.
import fasttext
# Train: positive = curated text (Wikipedia, books), negative = random web
# Label format for fastText: '__label__high <text>' / '__label__low <text>'
model = fasttext.train_supervised('quality_train.txt', epoch=5, wordNgrams=2)
def quality_score(text):
"""Probability the document is high-quality (0-1)."""
text = text.replace('\n', ' ')
labels, probs = model.predict(text, k=2)
scores = dict(zip(labels, probs))
return scores.get('__label__high', 0.0)
# Pareto-style filtering: keep documents above a quality threshold,
# OR sample stochastically so some borderline documents survive --
# hard thresholds can remove useful diversity (GPT-3 used a Pareto sampler).Beyond quality, some content should be removed for safety and legal reasons: toxic and abusive text, personally identifiable information (PII), and illegal material. This filtering reduces harmful model behaviour and memorization of private data, though it involves genuine trade-offs.
Categories of Safety Filtering
| Category | Examples | Method |
|---|---|---|
| Toxicity | Hate speech, harassment, abuse | Classifier (e.g. trained on toxicity labels) |
| PII | Emails, phone numbers, SSNs, addresses | Regex + NER detection and redaction |
| CSAM / illegal | Illegal material | Hash matching, strict removal |
| Benchmark data | Test sets for evaluations | N-gram overlap detection (decontamination) |
| Copyright-flagged | Some explicitly protected works | Source exclusion lists |
PII Detection and Redaction
import re
PII_PATTERNS = {
'email': r'[\w.+-]+@[\w-]+\.[\w.-]+',
'phone': r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
'ssn': r'\b\d{3}-\d{2}-\d{4}\b',
'ip': r'\b(?:\d{1,3}\.){3}\d{1,3}\b',
}
def redact_pii(text):
"""Replace detected PII with type placeholders."""
for kind, pattern in PII_PATTERNS.items():
text = re.sub(pattern, f'<{kind.upper()}>', text)
return text
print(redact_pii('Contact me at jane@example.com or 555-123-4567'))
# Contact me at <EMAIL> or <PHONE>
# Production systems use NER models (e.g. Presidio) for names and
# addresses that regex cannot reliably catch. Redaction reduces
# memorization of private data (Carlini et al., 2021 showed LLMs
# memorize and can regurgitate training-set PII verbatim).Benchmark Decontamination
A subtle but critical safety filter is decontamination: removing copies of evaluation benchmarks from the training data. If test questions leak into training, evaluation scores are inflated and meaningless. Pipelines detect benchmark text via n-gram overlap and remove matching documents — but contamination is pervasive and hard to fully eliminate, which is why Chapter 21 treats evaluation contamination as a first-class concern.
Pretraining data is not monolithic — it spans web text, code, books, academic papers, Wikipedia, and more. The data mixture (how much of each domain, and how many times each is repeated) is a powerful lever on the model's capabilities. More code improves reasoning and coding; more books improve long-form coherence; more academic text improves knowledge.
Typical Domain Weights
| Domain | Typical weight | What it provides |
|---|---|---|
| Filtered web (CC) | 60–80% | Breadth, general knowledge, fluency |
| Code | 5–20% | Reasoning, structure, tool use |
| Books | 5–15% | Long-form coherence, narrative, depth |
| Academic (arXiv, etc.) | 2–10% | Technical knowledge, formal reasoning |
| Wikipedia | 1–5% | Factual grounding, structured knowledge |
| Q&A / forums | 1–5% | Conversational, instructional patterns |
Weights are usually expressed as sampling probabilities: when assembling a training batch, sample each domain with its weight. High-quality but small domains (Wikipedia, books) are often UP-weighted relative to their raw size — meaning the model sees them multiple times (multiple epochs) while seeing abundant web data once. This is where the epoch question becomes central.
How Many Epochs?
Should the model see each token once, or repeat the data? Muennighoff et al. (2023) studied repeating data under the data wall and found that up to ~4 epochs of repetition is nearly as good as fresh data, but returns diminish sharply beyond that. So small high-quality domains can be safely repeated a few times; the abundant web is typically seen once or twice.
import numpy as np
DOMAINS = {
'web': {'weight': 0.67, 'epochs': 1},
'code': {'weight': 0.15, 'epochs': 1},
'books': {'weight': 0.10, 'epochs': 2},
'arxiv': {'weight': 0.05, 'epochs': 2},
'wiki': {'weight': 0.03, 'epochs': 3},
}
def sample_domain(rng):
"""Sample a domain according to its mixing weight."""
names = list(DOMAINS)
weights = np.array([DOMAINS[n]['weight'] for n in names])
weights /= weights.sum() # normalize
return rng.choice(names, p=weights)
# Each training batch draws domains by weight; small high-quality
# domains repeat across epochs while web data is seen ~once.
# The mixture is a key hyperparameter -- DoReMi (Xie et al., 2023)
# even learns optimal weights with a small proxy model.Chapter 16 introduced the data wall: the finite supply of high-quality human text, which Chinchilla-optimal scaling threatens to exhaust. One major response is synthetic data — text generated by existing models — which can be produced in unlimited quantities and tailored for quality and diversity.
Forms of Synthetic Data
| Approach | How |
|---|---|
| Textbook generation | Prompt a strong model to write clean, pedagogical text (Phi series) |
| Rephrasing | Rewrite web text into cleaner, more consistent form |
| Distillation | Train a small model on a larger model's outputs |
| Self-instruct | Generate instruction-response pairs for fine-tuning |
| Reasoning traces | Generate step-by-step solutions to augment reasoning data |
The Phi model series (Gunasekar et al., 2023, 'Textbooks Are All You Need') demonstrated that a small model trained largely on synthetic 'textbook-quality' data could rival much larger models on reasoning and coding. This showed that data QUALITY and structure can substitute for raw scale — a direct lever against the data wall.
We now assemble the stages into a complete pipeline. In practice this runs as a distributed job (Spark, Ray, or custom MapReduce) over thousands of machines, because the data volume is far too large for a single node. The logical flow, however, is the sequence of filters we have built.
import hashlib
def curate(raw_documents, target_lang='en'):
"""Run the full curation pipeline over a stream of raw documents."""
seen_hashes = set()
kept = []
stats = {'raw': 0, 'lang': 0, 'quality': 0, 'dedup': 0, 'safe': 0}
for doc in raw_documents:
stats['raw'] += 1
# 1. Language filter
lang, conf = detect_language(doc)
if lang != target_lang or conf < 0.65: continue
stats['lang'] += 1
# 2. Quality filter
if not passes_quality(doc): continue
stats['quality'] += 1
# 3. Exact dedup (hash)
h = hashlib.md5(doc.encode()).hexdigest()
if h in seen_hashes: continue
seen_hashes.add(h)
stats['dedup'] += 1
# (fuzzy dedup via MinHash/LSH would run as a separate pass)
# 4. Safety: redact PII, drop toxic
doc = redact_pii(doc)
if is_toxic(doc): continue # classifier from 17.6
stats['safe'] += 1
kept.append(doc)
print("Funnel:", stats)
return kept
# Typical funnel from raw Common Crawl (illustrative):
# raw: 1,000,000 -> lang: 400,000 -> quality: 120,000
# -> dedup: 65,000 -> safe: 60,000 (6% survive)A model inherits everything about its training data: its knowledge, its blind spots, its biases, and its legal exposure. Responsible data curation includes documenting what went in and grappling with the ethical and legal questions that web-scale data raises.
Datasheets and Documentation
Gebru et al. (2018) proposed datasheets for datasets: structured documentation of a dataset's composition, collection process, intended uses, and known limitations. For pretraining data this includes the sources, the filters applied, the languages covered, and the deduplication and decontamination steps. Documentation enables reproducibility, accountability, and informed downstream use.
The Hard Questions
| Issue | The tension |
|---|---|
| Copyright | Web text is copyrighted; fair-use status of training is legally unsettled |
| Consent | Authors did not consent to their text training commercial models |
| Attribution | Models reproduce content without crediting sources |
| Bias | The web over-represents some voices and under-represents others |
| Privacy | Personal data on the web ends up in model weights |
| Labor | Data filtering and annotation often rely on low-paid workers |
Curation Quick-Reference
| Stage | Method | Typical effect |
|---|---|---|
| Extraction | trafilatura on WARC | Web HTML → clean text |
| Language ID | fastText lid.176 | Route/filter by language |
| Quality | Heuristics + classifier | Remove spam/garbage |
| Exact dedup | Hashing | Drop identical copies |
| Fuzzy dedup | MinHash + LSH | Drop near-duplicates (30–70%) |
| Safety | Toxicity clf + PII redaction | Remove harmful/private content |
| Decontamination | N-gram overlap | Remove benchmark leakage |
| Mixing | Weighted sampling + epochs | Balance domains |
Exercises
Exercises 1–10 are pen-and-paper; 11–18 require code.
Further reading: “The RefinedWeb Dataset for Falcon LLM” (Penedo et al., 2023) and the FineWeb technical report (2024) for state-of-the-art open curation recipes. “The Pile” (Gao et al., 2020) and “Dolma” (Soldaini et al., 2024) for documented open datasets. “Deduplicating Training Data Makes Language Models Better” (Lee et al., 2022). “Documenting Large Webtext Corpora” (Dodge et al., 2021) on C4's filters and their biases. “Scaling Data-Constrained Language Models” (Muennighoff et al., 2023) on repeating data. “Datasheets for Datasets” (Gebru et al., 2018) on documentation.
Next → Chapter 18: Distributed Training
You now have a clean, curated dataset of trillions of tokens — far more than any single GPU could process. Chapter 18 confronts the engineering reality of frontier-scale training: how to spread one model and one dataset across thousands of GPUs. We will build up data parallelism, tensor parallelism, pipeline parallelism, and the ZeRO family of optimizer-sharding techniques, and see how they combine into the 3D-parallel strategies that train the largest models. The clean training loop of Chapter 15 becomes a distributed, fault-tolerant industrial system.