Layer Normalization & Residuals
In Chapter 13 you built a Transformer that maps token IDs to predictions — but where do token IDs come from? Tokenization is the process that turns raw text into the sequence of integers a model consumes, and converts the model's integer outputs back into text. It is the unglamorous interface layer that every other part of the model depends on, and its design choices ripple through everything from arithmetic ability to multilingual fairness.
The Fundamental Tension
Tokenization must choose a unit of text. The two obvious choices both fail, and the failure of each motivates the subword compromise that all modern tokenizers use.
| Character-level tokens | Word-level tokens |
|---|---|
| Tiny vocabulary (~256 bytes) | Huge vocabulary (millions of words) |
| Never out-of-vocabulary | Out-of-vocabulary words break it |
| Sequences are very long | Sequences are short |
| Model must learn spelling from scratch | No sharing across word forms |
| Wastes context on trivial structure | 'run' and 'running' unrelated |
| Slow: O(T²) attention over long T | Cannot handle typos, new words |
Characters give a tiny vocabulary and never fail on unseen text, but produce very long sequences — and attention is quadratic in length. Words give short sequences but an unbounded vocabulary, and shatter on any unseen word, typo, or morphological variant. Subword tokenization splits the difference: common words stay whole, rare words decompose into meaningful pieces.
Byte-Pair Encoding, originally a 1994 data-compression algorithm, was adapted for tokenization by Sennrich et al. (2016). It is the most widely used tokenization method, powering GPT-2, GPT-3, GPT-4, and many others. The idea is elegantly simple: start with individual characters and repeatedly merge the most frequent adjacent pair into a new token.
The BPE Training Algorithm
# Start: every word is a sequence of characters
vocab ← all individual characters in the corpus
represent each word as a list of characters + end marker
repeat until vocab reaches target size:
count all adjacent symbol pairs across the corpus
pair ← most frequent adjacent pair
merge pair into a new symbol; add to vocab
record the merge rule (order matters!)
return vocab and ordered list of merge rulesA worked example on a tiny corpus shows the mechanism. Suppose the word 'low' appears 5 times and 'lower' twice. BPE counts adjacent pairs, finds 'l'+'o' is most frequent, merges it to 'lo', then 'lo'+'w' to 'low', and so on — building up common substrings into single tokens.
import collections, re
def get_pair_counts(word_freqs):
"""Count adjacent symbol pairs, weighted by word frequency."""
pairs = collections.defaultdict(int)
for word, freq in word_freqs.items():
symbols = word.split()
for i in range(len(symbols)-1):
pairs[(symbols[i], symbols[i+1])] += freq
return pairs
def merge_pair(pair, word_freqs):
"""Replace every occurrence of `pair` with the merged symbol."""
merged = {}
bigram = re.escape(' '.join(pair))
pattern = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
for word in word_freqs:
new = pattern.sub(''.join(pair), word)
merged[new] = word_freqs[word]
return merged
def train_bpe(word_freqs, n_merges):
"""word_freqs: {space-separated chars: count}. Returns merge list."""
merges = []
for _ in range(n_merges):
pairs = get_pair_counts(word_freqs)
if not pairs: break
best = max(pairs, key=pairs.get) # most frequent pair
word_freqs = merge_pair(best, word_freqs)
merges.append(best)
return merges
# Tiny corpus: words split into chars with end-of-word marker </w>
corpus = {
'l o w </w>': 5, 'l o w e r </w>': 2,
'n e w e s t </w>': 6, 'w i d e s t </w>': 3,
}
merges = train_bpe(corpus, n_merges=10)
print(merges[:5])
# [('e','s'), ('es','t'), ('est','</w>'), ('l','o'), ('lo','w')]
# 'est</w>' became one token because it recurs in 'newest' and 'widest'Encoding New Text
Once trained, encoding applies the learned merges in order: split text into characters, then repeatedly apply each merge rule. The order matters — merges learned earlier are higher priority. Decoding is trivial: concatenate the token strings.
def encode_bpe(word, merges):
"""Apply learned merges in priority order to tokenize a word."""
symbols = list(word) + ['</w>']
for pair in merges: # merges are in priority order
i = 0
while i < len(symbols)-1:
if (symbols[i], symbols[i+1]) == pair:
symbols[i:i+2] = [''.join(pair)] # merge in place
else:
i += 1
return symbols
print(encode_bpe('lowest', merges))
# ['low', 'est</w>'] -- 'low' and 'est' were learned as units
# Even though 'lowest' never appeared in training, it decomposes cleanly.Character-level BPE has a hidden flaw: the set of possible characters is enormous (Unicode has ~150,000 code points) and open-ended (new emoji appear). If a character was never seen in training, it cannot be tokenized. GPT-2 solved this elegantly: run BPE over raw bytes instead of characters.
The Byte-Level Guarantee
Every piece of text, in any language or script, is ultimately a sequence of bytes — and there are only 256 possible byte values. By running BPE on bytes, the base vocabulary is exactly 256 tokens, and any text whatsoever can be represented. There are no out-of-vocabulary tokens, ever, by construction.
Any text → UTF-8 bytes → sequence in {0, ..., 255}
Base vocab = 256 byte tokens (always complete)
+ learned merges of frequent byte sequences
⇒ zero out-of-vocabulary tokens, any language, any emoji, any symbol# pip install tiktoken
import tiktoken
# GPT-4's tokenizer
enc = tiktoken.get_encoding('cl100k_base')
text = 'Tokenization shapes everything.'
ids = enc.encode(text)
print(ids) # [3404, 2065, 13745, 5238, 13]
print([enc.decode([i]) for i in ids])
# ['Token', 'ization', ' shapes', ' everything', '.']
# Note: ' shapes' includes the leading space -- spaces attach to words
# Token count varies wildly by content type
print(len(enc.encode('hello world'))) # 2 tokens
print(len(enc.encode('你好世界'))) # 6 tokens (Chinese costs more)
print(len(enc.encode('1234567890'))) # 4 tokens (digits split oddly)
# Rule of thumb for English: ~1 token ≈ 0.75 words ≈ 4 charactersBPE is the most common tokenizer, but two important alternatives exist. WordPiece (used by BERT) merges by likelihood rather than raw frequency. Unigram (used by many SentencePiece models) takes the opposite approach — starting with a large vocabulary and pruning it down probabilistically.
WordPiece: Merge by Likelihood
WordPiece is nearly identical to BPE but changes the merge criterion. Instead of merging the most frequent pair, it merges the pair that most increases the likelihood of the training data — effectively the pair whose merged frequency most exceeds what you'd expect from its parts appearing independently.
BPE: merge argmax count(a, b)
WordPiece: merge argmax count(a,b) / (count(a) · count(b))
# WordPiece prefers pairs that co-occur more than chance would predict
# (this is pointwise mutual information, from Chapter 4)Unigram: Prune Down, Don't Build Up
Unigram (Kudo, 2018) inverts the process. It starts with a large candidate vocabulary, assigns each token a probability, and iteratively removes the tokens whose loss of likelihood is smallest — pruning until the target size is reached. At encoding time, it finds the most probable segmentation of the text under the unigram language model.
| Method | Direction | Merge/keep criterion | Used in |
|---|---|---|---|
| BPE | Build up | Most frequent pair | GPT-2/3/4, RoBERTa |
| WordPiece | Build up | Highest likelihood gain | BERT, DistilBERT, Electra |
| Unigram | Prune down | Smallest likelihood loss | T5, ALBERT, mBART, XLNet |
| Word/SentencePiece | Wrapper | Raw text → either above | Many multilingual models |
Vocabulary size is one of the most important tokenizer hyperparameters, and it trades off two competing costs. A larger vocabulary tokenizes text into fewer tokens (shorter sequences, cheaper attention) but requires a larger embedding matrix and softmax (more parameters, more compute in the output layer).
The Two Competing Costs
| Larger vocabulary | Smaller vocabulary |
|---|---|
| Fewer tokens per text (shorter sequences) | More tokens per text (longer sequences) |
| Cheaper attention (smaller T) | More expensive attention (larger T) |
| Larger embedding & softmax matrices | Smaller embedding & softmax matrices |
| Rare tokens get few training updates | Tokens are more frequent, better trained |
| More of each language fits in context | Context fills with subword fragments |
Typical Vocabulary Sizes
| Model | Vocab size | Tokenizer |
|---|---|---|
| BERT | 30,522 | WordPiece |
| GPT-2 | 50,257 | Byte-level BPE |
| GPT-3 | 50,257 | Byte-level BPE |
| GPT-4 (cl100k) | ~100,277 | Byte-level BPE |
| LLaMA-2 | 32,000 | SentencePiece BPE |
| LLaMA-3 | 128,256 | Byte-level BPE (tiktoken) |
| Gemma | 256,000 | SentencePiece (large multilingual) |
The trend over time is toward larger vocabularies. LLaMA-3 quadrupled LLaMA-2's vocabulary (32k → 128k), and Gemma uses 256k. Larger vocabularies improve multilingual coverage and tokenization efficiency, and at large model scale the extra embedding parameters are a small fraction of the total.
LLMs are famously unreliable at arithmetic, and a surprising amount of the blame falls on tokenization. The way numbers get split into tokens is often inconsistent and counterintuitive, making it hard for the model to learn the place-value structure that arithmetic requires.
The Number-Splitting Problem
Consider how GPT-2's tokenizer splits numbers: '127' might be one token, but '128' three tokens, depending on which digit sequences happened to be frequent in training. The model sees no consistent representation of place value, so it cannot easily learn that the '1' in '127' means one hundred.
import tiktoken
enc = tiktoken.get_encoding('gpt2')
# Numbers split inconsistently -- no place-value structure
for n in ['127', '128', '1234', '12345', '1000000']:
ids = enc.encode(n)
print(f"{n:>8}: {len(ids)} tokens {[enc.decode([i]) for i in ids]}")
# 127: 1 tokens ['127']
# 128: 1 tokens ['128']
# 1234: 2 tokens ['12', '34']
# 12345: 3 tokens ['123', '45']
# 1000000: 3 tokens ['1', '000', '000']
# No consistent digit grouping -- place value is scrambled.Modern tokenizers mitigate this. LLaMA splits every digit into its own token, giving a consistent per-digit representation. GPT-4's tokenizer groups digits in chunks of up to three, aligned to how humans write large numbers. These choices measurably improve arithmetic accuracy.
Beyond ordinary text tokens, every tokenizer reserves special tokens that carry structural meaning: marking the start and end of sequences, padding, masking, and — for chat models — delineating turns between user and assistant. These tokens are how the model knows where a document begins, where a user's message ends, and when to stop generating.
| Special token | Purpose |
|---|---|
| <|endoftext|> / </s> | Marks document or sequence boundaries |
| <bos> / <s> | Beginning of sequence |
| [PAD] | Padding to align batch lengths |
| [MASK] | Masked position for BERT-style training |
| [CLS] / [SEP] | Classification token / segment separator (BERT) |
| <|im_start|> / <|im_end|> | Chat turn boundaries (ChatML format) |
| <|system|> <|user|> <|assistant|> | Role markers in chat templates |
Chat Templates
Instruction-tuned chat models wrap conversations in a specific template of special tokens. The model is trained to recognize these markers, so getting the template exactly right at inference is essential — a mismatched template can degrade quality dramatically. This is why you should use the tokenizer's built-in apply_chat_template rather than constructing prompts by hand.
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-chat-hf')
messages = [
{'role': 'system', 'content': 'You are helpful.'},
{'role': 'user', 'content': 'What is 2+2?'},
]
# Let the tokenizer apply the model's exact chat template
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(prompt)
# <s>[INST] <<SYS>>
# You are helpful.
# <</SYS>>
# What is 2+2? [/INST]
# NEVER hand-build this -- the exact special tokens and whitespace matter.
# A wrong template can severely degrade a chat model's responses.Tokenizers are trained on a corpus, and that corpus is overwhelmingly English. The consequence is a quiet but real unfairness: the same meaning costs far more tokens in some languages than others, which translates directly into higher latency, higher cost, and less effective context for speakers of under-represented languages.
The Token Inflation Problem
Because BPE merges are learned from frequency, English subwords dominate the vocabulary. Languages with different scripts (Thai, Burmese, Telugu) or that were under-represented in training fragment into many more tokens for the same content. A sentence that costs 10 tokens in English might cost 30–50 in a low-resource language.
| Language | Relative token cost | Why |
|---|---|---|
| English | 1.0× (baseline) | Tokenizer trained mostly on English |
| Spanish, French | ~1.1–1.3× | Latin script, well-represented |
| Chinese, Japanese | ~1.5–2.5× | Multi-byte chars, fewer merges |
| Hindi, Arabic | ~2–3× | Different scripts, under-represented |
| Burmese, Telugu | ~5–10× | Rare scripts, near byte-level |
One of the strangest tokenization phenomena is glitch tokens: tokens that exist in the vocabulary but were almost never seen during model training, leaving their embeddings essentially random. When prompted with such a token, the model behaves erratically — refusing to repeat it, hallucinating, or producing nonsense.
The SolidGoldMagikarp Phenomenon
The most famous example is 'SolidGoldMagikarp' — a Reddit username that became a single GPT-2/GPT-3 token because it appeared frequently in the tokenizer-training data (Reddit) but rarely in the model-training data. Its embedding was barely updated, so the model could not process it: asked to repeat 'SolidGoldMagikarp', GPT-3 would output unrelated words, evade, or break.
| Pathology | Cause |
|---|---|
| Glitch tokens | Token frequent in tokenizer data, rare in training data |
| Trailing-space sensitivity | ' the' and 'the' are different tokens; prompts can mismatch |
| Tokenization of own output | Model output re-tokenized differently than intended |
| Prompt-boundary effects | Where a word splits depends on surrounding context |
| Repeated-token degeneration | Some token sequences trigger repetitive loops |
Tokenization is the first and last step of every interaction with a language model. It is worth seeing the complete round trip — from raw text to token IDs to embeddings, through the model, and back to text — to appreciate how the tokenizer interfaces with everything you built in Chapters 12 and 13.
Arch Stack: The full text round-trip
| output text | decode |
| token IDs (argmax/sample) | (T,) |
| logits over vocabulary | (T, V) |
| Transformer (Ch. 13) | the model |
| embeddings | (T, d) |
| token IDs | (T,) |
| input text | encode (BPE) |
Shape Trace: Encode → model → decode (T tokens, vocab V, dim d)
| Operation | Shape | Note |
|---|---|---|
| raw text | string | human input |
| BPE encode | (T,) | integer token IDs |
| embedding lookup | (T, d) | E[token_ids] |
| Transformer | (T, d) | contextual representations |
| LM head | (T, V) | logits over vocabulary |
| sample / argmax | (T,) | predicted token IDs |
| BPE decode | string | human-readable output |
Tokenizer Quick-Reference
| Method | Builds by | Used in | Note |
|---|---|---|---|
| BPE | Merging frequent pairs | GPT family | Most common |
| Byte-level BPE | BPE over 256 bytes | GPT-2+, LLaMA-3 | No OOV ever |
| WordPiece | Merging by likelihood | BERT | ## prefix for subwords |
| Unigram | Pruning a large vocab | T5, ALBERT | Probabilistic segmentation |
| SentencePiece | Wraps BPE/Unigram | Multilingual | Language-agnostic |
Exercises
Exercises 1–10 are pen-and-paper; 11–18 require code.
Further reading: “Neural Machine Translation of Rare Words with Subword Units” (Sennrich et al., 2016) — BPE for NLP. “Subword Regularization” and “SentencePiece” (Kudo, 2018; Kudo & Richardson, 2018) for Unigram and the library. “Language Model Tokenizers Introduce Unfairness Between Languages” (Petrov et al., 2023). Andrej Karpathy's “Let's build the GPT Tokenizer” video and his minbpe repository — the best hands-on companion. The tiktoken and Hugging Face tokenizers documentation for production use.
Next → Chapter 15: Training Transformers
You now have a complete Transformer (Chapter 13) and know how to feed it text (this chapter). Chapter 15 covers how to actually train one: the next-token prediction objective at scale, learning-rate schedules with warmup and decay, the AdamW optimizer in practice, gradient accumulation and clipping, mixed-precision training, and the practical recipe for a stable training run. We bring together the optimization of Chapter 2, the numerical stability of Chapter 5, and the architecture of Chapter 13 into a working training pipeline.