Part VI: Productionization

Chapter 27

Inference Optimization

Making a trained model fast and cheap to run: the KV cache, prefill vs decode, quantization (GPTQ, AWQ, GGUF), PagedAttention, continuous batching, and speculative decoding.

22 Exercises

Learning Objectives

1.	Explain why inference is a distinct problem from training, and the cost shift it creates.
2.	Distinguish the prefill and decode phases and why each is bottlenecked differently.
3.	Understand the KV cache as the central object of inference optimization.
4.	Measure inference with the right metrics: TTFT, TPOT, latency, and throughput.
5.	Apply quantization (int8/int4) and understand GPTQ, AWQ, and GGUF.
6.	Understand PagedAttention and how it eliminates KV-cache memory waste.
7.	Explain continuous batching and why it dramatically raises throughput.
8.	Understand speculative decoding and why it speeds up generation for free.
9.	Choose appropriate decoding/sampling strategies.
10.	Assemble an optimized serving setup with engines like vLLM.

We have spent the whole book building and aligning a model. Now we have to RUN it — serve it to users, fast and cheaply. This turns out to be a completely different engineering challenge from training, with its own bottlenecks, its own metrics, and its own bag of tricks. Part VI begins here, with the techniques that make a trained model practical to deploy.

The Cost Shift: Training Is One-Time, Inference Is Forever

Here is the key economic fact. Training a model is a ONE-TIME cost — expensive, but paid once. Inference is a RECURRING cost — paid for every single token the model ever generates, for every user, forever. A popular model serves billions of tokens per day. Over a model's deployed life, the total inference cost dwarfs the training cost. This is why inference optimization matters so much: shaving 20% off inference cost saves far more, in absolute terms, than almost any training optimization.

✧

Infer Note: Inference Dominates Total Cost

For a widely-deployed model, the lifetime inference compute vastly exceeds the training compute. A model trained once for a few weeks may then serve queries for years. This is the opposite of the intuition many beginners start with — training feels like 'the expensive part', but for any successful deployed model, inference is where the money goes.

This cost structure drives the entire field of inference optimization. Every technique in this chapter — quantization, batching, caching, speculative decoding — exists to reduce the per-token cost of serving, because that cost is paid astronomically many times.

Two Goals in Tension: Latency and Throughput

Inference optimization juggles two goals that often conflict. LATENCY is how fast a single user gets their response — it matters for interactive use (a chatbot must feel responsive). THROUGHPUT is how many tokens you can generate across all users per second — it determines cost-efficiency (more throughput means serving more users per GPU). Many techniques trade one for the other, and the right balance depends on the application.

Latency (single-user speed)	Throughput (total tokens/sec)
How fast ONE response arrives	How many tokens across ALL users
Matters for interactivity	Matters for cost-efficiency
Small batches are better	Large batches are better
A chatbot feels snappy	Serve more users per GPU
Optimize per-request	Optimize aggregate

✧

Intuition: Why Latency and Throughput Conflict

Imagine a bus versus a taxi. A taxi (small batch) gets one passenger to their destination fastest — low latency. A bus (large batch) moves the most people per trip — high throughput — but each passenger waits longer for the bus to fill and stop along the way. Serving LLMs faces the same trade-off: batching many requests together uses the GPU efficiently (throughput) but can make any single request wait (latency).

Much of inference optimization is about getting the best of both — high throughput without sacrificing too much latency. Continuous batching (Section 27.8) is the clever scheduling that makes this possible, like a bus that picks up and drops off passengers without ever stopping.

To optimize inference, you must understand that generating a response happens in TWO distinct phases with very different performance characteristics. Confusing them is a common beginner mistake; distinguishing them is the foundation for everything else in this chapter.

Phase 1: Prefill (Processing the Prompt)

When you send a prompt, the model first PROCESSES the entire prompt in one go — this is prefill. All the prompt's tokens are fed through the model together, in parallel, to produce the first output token and to populate the KV cache (Section 27.3). Because all prompt tokens are processed at once, prefill is COMPUTE-BOUND: it does a lot of matrix multiplication and keeps the GPU's compute units busy.

Phase 2: Decode (Generating Tokens One by One)

After prefill, the model GENERATES the response one token at a time — this is decode. Each step takes the single most recent token, runs it through the model, and produces the next token, repeating until done. Because each step processes only ONE token, decode is MEMORY-BANDWIDTH-BOUND: the tiny computation for one token is dwarfed by the time spent reading the model weights and KV cache from memory.

Tool Trace: The two phases of generating a response

User	Sends prompt: 'Explain photosynthesis' (5 tokens)	→
Prefill	Process all 5 prompt tokens at once → first output token + KV cache	•
Decode	Generate token 2, attending to the cached prompt	•
Decode	Generate token 3, attending to all previous	•
Decode	... continue one token at a time until <end>	•
User	Receives the full streamed response	←

Property	Prefill	Decode
Processes	All prompt tokens at once	One token at a time
Bottleneck	Compute (matmul)	Memory bandwidth
GPU utilization	High	Low (underutilized)
Parallelism	Across prompt tokens	None (sequential)
Determines	Time to first token	Time per output token
Optimized by	Chunking, fusion	Batching, smaller weights

✧

Infer Note: Decode Is the Expensive Part — and It Wastes the GPU

Here is the crucial insight: decode generates one token per forward pass, and that single-token forward pass barely uses the GPU's enormous compute capacity — it spends almost all its time READING the model weights from memory. The GPU's compute units sit mostly idle. This memory-bound nature of decode is THE central problem of LLM inference, and most optimization techniques exist to address it.

Two big families of fixes follow directly: (1) make the weights SMALLER so there is less to read (quantization, Sections 27.5–27.6), and (2) read the weights ONCE but use them for MANY requests at the same time (batching, Section 27.8). Keep this framing in mind — it explains why every technique in this chapter works.

We met the KV cache in Chapters 13 and 19. It is so central to inference that we revisit it here as the object around which all of inference optimization revolves. Understanding it deeply makes the rest of the chapter click into place.

Why the Cache Exists

Recall how attention works: each new token attends to ALL previous tokens, using their keys (K) and values (V). Without a cache, generating each new token would require recomputing the keys and values for every previous token — enormously wasteful, since those don't change. The KV CACHE stores the keys and values of all previous tokens, so each new token only computes ITS OWN K and V and reads the rest from the cache. This turns generation from O(T²) recomputation into O(T) with caching — it is what makes generation practical.

KV cache

A store of the key and value vectors of all previously-processed tokens, kept so that generating each new token does not require recomputing them. It grows by one entry per generated token.

The Problem: The Cache Is Huge

The KV cache solves a compute problem but creates a MEMORY problem. It grows with every token generated, and at long context lengths and large batch sizes it becomes the dominant consumer of GPU memory — often larger than the model weights themselves. Recall the formula from Chapter 19:

text•KV-cache memory
cache = 2 · layers · kv_heads · head_dim · seq_len · batch · bytes

# the 2 is for Keys AND Values
Example: 13B model (40 layers, 40 heads, 128 dim), 4k ctx, batch 16, fp16:
    2 × 40 × 40 × 128 × 4096 × 16 × 2 ≈ 86 GB

86GB of KV cache for one batch — more than a single GPU's memory. This is why so much of inference optimization targets the cache: shrinking it (GQA from Chapter 19, quantizing it), managing its memory efficiently (PagedAttention, Section 27.7), and reusing it across requests (prefix caching, Section 27.10). The KV cache is the resource that everything fights over.

✧

Infer Note: Three Things Compete for GPU Memory at Inference

At inference, GPU memory holds three things: (1) the MODEL WEIGHTS (fixed size), (2) the KV CACHE (grows with tokens and batch size), and (3) ACTIVATIONS (small at decode time). The weights are shrunk by quantization; the cache is the variable that limits how many requests you can batch and how long the context can be.

More memory freed from weights (via quantization) means more room for KV cache, which means larger batches and longer contexts — higher throughput. This is why quantization and batching are deeply connected: shrinking the weights directly enables more concurrent requests.

You cannot optimize what you cannot measure. Inference has a specific vocabulary of metrics, and using the right ones is essential. Let us define them carefully, because they map directly onto the prefill/decode phases from Section 27.2.

Metric	Meaning	Driven by
TTFT	Time To First Token — prompt sent to first token out	Prefill speed
TPOT	Time Per Output Token — gap between successive tokens	Decode speed
Latency	Total time for the full response	TTFT + TPOT × tokens
Throughput	Total output tokens/sec across all requests	Batching, efficiency
Goodput	Throughput meeting latency targets	Both, balanced

How the Metrics Connect

These metrics fit together simply. The total time a user waits for a complete response is roughly the time to the first token (TTFT, set by prefill) plus the time per subsequent token (TPOT, set by decode) times the number of tokens generated. For an interactive chatbot, a low TTFT makes it feel responsive (the answer starts quickly), and a low TPOT makes it read smoothly (tokens stream fast enough to read).

text•Total latency
latency ≈ TTFT  +  TPOT × (output tokens)

TTFT: how long until the FIRST token appears (prefill)
TPOT: time between each token after that (decode)

# A chatbot wants low TTFT (feels responsive) AND low TPOT (reads smoothly).

✧

Infer Note: Different Apps Want Different Metrics

The right metric depends on the use case. An interactive chatbot cares most about TTFT and TPOT — user-perceived responsiveness. A batch job processing millions of documents overnight cares only about THROUGHPUT — total tokens per dollar, with latency irrelevant. A coding autocomplete needs ultra-low TTFT. Knowing which metric matters for YOUR application tells you which optimizations to prioritize.

This is why there is no single 'fastest' configuration: the optimal setup for a low-latency chatbot (small batches, speculative decoding) differs from the optimal setup for a high-throughput batch job (large batches, maximum quantization). Match the optimization to the metric that matters.

The single most impactful inference optimization is QUANTIZATION: representing the model's weights (and sometimes activations and the KV cache) with fewer bits. A model trained in 16-bit precision can often be run in 8-bit or even 4-bit with little quality loss — halving or quartering its memory footprint and, because decode is memory-bound (Section 27.2), speeding it up substantially.

Why Quantization Helps So Much

Recall that decode is bottlenecked by READING the weights from memory, not by computing with them. If you store the weights in 4 bits instead of 16, there is 4× LESS data to read per token — so decode gets up to 4× faster, AND the model takes 4× less memory (leaving more room for KV cache and bigger batches). Quantization attacks the exact bottleneck of inference.

Precision	Bits/weight	Memory (7B model)	Quality
fp16 / bf16	16	~14 GB	Baseline (full)
int8	8	~7 GB	Near-lossless
int4	4	~3.5 GB	Small loss, usually fine
int3 / int2	3 / 2	~2.6 / 1.8 GB	Noticeable degradation

The Basic Idea: Map Floats to a Small Set of Integers

Quantization maps a range of floating-point values onto a small set of integers. Instead of storing each weight as a 16-bit float, you store a scale factor per group of weights and represent each weight as a small integer that, multiplied by the scale, approximates the original. The art is choosing the mapping so that the approximation loses as little quality as possible.

text•Basic quantization (per-group)
For a group of weights w with max absolute value M:
    scale = M / (2^(bits-1) - 1)
    q = round(w / scale)        # small integer
    w ≈ q × scale               # dequantized approximation

# Store q (few bits) + scale (one per group). Reconstruct w when needed.

Python•Simple int8 quantization from scratch
import torch

def quantize_int8(weights, group_size=128):
    """Per-group symmetric int8 quantization."""
    w = weights.reshape(-1, group_size)       # groups of weights
    # One scale per group, from the group's max magnitude
    scale = w.abs().max(dim=1, keepdim=True).values / 127
    q = torch.round(w / scale).clamp(-128, 127).to(torch.int8)
    return q, scale

def dequantize(q, scale):
    """Reconstruct approximate fp weights for computation."""
    return (q.float() * scale)

# int8: store q (1 byte) + a scale per 128 weights, vs 2 bytes/weight.
# ~2x smaller, near-lossless. int4 (4 bits) is ~4x smaller with care.
# The trick is choosing scales/grouping to minimize error -- that's what
# the methods in the next section (GPTQ, AWQ) do cleverly.

Weight-Only vs Weight-and-Activation

Most LLM inference quantization is WEIGHT-ONLY: only the weights are stored in low precision, while computation happens in higher precision (the weights are dequantized on the fly). This is because weights are the memory bottleneck at decode time, and quantizing activations too is harder (activations have outliers that are sensitive to precision loss). Weight-only int4 is the sweet spot for most deployments.

✧

Infer Note: Quantization Is Almost Always Worth It

For most deployments, int8 quantization is essentially free — near-lossless quality, half the memory, faster decode. int4 costs a small, often imperceptible quality drop for another 2× saving. Given how dramatically it cuts the dominant cost (memory bandwidth at decode), quantization is the first optimization to reach for, and the methods in the next section make it remarkably effective.

The one caution: quantization can hurt more on tasks requiring precise reasoning or on already-small models, where every bit of precision matters more. Always evaluate quality on YOUR task after quantizing — but for most uses, the trade is overwhelmingly favorable.

The naive quantization of Section 27.5 works, but smarter methods lose far less quality at the same bit-width. Three names dominate practice — GPTQ, AWQ, and GGUF — and beginners are often confused about how they differ. Let us clear that up.

Method	What it is	Best for
GPTQ	Post-training quantization minimizing layer-wise error	GPU inference, 4-bit
AWQ	Activation-aware: protects the most important weights	GPU inference, quality
GGUF	A file FORMAT for quantized models (llama.cpp)	CPU / local / Mac inference
bitsandbytes	On-the-fly int8/int4 (used in QLoRA)	Easy, training + inference

GPTQ: Minimizing Quantization Error

GPTQ (Frantar et al., 2022) is a clever post-training quantization method. Instead of quantizing each weight independently, it quantizes them in a way that MINIMIZES the resulting error on a small calibration dataset, adjusting the remaining weights to compensate for the error introduced by each quantized weight. This careful, error-aware approach lets GPTQ quantize to 4 bits with minimal quality loss. It is one of the most widely-used 4-bit methods for GPU inference.

AWQ: Protecting the Important Weights

AWQ (Activation-aware Weight Quantization; Lin et al., 2023) is based on a key observation: not all weights are equally important. A small fraction of weights — those that multiply large activations — matter disproportionately for quality. AWQ identifies these important weights (by looking at activation magnitudes) and protects them from quantization error by scaling, while aggressively quantizing the rest. This activation-awareness often gives slightly better quality than GPTQ at the same bit-width.

✧

Intuition: Why 'Activation-Aware' Matters

Imagine compressing a photo. You would not compress every pixel equally — you'd preserve the detail in the important regions (a face) and compress the unimportant ones (a blurry background) more. AWQ does this for weights: it spends precision where it matters (weights that drive large activations) and saves it where it doesn't. The result is better quality at the same average bit-width.

GPTQ and AWQ both exploit the same underlying truth — that quantization error is not uniformly costly — just in different ways. GPTQ compensates for error as it goes; AWQ protects the weights where error would hurt most. Both dramatically outperform naive uniform quantization.

GGUF: A Format, Not an Algorithm

GGUF is frequently confused with GPTQ and AWQ, but it is a different KIND of thing: it is a FILE FORMAT (used by llama.cpp), not a quantization algorithm. GGUF files store quantized models in a portable format optimized for running on CPUs, laptops, and Apple Silicon — enabling LLMs to run locally on consumer hardware without a GPU. GGUF supports many quantization levels (Q4_K_M, Q5_K_M, Q8_0, etc.), each a different bit-width and scheme. When you download a model to run locally on a Mac or PC, it is usually GGUF.

You want to...	Use
Run on a GPU, 4-bit, good quality	GPTQ or AWQ
Run locally on a Mac/PC/CPU	GGUF (via llama.cpp / Ollama)
Fine-tune with quantization (QLoRA)	bitsandbytes (NF4)
Maximum throughput on GPU	AWQ/GPTQ with a serving engine (vLLM)

✧

Infer Note: GPTQ/AWQ vs GGUF: Algorithm vs Format

The cleanest way to keep these straight: GPTQ and AWQ are ALGORITHMS that decide HOW to quantize (which weights, what scales) for GPU inference. GGUF is a FILE FORMAT that stores a quantized model for CPU/local inference with llama.cpp. They answer different questions — 'how do I quantize well?' versus 'what file do I ship?' — and are not mutually exclusive.

For deployment: on GPUs in a datacenter, you'll typically use GPTQ or AWQ quantized models served by an engine like vLLM. On a laptop or for local/hobbyist use, you'll typically download a GGUF file and run it with llama.cpp or Ollama. Knowing this map saves a lot of beginner confusion.

We saw in Section 27.3 that the KV cache is a huge consumer of memory. But there is a second, subtler problem: traditional serving WASTES much of the memory it allocates to the cache. PagedAttention (Kwon et al., 2023), the technique behind the vLLM serving engine, solves this and was a major leap in serving efficiency.

The Waste Problem

Traditionally, when a request arrives, the server allocates one big CONTIGUOUS block of memory for its KV cache, sized for the maximum possible length. But most responses are far shorter than the maximum — so most of that reserved block sits empty and unusable by other requests. This is INTERNAL FRAGMENTATION: memory reserved but wasted. Studies found traditional serving wasted 60–80% of KV-cache memory this way.

✧

Intuition: The Operating-System Analogy

PagedAttention borrows a 50-year-old idea from operating systems: VIRTUAL MEMORY and PAGING. Your computer does not give each program one giant contiguous block of RAM; it hands out small fixed-size PAGES on demand and uses a page table to map them. This lets many programs share memory efficiently, with no need to pre-reserve huge contiguous blocks.

PagedAttention does exactly this for the KV cache. Instead of one big contiguous block per request, it stores the cache in small fixed-size BLOCKS (pages), allocated on demand as the response grows, with a block table mapping logical positions to physical blocks. A request uses only the blocks it actually needs — no waste from over-reservation.

How PagedAttention Works

text•PagedAttention (conceptual) (Pseudocode)
# KV cache stored in small fixed-size blocks (like OS memory pages)
on each new token:
    if the current block is full:
        allocate a NEW block on demand (from a shared pool)
    append the token's K,V to the current block
    update the block table: logical position -> physical block

# Attention reads K,V through the block table (gather from blocks)
# No request over-reserves; freed blocks return to the shared pool

The payoff is large. By allocating cache memory in small blocks on demand, PagedAttention nearly eliminates the wasted memory, letting the server fit FAR more concurrent requests in the same GPU memory — vLLM reported up to 24× higher throughput than prior systems, largely from this. More requests fit, so larger batches are possible, so throughput soars.

✧

Infer Note: Sharing Pages: Another Win

PagedAttention enables a bonus: multiple requests that share a common prefix (e.g. the same system prompt, or the same few-shot examples) can SHARE the physical blocks for that prefix — 'copy-on-write' style, just like OS shared memory. This avoids storing duplicate KV cache for shared prefixes, saving even more memory. It is the foundation of prefix caching (Section 27.10).

The deeper lesson: many hard systems problems have already been solved in other domains. The KV-cache memory problem looked novel, but the operating-systems community solved memory fragmentation decades ago. Borrowing that solution — paging — was the key insight behind one of the most impactful inference optimizations.

Batching — processing many requests together — is how we get throughput, because it lets us read the weights ONCE and use them for MANY requests (recall the decode bottleneck from Section 27.2). But naive batching wastes the GPU. Continuous batching, the other half of vLLM's magic, fixes this and is one of the most important throughput optimizations.

The Problem with Static Batching

With STATIC (naive) batching, you collect a batch of requests, run them all together until ALL of them finish, then start the next batch. The problem: requests finish at different times — one needs 10 tokens, another needs 500. With static batching, the finished short requests sit idle, wasting their GPU slots, while the batch waits for the longest request. The GPU is underused, and new requests wait for the whole batch to clear.

Static batching	Continuous batching
Batch runs until ALL finish	Each request leaves when IT finishes
Finished requests waste slots	Freed slots immediately reused
New requests wait for whole batch	New requests join mid-flight
GPU often underutilized	GPU stays full
Simple	Higher throughput, more complex

How Continuous Batching Works

Continuous batching (also called in-flight or dynamic batching) operates at the level of individual DECODE STEPS. After every single token-generation step, the scheduler checks: did any request finish? If so, remove it and immediately admit a waiting request into the freed slot. The batch is continuously reshuffled, so the GPU is always working on a full batch of ACTIVE requests — no idle slots waiting for stragglers.

Tool Trace: Continuous batching: the scheduler reshuffles every step

Scheduler	Batch has slots for 4 requests; all 4 active	•
Request A	Finishes at step 12 → leaves the batch	←
Scheduler	Slot freed → admit waiting Request E immediately	→
Request E	Begins prefill and joins the active batch	→
Scheduler	GPU stays full; no slot idles waiting for stragglers	•

✧

Infer Note: Continuous Batching Is the Throughput Workhorse

Continuous batching, combined with PagedAttention's efficient memory, is why modern serving engines achieve such high throughput. The GPU is kept busy on a full batch at virtually all times, and memory is used efficiently so the batch can be large. Together they can deliver an order of magnitude more throughput than naive serving — the same hardware serving 10×+ more users.

This is the 'bus that never stops' from Section 27.1: requests board and alight continuously while the bus keeps moving at full capacity. It captures much of the throughput benefit of large batches without forcing every request to wait for the slowest one.

Speculative decoding (Leviathan et al., 2023; Chen et al., 2023) is a clever technique that speeds up generation with NO loss in output quality — the output is provably identical to normal decoding. It directly attacks the decode bottleneck (one token per expensive forward pass) by generating SEVERAL tokens per pass of the big model. It feels like magic the first time you see it.

The Core Idea: Draft, Then Verify

The insight: a small, fast 'draft' model can GUESS the next several tokens cheaply, and the big model can VERIFY all those guesses in a SINGLE forward pass (because verifying tokens in parallel is cheap — it is just one prefill-like pass). If the guesses are right, you got several tokens for the price of one big-model pass. If a guess is wrong, you discard it from that point and continue. Because the big model only ever ACCEPTS tokens it would have generated anyway, the output is identical to normal decoding.

Tool Trace: Speculative decoding: draft then verify

Draft model	Cheaply guesses next 4 tokens: 'the cat sat on'	→
Big model	Verifies all 4 in ONE forward pass	•
Big model	Accepts 'the cat sat', rejects 'on' (would've said 'down')	•
Big model	Got 3 tokens + 1 correction = 4 tokens, 1 big pass	←
Draft model	Drafts the next batch from 'down', repeat	→

Why It Works and When It Helps

Speculative decoding helps because many tokens are EASY to predict — common words, completions of obvious phrases, code boilerplate. The small draft model gets these right most of the time, so the big model verifies several at once. The speedup depends on the ACCEPTANCE RATE: how often the draft's guesses are correct. A good draft model (well-matched to the big one) achieves high acceptance and 2–3× speedups.

text•Speculative decoding speedup
Draft k tokens, big model verifies in 1 pass.
If α = average acceptance rate, expected accepted tokens per pass ≈
    (1 - α^(k+1)) / (1 - α)

# High acceptance (easy tokens) → many tokens per big-model pass → big speedup.
# Output is IDENTICAL to normal decoding — quality is never traded away.

text•Speculative decoding (Pseudocode)
# big model M, small draft model D, draft length k
repeat until done:
    1. D cheaply generates k candidate tokens (autoregressively)
    2. M verifies all k in ONE forward pass (parallel)
    3. accept the longest prefix M agrees with
    4. on the first disagreement, take M's token instead
    5. continue drafting from there
# net: several tokens per big-model pass, same output as plain decoding

✧

Infer Note: Variants: Self-Speculation and Medusa

The draft model does not have to be a separate model. Medusa (Cai et al., 2024) adds extra lightweight 'heads' to the big model itself that predict several future tokens, which the model then verifies — no separate draft model needed. Other variants use the model's own earlier layers, or n-gram lookups, as the drafter. All share the draft-then-verify principle.

Speculative decoding is especially valuable for LATENCY-sensitive single-user serving, where batching cannot help (one request can't be batched with itself). It is one of the few techniques that speeds up a single request's decode without any quality cost — which is why it is now standard in serving engines.

Beyond the big techniques, a collection of smaller but valuable optimizations round out a modern inference stack. Each targets a specific inefficiency, and together they add up to substantial gains.

Technique	What it does
Prefix caching	Reuse the KV cache for shared prompt prefixes across requests
Chunked prefill	Split long-prompt prefill into chunks, interleaved with decode
KV-cache quantization	Store the KV cache in int8/int4 to fit more / longer contexts
FlashAttention	IO-aware exact attention kernel (Chapter 20)
Tensor parallelism	Split the model across GPUs for models too big for one (Ch. 18)
CUDA graphs	Capture and replay the decode step to cut kernel-launch overhead
Multi-token prediction	Predict several tokens per step (Medusa-style)

Prefix Caching: Don't Recompute Shared Prompts

Many requests share a common prefix — the same long system prompt, the same few-shot examples, the same document being asked about repeatedly. Prefix caching stores the KV cache for that shared prefix and REUSES it across all requests that share it, instead of recomputing the prefill every time. For applications with a large fixed system prompt, this can dramatically cut TTFT and cost. It builds directly on PagedAttention's block-sharing (Section 27.7).

Chunked Prefill: Smoothing Out Long Prompts

A very long prompt makes prefill take a long time, which delays the decode of other requests sharing the batch (and spikes TTFT). Chunked prefill splits a long prefill into smaller chunks and INTERLEAVES them with the decode steps of other requests. This keeps decode latency smooth for everyone, rather than letting one giant prompt monopolize the GPU. It balances the compute-heavy prefill against the latency-sensitive decode.

✧

Infer Note: These Compose

The techniques in this chapter are not alternatives — they STACK. A production setup typically uses quantization (smaller weights) + PagedAttention (efficient memory) + continuous batching (full GPU) + prefix caching (reuse shared prompts) + speculative decoding (more tokens per pass) + FlashAttention (fast attention) all at once. Each targets a different inefficiency, so their benefits multiply.

This is why modern serving engines deliver such enormous gains over a naive implementation: they layer a dozen optimizations, each well-engineered. You rarely implement these yourself — you use an engine that bundles them, which is the subject of Section 27.12.

One more inference topic every practitioner must understand: HOW the next token is chosen from the model's output distribution. The model outputs a probability for every possible next token; the DECODING STRATEGY decides which one to actually emit. This affects output quality, diversity, and even speed, and the knobs (temperature, top-k, top-p) are ones you will tune constantly.

Strategy	How it picks	Effect
Greedy	Always the highest-probability token	Deterministic, can be dull/repetitive
Temperature	Scales the distribution before sampling	Higher = more random/creative
Top-k	Sample from the k most likely tokens	Caps how wild it can get
Top-p (nucleus)	Sample from the smallest set summing to p	Adaptive cutoff
Beam search	Keep top-b sequences, pick the best	Better for closed-ended tasks

Temperature: The Creativity Dial

Temperature is the most important sampling knob. It scales the logits before the softmax: a temperature below 1 SHARPENS the distribution (more confident, more deterministic), while above 1 FLATTENS it (more random, more diverse). Temperature 0 is equivalent to greedy decoding (always the top token). For factual tasks you want low temperature (precise, consistent); for creative writing you want higher (varied, surprising).

text•Temperature scaling
P(token) = softmax(logits / T)

T < 1: sharper  → more deterministic, picks likely tokens (factual)
T = 1: the model's natural distribution
T > 1: flatter  → more random, more diverse (creative)
T → 0: equivalent to greedy (always the top token)

Python•Sampling with temperature, top-k, and top-p
import torch; import torch.nn.functional as F

def sample_next_token(logits, temperature=0.8, top_k=50, top_p=0.95):
    """Pick the next token with temperature + top-k + top-p (nucleus)."""
    # 1. Temperature: scale logits (lower = sharper)
    logits = logits / temperature

    # 2. Top-k: keep only the k highest-logit tokens
    if top_k:
        kth = torch.topk(logits, top_k).values[..., -1, None]
        logits = torch.where(logits < kth, float('-inf'), logits)

    # 3. Top-p (nucleus): keep the smallest set summing to >= p
    probs = F.softmax(logits, dim=-1)
    sorted_p, idx = torch.sort(probs, descending=True)
    cumulative = torch.cumsum(sorted_p, dim=-1)
    mask = cumulative - sorted_p > top_p       # drop the tail beyond p
    sorted_p[mask] = 0.0
    sorted_p /= sorted_p.sum(dim=-1, keepdim=True)

    # 4. Sample from the filtered distribution
    choice = torch.multinomial(sorted_p, 1)
    return idx.gather(-1, choice)

# Low temperature + greedy-ish: factual, consistent (Q&A, code).
# Higher temperature + top-p: creative, varied (stories, brainstorming).

✧

Infer Note: Sampling Also Interacts With Speed

Decoding strategy is not purely a quality choice — it touches speed too. Greedy decoding is the cheapest. Beam search multiplies cost by the beam width (it tracks several sequences at once), which is why it is rarely used for open-ended LLM generation. And speculative decoding (Section 27.9) interacts with sampling: the draft-and-verify procedure must be adapted to preserve the exact sampling distribution, which the published methods carefully do.

For most chat and generation, temperature with top-p sampling is the default. Reserve greedy for when you want determinism (reproducible outputs, some code tasks) and beam search for narrow closed-ended tasks like translation where a single best sequence is wanted.

You will almost never implement these optimizations yourself. Instead, you use a SERVING ENGINE — software that bundles quantization, PagedAttention, continuous batching, speculative decoding, and the rest into a single high-performance system. Knowing the major engines and what each is for completes the practical picture.

Engine	Strengths	Best for
vLLM	PagedAttention, continuous batching, high throughput	GPU serving at scale
TGI	Hugging Face's production server	Easy HF model deployment
TensorRT-LLM	NVIDIA's heavily-optimized engine	Max performance on NVIDIA
llama.cpp	CPU/Mac/local, GGUF, lightweight	Local & on-device inference
Ollama	Simple local model running (wraps llama.cpp)	Hobbyist / local dev
SGLang	Fast, with strong prefix caching	Complex prompting workloads

The Request Lifecycle in a Serving Engine

Tying the chapter together, here is the journey of a request through a modern serving engine, touching the optimizations we covered:

Pipeline Flow: A request through an optimized serving engine

1	Arrive	Request enters the queue; prefix cache checked for shared prompt
2	Schedule	Continuous-batching scheduler admits it into the active batch
3	Prefill	Prompt processed (chunked if long); KV cache allocated in pages
4	Decode	Tokens generated; speculative decoding drafts ahead; quantized weights
5	Stream	Tokens streamed back to the user as they are produced
6	Free	On completion, KV-cache pages return to the shared pool

Python•Serving with vLLM (the common case)
# pip install vllm
from vllm import LLM, SamplingParams

# Load a model (optionally a quantized one); the engine handles
# PagedAttention, continuous batching, etc. automatically.
llm = LLM(model='meta-llama/Llama-3.1-8B-Instruct',
          quantization='awq',            # use an AWQ-quantized model
          gpu_memory_utilization=0.9)

params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=512)

# Submit many prompts at once -- the engine batches them continuously
prompts = ['Explain photosynthesis.', 'Write a haiku.', ...]
outputs = llm.generate(prompts, params)

for out in outputs:
    print(out.outputs[0].text)

# All the chapter's optimizations are active under the hood.
# You write a few lines; the engine delivers production-grade throughput.

✧

Infer Note: Don't Reinvent the Engine

The practical advice: unless you are doing research on inference itself, do NOT implement these optimizations from scratch for production. Use a battle-tested engine (vLLM for GPU serving, llama.cpp/Ollama for local). They have implemented, tuned, and tested every technique in this chapter, and they keep up with new ones. Your job is to choose the right engine and configure it well.

Understanding the techniques (as this chapter teaches) is what lets you configure the engine intelligently, debug performance problems, and choose the right trade-offs — even though you let the engine do the heavy lifting. Know the concepts; use the tools.

Inference Quick-Reference

Concept	Key idea	Remember
Cost shift	Inference >> training over a model's life	Per-token cost paid forever
Prefill vs decode	Compute-bound vs memory-bound	Decode is the bottleneck
KV cache	Stores past keys/values	Dominates memory; the central object
Metrics	TTFT, TPOT, throughput	Match metric to use case
Quantization	Fewer bits per weight	int4 ~4x smaller, faster decode
GPTQ/AWQ/GGUF	Algorithms vs a file format	GPU algos vs local format
PagedAttention	Page the KV cache like OS memory	Eliminates fragmentation
Continuous batching	Reshuffle batch every step	Keeps the GPU full
Speculative decoding	Draft small, verify big	Faster, same output
Serving engines	Bundle all optimizations	Use vLLM / llama.cpp

Exercises

Exercises 1–11 are pen-and-paper or derivations; 12–22 require code.

✎

Exercise 1: Pen & Paper

Explain why inference cost dominates training cost over a deployed model's lifetime. What does this imply about where to focus optimization effort?

✎

Exercise 2: Pen & Paper

Distinguish the prefill and decode phases. Why is prefill compute-bound and decode memory-bound? Which determines TTFT and which TPOT?

✎

Exercise 3: Pen & Paper

Explain why decode underutilizes the GPU. Connect this to the two main fix families (smaller weights, batching).

✎

Exercise 4: Derive

Compute the KV-cache size for a 7B model (32 layers, 32 heads, 128 dim), 8k context, batch 32, fp16. How does GQA with 8 KV heads change it?

✎

Exercise 5: Pen & Paper

Define TTFT, TPOT, latency, and throughput, and write the latency formula. For a chatbot vs an overnight batch job, which metrics matter most?

✎

Exercise 6: Derive

Show how per-group int8 quantization stores weights. For a 7B model, compute the memory at fp16, int8, and int4, and the decode speedup you'd expect.

✎

Exercise 7: Pen & Paper

Explain GPTQ and AWQ at a high level and how they differ. Why does 'activation-aware' quantization preserve quality better than uniform quantization?

✎

Exercise 8: Pen & Paper

Clarify the difference between GPTQ/AWQ and GGUF. For GPU datacenter serving vs running on a MacBook, which would you use and why?

✎

Exercise 9: Pen & Paper

Explain the KV-cache fragmentation problem and how PagedAttention solves it using the operating-system paging analogy.

✎

Exercise 10: Pen & Paper

Contrast static and continuous batching. Why does continuous batching keep the GPU fuller, and what does it require of the scheduler?

✎

Exercise 11: Derive

Explain why speculative decoding produces output identical to normal decoding. Derive the expected accepted tokens per pass as a function of acceptance rate α and draft length k.

✎

Exercise 12: Code

Implement the KV cache for generation. Measure the speedup of cached vs uncached generation as the sequence length grows.

✎

Exercise 13: Code

Implement per-group int8 and int4 weight quantization. Quantize a small model's weights, measure the memory saved, and the error introduced.

✎

Exercise 14: Code

Measure quantization quality: run a small model at fp16 and int4 on a benchmark, and report the quality drop and the memory/speed gains.

✎

Exercise 15: Code

Implement temperature, top-k, and top-p sampling from scratch. Show how each knob changes the diversity and quality of generated text.

✎

Exercise 16: Code

Measure TTFT and TPOT: instrument a generation loop to report time to first token and time per output token separately. Vary prompt length and observe TTFT.

✎

Exercise 17: Code Lab

Simulate static vs continuous batching: model a stream of requests with varying lengths and measure GPU utilization and throughput under each scheme.

✎

Exercise 18: Code Lab

Implement a simplified PagedAttention KV cache: store the cache in fixed-size blocks allocated on demand, with a block table. Measure memory waste vs contiguous allocation.

✎

Exercise 19: Code Lab

Implement speculative decoding with a small draft model and a large model. Measure the acceptance rate and the resulting speedup, and verify the output matches plain decoding.

✎

Exercise 20: Code

Serve a model with vLLM. Submit a batch of prompts, measure throughput, then enable an AWQ-quantized model and compare memory and throughput.

✎

Exercise 21: Code

Implement prefix caching: detect a shared prompt prefix across requests and reuse its KV cache. Measure the TTFT reduction for a fixed long system prompt.

✎

Exercise 22: Code (Challenge)

Build a mini inference server that combines several optimizations: a paged KV cache, continuous batching of a request stream, int4-quantized weights, and temperature/top-p sampling. Benchmark it against a naive one-request-at-a-time fp16 baseline, and report the throughput and latency improvement from each optimization added in turn.

Further reading: “Efficient Memory Management for LLM Serving with PagedAttention” (Kwon et al., 2023, vLLM). “GPTQ” (Frantar et al., 2022) and “AWQ” (Lin et al., 2023) for quantization. “Fast Inference from Transformers via Speculative Decoding” (Leviathan et al., 2023) and “Accelerating LLM Decoding with Speculative Sampling” (Chen et al., 2023). “Medusa” (Cai et al., 2024) for multi-head speculation. “Orca” (Yu et al., 2022) for continuous batching. The vLLM, TensorRT-LLM, and llama.cpp documentation. “The Curious Case of Neural Text Degeneration” (Holtzman et al., 2019) for top-p sampling.

Next → Chapter 28: Tool Calling & Function Calling

You can now run a model fast and cheaply. But a model alone is limited to what is in its weights — it cannot look up today's weather, run a calculation reliably, query a database, or take actions in the world. Chapter 28 gives the model TOOLS: the ability to call functions, APIs, and external systems. We will see how a model is trained and prompted to decide WHEN to use a tool, format the call, and incorporate the result — the foundation of agents. The actor-to-actor message flow you saw in this chapter's diagrams becomes the agentic loop, where the model and its tools converse to accomplish tasks beyond what any single forward pass could.

✎ 22 Exercises in this chapter

Attempt each exercise before checking the worked solutions.

View Solutions →

←

PreviousCh 26. Constitutional AI & Safety Techniques

NextCh 28. Tool Calling & Function Use

→

Inference Optimization

Learning Objectives

Why Inference Is Its Own Problem

The Cost Shift: Training Is One-Time, Inference Is Forever

Two Goals in Tension: Latency and Throughput

The Two Phases: Prefill and Decode

Phase 1: Prefill (Processing the Prompt)

Phase 2: Decode (Generating Tokens One by One)

Tool Trace: The two phases of generating a response

The KV Cache, Revisited

Why the Cache Exists

The Problem: The Cache Is Huge

Measuring Inference

How the Metrics Connect

Quantization: Fewer Bits, Faster Models