Part VI: Productionization
Chapter 27

Parameter-Efficient Fine-Tuning

LoRA, QLoRA, adapters, and prefix tuning
22 Exercises
27.1

We have spent the whole book building and aligning a model. Now we have to RUN it — serve it to users, fast and cheaply. This turns out to be a completely different engineering challenge from training, with its own bottlenecks, its own metrics, and its own bag of tricks. Part VI begins here, with the techniques that make a trained model practical to deploy.

The Cost Shift: Training Is One-Time, Inference Is Forever

Here is the key economic fact. Training a model is a ONE-TIME cost — expensive, but paid once. Inference is a RECURRING cost — paid for every single token the model ever generates, for every user, forever. A popular model serves billions of tokens per day. Over a model's deployed life, the total inference cost dwarfs the training cost. This is why inference optimization matters so much: shaving 20% off inference cost saves far more, in absolute terms, than almost any training optimization.

Infer Note: Inference Dominates Total Cost
For a widely-deployed model, the lifetime inference compute vastly exceeds the training compute. A model trained once for a few weeks may then serve queries for years. This is the opposite of the intuition many beginners start with — training feels like 'the expensive part', but for any successful deployed model, inference is where the money goes.
This cost structure drives the entire field of inference optimization. Every technique in this chapter — quantization, batching, caching, speculative decoding — exists to reduce the per-token cost of serving, because that cost is paid astronomically many times.

Two Goals in Tension: Latency and Throughput

Inference optimization juggles two goals that often conflict. LATENCY is how fast a single user gets their response — it matters for interactive use (a chatbot must feel responsive). THROUGHPUT is how many tokens you can generate across all users per second — it determines cost-efficiency (more throughput means serving more users per GPU). Many techniques trade one for the other, and the right balance depends on the application.

Latency (single-user speed)Throughput (total tokens/sec)
How fast ONE response arrivesHow many tokens across ALL users
Matters for interactivityMatters for cost-efficiency
Small batches are betterLarge batches are better
A chatbot feels snappyServe more users per GPU
Optimize per-requestOptimize aggregate
Intuition: Why Latency and Throughput Conflict
Imagine a bus versus a taxi. A taxi (small batch) gets one passenger to their destination fastest — low latency. A bus (large batch) moves the most people per trip — high throughput — but each passenger waits longer for the bus to fill and stop along the way. Serving LLMs faces the same trade-off: batching many requests together uses the GPU efficiently (throughput) but can make any single request wait (latency).
Much of inference optimization is about getting the best of both — high throughput without sacrificing too much latency. Continuous batching (Section 27.8) is the clever scheduling that makes this possible, like a bus that picks up and drops off passengers without ever stopping.
27.2

To optimize inference, you must understand that generating a response happens in TWO distinct phases with very different performance characteristics. Confusing them is a common beginner mistake; distinguishing them is the foundation for everything else in this chapter.

Phase 1: Prefill (Processing the Prompt)

When you send a prompt, the model first PROCESSES the entire prompt in one go — this is prefill. All the prompt's tokens are fed through the model together, in parallel, to produce the first output token and to populate the KV cache (Section 27.3). Because all prompt tokens are processed at once, prefill is COMPUTE-BOUND: it does a lot of matrix multiplication and keeps the GPU's compute units busy.

Phase 2: Decode (Generating Tokens One by One)

After prefill, the model GENERATES the response one token at a time — this is decode. Each step takes the single most recent token, runs it through the model, and produces the next token, repeating until done. Because each step processes only ONE token, decode is MEMORY-BANDWIDTH-BOUND: the tiny computation for one token is dwarfed by the time spent reading the model weights and KV cache from memory.

Tool Trace: The two phases of generating a response

UserSends prompt: 'Explain photosynthesis' (5 tokens)
PrefillProcess all 5 prompt tokens at once → first output token + KV cache
DecodeGenerate token 2, attending to the cached prompt
DecodeGenerate token 3, attending to all previous
Decode... continue one token at a time until <end>
UserReceives the full streamed response
PropertyPrefillDecode
ProcessesAll prompt tokens at onceOne token at a time
BottleneckCompute (matmul)Memory bandwidth
GPU utilizationHighLow (underutilized)
ParallelismAcross prompt tokensNone (sequential)
DeterminesTime to first tokenTime per output token
Optimized byChunking, fusionBatching, smaller weights
Infer Note: Decode Is the Expensive Part — and It Wastes the GPU
Here is the crucial insight: decode generates one token per forward pass, and that single-token forward pass barely uses the GPU's enormous compute capacity — it spends almost all its time READING the model weights from memory. The GPU's compute units sit mostly idle. This memory-bound nature of decode is THE central problem of LLM inference, and most optimization techniques exist to address it.
Two big families of fixes follow directly: (1) make the weights SMALLER so there is less to read (quantization, Sections 27.5–27.6), and (2) read the weights ONCE but use them for MANY requests at the same time (batching, Section 27.8). Keep this framing in mind — it explains why every technique in this chapter works.
27.3

We met the KV cache in Chapters 13 and 19. It is so central to inference that we revisit it here as the object around which all of inference optimization revolves. Understanding it deeply makes the rest of the chapter click into place.

Why the Cache Exists

Recall how attention works: each new token attends to ALL previous tokens, using their keys (K) and values (V). Without a cache, generating each new token would require recomputing the keys and values for every previous token — enormously wasteful, since those don't change. The KV CACHE stores the keys and values of all previous tokens, so each new token only computes ITS OWN K and V and reads the rest from the cache. This turns generation from O(T²) recomputation into O(T) with caching — it is what makes generation practical.

KV cache
A store of the key and value vectors of all previously-processed tokens, kept so that generating each new token does not require recomputing them. It grows by one entry per generated token.

The Problem: The Cache Is Huge

The KV cache solves a compute problem but creates a MEMORY problem. It grows with every token generated, and at long context lengths and large batch sizes it becomes the dominant consumer of GPU memory — often larger than the model weights themselves. Recall the formula from Chapter 19:

textKV-cache memory
cache = 2 · layers · kv_heads · head_dim · seq_len · batch · bytes

# the 2 is for Keys AND Values
Example: 13B model (40 layers, 40 heads, 128 dim), 4k ctx, batch 16, fp16:
    2 × 40 × 40 × 128 × 4096 × 16 × 2 ≈ 86 GB

86GB of KV cache for one batch — more than a single GPU's memory. This is why so much of inference optimization targets the cache: shrinking it (GQA from Chapter 19, quantizing it), managing its memory efficiently (PagedAttention, Section 27.7), and reusing it across requests (prefix caching, Section 27.10). The KV cache is the resource that everything fights over.

Infer Note: Three Things Compete for GPU Memory at Inference
At inference, GPU memory holds three things: (1) the MODEL WEIGHTS (fixed size), (2) the KV CACHE (grows with tokens and batch size), and (3) ACTIVATIONS (small at decode time). The weights are shrunk by quantization; the cache is the variable that limits how many requests you can batch and how long the context can be.
More memory freed from weights (via quantization) means more room for KV cache, which means larger batches and longer contexts — higher throughput. This is why quantization and batching are deeply connected: shrinking the weights directly enables more concurrent requests.
27.4

You cannot optimize what you cannot measure. Inference has a specific vocabulary of metrics, and using the right ones is essential. Let us define them carefully, because they map directly onto the prefill/decode phases from Section 27.2.

MetricMeaningDriven by
TTFTTime To First Token — prompt sent to first token outPrefill speed
TPOTTime Per Output Token — gap between successive tokensDecode speed
LatencyTotal time for the full responseTTFT + TPOT × tokens
ThroughputTotal output tokens/sec across all requestsBatching, efficiency
GoodputThroughput meeting latency targetsBoth, balanced

How the Metrics Connect

These metrics fit together simply. The total time a user waits for a complete response is roughly the time to the first token (TTFT, set by prefill) plus the time per subsequent token (TPOT, set by decode) times the number of tokens generated. For an interactive chatbot, a low TTFT makes it feel responsive (the answer starts quickly), and a low TPOT makes it read smoothly (tokens stream fast enough to read).

textTotal latency
latency ≈ TTFT  +  TPOT × (output tokens)

TTFT: how long until the FIRST token appears (prefill)
TPOT: time between each token after that (decode)

# A chatbot wants low TTFT (feels responsive) AND low TPOT (reads smoothly).
Infer Note: Different Apps Want Different Metrics
The right metric depends on the use case. An interactive chatbot cares most about TTFT and TPOT — user-perceived responsiveness. A batch job processing millions of documents overnight cares only about THROUGHPUT — total tokens per dollar, with latency irrelevant. A coding autocomplete needs ultra-low TTFT. Knowing which metric matters for YOUR application tells you which optimizations to prioritize.
This is why there is no single 'fastest' configuration: the optimal setup for a low-latency chatbot (small batches, speculative decoding) differs from the optimal setup for a high-throughput batch job (large batches, maximum quantization). Match the optimization to the metric that matters.
27.5

The single most impactful inference optimization is QUANTIZATION: representing the model's weights (and sometimes activations and the KV cache) with fewer bits. A model trained in 16-bit precision can often be run in 8-bit or even 4-bit with little quality loss — halving or quartering its memory footprint and, because decode is memory-bound (Section 27.2), speeding it up substantially.

Why Quantization Helps So Much

Recall that decode is bottlenecked by READING the weights from memory, not by computing with them. If you store the weights in 4 bits instead of 16, there is 4× LESS data to read per token — so decode gets up to 4× faster, AND the model takes 4× less memory (leaving more room for KV cache and bigger batches). Quantization attacks the exact bottleneck of inference.

PrecisionBits/weightMemory (7B model)Quality
fp16 / bf1616~14 GBBaseline (full)
int88~7 GBNear-lossless
int44~3.5 GBSmall loss, usually fine
int3 / int23 / 2~2.6 / 1.8 GBNoticeable degradation

The Basic Idea: Map Floats to a Small Set of Integers

Quantization maps a range of floating-point values onto a small set of integers. Instead of storing each weight as a 16-bit float, you store a scale factor per group of weights and represent each weight as a small integer that, multiplied by the scale, approximates the original. The art is choosing the mapping so that the approximation loses as little quality as possible.

textBasic quantization (per-group)
For a group of weights w with max absolute value M:
    scale = M / (2^(bits-1) - 1)
    q = round(w / scale)        # small integer
    w ≈ q × scale               # dequantized approximation

# Store q (few bits) + scale (one per group). Reconstruct w when needed.
PythonSimple int8 quantization from scratch
import torch

def quantize_int8(weights, group_size=128):
    """Per-group symmetric int8 quantization."""
    w = weights.reshape(-1, group_size)       # groups of weights
    # One scale per group, from the group's max magnitude
    scale = w.abs().max(dim=1, keepdim=True).values / 127
    q = torch.round(w / scale).clamp(-128, 127).to(torch.int8)
    return q, scale

def dequantize(q, scale):
    """Reconstruct approximate fp weights for computation."""
    return (q.float() * scale)

# int8: store q (1 byte) + a scale per 128 weights, vs 2 bytes/weight.
# ~2x smaller, near-lossless. int4 (4 bits) is ~4x smaller with care.
# The trick is choosing scales/grouping to minimize error -- that's what
# the methods in the next section (GPTQ, AWQ) do cleverly.

Weight-Only vs Weight-and-Activation

Most LLM inference quantization is WEIGHT-ONLY: only the weights are stored in low precision, while computation happens in higher precision (the weights are dequantized on the fly). This is because weights are the memory bottleneck at decode time, and quantizing activations too is harder (activations have outliers that are sensitive to precision loss). Weight-only int4 is the sweet spot for most deployments.

Infer Note: Quantization Is Almost Always Worth It
For most deployments, int8 quantization is essentially free — near-lossless quality, half the memory, faster decode. int4 costs a small, often imperceptible quality drop for another 2× saving. Given how dramatically it cuts the dominant cost (memory bandwidth at decode), quantization is the first optimization to reach for, and the methods in the next section make it remarkably effective.
The one caution: quantization can hurt more on tasks requiring precise reasoning or on already-small models, where every bit of precision matters more. Always evaluate quality on YOUR task after quantizing — but for most uses, the trade is overwhelmingly favorable.
27.6

The naive quantization of Section 27.5 works, but smarter methods lose far less quality at the same bit-width. Three names dominate practice — GPTQ, AWQ, and GGUF — and beginners are often confused about how they differ. Let us clear that up.

MethodWhat it isBest for
GPTQPost-training quantization minimizing layer-wise errorGPU inference, 4-bit
AWQActivation-aware: protects the most important weightsGPU inference, quality
GGUFA file FORMAT for quantized models (llama.cpp)CPU / local / Mac inference
bitsandbytesOn-the-fly int8/int4 (used in QLoRA)Easy, training + inference

GPTQ: Minimizing Quantization Error

GPTQ (Frantar et al., 2022) is a clever post-training quantization method. Instead of quantizing each weight independently, it quantizes them in a way that MINIMIZES the resulting error on a small calibration dataset, adjusting the remaining weights to compensate for the error introduced by each quantized weight. This careful, error-aware approach lets GPTQ quantize to 4 bits with minimal quality loss. It is one of the most widely-used 4-bit methods for GPU inference.

AWQ: Protecting the Important Weights

AWQ (Activation-aware Weight Quantization; Lin et al., 2023) is based on a key observation: not all weights are equally important. A small fraction of weights — those that multiply large activations — matter disproportionately for quality. AWQ identifies these important weights (by looking at activation magnitudes) and protects them from quantization error by scaling, while aggressively quantizing the rest. This activation-awareness often gives slightly better quality than GPTQ at the same bit-width.

Intuition: Why 'Activation-Aware' Matters
Imagine compressing a photo. You would not compress every pixel equally — you'd preserve the detail in the important regions (a face) and compress the unimportant ones (a blurry background) more. AWQ does this for weights: it spends precision where it matters (weights that drive large activations) and saves it where it doesn't. The result is better quality at the same average bit-width.
GPTQ and AWQ both exploit the same underlying truth — that quantization error is not uniformly costly — just in different ways. GPTQ compensates for error as it goes; AWQ protects the weights where error would hurt most. Both dramatically outperform naive uniform quantization.

GGUF: A Format, Not an Algorithm

GGUF is frequently confused with GPTQ and AWQ, but it is a different KIND of thing: it is a FILE FORMAT (used by llama.cpp), not a quantization algorithm. GGUF files store quantized models in a portable format optimized for running on CPUs, laptops, and Apple Silicon — enabling LLMs to run locally on consumer hardware without a GPU. GGUF supports many quantization levels (Q4_K_M, Q5_K_M, Q8_0, etc.), each a different bit-width and scheme. When you download a model to run locally on a Mac or PC, it is usually GGUF.

You want to...Use
Run on a GPU, 4-bit, good qualityGPTQ or AWQ
Run locally on a Mac/PC/CPUGGUF (via llama.cpp / Ollama)
Fine-tune with quantization (QLoRA)bitsandbytes (NF4)
Maximum throughput on GPUAWQ/GPTQ with a serving engine (vLLM)
Infer Note: GPTQ/AWQ vs GGUF: Algorithm vs Format
The cleanest way to keep these straight: GPTQ and AWQ are ALGORITHMS that decide HOW to quantize (which weights, what scales) for GPU inference. GGUF is a FILE FORMAT that stores a quantized model for CPU/local inference with llama.cpp. They answer different questions — 'how do I quantize well?' versus 'what file do I ship?' — and are not mutually exclusive.
For deployment: on GPUs in a datacenter, you'll typically use GPTQ or AWQ quantized models served by an engine like vLLM. On a laptop or for local/hobbyist use, you'll typically download a GGUF file and run it with llama.cpp or Ollama. Knowing this map saves a lot of beginner confusion.
27.7

We saw in Section 27.3 that the KV cache is a huge consumer of memory. But there is a second, subtler problem: traditional serving WASTES much of the memory it allocates to the cache. PagedAttention (Kwon et al., 2023), the technique behind the vLLM serving engine, solves this and was a major leap in serving efficiency.

The Waste Problem

Traditionally, when a request arrives, the server allocates one big CONTIGUOUS block of memory for its KV cache, sized for the maximum possible length. But most responses are far shorter than the maximum — so most of that reserved block sits empty and unusable by other requests. This is INTERNAL FRAGMENTATION: memory reserved but wasted. Studies found traditional serving wasted 60–80% of KV-cache memory this way.

Intuition: The Operating-System Analogy
PagedAttention borrows a 50-year-old idea from operating systems: VIRTUAL MEMORY and PAGING. Your computer does not give each program one giant contiguous block of RAM; it hands out small fixed-size PAGES on demand and uses a page table to map them. This lets many programs share memory efficiently, with no need to pre-reserve huge contiguous blocks.
PagedAttention does exactly this for the KV cache. Instead of one big contiguous block per request, it stores the cache in small fixed-size BLOCKS (pages), allocated on demand as the response grows, with a block table mapping logical positions to physical blocks. A request uses only the blocks it actually needs — no waste from over-reservation.

How PagedAttention Works

textPagedAttention (conceptual) (Pseudocode)
# KV cache stored in small fixed-size blocks (like OS memory pages)
on each new token:
    if the current block is full:
        allocate a NEW block on demand (from a shared pool)
    append the token's K,V to the current block
    update the block table: logical position -> physical block

# Attention reads K,V through the block table (gather from blocks)
# No request over-reserves; freed blocks return to the shared pool

The payoff is large. By allocating cache memory in small blocks on demand, PagedAttention nearly eliminates the wasted memory, letting the server fit FAR more concurrent requests in the same GPU memory — vLLM reported up to 24× higher throughput than prior systems, largely from this. More requests fit, so larger batches are possible, so throughput soars.

Infer Note: Sharing Pages: Another Win
PagedAttention enables a bonus: multiple requests that share a common prefix (e.g. the same system prompt, or the same few-shot examples) can SHARE the physical blocks for that prefix — 'copy-on-write' style, just like OS shared memory. This avoids storing duplicate KV cache for shared prefixes, saving even more memory. It is the foundation of prefix caching (Section 27.10).
The deeper lesson: many hard systems problems have already been solved in other domains. The KV-cache memory problem looked novel, but the operating-systems community solved memory fragmentation decades ago. Borrowing that solution — paging — was the key insight behind one of the most impactful inference optimizations.
27.8

Batching — processing many requests together — is how we get throughput, because it lets us read the weights ONCE and use them for MANY requests (recall the decode bottleneck from Section 27.2). But naive batching wastes the GPU. Continuous batching, the other half of vLLM's magic, fixes this and is one of the most important throughput optimizations.

The Problem with Static Batching

With STATIC (naive) batching, you collect a batch of requests, run them all together until ALL of them finish, then start the next batch. The problem: requests finish at different times — one needs 10 tokens, another needs 500. With static batching, the finished short requests sit idle, wasting their GPU slots, while the batch waits for the longest request. The GPU is underused, and new requests wait for the whole batch to clear.

Static batchingContinuous batching
Batch runs until ALL finishEach request leaves when IT finishes
Finished requests waste slotsFreed slots immediately reused
New requests wait for whole batchNew requests join mid-flight
GPU often underutilizedGPU stays full
SimpleHigher throughput, more complex

How Continuous Batching Works

Continuous batching (also called in-flight or dynamic batching) operates at the level of individual DECODE STEPS. After every single token-generation step, the scheduler checks: did any request finish? If so, remove it and immediately admit a waiting request into the freed slot. The batch is continuously reshuffled, so the GPU is always working on a full batch of ACTIVE requests — no idle slots waiting for stragglers.

Tool Trace: Continuous batching: the scheduler reshuffles every step

SchedulerBatch has slots for 4 requests; all 4 active
Request AFinishes at step 12 → leaves the batch
SchedulerSlot freed → admit waiting Request E immediately
Request EBegins prefill and joins the active batch
SchedulerGPU stays full; no slot idles waiting for stragglers
Infer Note: Continuous Batching Is the Throughput Workhorse
Continuous batching, combined with PagedAttention's efficient memory, is why modern serving engines achieve such high throughput. The GPU is kept busy on a full batch at virtually all times, and memory is used efficiently so the batch can be large. Together they can deliver an order of magnitude more throughput than naive serving — the same hardware serving 10×+ more users.
This is the 'bus that never stops' from Section 27.1: requests board and alight continuously while the bus keeps moving at full capacity. It captures much of the throughput benefit of large batches without forcing every request to wait for the slowest one.
27.9

Speculative decoding (Leviathan et al., 2023; Chen et al., 2023) is a clever technique that speeds up generation with NO loss in output quality — the output is provably identical to normal decoding. It directly attacks the decode bottleneck (one token per expensive forward pass) by generating SEVERAL tokens per pass of the big model. It feels like magic the first time you see it.

The Core Idea: Draft, Then Verify

The insight: a small, fast 'draft' model can GUESS the next several tokens cheaply, and the big model can VERIFY all those guesses in a SINGLE forward pass (because verifying tokens in parallel is cheap — it is just one prefill-like pass). If the guesses are right, you got several tokens for the price of one big-model pass. If a guess is wrong, you discard it from that point and continue. Because the big model only ever ACCEPTS tokens it would have generated anyway, the output is identical to normal decoding.

Tool Trace: Speculative decoding: draft then verify

Draft modelCheaply guesses next 4 tokens: 'the cat sat on'
Big modelVerifies all 4 in ONE forward pass
Big modelAccepts 'the cat sat', rejects 'on' (would've said 'down')
Big modelGot 3 tokens + 1 correction = 4 tokens, 1 big pass
Draft modelDrafts the next batch from 'down', repeat

Why It Works and When It Helps

Speculative decoding helps because many tokens are EASY to predict — common words, completions of obvious phrases, code boilerplate. The small draft model gets these right most of the time, so the big model verifies several at once. The speedup depends on the ACCEPTANCE RATE: how often the draft's guesses are correct. A good draft model (well-matched to the big one) achieves high acceptance and 2–3× speedups.

textSpeculative decoding speedup
Draft k tokens, big model verifies in 1 pass.
If α = average acceptance rate, expected accepted tokens per pass ≈
    (1 - α^(k+1)) / (1 - α)

# High acceptance (easy tokens) → many tokens per big-model pass → big speedup.
# Output is IDENTICAL to normal decoding — quality is never traded away.
textSpeculative decoding (Pseudocode)
# big model M, small draft model D, draft length k
repeat until done:
    1. D cheaply generates k candidate tokens (autoregressively)
    2. M verifies all k in ONE forward pass (parallel)
    3. accept the longest prefix M agrees with
    4. on the first disagreement, take M's token instead
    5. continue drafting from there
# net: several tokens per big-model pass, same output as plain decoding
Infer Note: Variants: Self-Speculation and Medusa
The draft model does not have to be a separate model. Medusa (Cai et al., 2024) adds extra lightweight 'heads' to the big model itself that predict several future tokens, which the model then verifies — no separate draft model needed. Other variants use the model's own earlier layers, or n-gram lookups, as the drafter. All share the draft-then-verify principle.
Speculative decoding is especially valuable for LATENCY-sensitive single-user serving, where batching cannot help (one request can't be batched with itself). It is one of the few techniques that speeds up a single request's decode without any quality cost — which is why it is now standard in serving engines.
27.10

Beyond the big techniques, a collection of smaller but valuable optimizations round out a modern inference stack. Each targets a specific inefficiency, and together they add up to substantial gains.

TechniqueWhat it does
Prefix cachingReuse the KV cache for shared prompt prefixes across requests
Chunked prefillSplit long-prompt prefill into chunks, interleaved with decode
KV-cache quantizationStore the KV cache in int8/int4 to fit more / longer contexts
FlashAttentionIO-aware exact attention kernel (Chapter 20)
Tensor parallelismSplit the model across GPUs for models too big for one (Ch. 18)
CUDA graphsCapture and replay the decode step to cut kernel-launch overhead
Multi-token predictionPredict several tokens per step (Medusa-style)

Prefix Caching: Don't Recompute Shared Prompts

Many requests share a common prefix — the same long system prompt, the same few-shot examples, the same document being asked about repeatedly. Prefix caching stores the KV cache for that shared prefix and REUSES it across all requests that share it, instead of recomputing the prefill every time. For applications with a large fixed system prompt, this can dramatically cut TTFT and cost. It builds directly on PagedAttention's block-sharing (Section 27.7).

Chunked Prefill: Smoothing Out Long Prompts

A very long prompt makes prefill take a long time, which delays the decode of other requests sharing the batch (and spikes TTFT). Chunked prefill splits a long prefill into smaller chunks and INTERLEAVES them with the decode steps of other requests. This keeps decode latency smooth for everyone, rather than letting one giant prompt monopolize the GPU. It balances the compute-heavy prefill against the latency-sensitive decode.

Infer Note: These Compose
The techniques in this chapter are not alternatives — they STACK. A production setup typically uses quantization (smaller weights) + PagedAttention (efficient memory) + continuous batching (full GPU) + prefix caching (reuse shared prompts) + speculative decoding (more tokens per pass) + FlashAttention (fast attention) all at once. Each targets a different inefficiency, so their benefits multiply.
This is why modern serving engines deliver such enormous gains over a naive implementation: they layer a dozen optimizations, each well-engineered. You rarely implement these yourself — you use an engine that bundles them, which is the subject of Section 27.12.
27.11

One more inference topic every practitioner must understand: HOW the next token is chosen from the model's output distribution. The model outputs a probability for every possible next token; the DECODING STRATEGY decides which one to actually emit. This affects output quality, diversity, and even speed, and the knobs (temperature, top-k, top-p) are ones you will tune constantly.

StrategyHow it picksEffect
GreedyAlways the highest-probability tokenDeterministic, can be dull/repetitive
TemperatureScales the distribution before samplingHigher = more random/creative
Top-kSample from the k most likely tokensCaps how wild it can get
Top-p (nucleus)Sample from the smallest set summing to pAdaptive cutoff
Beam searchKeep top-b sequences, pick the bestBetter for closed-ended tasks

Temperature: The Creativity Dial

Temperature is the most important sampling knob. It scales the logits before the softmax: a temperature below 1 SHARPENS the distribution (more confident, more deterministic), while above 1 FLATTENS it (more random, more diverse). Temperature 0 is equivalent to greedy decoding (always the top token). For factual tasks you want low temperature (precise, consistent); for creative writing you want higher (varied, surprising).

textTemperature scaling
P(token) = softmax(logits / T)

T < 1: sharper  → more deterministic, picks likely tokens (factual)
T = 1: the model's natural distribution
T > 1: flatter  → more random, more diverse (creative)
T → 0: equivalent to greedy (always the top token)
PythonSampling with temperature, top-k, and top-p
import torch; import torch.nn.functional as F

def sample_next_token(logits, temperature=0.8, top_k=50, top_p=0.95):
    """Pick the next token with temperature + top-k + top-p (nucleus)."""
    # 1. Temperature: scale logits (lower = sharper)
    logits = logits / temperature

    # 2. Top-k: keep only the k highest-logit tokens
    if top_k:
        kth = torch.topk(logits, top_k).values[..., -1, None]
        logits = torch.where(logits < kth, float('-inf'), logits)

    # 3. Top-p (nucleus): keep the smallest set summing to >= p
    probs = F.softmax(logits, dim=-1)
    sorted_p, idx = torch.sort(probs, descending=True)
    cumulative = torch.cumsum(sorted_p, dim=-1)
    mask = cumulative - sorted_p > top_p       # drop the tail beyond p
    sorted_p[mask] = 0.0
    sorted_p /= sorted_p.sum(dim=-1, keepdim=True)

    # 4. Sample from the filtered distribution
    choice = torch.multinomial(sorted_p, 1)
    return idx.gather(-1, choice)

# Low temperature + greedy-ish: factual, consistent (Q&A, code).
# Higher temperature + top-p: creative, varied (stories, brainstorming).
Infer Note: Sampling Also Interacts With Speed
Decoding strategy is not purely a quality choice — it touches speed too. Greedy decoding is the cheapest. Beam search multiplies cost by the beam width (it tracks several sequences at once), which is why it is rarely used for open-ended LLM generation. And speculative decoding (Section 27.9) interacts with sampling: the draft-and-verify procedure must be adapted to preserve the exact sampling distribution, which the published methods carefully do.
For most chat and generation, temperature with top-p sampling is the default. Reserve greedy for when you want determinism (reproducible outputs, some code tasks) and beam search for narrow closed-ended tasks like translation where a single best sequence is wanted.
27.12

You will almost never implement these optimizations yourself. Instead, you use a SERVING ENGINE — software that bundles quantization, PagedAttention, continuous batching, speculative decoding, and the rest into a single high-performance system. Knowing the major engines and what each is for completes the practical picture.

EngineStrengthsBest for
vLLMPagedAttention, continuous batching, high throughputGPU serving at scale
TGIHugging Face's production serverEasy HF model deployment
TensorRT-LLMNVIDIA's heavily-optimized engineMax performance on NVIDIA
llama.cppCPU/Mac/local, GGUF, lightweightLocal & on-device inference
OllamaSimple local model running (wraps llama.cpp)Hobbyist / local dev
SGLangFast, with strong prefix cachingComplex prompting workloads

The Request Lifecycle in a Serving Engine

Tying the chapter together, here is the journey of a request through a modern serving engine, touching the optimizations we covered:

Pipeline Flow: A request through an optimized serving engine

1ArriveRequest enters the queue; prefix cache checked for shared prompt
2ScheduleContinuous-batching scheduler admits it into the active batch
3PrefillPrompt processed (chunked if long); KV cache allocated in pages
4DecodeTokens generated; speculative decoding drafts ahead; quantized weights
5StreamTokens streamed back to the user as they are produced
6FreeOn completion, KV-cache pages return to the shared pool
PythonServing with vLLM (the common case)
# pip install vllm
from vllm import LLM, SamplingParams

# Load a model (optionally a quantized one); the engine handles
# PagedAttention, continuous batching, etc. automatically.
llm = LLM(model='meta-llama/Llama-3.1-8B-Instruct',
          quantization='awq',            # use an AWQ-quantized model
          gpu_memory_utilization=0.9)

params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=512)

# Submit many prompts at once -- the engine batches them continuously
prompts = ['Explain photosynthesis.', 'Write a haiku.', ...]
outputs = llm.generate(prompts, params)

for out in outputs:
    print(out.outputs[0].text)

# All the chapter's optimizations are active under the hood.
# You write a few lines; the engine delivers production-grade throughput.
Infer Note: Don't Reinvent the Engine
The practical advice: unless you are doing research on inference itself, do NOT implement these optimizations from scratch for production. Use a battle-tested engine (vLLM for GPU serving, llama.cpp/Ollama for local). They have implemented, tuned, and tested every technique in this chapter, and they keep up with new ones. Your job is to choose the right engine and configure it well.
Understanding the techniques (as this chapter teaches) is what lets you configure the engine intelligently, debug performance problems, and choose the right trade-offs — even though you let the engine do the heavy lifting. Know the concepts; use the tools.
27.13

Inference Quick-Reference

ConceptKey ideaRemember
Cost shiftInference >> training over a model's lifePer-token cost paid forever
Prefill vs decodeCompute-bound vs memory-boundDecode is the bottleneck
KV cacheStores past keys/valuesDominates memory; the central object
MetricsTTFT, TPOT, throughputMatch metric to use case
QuantizationFewer bits per weightint4 ~4x smaller, faster decode
GPTQ/AWQ/GGUFAlgorithms vs a file formatGPU algos vs local format
PagedAttentionPage the KV cache like OS memoryEliminates fragmentation
Continuous batchingReshuffle batch every stepKeeps the GPU full
Speculative decodingDraft small, verify bigFaster, same output
Serving enginesBundle all optimizationsUse vLLM / llama.cpp

Exercises

Exercises 1–11 are pen-and-paper or derivations; 12–22 require code.

Exercise 1: Pen & Paper
Explain why inference cost dominates training cost over a deployed model's lifetime. What does this imply about where to focus optimization effort?
Exercise 2: Pen & Paper
Distinguish the prefill and decode phases. Why is prefill compute-bound and decode memory-bound? Which determines TTFT and which TPOT?
Exercise 3: Pen & Paper
Explain why decode underutilizes the GPU. Connect this to the two main fix families (smaller weights, batching).
Exercise 4: Derive
Compute the KV-cache size for a 7B model (32 layers, 32 heads, 128 dim), 8k context, batch 32, fp16. How does GQA with 8 KV heads change it?
Exercise 5: Pen & Paper
Define TTFT, TPOT, latency, and throughput, and write the latency formula. For a chatbot vs an overnight batch job, which metrics matter most?
Exercise 6: Derive
Show how per-group int8 quantization stores weights. For a 7B model, compute the memory at fp16, int8, and int4, and the decode speedup you'd expect.
Exercise 7: Pen & Paper
Explain GPTQ and AWQ at a high level and how they differ. Why does 'activation-aware' quantization preserve quality better than uniform quantization?
Exercise 8: Pen & Paper
Clarify the difference between GPTQ/AWQ and GGUF. For GPU datacenter serving vs running on a MacBook, which would you use and why?
Exercise 9: Pen & Paper
Explain the KV-cache fragmentation problem and how PagedAttention solves it using the operating-system paging analogy.
Exercise 10: Pen & Paper
Contrast static and continuous batching. Why does continuous batching keep the GPU fuller, and what does it require of the scheduler?
Exercise 11: Derive
Explain why speculative decoding produces output identical to normal decoding. Derive the expected accepted tokens per pass as a function of acceptance rate α and draft length k.
Exercise 12: Code
Implement the KV cache for generation. Measure the speedup of cached vs uncached generation as the sequence length grows.
Exercise 13: Code
Implement per-group int8 and int4 weight quantization. Quantize a small model's weights, measure the memory saved, and the error introduced.
Exercise 14: Code
Measure quantization quality: run a small model at fp16 and int4 on a benchmark, and report the quality drop and the memory/speed gains.
Exercise 15: Code
Implement temperature, top-k, and top-p sampling from scratch. Show how each knob changes the diversity and quality of generated text.
Exercise 16: Code
Measure TTFT and TPOT: instrument a generation loop to report time to first token and time per output token separately. Vary prompt length and observe TTFT.
Exercise 17: Code Lab
Simulate static vs continuous batching: model a stream of requests with varying lengths and measure GPU utilization and throughput under each scheme.
Exercise 18: Code Lab
Implement a simplified PagedAttention KV cache: store the cache in fixed-size blocks allocated on demand, with a block table. Measure memory waste vs contiguous allocation.
Exercise 19: Code Lab
Implement speculative decoding with a small draft model and a large model. Measure the acceptance rate and the resulting speedup, and verify the output matches plain decoding.
Exercise 20: Code
Serve a model with vLLM. Submit a batch of prompts, measure throughput, then enable an AWQ-quantized model and compare memory and throughput.
Exercise 21: Code
Implement prefix caching: detect a shared prompt prefix across requests and reuse its KV cache. Measure the TTFT reduction for a fixed long system prompt.
Exercise 22: Code (Challenge)
Build a mini inference server that combines several optimizations: a paged KV cache, continuous batching of a request stream, int4-quantized weights, and temperature/top-p sampling. Benchmark it against a naive one-request-at-a-time fp16 baseline, and report the throughput and latency improvement from each optimization added in turn.

Further reading: “Efficient Memory Management for LLM Serving with PagedAttention” (Kwon et al., 2023, vLLM). “GPTQ” (Frantar et al., 2022) and “AWQ” (Lin et al., 2023) for quantization. “Fast Inference from Transformers via Speculative Decoding” (Leviathan et al., 2023) and “Accelerating LLM Decoding with Speculative Sampling” (Chen et al., 2023). “Medusa” (Cai et al., 2024) for multi-head speculation. “Orca” (Yu et al., 2022) for continuous batching. The vLLM, TensorRT-LLM, and llama.cpp documentation. “The Curious Case of Neural Text Degeneration” (Holtzman et al., 2019) for top-p sampling.


Next → Chapter 28: Tool Calling & Function Calling

You can now run a model fast and cheaply. But a model alone is limited to what is in its weights — it cannot look up today's weather, run a calculation reliably, query a database, or take actions in the world. Chapter 28 gives the model TOOLS: the ability to call functions, APIs, and external systems. We will see how a model is trained and prompted to decide WHEN to use a tool, format the call, and incorporate the result — the foundation of agents. The actor-to-actor message flow you saw in this chapter's diagrams becomes the agentic loop, where the model and its tools converse to accomplish tasks beyond what any single forward pass could.

22 Exercises in this chapter
Attempt each exercise before checking the worked solutions.
View Solutions →