Part VI: Productionization

Chapter 31

Serving at Scale

Turning a model into a dependable product: API design, load balancing and autoscaling, SLAs and reliability, model versioning and A/B testing, and optimizing cost at the scale of millions of users.

20 Exercises

Learning Objectives

1.	Explain the systems challenges of serving an LLM to many users.
2.	Design a clean, robust LLM API.
3.	Load-balance and autoscale across many GPU replicas.
4.	Define and meet SLAs, and measure reliability with the right metrics.
5.	Route and schedule requests across a heterogeneous fleet.
6.	Version models and roll out changes safely.
7.	Run A/B tests and evaluate models in production.
8.	Build reliability: redundancy, failover, graceful degradation.
9.	Optimize and control cost at scale.
10.	Assemble the full production serving stack.

Chapter 27 made a single model run fast on a single machine. But a real product serves thousands of requests per second, from users around the world, around the clock, reliably, within a budget. That is no longer a model problem — it is a SYSTEMS problem. This final chapter of Part VI is about the engineering that turns an optimized model into a dependable, scalable SERVICE.

What Changes at Scale

A demo serving one request at a time hides nearly everything that matters in production. At scale, you must handle many concurrent users, spread load across many expensive GPUs, survive hardware failures without downtime, meet latency promises even under load spikes, roll out new models without breaking anything, and keep the enormous compute bill under control. None of this is about the model itself; all of it determines whether the product works.

Concern	One request (Ch. 27)	At scale (this chapter)
Concurrency	One at a time	Thousands per second
Hardware	One GPU	Fleets of GPUs across regions
Failure	Restart and retry	Survive failures, no downtime
Latency	Best effort	Guaranteed by an SLA
Updates	Reload the model	Safe rollouts, versioning
Cost	Run it	Optimize across the fleet

✧

Scale Note: The Model Is the Easy Part

A recurring theme of Part VI reaches its peak here: the trained model is the EASY part of a production system. The hard part is everything around it — the API, the load balancer, the autoscaler, the monitoring, the failover, the cost controls, the deployment pipeline. A brilliant model behind a fragile, slow, or expensive serving stack is a failed product; a good model behind excellent infrastructure is a great one.

This chapter is the least about machine learning and the most about systems engineering — and that is precisely the point. Deploying AI at scale is a software-and-infrastructure discipline. The ML got you a capable model; the systems engineering is what makes it a service people can depend on.

The Three Forces: Latency, Throughput, Cost

Everything in production serving balances three forces, building on Chapter 27's metrics: LATENCY (responses must be fast enough for users), THROUGHPUT (the system must handle the total load), and COST (it must fit a budget). These pull against each other — lower latency often means more cost, higher throughput can hurt latency — and the art of serving at scale is balancing all three under real-world conditions.

The API is how everyone — your own apps, external developers, other services — talks to the model. A well-designed API is the foundation of a usable service; a poorly-designed one creates friction forever. Let us cover the essentials of designing an LLM API.

Core API Design Choices

Decision	Options / considerations
Streaming	Stream tokens as generated (low perceived latency) vs return all at once
Sync vs async	Immediate response vs submit-and-poll for long jobs
Stateless vs stateful	Send full context each time vs server keeps conversation state
Batching interface	Single requests vs a batch endpoint for bulk jobs
Parameters	Expose temperature, max tokens, stop sequences, etc.
Error format	Clear, consistent, actionable error responses
Versioning	Version the API so changes don't break clients

Streaming: The Most Important Choice

For interactive use, STREAMING is essential. Instead of waiting for the entire response (which could be many seconds for a long answer), the API streams tokens to the user AS THEY ARE GENERATED. The user sees text appearing immediately, dramatically improving perceived responsiveness — the response starts at the TTFT (Chapter 27), not after the full generation. Almost every interactive LLM product streams.

Python•A streaming LLM API endpoint (sketch)
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

@app.post('/v1/generate')
async def generate(request: GenerateRequest):
    """Stream tokens to the client as they are generated."""
    async def token_stream():
        async for token in model.generate_stream(
            request.prompt, max_tokens=request.max_tokens,
            temperature=request.temperature):
            yield f'data: {token}\n\n'   # server-sent events

    return StreamingResponse(token_stream(), media_type='text/event-stream')

# The user sees tokens appear immediately (at TTFT), not after the
# whole response is done. This is why chat UIs feel responsive.

✧

Scale Note: Stateless Servers Scale Better

A key design principle: make the serving layer STATELESS where possible. If each request carries its full context (the whole conversation), then ANY server can handle ANY request — you can freely load-balance across the fleet and add or remove servers at will. If servers hold conversation state, requests must be 'sticky' to a particular server, which complicates load balancing and failure recovery. Statelessness is what makes horizontal scaling clean.

The trade-off is that stateless servers re-process the conversation context each turn (more compute), which prefix caching (Chapter 27) mitigates by reusing the cached prefix. The common pattern: stateless servers plus a shared cache — the best of both, scalable and efficient.

One GPU cannot serve millions of users. Production systems run many REPLICAS — copies of the model on many GPUs — with a LOAD BALANCER spreading requests across them and an AUTOSCALER adding or removing replicas as demand changes. This is the backbone of scalable serving.

Load Balancing: Spreading the Work

A load balancer sits in front of the replicas and routes each incoming request to one of them. The goal is to keep all replicas evenly busy — no replica overloaded while others sit idle. Naive strategies like round-robin (rotate through replicas) work poorly for LLMs because requests vary wildly in cost (a 10-token reply vs a 2,000-token reply). LLM-aware load balancing routes based on actual LOAD — current queue depth, KV-cache usage, or number of active requests per replica.

Device Grid: Load balancer spreading requests across GPU replicas

	GPU 0	GPU 1	GPU 2	GPU 3
Replicas	model	model	model	model
Load	60%	55%	62%	58%

Autoscaling: Matching Capacity to Demand

Demand is not constant — it spikes during the day, drops at night, surges with viral events. AUTOSCALING adds replicas when demand rises (to maintain latency) and removes them when demand falls (to save cost). The challenge for LLMs: spinning up a new replica is SLOW — loading a large model onto a GPU can take minutes — so autoscaling must be PREDICTIVE (scale up before the spike, based on trends) rather than purely reactive, or users hit slow responses while new capacity warms up.

Strategy	How it decides
Reactive scaling	Add replicas when load/latency crosses a threshold (can lag)
Predictive scaling	Forecast demand and scale ahead of it
Scheduled scaling	Scale on known patterns (e.g. up at 9am, down at night)
Queue-based	Scale to keep request queue depth bounded

✧

Scale Note: Cold Starts Are the Autoscaling Challenge

The biggest practical headache in LLM autoscaling is the COLD START: a new replica must load gigabytes of model weights onto a GPU before it can serve, taking seconds to minutes. By the time a reactive autoscaler notices high load and starts a replica, users have already suffered slow responses. Mitigations include keeping warm spare replicas, predictive scaling, faster model loading, and over-provisioning slightly to absorb spikes.

This is a key difference from traditional web services, where a new server starts in seconds. The sheer size of LLM weights makes capacity slow to add, which is why LLM serving leans heavily on prediction and warm pools rather than purely reactive scaling. Plan capacity for the spike, not the average.

A production service makes PROMISES about its performance — a Service Level Agreement (SLA) — and must measure whether it keeps them. Defining and meeting SLAs is central to running a dependable service, and it requires the right metrics.

What an SLA Specifies

An SLA defines the level of service users can expect, typically covering: AVAILABILITY (uptime, e.g. 99.9%), LATENCY (e.g. 'p95 time-to-first-token under 500ms'), and sometimes throughput or error rates. The service is engineered, monitored, and over-provisioned to meet these targets. Falling short has consequences — unhappy users, broken contracts, financial penalties.

SLA (Service Level Agreement)

A commitment to a measurable level of service — such as 99.9% availability and p95 latency under a threshold — that the service is engineered and monitored to meet.

Percentiles, Not Averages

A crucial lesson: measure latency with PERCENTILES, not averages. The average hides the bad cases — if 99% of requests are fast but 1% take 30 seconds, the average looks fine while some users have a terrible experience. The p95 (95th percentile) and p99 (99th percentile) latencies capture the TAIL — how bad the slow requests are. SLAs are written in percentiles ('p99 latency < 2s') because they reflect the worst experiences real users actually have.

text•Why percentiles beat averages
100 requests: 99 take 0.5s, 1 takes 30s
    average  = (99×0.5 + 30) / 100 = 0.795s   <- looks fine!
    p99      = 30s                            <- the truth: 1% are awful

# SLAs use p95/p99 because they capture the TAIL users actually feel.
# Tail latency, not average, determines user experience at scale.

Metric	Meaning
Availability	Fraction of time the service is up (99.9% = ~9hr/yr down)
p50 / p95 / p99 latency	Median / tail latency — the slow-case experience
Error rate	Fraction of requests that fail
Throughput	Requests or tokens served per second
Saturation	How close the fleet is to capacity

✧

Scale Note: Tail Latency Is the Enemy

At scale, the TAIL (p99, p999) is what hurts. With millions of requests, even a tiny fraction of slow ones is a large number of frustrated users — and a single user's session may involve many requests, so the chance of hitting at least one slow one compounds. Much of serving engineering is about taming the tail: avoiding stragglers, handling load spikes, and ensuring no request waits too long behind others.

This is why LLM-aware scheduling and load balancing matter so much: a long request can block short ones (head-of-line blocking), spiking tail latency. Continuous batching (Chapter 27), fair scheduling, and routing all exist partly to keep the tail under control.

Beyond simple load balancing, a sophisticated serving system ROUTES requests intelligently and SCHEDULES them to balance competing goals. With diverse requests (short and long, cheap and expensive, free and paid) and a heterogeneous fleet (different GPUs, different models), good routing and scheduling are key to both performance and cost.

Request Routing

Routing decides WHICH replica or model handles a request. Smart routing considers: which replica is least loaded, which already has the relevant prefix cached (route a follow-up to the same replica — 'cache-aware routing'), which model size suits the request (route easy queries to a small fast model, hard ones to a large model — 'model routing'), and the request's priority tier. Good routing can dramatically improve both latency and cost.

Tool Trace: Intelligent request routing

User	Sends a follow-up message in an ongoing chat	→
Router	Sees the conversation prefix is cached on Replica 2	•
Router	Routes to Replica 2 to reuse its prefix cache (fast TTFT)	→
Replica 2	Reuses cached prefix, generates only the new turn	•
User	Gets a fast response — no re-processing the whole history	←

Scheduling: Who Goes First

Within a replica, the SCHEDULER decides the order requests are served and how they share the GPU (building on continuous batching from Chapter 27). It balances fairness (no request starves), priority (paid or interactive requests first), and efficiency (keep the batch full). A key concern is preventing a few huge requests from monopolizing the GPU and starving everyone else — so schedulers may preempt, chunk, or cap long generations to protect the tail latency of short ones.

Routing/scheduling idea	Benefit
Cache-aware routing	Route to the replica with the prefix cached → fast TTFT
Model routing	Easy queries to small models, hard to large → save cost
Priority tiers	Interactive/paid requests served first
Fair scheduling	No request starves behind big ones
Preemption / chunking	Long requests don't block short ones
Geographic routing	Serve from the nearest region → lower latency

✧

Scale Note: Model Routing Saves Real Money

A powerful cost lever: not every query needs your biggest, most expensive model. MODEL ROUTING sends easy queries (simple questions, short completions) to a small fast cheap model, and reserves the large expensive model for genuinely hard queries. A classifier or the small model itself can decide. Since most queries are easy, routing the bulk of traffic to a cheaper model can cut costs dramatically while preserving quality where it matters.

This echoes the adaptive-compute idea from reasoning (Chapter 25): spend expensive capability only where it is needed. At the fleet level, model routing is one of the highest-impact cost optimizations — matching each request to the cheapest model that can handle it well.

Models are updated regularly — improved versions, fine-tunes, fixes. But a new model can behave differently, and a bad update can degrade quality or break downstream systems for millions of users at once. MODEL VERSIONING and SAFE ROLLOUT practices let you change models without risking the whole product.

Why Versioning Matters

Users and downstream systems may depend on a model's specific behaviour. Silently swapping in a new model can break prompts that were tuned for the old one, change output formats that code parses, or shift quality in unexpected ways. VERSIONING means each model version is named and addressable, old versions remain available for a transition period, and clients can pin to a specific version. This lets the system evolve without yanking the ground out from under anyone.

Safe Rollout Strategies

Strategy	How it works
Canary release	Send a small % of traffic to the new model; watch metrics; expand if good
Blue-green	Run old (blue) and new (green) in parallel; switch over instantly; roll back fast
Shadow / mirror	Send traffic to the new model WITHOUT using its output, to compare safely
Gradual rollout	Ramp traffic to the new model slowly (1% → 10% → 50% → 100%)
Feature flags	Toggle the new model on/off per user or cohort instantly

Pipeline Flow: A safe model rollout

1	Shadow test	Run the new model on real traffic without serving its output; compare
2	Canary	Serve the new model to 1% of users; watch quality, latency, errors
3	Ramp	Gradually increase: 1% → 10% → 50% as metrics stay healthy
4	Full + monitor	Roll out to 100%, keep the old version ready to roll back

✧

Scale Note: Always Be Able to Roll Back

The golden rule of safe rollouts: NEVER deploy a change you cannot quickly undo. Keep the previous model version warm and ready, and have a one-click (or automatic) rollback if metrics degrade. Even with careful shadow testing and canaries, a new model can surprise you in production at full scale. The ability to roll back in seconds turns a potential disaster into a minor blip.

This is standard practice from software deployment, applied to models: gradual rollout, continuous monitoring, instant rollback. A model is just another component being deployed — treat its rollout with the same discipline as any production change, because at scale the blast radius of a bad model is enormous.

How do you know a new model or prompt is actually BETTER, not just different? Offline benchmarks (Chapter 21) help, but the real test is how it performs with REAL users on REAL traffic. A/B TESTING and production evaluation answer this rigorously, building on the evaluation discipline from Part IV.

A/B Testing

In an A/B test, you serve the OLD model (A) to one randomly-chosen group of users and the NEW model (B) to another, then compare outcomes — user satisfaction, task completion, engagement, thumbs-up rates, retention. Because users are randomly assigned, differences in outcomes can be attributed to the model change. This is the gold standard for deciding whether a change genuinely improves the product, not just the benchmark.

Tool Trace: An A/B test in production

User pool	Randomly split into group A and group B	•
Group A	Served the current model (control)	→
Group B	Served the new model (treatment)	→
Metrics	Compare satisfaction, completion, retention between A and B	•
Decision	If B is significantly better, roll it out; else, don't	←

What to Measure

Production evaluation goes beyond benchmark scores to real outcomes: explicit feedback (thumbs up/down, ratings), implicit signals (did the user retry, rephrase, or abandon?), task completion, latency, and cost. The key discipline — echoing the 'distrust the proxy' lesson of Chapter 23 — is to measure what actually MATTERS to users, not just what is easy to measure. A model that scores higher on a benchmark but frustrates real users is not an improvement.

✧

Scale Note: Online and Offline Evaluation Complement

Offline benchmarks (Chapter 21) are fast, cheap, and reproducible but may not reflect real use. Online A/B tests measure real-user impact but are slower, costlier, and noisier. The two complement each other: use offline evals to catch regressions and screen candidates quickly, then A/B test the promising ones to confirm real-world benefit before full rollout. Neither alone is sufficient.

And both connect to the deployment safety of Section 31.6: shadow testing, canaries, and A/B tests form a graduated evaluation pipeline — each stage exposes the new model to more real traffic with more confidence, so problems surface before they reach everyone. Evaluation is not a one-time gate but a continuous part of safe operation.

At scale, failures are not exceptional — they are constant. GPUs fail, networks hiccup, replicas crash, dependencies time out, traffic spikes. A reliable service is designed to keep working DESPITE failures, not to assume they won't happen. Fault tolerance is what separates a service that has occasional bad days from one users can depend on.

Designing for Failure

Technique	What it protects against
Redundancy	Many replicas — one failing doesn't take down the service
Health checks	Detect and remove unhealthy replicas automatically
Failover	Reroute requests from a failed replica to healthy ones
Retries (with backoff)	Transient failures — retry, but not in a way that amplifies load
Timeouts	Don't let one stuck request hang forever
Circuit breakers	Stop hammering a failing dependency; fail fast
Graceful degradation	Fall back to a smaller model / cached / simpler response
Rate limiting	Protect the system from overload and abuse

Graceful Degradation

A crucial reliability idea: when the system is overloaded or partially failing, DEGRADE GRACEFULLY rather than collapse. Under extreme load, it is better to serve everyone a slightly-worse response (a smaller faster model, a cached answer, a shorter generation) than to serve some users perfectly while others get errors or time out. A service that bends under pressure beats one that breaks.

⚠️

Retries Can Make Things Worse

A subtle but important danger: naive retries can turn a small problem into a catastrophe. If a service slows down and every client immediately retries, the retries MULTIPLY the load, pushing the struggling service further over the edge — a 'retry storm' that can cause a full outage. Always use retries with EXPONENTIAL BACKOFF (wait longer between each retry) and JITTER (randomize the timing), and cap the number of retries, so recovery is helped, not hindered.

This is a classic distributed-systems lesson that applies fully to LLM serving: the mechanisms meant to improve reliability (retries) can destroy it if implemented carelessly. Reliability engineering is full of such counterintuitive traps, which is why it is its own discipline — and why serving at scale is far more than just running the model.

LLM serving is expensive — GPUs cost a lot, and at scale the bill is enormous. Cost optimization is not an afterthought; it can be the difference between a viable product and an unprofitable one. Fortunately, many levers reduce cost, layering on top of the per-request optimizations of Chapter 27.

The Levers of Cost

Lever	How it cuts cost
Quantization	Smaller models → cheaper GPUs, more throughput (Ch. 27)
Continuous batching	More requests per GPU → fewer GPUs needed (Ch. 27)
Model routing	Cheap model for easy queries, big model only when needed
Caching	Reuse results for repeated/similar queries; prefix caching
Right-sizing	Match GPU type to the workload; don't over-provision
Autoscaling	Don't pay for idle capacity at off-peak times
Spot / preemptible	Use cheaper interruptible instances for batch work
Distillation	Serve a smaller distilled model where quality allows

The Cost-per-Token Mindset

The fundamental unit of LLM serving cost is COST PER TOKEN (or per request). Every optimization ultimately aims to lower it: quantization and batching lower the GPU cost of producing each token; routing and caching avoid producing tokens with an expensive model when a cheaper path suffices. Tracking cost per token across the fleet, and per feature, reveals where the money goes and where optimization pays off most.

text•Cost per token (the key unit)
cost_per_token ≈ (GPU $/hour) / (tokens/hour per GPU)

Lower it by:
  ↑ tokens/hour: batching, quantization, better kernels (Ch. 27)
  ↓ GPU $/hour: right-sizing, spot instances, cheaper hardware
  avoid tokens: caching, routing easy queries to cheap models

✧

Scale Note: Caching Is Free Money

Caching deserves special emphasis as a cost lever. Many real workloads have repeated or near-repeated queries (the same question, the same system prompt, common requests). Caching responses (exact or semantic) and reusing prefix KV caches (Chapter 27) means NOT running the model at all for cache hits — the cheapest possible token is one you never generate. For workloads with repetition, caching can cut costs substantially with no quality loss.

Combine the levers and the savings multiply: quantization + batching make each token cheaper, routing + caching avoid expensive or redundant tokens, and autoscaling avoids paying for idle GPUs. Cost optimization at scale is about stacking many such savings, each modest, into a large total reduction.

A production service must be OBSERVABLE — you must be able to see what it is doing, detect problems quickly, and diagnose them. Monitoring and observability are what let you operate a service reliably, catch issues before users do, and understand what is happening across a large fleet.

What to Monitor

Category	What to watch
Performance	Latency (p50/p95/p99), TTFT, TPOT, throughput
Reliability	Error rates, availability, failed/timed-out requests
Capacity	GPU utilization, queue depth, KV-cache usage, saturation
Cost	Tokens served, cost per token, spend by feature/customer
Quality	Feedback signals, refusal rates, output anomalies
Traffic	Request volume, patterns, geographic distribution

Alerting and Dashboards

Monitoring data feeds two things: DASHBOARDS (live views of the system's health for humans to inspect) and ALERTS (automatic notifications when something crosses a threshold — latency spiking, errors rising, a replica down). Good alerting catches problems early, ideally before users notice; good dashboards let engineers diagnose and resolve them quickly. The aim is to know about and fix issues faster than users experience them.

Monitoring Quality, Not Just Systems

Beyond system metrics (latency, errors), production LLM serving must monitor OUTPUT QUALITY — which is harder. Watch for spikes in refusals (the model suddenly declining too much), output anomalies, drops in user feedback, and shifts in behaviour after a deployment. Quality regressions can be subtle and invisible to system metrics: the service is 'up' and fast, but the model's answers got worse. Monitoring quality signals — not just whether requests succeed — is essential and distinctive to ML serving.

✧

Scale Note: Watch Quality, Not Just Uptime

Traditional service monitoring asks 'is it up and fast?'. LLM serving must also ask 'are the answers still good?'. A model can be perfectly available and fast while silently producing worse outputs — after a bad deployment, a data shift, or a prompt change. Because quality is hard to measure automatically, teams combine proxy signals (feedback rates, refusal rates, output-length distributions, sample audits) to detect quality regressions that system metrics miss.

This is the observability frontier unique to AI products: monitoring not just the SYSTEM but the MODEL'S BEHAVIOUR. It ties back to evaluation (Section 31.7 and Chapter 21) — production quality monitoring is continuous evaluation on live traffic, the last line of defense against shipping a regression to everyone.

A public LLM service faces security, privacy, and abuse challenges beyond performance and reliability. Serving at scale means defending against misuse, protecting user data, and enforcing usage policies — responsibilities that grow with the service's reach.

Concern	Defense
Abuse / misuse	Safety filtering (Ch. 26), usage policies, monitoring, bans
Prompt injection	Treat inputs as untrusted; sandbox tools (Ch. 28)
Data privacy	Encrypt data; minimize retention; honor deletion; isolate tenants
DoS / overload	Rate limiting, quotas, authentication
Data leakage	Prevent one user's data leaking to another; careful caching
Cost attacks	Quotas and limits so abuse can't run up huge bills

Privacy and Multi-tenancy

When a service handles many users' or organizations' data ('multi-tenant'), strict ISOLATION is essential: one tenant's data, cache entries, and context must never leak to another. This shapes caching (don't share caches across tenants carelessly), logging (be careful what you store), and data handling (encryption, retention limits, deletion on request). Privacy is both an ethical obligation and, increasingly, a legal requirement.

Rate Limiting and Quotas

Rate limiting protects the service from overload and abuse, and controls cost. By capping how many requests or tokens a user can consume per time window, the service prevents any single user (malicious or buggy) from overwhelming capacity or running up unbounded cost. Tiered quotas (more for paying customers) also implement the business model. Rate limiting is a basic but essential layer of any public LLM API.

✧

Scale Note: Safety Is Part of Serving

The safety techniques of Chapter 26 (content filtering, refusal, harmlessness) are not just training concerns — they are enforced at SERVING time too. A production system layers input and output safety filters around the model, monitors for abuse patterns, and enforces usage policies. Serving safely at scale means combining the model's trained-in safety with system-level guardrails: filtering, monitoring, rate limits, and the ability to respond quickly to newly-discovered misuse.

This completes the safety picture from Part V: training instills safety into the model, and serving enforces it operationally. Neither alone is sufficient — robust real-world safety comes from defense in depth across both the model and the system that serves it.

Let us assemble the whole chapter — and much of Part VI — into a picture of a complete production LLM serving stack, from the user's request to the response and back.

Pipeline Flow: A request through the full production stack

1	Gateway	Authentication, rate limiting, input safety filtering
2	Router	Cache-aware + model routing to the right replica/model
3	Load balancer	Spread across healthy replicas in the autoscaled fleet
4	Serving engine	vLLM: paged KV cache, continuous batching, quantized (Ch. 27)
5	Generate & stream	Tokens streamed back through output safety filtering
6	Observe	Log latency, cost, quality signals; alert on anomalies

The Layers of a Serving System

Arch Stack: The production serving stack, layer by layer

API / Gateway	auth, rate limits, safety filtering, versioning
Routing & load balancing	cache-aware, model routing, autoscaling
Serving engines	vLLM/TGI: batching, paging, quantization
GPU fleet	many replicas across regions, with failover
Observability	monitoring, alerting, A/B testing, cost tracking

✧

Scale Note: It All Composes — and It's a Team Sport

The production stack layers everything from Part VI: the per-request optimizations of Chapter 27 (inside the serving engine), the tool and RAG capabilities of Chapters 28–29, the multi-modal handling of Chapter 30, and this chapter's systems engineering around them. Each layer has a job, and together they turn a model into a service that is fast, reliable, safe, and affordable at scale.

And building this is a TEAM effort spanning ML, systems, infrastructure, and operations — a reminder that deploying AI in the real world is a multidisciplinary engineering endeavor, not just a modeling exercise. The model is where it starts; the production stack is what makes it matter to real users.

Serving-at-Scale Quick-Reference

Concept	Key idea	Remember
Scale = systems	Serving is a systems problem	The model is the easy part
API design	Stream tokens; stateless servers	Streaming = responsive UX
Load balancing	Spread load across replicas	LLM-aware, not round-robin
Autoscaling	Match capacity to demand	Cold starts → scale predictively
SLAs	Promise & measure service	Percentiles, not averages
Routing	Send to the right replica/model	Cache-aware + model routing
Versioning	Safe, reversible rollouts	Always be able to roll back
A/B testing	Measure real-user impact	Online + offline complement
Reliability	Survive failures	Graceful degradation; careful retries
Cost	Lower cost per token	Quantize, batch, route, cache

Exercises

Exercises 1–10 are pen-and-paper or design; 11–20 require code.

✎

Exercise 1: Pen & Paper

Explain why serving at scale is a systems problem, not a model problem. List five things that change between serving one request and serving millions.

✎

Exercise 2: Pen & Paper

Why is streaming essential for interactive LLM products? Connect it to TTFT (Chapter 27) and perceived responsiveness.

✎

Exercise 3: Pen & Paper

Explain why stateless servers scale better than stateful ones. What trade-off do they create, and how does prefix caching address it?

✎

Exercise 4: Pen & Paper

Why does round-robin load balancing work poorly for LLMs? Describe an LLM-aware load-balancing strategy and why it's better.

✎

Exercise 5: Pen & Paper

Explain the cold-start problem in LLM autoscaling and why it forces predictive rather than purely reactive scaling.

✎

Exercise 6: Pen & Paper

Why are SLAs written in percentiles (p95/p99) rather than averages? Construct an example where the average looks fine but the tail is terrible.

✎

Exercise 7: Pen & Paper

Describe cache-aware routing and model routing. How does each improve latency or cost?

✎

Exercise 8: Pen & Paper

Compare canary, blue-green, and shadow rollout strategies. Why must you always be able to roll back?

✎

Exercise 9: Pen & Paper

Explain A/B testing for models. Why is it the gold standard over offline benchmarks, and how do the two complement each other?

✎

Exercise 10: Pen & Paper

Explain how naive retries can cause a retry storm, and how exponential backoff with jitter prevents it.

✎

Exercise 11: Code

Build a streaming LLM API endpoint that sends tokens as server-sent events. Measure the perceived latency improvement vs returning the full response.

✎

Exercise 12: Code

Implement an LLM-aware load balancer that routes to the replica with the lowest current load (queue depth or active requests). Compare its tail latency to round-robin under variable request sizes.

✎

Exercise 13: Code Lab

Simulate autoscaling: model a fluctuating request stream and an autoscaler with a cold-start delay. Compare reactive vs predictive scaling on latency and cost.

✎

Exercise 14: Code

Compute and plot p50, p95, and p99 latency from a stream of request timings. Show how an average can hide a bad tail.

✎

Exercise 15: Code

Implement cache-aware routing: route follow-up requests in a conversation to the replica that holds the cached prefix, and measure the TTFT improvement.

✎

Exercise 16: Code

Implement model routing: a classifier (or heuristic) sends easy queries to a small model and hard ones to a large model. Measure the cost savings and any quality change.

✎

Exercise 17: Code Lab

Simulate an A/B test: assign simulated users to model A or B, generate outcome metrics with a real difference, and run a significance test to decide if B is better.

✎

Exercise 18: Code

Implement retries with exponential backoff and jitter. Simulate a struggling service and show that naive retries cause a storm while backoff allows recovery.

✎

Exercise 19: Code

Build a cost dashboard: track tokens served and cost per token across simulated replicas, broken down by request type. Identify the biggest cost driver.

✎

Exercise 20: Code (Challenge)

Build a mini serving system that ties Part VI together: a streaming API gateway with rate limiting, an LLM-aware load balancer over several simulated replicas (each using continuous batching from Chapter 27), cache-aware and model routing, autoscaling with cold-start modeling, p99-latency and cost-per-token monitoring, and a safe canary rollout of a 'new model'. Drive it with a realistic fluctuating workload and report how latency, throughput, and cost respond — then deliberately fail a replica and show your failover keeps the SLA.

Further reading: “Orca: A Distributed Serving System for Transformer-Based Generative Models” (Yu et al., 2022). The vLLM, TensorRT-LLM, and Ray Serve documentation for production serving. Google's Site Reliability Engineering (SRE) book for SLAs, monitoring, and reliability principles. “The Tail at Scale” (Dean & Barroso, 2013) for tail-latency engineering. Literature on A/B testing and online experimentation (e.g. Kohavi et al.). Cloud providers' guidance on GPU autoscaling and cost optimization for inference workloads.

Part VI Complete: Inference, Tools & Deployment

Ch. 27	Inference Optimization	KV cache, quantization, PagedAttention, continuous batching, speculative decoding — making a model fast and cheap.
Ch. 28	Tool Calling & Function Use	JSON-schema tools, structured output, ReAct, reliable agents — letting the model act in the world.
Ch. 29	Retrieval-Augmented Generation	dense retrieval, vector DBs, chunking, hybrid search, reranking — grounding answers in external knowledge.
Ch. 30	Multi-modal LLMs	vision encoders, CLIP, LLaVA, audio — teaching the model to see and hear via a shared embedding space.
Ch. 31	Serving at Scale	API design, load balancing, SLAs, versioning, A/B testing, cost — turning the model into a dependable service.

You have now taken a model all the way from raw mathematics to a deployed, scalable, multi-modal, tool-using service. Across six Parts you have built the foundations (Part I), classical methods (Part II), the Transformer (Part III), pretraining (Part IV), alignment (Part V), and deployment (Part VI). Part VII — Frontier Techniques — turns to the cutting edge and the open horizon: Mixture-of-Experts architectures that scale models efficiently (Chapter 32), long-context and memory methods that extend how much a model can attend to (Chapter 33), agents and multi-agent systems that push tool use to its limits (Chapter 34), and the open problems that remain unsolved at the frontier of the field (Chapter 35). Having mastered how LLMs work and how to deploy them, you are ready to explore where they are going — and to contribute to what comes next.

✎ 20 Exercises in this chapter

Attempt each exercise before checking the worked solutions.

View Solutions →

←

PreviousCh 30. Multi-modal LLMs

NextCh 32. Mixture of Experts

→