Serving at Scale
Chapter 27 made a single model run fast on a single machine. But a real product serves thousands of requests per second, from users around the world, around the clock, reliably, within a budget. That is no longer a model problem — it is a SYSTEMS problem. This final chapter of Part VI is about the engineering that turns an optimized model into a dependable, scalable SERVICE.
What Changes at Scale
A demo serving one request at a time hides nearly everything that matters in production. At scale, you must handle many concurrent users, spread load across many expensive GPUs, survive hardware failures without downtime, meet latency promises even under load spikes, roll out new models without breaking anything, and keep the enormous compute bill under control. None of this is about the model itself; all of it determines whether the product works.
| Concern | One request (Ch. 27) | At scale (this chapter) |
|---|---|---|
| Concurrency | One at a time | Thousands per second |
| Hardware | One GPU | Fleets of GPUs across regions |
| Failure | Restart and retry | Survive failures, no downtime |
| Latency | Best effort | Guaranteed by an SLA |
| Updates | Reload the model | Safe rollouts, versioning |
| Cost | Run it | Optimize across the fleet |
The Three Forces: Latency, Throughput, Cost
Everything in production serving balances three forces, building on Chapter 27's metrics: LATENCY (responses must be fast enough for users), THROUGHPUT (the system must handle the total load), and COST (it must fit a budget). These pull against each other — lower latency often means more cost, higher throughput can hurt latency — and the art of serving at scale is balancing all three under real-world conditions.
The API is how everyone — your own apps, external developers, other services — talks to the model. A well-designed API is the foundation of a usable service; a poorly-designed one creates friction forever. Let us cover the essentials of designing an LLM API.
Core API Design Choices
| Decision | Options / considerations |
|---|---|
| Streaming | Stream tokens as generated (low perceived latency) vs return all at once |
| Sync vs async | Immediate response vs submit-and-poll for long jobs |
| Stateless vs stateful | Send full context each time vs server keeps conversation state |
| Batching interface | Single requests vs a batch endpoint for bulk jobs |
| Parameters | Expose temperature, max tokens, stop sequences, etc. |
| Error format | Clear, consistent, actionable error responses |
| Versioning | Version the API so changes don't break clients |
Streaming: The Most Important Choice
For interactive use, STREAMING is essential. Instead of waiting for the entire response (which could be many seconds for a long answer), the API streams tokens to the user AS THEY ARE GENERATED. The user sees text appearing immediately, dramatically improving perceived responsiveness — the response starts at the TTFT (Chapter 27), not after the full generation. Almost every interactive LLM product streams.
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
app = FastAPI()
@app.post('/v1/generate')
async def generate(request: GenerateRequest):
"""Stream tokens to the client as they are generated."""
async def token_stream():
async for token in model.generate_stream(
request.prompt, max_tokens=request.max_tokens,
temperature=request.temperature):
yield f'data: {token}\n\n' # server-sent events
return StreamingResponse(token_stream(), media_type='text/event-stream')
# The user sees tokens appear immediately (at TTFT), not after the
# whole response is done. This is why chat UIs feel responsive.One GPU cannot serve millions of users. Production systems run many REPLICAS — copies of the model on many GPUs — with a LOAD BALANCER spreading requests across them and an AUTOSCALER adding or removing replicas as demand changes. This is the backbone of scalable serving.
Load Balancing: Spreading the Work
A load balancer sits in front of the replicas and routes each incoming request to one of them. The goal is to keep all replicas evenly busy — no replica overloaded while others sit idle. Naive strategies like round-robin (rotate through replicas) work poorly for LLMs because requests vary wildly in cost (a 10-token reply vs a 2,000-token reply). LLM-aware load balancing routes based on actual LOAD — current queue depth, KV-cache usage, or number of active requests per replica.
Device Grid: Load balancer spreading requests across GPU replicas
| GPU 0 | GPU 1 | GPU 2 | GPU 3 | |
|---|---|---|---|---|
| Replicas | model | model | model | model |
| Load | 60% | 55% | 62% | 58% |
Autoscaling: Matching Capacity to Demand
Demand is not constant — it spikes during the day, drops at night, surges with viral events. AUTOSCALING adds replicas when demand rises (to maintain latency) and removes them when demand falls (to save cost). The challenge for LLMs: spinning up a new replica is SLOW — loading a large model onto a GPU can take minutes — so autoscaling must be PREDICTIVE (scale up before the spike, based on trends) rather than purely reactive, or users hit slow responses while new capacity warms up.
| Strategy | How it decides |
|---|---|
| Reactive scaling | Add replicas when load/latency crosses a threshold (can lag) |
| Predictive scaling | Forecast demand and scale ahead of it |
| Scheduled scaling | Scale on known patterns (e.g. up at 9am, down at night) |
| Queue-based | Scale to keep request queue depth bounded |
A production service makes PROMISES about its performance — a Service Level Agreement (SLA) — and must measure whether it keeps them. Defining and meeting SLAs is central to running a dependable service, and it requires the right metrics.
What an SLA Specifies
An SLA defines the level of service users can expect, typically covering: AVAILABILITY (uptime, e.g. 99.9%), LATENCY (e.g. 'p95 time-to-first-token under 500ms'), and sometimes throughput or error rates. The service is engineered, monitored, and over-provisioned to meet these targets. Falling short has consequences — unhappy users, broken contracts, financial penalties.
Percentiles, Not Averages
A crucial lesson: measure latency with PERCENTILES, not averages. The average hides the bad cases — if 99% of requests are fast but 1% take 30 seconds, the average looks fine while some users have a terrible experience. The p95 (95th percentile) and p99 (99th percentile) latencies capture the TAIL — how bad the slow requests are. SLAs are written in percentiles ('p99 latency < 2s') because they reflect the worst experiences real users actually have.
100 requests: 99 take 0.5s, 1 takes 30s
average = (99×0.5 + 30) / 100 = 0.795s <- looks fine!
p99 = 30s <- the truth: 1% are awful
# SLAs use p95/p99 because they capture the TAIL users actually feel.
# Tail latency, not average, determines user experience at scale.| Metric | Meaning |
|---|---|
| Availability | Fraction of time the service is up (99.9% = ~9hr/yr down) |
| p50 / p95 / p99 latency | Median / tail latency — the slow-case experience |
| Error rate | Fraction of requests that fail |
| Throughput | Requests or tokens served per second |
| Saturation | How close the fleet is to capacity |
Beyond simple load balancing, a sophisticated serving system ROUTES requests intelligently and SCHEDULES them to balance competing goals. With diverse requests (short and long, cheap and expensive, free and paid) and a heterogeneous fleet (different GPUs, different models), good routing and scheduling are key to both performance and cost.
Request Routing
Routing decides WHICH replica or model handles a request. Smart routing considers: which replica is least loaded, which already has the relevant prefix cached (route a follow-up to the same replica — 'cache-aware routing'), which model size suits the request (route easy queries to a small fast model, hard ones to a large model — 'model routing'), and the request's priority tier. Good routing can dramatically improve both latency and cost.
Tool Trace: Intelligent request routing
| User | Sends a follow-up message in an ongoing chat | → |
| Router | Sees the conversation prefix is cached on Replica 2 | • |
| Router | Routes to Replica 2 to reuse its prefix cache (fast TTFT) | → |
| Replica 2 | Reuses cached prefix, generates only the new turn | • |
| User | Gets a fast response — no re-processing the whole history | ← |
Scheduling: Who Goes First
Within a replica, the SCHEDULER decides the order requests are served and how they share the GPU (building on continuous batching from Chapter 27). It balances fairness (no request starves), priority (paid or interactive requests first), and efficiency (keep the batch full). A key concern is preventing a few huge requests from monopolizing the GPU and starving everyone else — so schedulers may preempt, chunk, or cap long generations to protect the tail latency of short ones.
| Routing/scheduling idea | Benefit |
|---|---|
| Cache-aware routing | Route to the replica with the prefix cached → fast TTFT |
| Model routing | Easy queries to small models, hard to large → save cost |
| Priority tiers | Interactive/paid requests served first |
| Fair scheduling | No request starves behind big ones |
| Preemption / chunking | Long requests don't block short ones |
| Geographic routing | Serve from the nearest region → lower latency |
Models are updated regularly — improved versions, fine-tunes, fixes. But a new model can behave differently, and a bad update can degrade quality or break downstream systems for millions of users at once. MODEL VERSIONING and SAFE ROLLOUT practices let you change models without risking the whole product.
Why Versioning Matters
Users and downstream systems may depend on a model's specific behaviour. Silently swapping in a new model can break prompts that were tuned for the old one, change output formats that code parses, or shift quality in unexpected ways. VERSIONING means each model version is named and addressable, old versions remain available for a transition period, and clients can pin to a specific version. This lets the system evolve without yanking the ground out from under anyone.
Safe Rollout Strategies
| Strategy | How it works |
|---|---|
| Canary release | Send a small % of traffic to the new model; watch metrics; expand if good |
| Blue-green | Run old (blue) and new (green) in parallel; switch over instantly; roll back fast |
| Shadow / mirror | Send traffic to the new model WITHOUT using its output, to compare safely |
| Gradual rollout | Ramp traffic to the new model slowly (1% → 10% → 50% → 100%) |
| Feature flags | Toggle the new model on/off per user or cohort instantly |
Pipeline Flow: A safe model rollout
| 1 | Shadow test | Run the new model on real traffic without serving its output; compare |
| 2 | Canary | Serve the new model to 1% of users; watch quality, latency, errors |
| 3 | Ramp | Gradually increase: 1% → 10% → 50% as metrics stay healthy |
| 4 | Full + monitor | Roll out to 100%, keep the old version ready to roll back |
How do you know a new model or prompt is actually BETTER, not just different? Offline benchmarks (Chapter 21) help, but the real test is how it performs with REAL users on REAL traffic. A/B TESTING and production evaluation answer this rigorously, building on the evaluation discipline from Part IV.
A/B Testing
In an A/B test, you serve the OLD model (A) to one randomly-chosen group of users and the NEW model (B) to another, then compare outcomes — user satisfaction, task completion, engagement, thumbs-up rates, retention. Because users are randomly assigned, differences in outcomes can be attributed to the model change. This is the gold standard for deciding whether a change genuinely improves the product, not just the benchmark.
Tool Trace: An A/B test in production
| User pool | Randomly split into group A and group B | • |
| Group A | Served the current model (control) | → |
| Group B | Served the new model (treatment) | → |
| Metrics | Compare satisfaction, completion, retention between A and B | • |
| Decision | If B is significantly better, roll it out; else, don't | ← |
What to Measure
Production evaluation goes beyond benchmark scores to real outcomes: explicit feedback (thumbs up/down, ratings), implicit signals (did the user retry, rephrase, or abandon?), task completion, latency, and cost. The key discipline — echoing the 'distrust the proxy' lesson of Chapter 23 — is to measure what actually MATTERS to users, not just what is easy to measure. A model that scores higher on a benchmark but frustrates real users is not an improvement.
At scale, failures are not exceptional — they are constant. GPUs fail, networks hiccup, replicas crash, dependencies time out, traffic spikes. A reliable service is designed to keep working DESPITE failures, not to assume they won't happen. Fault tolerance is what separates a service that has occasional bad days from one users can depend on.
Designing for Failure
| Technique | What it protects against |
|---|---|
| Redundancy | Many replicas — one failing doesn't take down the service |
| Health checks | Detect and remove unhealthy replicas automatically |
| Failover | Reroute requests from a failed replica to healthy ones |
| Retries (with backoff) | Transient failures — retry, but not in a way that amplifies load |
| Timeouts | Don't let one stuck request hang forever |
| Circuit breakers | Stop hammering a failing dependency; fail fast |
| Graceful degradation | Fall back to a smaller model / cached / simpler response |
| Rate limiting | Protect the system from overload and abuse |
Graceful Degradation
A crucial reliability idea: when the system is overloaded or partially failing, DEGRADE GRACEFULLY rather than collapse. Under extreme load, it is better to serve everyone a slightly-worse response (a smaller faster model, a cached answer, a shorter generation) than to serve some users perfectly while others get errors or time out. A service that bends under pressure beats one that breaks.
LLM serving is expensive — GPUs cost a lot, and at scale the bill is enormous. Cost optimization is not an afterthought; it can be the difference between a viable product and an unprofitable one. Fortunately, many levers reduce cost, layering on top of the per-request optimizations of Chapter 27.
The Levers of Cost
| Lever | How it cuts cost |
|---|---|
| Quantization | Smaller models → cheaper GPUs, more throughput (Ch. 27) |
| Continuous batching | More requests per GPU → fewer GPUs needed (Ch. 27) |
| Model routing | Cheap model for easy queries, big model only when needed |
| Caching | Reuse results for repeated/similar queries; prefix caching |
| Right-sizing | Match GPU type to the workload; don't over-provision |
| Autoscaling | Don't pay for idle capacity at off-peak times |
| Spot / preemptible | Use cheaper interruptible instances for batch work |
| Distillation | Serve a smaller distilled model where quality allows |
The Cost-per-Token Mindset
The fundamental unit of LLM serving cost is COST PER TOKEN (or per request). Every optimization ultimately aims to lower it: quantization and batching lower the GPU cost of producing each token; routing and caching avoid producing tokens with an expensive model when a cheaper path suffices. Tracking cost per token across the fleet, and per feature, reveals where the money goes and where optimization pays off most.
cost_per_token ≈ (GPU $/hour) / (tokens/hour per GPU)
Lower it by:
↑ tokens/hour: batching, quantization, better kernels (Ch. 27)
↓ GPU $/hour: right-sizing, spot instances, cheaper hardware
avoid tokens: caching, routing easy queries to cheap modelsA production service must be OBSERVABLE — you must be able to see what it is doing, detect problems quickly, and diagnose them. Monitoring and observability are what let you operate a service reliably, catch issues before users do, and understand what is happening across a large fleet.
What to Monitor
| Category | What to watch |
|---|---|
| Performance | Latency (p50/p95/p99), TTFT, TPOT, throughput |
| Reliability | Error rates, availability, failed/timed-out requests |
| Capacity | GPU utilization, queue depth, KV-cache usage, saturation |
| Cost | Tokens served, cost per token, spend by feature/customer |
| Quality | Feedback signals, refusal rates, output anomalies |
| Traffic | Request volume, patterns, geographic distribution |
Alerting and Dashboards
Monitoring data feeds two things: DASHBOARDS (live views of the system's health for humans to inspect) and ALERTS (automatic notifications when something crosses a threshold — latency spiking, errors rising, a replica down). Good alerting catches problems early, ideally before users notice; good dashboards let engineers diagnose and resolve them quickly. The aim is to know about and fix issues faster than users experience them.
Monitoring Quality, Not Just Systems
Beyond system metrics (latency, errors), production LLM serving must monitor OUTPUT QUALITY — which is harder. Watch for spikes in refusals (the model suddenly declining too much), output anomalies, drops in user feedback, and shifts in behaviour after a deployment. Quality regressions can be subtle and invisible to system metrics: the service is 'up' and fast, but the model's answers got worse. Monitoring quality signals — not just whether requests succeed — is essential and distinctive to ML serving.
A public LLM service faces security, privacy, and abuse challenges beyond performance and reliability. Serving at scale means defending against misuse, protecting user data, and enforcing usage policies — responsibilities that grow with the service's reach.
| Concern | Defense |
|---|---|
| Abuse / misuse | Safety filtering (Ch. 26), usage policies, monitoring, bans |
| Prompt injection | Treat inputs as untrusted; sandbox tools (Ch. 28) |
| Data privacy | Encrypt data; minimize retention; honor deletion; isolate tenants |
| DoS / overload | Rate limiting, quotas, authentication |
| Data leakage | Prevent one user's data leaking to another; careful caching |
| Cost attacks | Quotas and limits so abuse can't run up huge bills |
Privacy and Multi-tenancy
When a service handles many users' or organizations' data ('multi-tenant'), strict ISOLATION is essential: one tenant's data, cache entries, and context must never leak to another. This shapes caching (don't share caches across tenants carelessly), logging (be careful what you store), and data handling (encryption, retention limits, deletion on request). Privacy is both an ethical obligation and, increasingly, a legal requirement.
Rate Limiting and Quotas
Rate limiting protects the service from overload and abuse, and controls cost. By capping how many requests or tokens a user can consume per time window, the service prevents any single user (malicious or buggy) from overwhelming capacity or running up unbounded cost. Tiered quotas (more for paying customers) also implement the business model. Rate limiting is a basic but essential layer of any public LLM API.
Let us assemble the whole chapter — and much of Part VI — into a picture of a complete production LLM serving stack, from the user's request to the response and back.
Pipeline Flow: A request through the full production stack
| 1 | Gateway | Authentication, rate limiting, input safety filtering |
| 2 | Router | Cache-aware + model routing to the right replica/model |
| 3 | Load balancer | Spread across healthy replicas in the autoscaled fleet |
| 4 | Serving engine | vLLM: paged KV cache, continuous batching, quantized (Ch. 27) |
| 5 | Generate & stream | Tokens streamed back through output safety filtering |
| 6 | Observe | Log latency, cost, quality signals; alert on anomalies |
The Layers of a Serving System
Arch Stack: The production serving stack, layer by layer
| API / Gateway | auth, rate limits, safety filtering, versioning |
| Routing & load balancing | cache-aware, model routing, autoscaling |
| Serving engines | vLLM/TGI: batching, paging, quantization |
| GPU fleet | many replicas across regions, with failover |
| Observability | monitoring, alerting, A/B testing, cost tracking |
Serving-at-Scale Quick-Reference
| Concept | Key idea | Remember |
|---|---|---|
| Scale = systems | Serving is a systems problem | The model is the easy part |
| API design | Stream tokens; stateless servers | Streaming = responsive UX |
| Load balancing | Spread load across replicas | LLM-aware, not round-robin |
| Autoscaling | Match capacity to demand | Cold starts → scale predictively |
| SLAs | Promise & measure service | Percentiles, not averages |
| Routing | Send to the right replica/model | Cache-aware + model routing |
| Versioning | Safe, reversible rollouts | Always be able to roll back |
| A/B testing | Measure real-user impact | Online + offline complement |
| Reliability | Survive failures | Graceful degradation; careful retries |
| Cost | Lower cost per token | Quantize, batch, route, cache |
Exercises
Exercises 1–10 are pen-and-paper or design; 11–20 require code.
Further reading: “Orca: A Distributed Serving System for Transformer-Based Generative Models” (Yu et al., 2022). The vLLM, TensorRT-LLM, and Ray Serve documentation for production serving. Google's Site Reliability Engineering (SRE) book for SLAs, monitoring, and reliability principles. “The Tail at Scale” (Dean & Barroso, 2013) for tail-latency engineering. Literature on A/B testing and online experimentation (e.g. Kohavi et al.). Cloud providers' guidance on GPU autoscaling and cost optimization for inference workloads.
Part VI Complete: Inference, Tools & Deployment
| Ch. 27 | Inference Optimization | KV cache, quantization, PagedAttention, continuous batching, speculative decoding — making a model fast and cheap. |
| Ch. 28 | Tool Calling & Function Use | JSON-schema tools, structured output, ReAct, reliable agents — letting the model act in the world. |
| Ch. 29 | Retrieval-Augmented Generation | dense retrieval, vector DBs, chunking, hybrid search, reranking — grounding answers in external knowledge. |
| Ch. 30 | Multi-modal LLMs | vision encoders, CLIP, LLaVA, audio — teaching the model to see and hear via a shared embedding space. |
| Ch. 31 | Serving at Scale | API design, load balancing, SLAs, versioning, A/B testing, cost — turning the model into a dependable service. |
You have now taken a model all the way from raw mathematics to a deployed, scalable, multi-modal, tool-using service. Across six Parts you have built the foundations (Part I), classical methods (Part II), the Transformer (Part III), pretraining (Part IV), alignment (Part V), and deployment (Part VI). Part VII — Frontier Techniques — turns to the cutting edge and the open horizon: Mixture-of-Experts architectures that scale models efficiently (Chapter 32), long-context and memory methods that extend how much a model can attend to (Chapter 33), agents and multi-agent systems that push tool use to its limits (Chapter 34), and the open problems that remain unsolved at the frontier of the field (Chapter 35). Having mastered how LLMs work and how to deploy them, you are ready to explore where they are going — and to contribute to what comes next.