Key takeaways
- Capacity is constrained by memory, active tokens, queueing, bandwidth, tool dependencies, and latency—not only model parameters.
- Use Goodput at target SLOs, not maximum benchmark throughput, for replica sizing.
- Context/output distributions and cache hit rates materially change GPU memory and execution cost.
- Warm reserve and regional redundancy are intentional capacity costs.
- Cost per successful task is often more useful than cost per token.
Runtime boundary
A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.
Receives
Traffic distribution, SLOs, model/runtime configuration, memory profile, benchmark curves, cache behavior, scaling limits, prices, and reliability targets.
Owns
Capacity model, assumptions, safety margin, cost attribution, and forecast validation.
Emits
Replica/device plan, admission limits, headroom, scaling thresholds, budgets, forecasts, and sensitivity scenarios.
Does not own
A guarantee that vendor quotas or future prices remain unchanged.
Failure modes
Underprovisioned memory, queue collapse, overprovisioned idle fleet, ignored warmup, cache miss shock, quota exhaustion, and misleading unit economics.
Evidence and metrics
Arrival rate, service time, Goodput/device, active tokens, memory headroom, queue, utilization, scale-up, idle reserve, cost/success, and forecast error.
Workload model
Capacity planning begins with arrivals, prompt/output distributions, concurrency, tenant/model mix, cache state, tools, cancellation, and time-of-day.
Implementation
Use recent production traces or a documented synthetic model; segment by request class.
Operational implications
Averages hide bursts and long-context tails.
Measure
Arrival percentiles, burst duration, length distribution, active tokens, cache hit, and seasonality.
Model and memory budget
Resident weights, runtime workspace, activations, KV cache, communication, adapters, fragmentation, and headroom share device memory.
Implementation
Measure actual runtime usage and model a worst approved context/concurrency combination.
Operational implications
A theoretical parameter-size estimate undercounts runtime overhead.
Measure
Resident/peak bytes, KV bytes/token, fragmentation, free headroom, and OOM.
Service curve and Goodput
Load tests establish Goodput and latency/error curves per device or replica.
Implementation
Choose an operating point below the saturation knee and validate mixed workload/co-location.
Operational implications
Peak throughput is not a safe sustained capacity target.
Measure
Offered/achieved load, Goodput, p95/p99 SLO, errors, queue, and utilization.
Queueing and burst absorption
Queues absorb short bursts but increase user latency and memory.
Implementation
Set bounded queues, deadlines, priorities, backpressure, and a burst budget tied to scale-up time.
Operational implications
Long queues create timeout/retry amplification.
Measure
Queue depth/age, timeout, reject, retry, and burst duration.
Replica and reserve planning
Required replicas depend on class-specific Goodput, failure reserve, rollout reserve, and regional capacity.
Implementation
Plan N+failure, canary, maintenance, and warm reserve separately from steady demand.
Operational implications
Running at 100% removes headroom for failure and rollout.
Measure
Steady replicas, reserve, failover capacity, utilization target, and SLO under loss.
Autoscaling model
Scale decisions require a signal that anticipates capacity demand and time-to-ready.
Implementation
Use queue, concurrency, active tokens, memory, or custom Goodput indicators; model node/model warmup.
Operational implications
Reactive scaling cannot prevent every spike when accelerator supply is slow.
Measure
Decision lag, time to ready, queue during scale, over/undershoot, and thrash.
Cost model
Include compute reservation/usage, storage, network, provider tokens, tool APIs, observability, evaluation, idle reserve, failed work, and engineering operations.
Implementation
Attribute by product/tenant/model with controlled dimensions; record pricing date/region and amortization assumptions.
Operational implications
Do not equate provider token price with total task cost.
Measure
Cost/request/token/success, idle reserve, failure cost, egress, and unallocated cost.
Sensitivity and uncertainty
Traffic, cache hit, output length, model mix, quality routing, and prices change.
Implementation
Run best/base/worst scenarios and update assumptions against observed forecast error.
Operational implications
A single deterministic forecast hides risk.
Measure
Forecast error, scenario range, assumption age, and capacity shortfall frequency.
Reference tables
| Input | Example unit | Why it matters |
|---|---|---|
| Arrival distribution | Requests/s by minute/percentile | Burst and baseline demand |
| Prompt/output distribution | Tokens/request | Prefill, decode, KV and latency |
| Goodput/device | SLO-qualified requests/s | Safe capacity |
| Memory profile | GB/model/active token | Concurrency fit |
| Time to ready | Minutes | Warm reserve and queue budget |
| Failure reserve | Devices/region | Availability |
| Unit cost | $/device-hour, $/token, $/tool | Cost/success forecast |
Decision checklist
- What workload distributions and bursts must be supported?
- What device memory is reserved for weights, KV, buffers, and safety margin?
- What Goodput operating point meets p95/p99 SLOs?
- How much reserve covers failure, rollout, and burst?
- How long does new capacity take to become ready?
- Which costs are fixed, variable, shared, or external?
- What sensitivity variables can change the plan most?
- How often is the forecast reconciled with production?
Common mistakes
- Sizing from average request length.
- Using maximum throughput as sustainable capacity.
- Ignoring KV cache and runtime workspace.
- Assuming autoscaling is instantaneous.
- Excluding failed requests and idle reserve from cost.
- Attributing shared GPU cost only by request count.
- Publishing cost without date, region, model, and success criteria.
Sources and further reading
-
MLPerf Inference: Datacenter
(opens in a new tab)
-
KServe autoscaling
(opens in a new tab)
-
Kubernetes resource management
(opens in a new tab)
Last reviewed: 2026-06-21 UTC
