Cost and Capacity Planning

Key takeaways

Capacity is constrained by memory, active tokens, queueing, bandwidth, tool dependencies, and latency—not only model parameters.
Use Goodput at target SLOs, not maximum benchmark throughput, for replica sizing.
Context/output distributions and cache hit rates materially change GPU memory and execution cost.
Warm reserve and regional redundancy are intentional capacity costs.
Cost per successful task is often more useful than cost per token.

Runtime boundary

A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.

Receives

Traffic distribution, SLOs, model/runtime configuration, memory profile, benchmark curves, cache behavior, scaling limits, prices, and reliability targets.

Owns

Capacity model, assumptions, safety margin, cost attribution, and forecast validation.

Emits

Replica/device plan, admission limits, headroom, scaling thresholds, budgets, forecasts, and sensitivity scenarios.

Does not own

A guarantee that vendor quotas or future prices remain unchanged.

Failure modes

Underprovisioned memory, queue collapse, overprovisioned idle fleet, ignored warmup, cache miss shock, quota exhaustion, and misleading unit economics.

Evidence and metrics

Arrival rate, service time, Goodput/device, active tokens, memory headroom, queue, utilization, scale-up, idle reserve, cost/success, and forecast error.

Workload model

Capacity planning begins with arrivals, prompt/output distributions, concurrency, tenant/model mix, cache state, tools, cancellation, and time-of-day.

Implementation

Use recent production traces or a documented synthetic model; segment by request class.

Operational implications

Averages hide bursts and long-context tails.

Measure

Arrival percentiles, burst duration, length distribution, active tokens, cache hit, and seasonality.

Model and memory budget

Resident weights, runtime workspace, activations, KV cache, communication, adapters, fragmentation, and headroom share device memory.

Implementation

Measure actual runtime usage and model a worst approved context/concurrency combination.

Operational implications

A theoretical parameter-size estimate undercounts runtime overhead.

Measure

Resident/peak bytes, KV bytes/token, fragmentation, free headroom, and OOM.

Service curve and Goodput

Load tests establish Goodput and latency/error curves per device or replica.

Implementation

Choose an operating point below the saturation knee and validate mixed workload/co-location.

Operational implications

Peak throughput is not a safe sustained capacity target.

Measure

Offered/achieved load, Goodput, p95/p99 SLO, errors, queue, and utilization.

Queueing and burst absorption

Queues absorb short bursts but increase user latency and memory.

Implementation

Set bounded queues, deadlines, priorities, backpressure, and a burst budget tied to scale-up time.

Operational implications

Long queues create timeout/retry amplification.

Measure

Queue depth/age, timeout, reject, retry, and burst duration.

Replica and reserve planning

Required replicas depend on class-specific Goodput, failure reserve, rollout reserve, and regional capacity.

Implementation

Plan N+failure, canary, maintenance, and warm reserve separately from steady demand.

Operational implications

Running at 100% removes headroom for failure and rollout.

Measure

Steady replicas, reserve, failover capacity, utilization target, and SLO under loss.

Autoscaling model

Scale decisions require a signal that anticipates capacity demand and time-to-ready.

Implementation

Use queue, concurrency, active tokens, memory, or custom Goodput indicators; model node/model warmup.

Operational implications

Reactive scaling cannot prevent every spike when accelerator supply is slow.

Measure

Decision lag, time to ready, queue during scale, over/undershoot, and thrash.

Cost model

Include compute reservation/usage, storage, network, provider tokens, tool APIs, observability, evaluation, idle reserve, failed work, and engineering operations.

Implementation

Attribute by product/tenant/model with controlled dimensions; record pricing date/region and amortization assumptions.

Operational implications

Do not equate provider token price with total task cost.

Measure

Cost/request/token/success, idle reserve, failure cost, egress, and unallocated cost.

Sensitivity and uncertainty

Traffic, cache hit, output length, model mix, quality routing, and prices change.

Implementation

Run best/base/worst scenarios and update assumptions against observed forecast error.

Operational implications

A single deterministic forecast hides risk.

Measure

Forecast error, scenario range, assumption age, and capacity shortfall frequency.

Reference tables

Capacity planning worksheet
Input	Example unit	Why it matters
Arrival distribution	Requests/s by minute/percentile	Burst and baseline demand
Prompt/output distribution	Tokens/request	Prefill, decode, KV and latency
Goodput/device	SLO-qualified requests/s	Safe capacity
Memory profile	GB/model/active token	Concurrency fit
Time to ready	Minutes	Warm reserve and queue budget
Failure reserve	Devices/region	Availability
Unit cost	$/device-hour, $/token, $/tool	Cost/success forecast

Decision checklist

What workload distributions and bursts must be supported?
What device memory is reserved for weights, KV, buffers, and safety margin?
What Goodput operating point meets p95/p99 SLOs?
How much reserve covers failure, rollout, and burst?
How long does new capacity take to become ready?
Which costs are fixed, variable, shared, or external?
What sensitivity variables can change the plan most?
How often is the forecast reconciled with production?

Common mistakes

Sizing from average request length.
Using maximum throughput as sustainable capacity.
Ignoring KV cache and runtime workspace.
Assuming autoscaling is instantaneous.
Excluding failed requests and idle reserve from cost.
Attributing shared GPU cost only by request count.
Publishing cost without date, region, model, and success criteria.

Sources and further reading

MLPerf Inference: Datacenter
(opens in a new tab)

MLCommons · Benchmark specification · accessed 2026-06-21 UTC
KServe autoscaling
(opens in a new tab)

KServe · Official documentation · accessed 2026-06-21 UTC
Kubernetes resource management
(opens in a new tab)

Kubernetes · Official documentation · accessed 2026-06-21 UTC

Last reviewed: 2026-06-21 UTC

Key takeaways

Runtime boundary

Receives

Owns

Emits

Does not own

Failure modes

Evidence and metrics

Workload model

Implementation

Operational implications

Measure

Model and memory budget

Implementation

Operational implications

Measure

Service curve and Goodput

Implementation

Operational implications

Measure

Queueing and burst absorption

Implementation

Operational implications

Measure

Replica and reserve planning

Implementation

Operational implications

Measure

Autoscaling model

Implementation

Operational implications

Measure

Cost model

Implementation

Operational implications

Measure

Sensitivity and uncertainty

Implementation

Operational implications

Measure

Reference tables

Decision checklist

Common mistakes

Sources and further reading

Maintenance record