Cost and Capacity Planning

Cost and capacity planning converts workload distributions and service objectives into model, cache, compute, network, storage, tool, and human-review resources.

Key takeaways

Plan from successful outcomes and latency constraints, not average token counts alone.
KV cache, context length, concurrency, and model residency often set capacity.
Retries, tools, approvals, evaluation, and evidence add material agentic cost.

Demand model

Record requests by traffic class, arrival distribution, input/output lengths, concurrency, session continuation, tool frequency, deadline, region, and growth. Include bursts and scheduled jobs. Separate interactive and background demand.

Resource model

Account for weights, runtime workspace, KV cache, CPU preprocessing, network transfer, storage, model-load time, tool dependencies, evidence, and human approval. Resource relationships are nonlinear when batching, cache reuse, or model parallelism changes.

Capacity envelope

Benchmark at representative load and identify the point where an SLO first fails. Capacity is the workload that meets objectives, not the maximum accepted queue. Model several context/concurrency mixes because identical average tokens can produce different cache pressure.

Cost model

Include provisioned or usage compute, storage, interconnect, managed-service fees, model API charges, observability, engineering operations, and idle headroom. Report cost per quality-valid request or completed workflow, not only per token.

Agentic cost

Agent workflows add repeated model calls, tool APIs, sandbox startup, retries, checkpoints, evidence storage, and human attention. A cheaper model can increase total cost if it causes more attempts or review. Optimize the full workflow.

Headroom and failure

Reserve capacity for failures, rollout, maintenance, and correlated bursts. Test loss of a device, node, zone, provider, cache tier, and tool dependency. A fallback model or region needs its own capacity and quality validation.

Planning process

Define traffic classes and SLOs.
Measure workload distributions.
Benchmark configurations under representative load.
Model capacity, headroom, and failure scenarios.
Estimate total cost per successful outcome.
Validate with canary traffic and update forecasts from telemetry.

Sensitivity and scenario analysis

Use distributions rather than one average request. Model prompt length, generated length, concurrency, tool latency, retry rate, approval delay, and cache hit rate as separate variables. Run at least steady-state, burst, dependency-degraded, and recovery scenarios. A small change in context length or cache residency can alter the feasible batch size; an increase in tool retries can consume human and external-system capacity even when GPU utilization appears healthy.

Publish assumptions with every estimate: model and runtime version, hardware, precision, region, reserved or on-demand pricing basis, utilization target, workload window, and excluded costs. Revisit the model when any of those assumptions changes.

Capacity review cadence

Compare forecast and actual demand by traffic class.
Review tail latency, rejection, queue age, cache pressure, and fallback rate.
Attribute cost to completed outcomes as well as model calls.
Include evidence storage, observability, tool APIs, and human review.
Record the headroom required for failure recovery and rollout overlap.

Find runtime definitions and implementation guidance