Search ARuntime.com

Find runtime definitions and implementation guidance

Search page titles, summaries, headings, glossary terms, use cases, and runtime-directory entries.

Enter at least two characters.

Operations

Cost and Capacity Planning

Plan AI runtime capacity and cost using model memory, KV cache, concurrency, Goodput, arrival rates, queueing, accelerator utilization, warm pools, provider cost, and task outcomes.

Audience: Technical readers Reading time: 5 minutes Status: Production guidance Last reviewed:

Key takeaways

  • Capacity is constrained by memory, active tokens, queueing, bandwidth, tool dependencies, and latency—not only model parameters.
  • Use Goodput at target SLOs, not maximum benchmark throughput, for replica sizing.
  • Context/output distributions and cache hit rates materially change GPU memory and execution cost.
  • Warm reserve and regional redundancy are intentional capacity costs.
  • Cost per successful task is often more useful than cost per token.

Runtime boundary

A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.

Receives

Traffic distribution, SLOs, model/runtime configuration, memory profile, benchmark curves, cache behavior, scaling limits, prices, and reliability targets.

Owns

Capacity model, assumptions, safety margin, cost attribution, and forecast validation.

Emits

Replica/device plan, admission limits, headroom, scaling thresholds, budgets, forecasts, and sensitivity scenarios.

Does not own

A guarantee that vendor quotas or future prices remain unchanged.

Failure modes

Underprovisioned memory, queue collapse, overprovisioned idle fleet, ignored warmup, cache miss shock, quota exhaustion, and misleading unit economics.

Evidence and metrics

Arrival rate, service time, Goodput/device, active tokens, memory headroom, queue, utilization, scale-up, idle reserve, cost/success, and forecast error.

Workload model

Capacity planning begins with arrivals, prompt/output distributions, concurrency, tenant/model mix, cache state, tools, cancellation, and time-of-day.

Implementation

Use recent production traces or a documented synthetic model; segment by request class.

Operational implications

Averages hide bursts and long-context tails.

Measure

Arrival percentiles, burst duration, length distribution, active tokens, cache hit, and seasonality.

Model and memory budget

Resident weights, runtime workspace, activations, KV cache, communication, adapters, fragmentation, and headroom share device memory.

Implementation

Measure actual runtime usage and model a worst approved context/concurrency combination.

Operational implications

A theoretical parameter-size estimate undercounts runtime overhead.

Measure

Resident/peak bytes, KV bytes/token, fragmentation, free headroom, and OOM.

Service curve and Goodput

Load tests establish Goodput and latency/error curves per device or replica.

Implementation

Choose an operating point below the saturation knee and validate mixed workload/co-location.

Operational implications

Peak throughput is not a safe sustained capacity target.

Measure

Offered/achieved load, Goodput, p95/p99 SLO, errors, queue, and utilization.

Queueing and burst absorption

Queues absorb short bursts but increase user latency and memory.

Implementation

Set bounded queues, deadlines, priorities, backpressure, and a burst budget tied to scale-up time.

Operational implications

Long queues create timeout/retry amplification.

Measure

Queue depth/age, timeout, reject, retry, and burst duration.

Replica and reserve planning

Required replicas depend on class-specific Goodput, failure reserve, rollout reserve, and regional capacity.

Implementation

Plan N+failure, canary, maintenance, and warm reserve separately from steady demand.

Operational implications

Running at 100% removes headroom for failure and rollout.

Measure

Steady replicas, reserve, failover capacity, utilization target, and SLO under loss.

Autoscaling model

Scale decisions require a signal that anticipates capacity demand and time-to-ready.

Implementation

Use queue, concurrency, active tokens, memory, or custom Goodput indicators; model node/model warmup.

Operational implications

Reactive scaling cannot prevent every spike when accelerator supply is slow.

Measure

Decision lag, time to ready, queue during scale, over/undershoot, and thrash.

Cost model

Include compute reservation/usage, storage, network, provider tokens, tool APIs, observability, evaluation, idle reserve, failed work, and engineering operations.

Implementation

Attribute by product/tenant/model with controlled dimensions; record pricing date/region and amortization assumptions.

Operational implications

Do not equate provider token price with total task cost.

Measure

Cost/request/token/success, idle reserve, failure cost, egress, and unallocated cost.

Sensitivity and uncertainty

Traffic, cache hit, output length, model mix, quality routing, and prices change.

Implementation

Run best/base/worst scenarios and update assumptions against observed forecast error.

Operational implications

A single deterministic forecast hides risk.

Measure

Forecast error, scenario range, assumption age, and capacity shortfall frequency.

Reference tables

Capacity planning worksheet
Input Example unit Why it matters
Arrival distribution Requests/s by minute/percentile Burst and baseline demand
Prompt/output distribution Tokens/request Prefill, decode, KV and latency
Goodput/device SLO-qualified requests/s Safe capacity
Memory profile GB/model/active token Concurrency fit
Time to ready Minutes Warm reserve and queue budget
Failure reserve Devices/region Availability
Unit cost $/device-hour, $/token, $/tool Cost/success forecast

Decision checklist

  1. What workload distributions and bursts must be supported?
  2. What device memory is reserved for weights, KV, buffers, and safety margin?
  3. What Goodput operating point meets p95/p99 SLOs?
  4. How much reserve covers failure, rollout, and burst?
  5. How long does new capacity take to become ready?
  6. Which costs are fixed, variable, shared, or external?
  7. What sensitivity variables can change the plan most?
  8. How often is the forecast reconciled with production?

Common mistakes

  • Sizing from average request length.
  • Using maximum throughput as sustainable capacity.
  • Ignoring KV cache and runtime workspace.
  • Assuming autoscaling is instantaneous.
  • Excluding failed requests and idle reserve from cost.
  • Attributing shared GPU cost only by request count.
  • Publishing cost without date, region, model, and success criteria.

Sources and further reading


  1. MLPerf Inference: Datacenter
    (opens in a new tab)

    MLCommons · Benchmark specification · accessed 2026-06-21 UTC

  2. KServe autoscaling
    (opens in a new tab)

    KServe · Official documentation · accessed 2026-06-21 UTC

  3. Kubernetes resource management
    (opens in a new tab)

    Kubernetes · Official documentation · accessed 2026-06-21 UTC

Last reviewed: 2026-06-21 UTC

Maintenance record

Found an error, outdated capability, or unclear category boundary? Submit a correction with a supporting source.