Search ARuntime.com

Find runtime definitions and implementation guidance

Search page titles, summaries, headings, glossary terms, use cases, and runtime-directory entries.

Enter at least two characters.

ARuntime Reference

Disaggregated Inference

Separating prefill, decode, cache, and other serving stages allows each resource pool to scale against different compute, memory, and latency characteristics. It also turns cache transfer and network placement into first-class runtime problems.

Audience: Technical readers Reading time: 4 minutes Status: Research and production guidance Last reviewed:

Disaggregated inference separates execution phases or state services that a conventional LLM engine colocates. The most studied pattern assigns prompt prefill and token decode to different worker pools and moves KV state between them.

Key takeaways

  • Disaggregation can reduce interference and allow asymmetric scaling.
  • It replaces a local-memory problem with routing, transport, consistency, and failure-recovery problems.
  • Use it when measured workload shape and service objectives justify the additional system.

Why disaggregate

Prompt prefill performs parallel computation over the input and tends to be compute-intensive. Decode advances one or a few token steps per active sequence and repeatedly reads accumulated KV state. When both share the same device and scheduler, a large prefill can delay decode and create inter-token jitter. Separating the pools can isolate these resource profiles and scale them independently. DistServe formalizes this approach around goodput constrained by time-to-first-token and time-per-output-token objectives. [ar_cite id=”distserve” label=”DistServe”]

The pattern is not automatically faster. It adds a mandatory state handoff, extra routing decisions, more failure points, and a capacity-balancing problem between pools.

Prefill and decode pools

Prefill pool

Optimized for prompt processing, compute utilization, chunked prefill, and creation of KV state.

Decode pool

Optimized for memory capacity/bandwidth, sequence scheduling, cache residency, and inter-token latency.

Shared or encode pool

Multimodal encoders or embedding stages may be separated when their resource profile competes with decode.

Transfer and event services

Move state, advertise cache locality, and coordinate ownership across workers.

Asymmetric pools are useful only when the planner can estimate demand. Too few prefill workers starve decode; too few decode workers accumulate completed prefills and increase first-token delay.

KV-aware routing

A router should consider more than immediate queue length. Relevant inputs include prefix or cache locality, expected prefill work, active sequence load, worker memory pressure, network topology, tenant placement, and deadline. Multi-turn traffic may benefit from returning to a worker or cache tier that already holds reusable state.

Cache affinity can create hotspots. The policy should expose when it prefers locality, load balance, or deadline and should cap the penalty one tenant can impose by retaining large cache state.

Cache transport

The handoff payload can be large, so transport efficiency is central. Implementations may use high-speed device interconnects, RDMA-capable networking, host staging, or a shared cache service. The runtime must bind transferred blocks to model version, layer layout, precision, sequence identity, and token position. A cache generated by an incompatible engine or model build must not be reused.

Transport telemetry should separate serialization or packing, queue delay, network transfer, validation, and device placement. Without that breakdown, a “decode latency” problem may actually be a state-transfer problem.

KV-centric storage

Mooncake describes a KV-cache-centric serving architecture that treats distributed cache as a shared runtime resource rather than state permanently attached to one GPU worker. [ar_cite id=”mooncake” label=”Mooncake”] Hierarchical tiers can include device memory, host memory, local storage, and remote cache services. A global namespace improves reuse and elasticity but requires ownership, eviction, consistency, access control, and secure deletion.

Elasticity and placement

A planner can scale prefill and decode pools independently, but scaling decisions must account for model-load time, cache warmup, outstanding state, and network capacity. Placement should prefer fast paths for expected state movement while respecting failure domains and tenant constraints. Draining a decode worker may require sequences to finish or state to migrate before termination.

Failure modes

Disaggregated-inference failure model
Failure Required response
Prefill worker fails before publish Discard incomplete state and retry the prefill within request budget.
State transfer fails Retry transport or re-prefill; never expose partial cache as valid.
Decode worker fails mid-sequence Resume only from a validated checkpoint/cache position or restart with explicit output semantics.
Cache directory is stale Invalidate location, reconcile ownership, and route without assuming reuse.
Network partition Protect against duplicate ownership and split-brain cache updates.
Pool imbalance Apply backpressure, admission limits, or temporary colocation rather than unbounded queueing.

Adoption checklist

  • Measure prefill/decode interference in the current workload.
  • Define TTFT, TPOT, deadline, and goodput objectives by traffic class.
  • Profile cache size and transport time at representative context lengths.
  • Validate model/engine/cache compatibility identifiers.
  • Test worker loss, stale directory, duplicate transfer, and network degradation.
  • Compare against a simpler colocated baseline at equal hardware and quality.

Maintenance record

Found an error, outdated capability, or unclear category boundary? Submit a correction with a supporting source.