Distributed Inference Runtimes

A distributed inference runtime coordinates model execution across devices or hosts. It owns parallelism, collective communication, placement, remote cache, phase disaggregation, elasticity, and failure handling that a single-process engine cannot provide.

Key takeaways

Tensor, pipeline, expert, and data parallelism solve different capacity and throughput constraints.
Disaggregating prefill and decode removes one interference source but creates cache-transfer and routing problems.
Cluster goodput depends on network, state placement, scheduler, and SLO—not GPU utilization alone.

Definition

Distribution begins when one model request or service requires coordinated work beyond one device process. Replicating independent servers is distributed serving; sharding a model or transferring execution state is distributed model execution. Both may coexist.

Parallelism

Tensor parallelism

Splits tensor operations across devices and requires frequent collective communication.

Pipeline parallelism

Splits model stages and schedules microbatches through them.

Expert parallelism

Distributes mixture-of-experts components and routes tokens to selected experts.

Data parallel or replica serving

Replicates model execution and routes independent requests.

Parallelism should be selected from model size, interconnect, batch/concurrency, latency objective, and failure domain. More devices can reduce capacity pressure while increasing communication and coordination cost.

Prefill-decode disaggregation

Prefill is compute-oriented; decode repeatedly accesses memory state. Systems such as DistServe study separating these phases to improve goodput under latency constraints. [ar_cite id=”distserve” label=”DistServe”] The trade is explicit: cache state must reach the decode worker, and the scheduler must coordinate two pools without increasing time to first token.

Communication and placement

Placement considers device topology, interconnect bandwidth, collective algorithms, cache locality, failure domain, and tenant policy. A scheduler may prefer a loaded worker with a reusable prefix over an idle worker requiring a large transfer. Network-aware planning should expose when locality, load, or deadline drove the decision.

Remote cache and state

Distributed caches can span GPU memory, host memory, local storage, and remote pools. Mooncake presents a KV-cache-centric architecture in which cache storage and transfer become shared runtime services. [ar_cite id=”mooncake” label=”Mooncake”] Remote state increases reuse and elasticity but adds consistency, transport, eviction, and security boundaries.

Failure handling

Detect worker, device, collective, and transfer failure separately.
Cancel peers when a coordinated step cannot complete safely.
Reconstruct or invalidate remote cache with version and ownership checks.
Drain stateful sequences before rollout where possible.
Do not repeat partially streamed application output without explicit semantics.

Metrics and selection

Measure request goodput under defined TTFT/TPOT objectives, queueing, transfer time, cache hit and reuse, collective time, batch composition, GPU and memory utilization, failure recovery, and cost. Record model, version, hardware, runtime, precision, input/output length, concurrency, cache state, and warm/cold condition.

Find runtime definitions and implementation guidance