A distributed inference runtime coordinates model execution across devices or hosts. It owns parallelism, collective communication, placement, remote cache, phase disaggregation, elasticity, and failure handling that a single-process engine cannot provide.
Key takeaways
- Tensor, pipeline, expert, and data parallelism solve different capacity and throughput constraints.
- Disaggregating prefill and decode removes one interference source but creates cache-transfer and routing problems.
- Cluster goodput depends on network, state placement, scheduler, and SLO—not GPU utilization alone.
Definition
Distribution begins when one model request or service requires coordinated work beyond one device process. Replicating independent servers is distributed serving; sharding a model or transferring execution state is distributed model execution. Both may coexist.
Parallelism
Tensor parallelism
Splits tensor operations across devices and requires frequent collective communication.
Pipeline parallelism
Splits model stages and schedules microbatches through them.
Expert parallelism
Distributes mixture-of-experts components and routes tokens to selected experts.
Data parallel or replica serving
Replicates model execution and routes independent requests.
Parallelism should be selected from model size, interconnect, batch/concurrency, latency objective, and failure domain. More devices can reduce capacity pressure while increasing communication and coordination cost.
Prefill-decode disaggregation
Prefill is compute-oriented; decode repeatedly accesses memory state. Systems such as DistServe study separating these phases to improve goodput under latency constraints. [ar_cite id=”distserve” label=”DistServe”] The trade is explicit: cache state must reach the decode worker, and the scheduler must coordinate two pools without increasing time to first token.
Communication and placement
Placement considers device topology, interconnect bandwidth, collective algorithms, cache locality, failure domain, and tenant policy. A scheduler may prefer a loaded worker with a reusable prefix over an idle worker requiring a large transfer. Network-aware planning should expose when locality, load, or deadline drove the decision.
Remote cache and state
Distributed caches can span GPU memory, host memory, local storage, and remote pools. Mooncake presents a KV-cache-centric architecture in which cache storage and transfer become shared runtime services. [ar_cite id=”mooncake” label=”Mooncake”] Remote state increases reuse and elasticity but adds consistency, transport, eviction, and security boundaries.
Failure handling
- Detect worker, device, collective, and transfer failure separately.
- Cancel peers when a coordinated step cannot complete safely.
- Reconstruct or invalidate remote cache with version and ownership checks.
- Drain stateful sequences before rollout where possible.
- Do not repeat partially streamed application output without explicit semantics.
Metrics and selection
Measure request goodput under defined TTFT/TPOT objectives, queueing, transfer time, cache hit and reuse, collective time, batch composition, GPU and memory utilization, failure recovery, and cost. Record model, version, hardware, runtime, precision, input/output length, concurrency, cache state, and warm/cold condition.
