Search ARuntime.com

Find runtime definitions and implementation guidance

Search page titles, summaries, headings, glossary terms, use cases, and runtime-directory entries.

Enter at least two characters.

Architectures

Distributed Inference

Guide to tensor, pipeline, data, expert, context, sequence, and disaggregated prefill/decode inference, including interconnects, KV transfer, locality, scheduling, and failure handling.

Audience: Technical readers Reading time: 5 minutes Status: Architecture Last reviewed:

Key takeaways

  • Choose a distribution pattern for a specific bottleneck: model fit, throughput, expert routing, long context, or phase interference.
  • Communication, synchronization, and failure domains can erase theoretical compute gains.
  • Tensor and pipeline parallelism divide model execution; data parallelism replicates models for independent requests.
  • Disaggregated prefill/decode transfers KV state and requires fast fabrics and phase-aware routing.
  • Observability must attribute queue, compute, collective, transfer, and retry time across every participant.

Runtime boundary

A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.

Receives

Model shards or replicas, parallelism plan, topology, request routing, cache location, collective configuration, and worker health.

Owns

Sharding, replica placement, communication, synchronization, locality-aware routing, recovery boundaries, and distributed scheduling.

Emits

Distributed results, KV transfers, collective events, placement decisions, partial failures, and end-to-end traces.

Does not own

Model-quality approval, business workflow, or agent-tool authorization.

Failure modes

Collective timeout, straggler, topology mismatch, node loss, cache-transfer delay, load imbalance, shard/version skew, and stale routing.

Evidence and metrics

Per-stage queue/compute, collective time, bytes transferred, link utilization, straggler ratio, HBM headroom, locality, Goodput, and recovery.

Data parallel serving

Full model replicas serve independent requests.

Implementation

Route using queue, active tokens, memory, health, and optional cache locality.

Operational implications

Simplest pattern when the model fits one device; duplicates weight memory but provides horizontal capacity and failure isolation.

Measure

Requests/replica, load spread, queue, cache locality, Goodput, and failover.

Tensor parallelism

Large tensor operations are split across devices and partial results are combined with frequent collectives.

Implementation

Choose a degree matching physical topology and supported kernels; keep ranks and shard versions consistent.

Operational implications

Fast intra-node or rack links are critical. Communication can dominate small batches or poor topology.

Measure

Collective time/bytes, compute/communication ratio, scaling efficiency, and stragglers.

Pipeline parallelism

Model layers are assigned to stages and activations move between them.

Implementation

Balance stage time and memory, select microbatching, and manage pipeline bubbles.

Operational implications

A single slow stage limits throughput; failures often invalidate the whole pipeline group.

Measure

Stage latency, bubble fraction, activation transfer, throughput, and stage memory.

Expert parallelism

Mixture-of-experts weights are distributed and tokens route to selected experts.

Implementation

Use expert-aware all-to-all communication, capacity controls, and load balancing.

Operational implications

Hot experts and skew cause stragglers and memory pressure even when average routing is balanced.

Measure

Tokens/expert, all-to-all time, dropped/rerouted tokens, skew, and Goodput.

Context and sequence parallelism

Long-sequence work and memory are partitioned across devices.

Implementation

Use only with supported model/kernels and account for communication introduced by attention or sequence operations.

Operational implications

Benefits can disappear on short prompts; configuration is model-specific.

Measure

Sequence length, communication, memory reduction, latency, and scaling efficiency.

Prefill/decode disaggregation

Prefill workers build KV state; decode workers receive it and generate tokens.

Implementation

Use a phase-aware router, optimized KV transfer, capacity ratios, and locality metadata.

Operational implications

It helps when avoided interference exceeds transfer/queue cost; ordinary networks may negate it.

Measure

Prefill/decode queue, KV bytes/transfer time, TTFT, ITL, and Goodput.

Topology and collectives

Logical device groups execute over physical PCIe roots, high-speed links, switches, and networks.

Implementation

Record physical topology, communication library/version, operation type, message size, and participants.

Operational implications

Aggregate utilization can hide workers stalled on communication.

Measure

Link utilization, collective p95/p99, retries, timeouts, and compute idle.

Cache-aware routing

Routers can preserve KV/prefix locality across replicas and phase pools.

Implementation

Combine locality, queue, memory, health, and tenant policy. Expire location metadata on eviction/restart.

Operational implications

Locality-only routing creates hotspots; stale metadata increases TTFT.

Measure

Matched tokens, stale hit, queue delta, load balance, and transfer/recompute.

Elasticity and failure recovery

Replacement workers must load shards, build/load engines, join groups, allocate cache, and warm up.

Implementation

Classify request, rank, worker, group, node, and control-plane failures; retain retry lineage.

Operational implications

Distributed scale-up is not instant and retries can duplicate expensive work.

Measure

Join/ready time, group rebuild, retry success, unavailable devices, and recovery objective.

Reference tables

Distributed inference patterns
Pattern Solves Primary cost Failure mode
Data parallel Request throughput/availability Duplicate model memory Load imbalance/cold replica
Tensor parallel Layers too large/slow for one device Frequent collectives Collective timeout/topology mismatch
Pipeline parallel Model depth across devices Bubbles/activation transfer Straggler stage
Expert parallel Large sparse MoE experts All-to-all routing Hot experts/skew
Context/sequence parallel Long-context memory/attention Communication/complexity Poor scaling on short inputs
Prefill/decode disaggregation Phase interference and independent scaling KV transfer Transfer slower than co-location

Decision checklist

  1. Which single-device limit requires distribution?
  2. Which pattern matches the bottleneck and model architecture?
  3. What physical topology and bandwidth are guaranteed?
  4. How much communication occurs per token or request?
  5. How are KV state and cache locality routed and invalidated?
  6. What happens when one rank, stage, or worker fails?
  7. How long does a replacement group take to become ready?
  8. Can the result be compared with a simpler replicated baseline?

Common mistakes

  • Adding tensor parallelism when independent replicas meet the goal.
  • Reporting device count without topology and parallel configuration.
  • Assuming KV transfer is small relative to prompt execution.
  • Balancing request count while ignoring active tokens and cache locality.
  • Scaling down cache-rich workers without accounting for lost reuse.
  • Retrying partially executed distributed work without idempotency.
  • Using average network throughput instead of tail transfer latency.

Sources and further reading


  1. Distributed serving
    (opens in a new tab)

    vLLM · Official documentation · accessed 2026-06-21 UTC

  2. TensorRT-LLM parallelism
    (opens in a new tab)

    NVIDIA · Official documentation · accessed 2026-06-21 UTC

  3. NVIDIA Dynamo
    (opens in a new tab)

    NVIDIA · Official documentation · accessed 2026-06-21 UTC

  4. KServe multi-node inference
    (opens in a new tab)

    KServe · Official documentation · accessed 2026-06-21 UTC

  5. NCCL user guide
    (opens in a new tab)

    NVIDIA · Official documentation · accessed 2026-06-21 UTC

Last reviewed: 2026-06-21 UTC

Maintenance record

Found an error, outdated capability, or unclear category boundary? Submit a correction with a supporting source.