Key takeaways
- Choose a distribution pattern for a specific bottleneck: model fit, throughput, expert routing, long context, or phase interference.
- Communication, synchronization, and failure domains can erase theoretical compute gains.
- Tensor and pipeline parallelism divide model execution; data parallelism replicates models for independent requests.
- Disaggregated prefill/decode transfers KV state and requires fast fabrics and phase-aware routing.
- Observability must attribute queue, compute, collective, transfer, and retry time across every participant.
Runtime boundary
A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.
Receives
Model shards or replicas, parallelism plan, topology, request routing, cache location, collective configuration, and worker health.
Owns
Sharding, replica placement, communication, synchronization, locality-aware routing, recovery boundaries, and distributed scheduling.
Emits
Distributed results, KV transfers, collective events, placement decisions, partial failures, and end-to-end traces.
Does not own
Model-quality approval, business workflow, or agent-tool authorization.
Failure modes
Collective timeout, straggler, topology mismatch, node loss, cache-transfer delay, load imbalance, shard/version skew, and stale routing.
Evidence and metrics
Per-stage queue/compute, collective time, bytes transferred, link utilization, straggler ratio, HBM headroom, locality, Goodput, and recovery.
Data parallel serving
Full model replicas serve independent requests.
Implementation
Route using queue, active tokens, memory, health, and optional cache locality.
Operational implications
Simplest pattern when the model fits one device; duplicates weight memory but provides horizontal capacity and failure isolation.
Measure
Requests/replica, load spread, queue, cache locality, Goodput, and failover.
Tensor parallelism
Large tensor operations are split across devices and partial results are combined with frequent collectives.
Implementation
Choose a degree matching physical topology and supported kernels; keep ranks and shard versions consistent.
Operational implications
Fast intra-node or rack links are critical. Communication can dominate small batches or poor topology.
Measure
Collective time/bytes, compute/communication ratio, scaling efficiency, and stragglers.
Pipeline parallelism
Model layers are assigned to stages and activations move between them.
Implementation
Balance stage time and memory, select microbatching, and manage pipeline bubbles.
Operational implications
A single slow stage limits throughput; failures often invalidate the whole pipeline group.
Measure
Stage latency, bubble fraction, activation transfer, throughput, and stage memory.
Expert parallelism
Mixture-of-experts weights are distributed and tokens route to selected experts.
Implementation
Use expert-aware all-to-all communication, capacity controls, and load balancing.
Operational implications
Hot experts and skew cause stragglers and memory pressure even when average routing is balanced.
Measure
Tokens/expert, all-to-all time, dropped/rerouted tokens, skew, and Goodput.
Context and sequence parallelism
Long-sequence work and memory are partitioned across devices.
Implementation
Use only with supported model/kernels and account for communication introduced by attention or sequence operations.
Operational implications
Benefits can disappear on short prompts; configuration is model-specific.
Measure
Sequence length, communication, memory reduction, latency, and scaling efficiency.
Prefill/decode disaggregation
Prefill workers build KV state; decode workers receive it and generate tokens.
Implementation
Use a phase-aware router, optimized KV transfer, capacity ratios, and locality metadata.
Operational implications
It helps when avoided interference exceeds transfer/queue cost; ordinary networks may negate it.
Measure
Prefill/decode queue, KV bytes/transfer time, TTFT, ITL, and Goodput.
Topology and collectives
Logical device groups execute over physical PCIe roots, high-speed links, switches, and networks.
Implementation
Record physical topology, communication library/version, operation type, message size, and participants.
Operational implications
Aggregate utilization can hide workers stalled on communication.
Measure
Link utilization, collective p95/p99, retries, timeouts, and compute idle.
Cache-aware routing
Routers can preserve KV/prefix locality across replicas and phase pools.
Implementation
Combine locality, queue, memory, health, and tenant policy. Expire location metadata on eviction/restart.
Operational implications
Locality-only routing creates hotspots; stale metadata increases TTFT.
Measure
Matched tokens, stale hit, queue delta, load balance, and transfer/recompute.
Elasticity and failure recovery
Replacement workers must load shards, build/load engines, join groups, allocate cache, and warm up.
Implementation
Classify request, rank, worker, group, node, and control-plane failures; retain retry lineage.
Operational implications
Distributed scale-up is not instant and retries can duplicate expensive work.
Measure
Join/ready time, group rebuild, retry success, unavailable devices, and recovery objective.
Reference tables
| Pattern | Solves | Primary cost | Failure mode |
|---|---|---|---|
| Data parallel | Request throughput/availability | Duplicate model memory | Load imbalance/cold replica |
| Tensor parallel | Layers too large/slow for one device | Frequent collectives | Collective timeout/topology mismatch |
| Pipeline parallel | Model depth across devices | Bubbles/activation transfer | Straggler stage |
| Expert parallel | Large sparse MoE experts | All-to-all routing | Hot experts/skew |
| Context/sequence parallel | Long-context memory/attention | Communication/complexity | Poor scaling on short inputs |
| Prefill/decode disaggregation | Phase interference and independent scaling | KV transfer | Transfer slower than co-location |
Decision checklist
- Which single-device limit requires distribution?
- Which pattern matches the bottleneck and model architecture?
- What physical topology and bandwidth are guaranteed?
- How much communication occurs per token or request?
- How are KV state and cache locality routed and invalidated?
- What happens when one rank, stage, or worker fails?
- How long does a replacement group take to become ready?
- Can the result be compared with a simpler replicated baseline?
Common mistakes
- Adding tensor parallelism when independent replicas meet the goal.
- Reporting device count without topology and parallel configuration.
- Assuming KV transfer is small relative to prompt execution.
- Balancing request count while ignoring active tokens and cache locality.
- Scaling down cache-rich workers without accounting for lost reuse.
- Retrying partially executed distributed work without idempotency.
- Using average network throughput instead of tail transfer latency.
Sources and further reading
-
Distributed serving
(opens in a new tab)
-
TensorRT-LLM parallelism
(opens in a new tab)
-
NVIDIA Dynamo
(opens in a new tab)
-
KServe multi-node inference
(opens in a new tab)
-
NCCL user guide
(opens in a new tab)
Last reviewed: 2026-06-21 UTC
