Key takeaways
- Batching improves utilization only when queueing and memory pressure remain within SLOs.
- Continuous batching fits variable-length generation better than fixed completion barriers.
- Prefill and decode compete for device time; one long prompt can degrade many streaming requests.
- Fairness, priorities, tenant quotas, cancellation, and backpressure require explicit policy.
- Goodput under latency and quality constraints is safer than maximum raw throughput.
Runtime boundary
A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.
Receives
Admitted requests, prompt/output estimates, priority, deadline, tenant quota, memory requirement, cache locality, and worker capacity.
Owns
Admission, ordering, batch composition, phase mix, fairness, quotas, deferral, and overload signaling.
Emits
Execution batches, queue state, reservations, rejection/routing decisions, cancellation actions, and scheduler telemetry.
Does not own
User authorization, model correctness, or product-level retry semantics.
Failure modes
Unbounded queues, head-of-line blocking, starvation, unfair tenants, prefill interference, OOM admission, stale cancellation, and retry storms.
Evidence and metrics
Queue depth/time, admitted/rejected, active sequences/tokens, batch composition, phase share, Goodput, fairness, and release time.
Fixed and dynamic batching
Fixed batches complete together; dynamic batching waits briefly to combine compatible independent requests.
Implementation
Define maximum queue delay, compatibility, preferred/max size, and per-model instance capacity.
Operational implications
Useful for predictive serving and short uniform work; completion barriers hurt variable-length generation.
Measure
Batch size, queue delay, throughput, p95 latency, and rejected requests.
Continuous batching
The active generation batch changes as sequences finish and new requests join.
Implementation
Track active tokens, memory blocks, sequence state, and per-iteration phase work.
Operational implications
Improves utilization for variable output lengths but complicates fairness, cancellation, and memory admission.
Measure
Active sequences/tokens, iteration duration, batch churn, Goodput, and ITL.
Admission and backpressure
The scheduler estimates whether a request can fit memory, queue, deadline, and quota before activating it.
Implementation
Use prompt length, max output, cache state, model/adapter, tenant, and request class; bound queues and signal overload upstream.
Operational implications
Unbounded queues convert capacity shortage into timeout, memory pressure, and retry amplification.
Measure
Queue age/depth, admitted/deferred/rejected, estimate error, Retry-After use, and timeout.
Prefill/decode interference
Long prompt prefill can monopolize compute and cause token gaps for active decodes.
Implementation
Cap prefill tokens per iteration, chunk long prompts, reserve decode capacity, or disaggregate phases.
Operational implications
Balance TTFT for new requests against ITL for active streams; do not optimize one invisibly at the other’s expense.
Measure
Prefill/decode time share, chunk count, TTFT, ITL jitter, and active streams.
Priority, fairness, and quotas
Requests differ by user impact, deadline, token cost, and tenant allocation.
Implementation
Use bounded priority classes, aging or deficit mechanisms, token/memory quotas, and starvation monitoring.
Operational implications
Request count is a poor fairness unit when one long request consumes far more cache and compute.
Measure
Wait by class, service share, active tokens/tenant, starvation, and SLO misses.
Cache-aware routing
Routing can preserve prefix/KV locality while considering queue, memory, health, and failure domain.
Implementation
Use a multi-factor score and expire locality metadata after eviction or restart.
Operational implications
Locality-only routing can overload one worker; load-only routing wastes expensive cache.
Measure
Cache hit length, queue delta, stale locality, load spread, and TTFT.
Cancellation and timeouts
Cancelled queued or active work must leave batches and release cache and reservations.
Implementation
Propagate client disconnect and deadline; make cleanup idempotent and observable.
Operational implications
Slow cleanup creates invisible capacity leaks and unfairness.
Measure
Disconnect-to-stop, block release, orphaned requests, late tokens, and cancellation success.
Load testing the scheduler
Scheduler behavior emerges from traffic distributions, not one saturated steady state.
Implementation
Vary arrivals, prompt/output lengths, tenant mix, cache hit, priority, bursts, cancellation, and worker loss.
Operational implications
Use an external load generator and report tails, errors, queue, Goodput, and fairness.
Measure
Goodput curve, p95/p99 queue/TTFT/TPOT, reject/error, fairness, and recovery.
Reference tables
| Model | Strength | Latency cost | Best fit |
|---|---|---|---|
| Fixed batch | Simple dense execution | Waits for slowest item | Offline and uniform workloads |
| Dynamic batch | Combines arrivals within a window | Configurable queue delay | Predictive serving/short requests |
| Microbatch | Controls memory/pipeline granularity | More scheduling overhead | Pipelines and long prefill chunks |
| Continuous batch | Handles variable sequence completion | Complex fairness and memory policy | LLM token generation |
| Workload | Primary objective | Scheduling emphasis | Risk |
|---|---|---|---|
| Short chat | TTFT and stable ITL | Small queue budget, decode reservation | Underutilization if too strict |
| Long RAG | Bounded TTFT | Token admission, prefix reuse, chunked prefill | Prefill interference/KV pressure |
| Batch summary | Cost and aggregate throughput | Large batches/deadlines | Interactive starvation |
| Tool-heavy agent | Task completion | Release model capacity during tool waits | Retry loops/stale state |
Decision checklist
- What bounded queue and deadline policy applies to each class?
- Which resource—requests, active tokens, KV bytes, or compute—drives admission?
- How are prefill and decode shares controlled?
- What fairness and starvation protections exist?
- How do cache locality and load balance interact?
- How are cancelled requests removed from active batches?
- What overload response prevents retry storms?
Common mistakes
- Using maximum concurrency as the operating target.
- Counting requests equally when token and memory costs differ.
- Allowing large prefill to block active decode without telemetry.
- Implementing priority without starvation protection.
- Hiding queue time from latency reports.
- Retrying overload immediately and synchronously.
- Keeping accelerator reservations while an agent waits on external tools.
Sources and further reading
-
vLLM documentation
(opens in a new tab)
-
Triton dynamic batcher
(opens in a new tab)
-
Ray Serve production guide
(opens in a new tab)
-
KServe autoscaling
(opens in a new tab)
Last reviewed: 2026-06-21 UTC
