Scheduling and Batching

Key takeaways

Batching improves utilization only when queueing and memory pressure remain within SLOs.
Continuous batching fits variable-length generation better than fixed completion barriers.
Prefill and decode compete for device time; one long prompt can degrade many streaming requests.
Fairness, priorities, tenant quotas, cancellation, and backpressure require explicit policy.
Goodput under latency and quality constraints is safer than maximum raw throughput.

Runtime boundary

A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.

Receives

Admitted requests, prompt/output estimates, priority, deadline, tenant quota, memory requirement, cache locality, and worker capacity.

Owns

Admission, ordering, batch composition, phase mix, fairness, quotas, deferral, and overload signaling.

Emits

Execution batches, queue state, reservations, rejection/routing decisions, cancellation actions, and scheduler telemetry.

Does not own

User authorization, model correctness, or product-level retry semantics.

Failure modes

Unbounded queues, head-of-line blocking, starvation, unfair tenants, prefill interference, OOM admission, stale cancellation, and retry storms.

Evidence and metrics

Queue depth/time, admitted/rejected, active sequences/tokens, batch composition, phase share, Goodput, fairness, and release time.

Fixed and dynamic batching

Fixed batches complete together; dynamic batching waits briefly to combine compatible independent requests.

Implementation

Define maximum queue delay, compatibility, preferred/max size, and per-model instance capacity.

Operational implications

Useful for predictive serving and short uniform work; completion barriers hurt variable-length generation.

Measure

Batch size, queue delay, throughput, p95 latency, and rejected requests.

Continuous batching

The active generation batch changes as sequences finish and new requests join.

Implementation

Track active tokens, memory blocks, sequence state, and per-iteration phase work.

Operational implications

Improves utilization for variable output lengths but complicates fairness, cancellation, and memory admission.

Measure

Active sequences/tokens, iteration duration, batch churn, Goodput, and ITL.

Admission and backpressure

The scheduler estimates whether a request can fit memory, queue, deadline, and quota before activating it.

Implementation

Use prompt length, max output, cache state, model/adapter, tenant, and request class; bound queues and signal overload upstream.

Operational implications

Unbounded queues convert capacity shortage into timeout, memory pressure, and retry amplification.

Measure

Queue age/depth, admitted/deferred/rejected, estimate error, Retry-After use, and timeout.

Prefill/decode interference

Long prompt prefill can monopolize compute and cause token gaps for active decodes.

Implementation

Cap prefill tokens per iteration, chunk long prompts, reserve decode capacity, or disaggregate phases.

Operational implications

Balance TTFT for new requests against ITL for active streams; do not optimize one invisibly at the other’s expense.

Measure

Prefill/decode time share, chunk count, TTFT, ITL jitter, and active streams.

Priority, fairness, and quotas

Requests differ by user impact, deadline, token cost, and tenant allocation.

Implementation

Use bounded priority classes, aging or deficit mechanisms, token/memory quotas, and starvation monitoring.

Operational implications

Request count is a poor fairness unit when one long request consumes far more cache and compute.

Measure

Wait by class, service share, active tokens/tenant, starvation, and SLO misses.

Cache-aware routing

Routing can preserve prefix/KV locality while considering queue, memory, health, and failure domain.

Implementation

Use a multi-factor score and expire locality metadata after eviction or restart.

Operational implications

Locality-only routing can overload one worker; load-only routing wastes expensive cache.

Measure

Cache hit length, queue delta, stale locality, load spread, and TTFT.

Cancellation and timeouts

Cancelled queued or active work must leave batches and release cache and reservations.

Implementation

Propagate client disconnect and deadline; make cleanup idempotent and observable.

Operational implications

Slow cleanup creates invisible capacity leaks and unfairness.

Measure

Disconnect-to-stop, block release, orphaned requests, late tokens, and cancellation success.

Load testing the scheduler

Scheduler behavior emerges from traffic distributions, not one saturated steady state.

Implementation

Vary arrivals, prompt/output lengths, tenant mix, cache hit, priority, bursts, cancellation, and worker loss.

Operational implications

Use an external load generator and report tails, errors, queue, Goodput, and fairness.

Measure

Goodput curve, p95/p99 queue/TTFT/TPOT, reject/error, fairness, and recovery.

Reference tables

Batching models
Model	Strength	Latency cost	Best fit
Fixed batch	Simple dense execution	Waits for slowest item	Offline and uniform workloads
Dynamic batch	Combines arrivals within a window	Configurable queue delay	Predictive serving/short requests
Microbatch	Controls memory/pipeline granularity	More scheduling overhead	Pipelines and long prefill chunks
Continuous batch	Handles variable sequence completion	Complex fairness and memory policy	LLM token generation

Workload-aware scheduling
Workload	Primary objective	Scheduling emphasis	Risk
Short chat	TTFT and stable ITL	Small queue budget, decode reservation	Underutilization if too strict
Long RAG	Bounded TTFT	Token admission, prefix reuse, chunked prefill	Prefill interference/KV pressure
Batch summary	Cost and aggregate throughput	Large batches/deadlines	Interactive starvation
Tool-heavy agent	Task completion	Release model capacity during tool waits	Retry loops/stale state

Decision checklist

What bounded queue and deadline policy applies to each class?
Which resource—requests, active tokens, KV bytes, or compute—drives admission?
How are prefill and decode shares controlled?
What fairness and starvation protections exist?
How do cache locality and load balance interact?
How are cancelled requests removed from active batches?
What overload response prevents retry storms?

Common mistakes

Using maximum concurrency as the operating target.
Counting requests equally when token and memory costs differ.
Allowing large prefill to block active decode without telemetry.
Implementing priority without starvation protection.
Hiding queue time from latency reports.
Retrying overload immediately and synchronously.
Keeping accelerator reservations while an agent waits on external tools.

Sources and further reading

vLLM documentation
(opens in a new tab)

vLLM · Official documentation · accessed 2026-06-21 UTC
Triton dynamic batcher
(opens in a new tab)

NVIDIA Triton · Official documentation · accessed 2026-06-21 UTC
Ray Serve production guide
(opens in a new tab)

Ray · Official documentation · accessed 2026-06-21 UTC
KServe autoscaling
(opens in a new tab)

KServe · Official documentation · accessed 2026-06-21 UTC

Last reviewed: 2026-06-21 UTC

Key takeaways

Runtime boundary

Receives

Owns

Emits

Does not own

Failure modes

Evidence and metrics

Fixed and dynamic batching

Implementation

Operational implications

Measure

Continuous batching

Implementation

Operational implications

Measure

Admission and backpressure

Implementation

Operational implications

Measure

Prefill/decode interference

Implementation

Operational implications

Measure

Priority, fairness, and quotas

Implementation

Operational implications

Measure

Cache-aware routing

Implementation

Operational implications

Measure

Cancellation and timeouts

Implementation

Operational implications

Measure

Load testing the scheduler

Implementation

Operational implications

Measure

Reference tables

Decision checklist

Common mistakes

Sources and further reading

Maintenance record