Search ARuntime.com

Find runtime definitions and implementation guidance

Search page titles, summaries, headings, glossary terms, use cases, and runtime-directory entries.

Enter at least two characters.

Mechanics

Scheduling and Batching

Guide to AI inference scheduling and batching: admission control, continuous batching, dynamic batching, chunked prefill, fairness, priorities, backpressure, cancellation, and Goodput.

Audience: Technical readers Reading time: 5 minutes Status: Foundational Last reviewed:

Key takeaways

  • Batching improves utilization only when queueing and memory pressure remain within SLOs.
  • Continuous batching fits variable-length generation better than fixed completion barriers.
  • Prefill and decode compete for device time; one long prompt can degrade many streaming requests.
  • Fairness, priorities, tenant quotas, cancellation, and backpressure require explicit policy.
  • Goodput under latency and quality constraints is safer than maximum raw throughput.

Runtime boundary

A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.

Receives

Admitted requests, prompt/output estimates, priority, deadline, tenant quota, memory requirement, cache locality, and worker capacity.

Owns

Admission, ordering, batch composition, phase mix, fairness, quotas, deferral, and overload signaling.

Emits

Execution batches, queue state, reservations, rejection/routing decisions, cancellation actions, and scheduler telemetry.

Does not own

User authorization, model correctness, or product-level retry semantics.

Failure modes

Unbounded queues, head-of-line blocking, starvation, unfair tenants, prefill interference, OOM admission, stale cancellation, and retry storms.

Evidence and metrics

Queue depth/time, admitted/rejected, active sequences/tokens, batch composition, phase share, Goodput, fairness, and release time.

Fixed and dynamic batching

Fixed batches complete together; dynamic batching waits briefly to combine compatible independent requests.

Implementation

Define maximum queue delay, compatibility, preferred/max size, and per-model instance capacity.

Operational implications

Useful for predictive serving and short uniform work; completion barriers hurt variable-length generation.

Measure

Batch size, queue delay, throughput, p95 latency, and rejected requests.

Continuous batching

The active generation batch changes as sequences finish and new requests join.

Implementation

Track active tokens, memory blocks, sequence state, and per-iteration phase work.

Operational implications

Improves utilization for variable output lengths but complicates fairness, cancellation, and memory admission.

Measure

Active sequences/tokens, iteration duration, batch churn, Goodput, and ITL.

Admission and backpressure

The scheduler estimates whether a request can fit memory, queue, deadline, and quota before activating it.

Implementation

Use prompt length, max output, cache state, model/adapter, tenant, and request class; bound queues and signal overload upstream.

Operational implications

Unbounded queues convert capacity shortage into timeout, memory pressure, and retry amplification.

Measure

Queue age/depth, admitted/deferred/rejected, estimate error, Retry-After use, and timeout.

Prefill/decode interference

Long prompt prefill can monopolize compute and cause token gaps for active decodes.

Implementation

Cap prefill tokens per iteration, chunk long prompts, reserve decode capacity, or disaggregate phases.

Operational implications

Balance TTFT for new requests against ITL for active streams; do not optimize one invisibly at the other’s expense.

Measure

Prefill/decode time share, chunk count, TTFT, ITL jitter, and active streams.

Priority, fairness, and quotas

Requests differ by user impact, deadline, token cost, and tenant allocation.

Implementation

Use bounded priority classes, aging or deficit mechanisms, token/memory quotas, and starvation monitoring.

Operational implications

Request count is a poor fairness unit when one long request consumes far more cache and compute.

Measure

Wait by class, service share, active tokens/tenant, starvation, and SLO misses.

Cache-aware routing

Routing can preserve prefix/KV locality while considering queue, memory, health, and failure domain.

Implementation

Use a multi-factor score and expire locality metadata after eviction or restart.

Operational implications

Locality-only routing can overload one worker; load-only routing wastes expensive cache.

Measure

Cache hit length, queue delta, stale locality, load spread, and TTFT.

Cancellation and timeouts

Cancelled queued or active work must leave batches and release cache and reservations.

Implementation

Propagate client disconnect and deadline; make cleanup idempotent and observable.

Operational implications

Slow cleanup creates invisible capacity leaks and unfairness.

Measure

Disconnect-to-stop, block release, orphaned requests, late tokens, and cancellation success.

Load testing the scheduler

Scheduler behavior emerges from traffic distributions, not one saturated steady state.

Implementation

Vary arrivals, prompt/output lengths, tenant mix, cache hit, priority, bursts, cancellation, and worker loss.

Operational implications

Use an external load generator and report tails, errors, queue, Goodput, and fairness.

Measure

Goodput curve, p95/p99 queue/TTFT/TPOT, reject/error, fairness, and recovery.

Reference tables

Batching models
Model Strength Latency cost Best fit
Fixed batch Simple dense execution Waits for slowest item Offline and uniform workloads
Dynamic batch Combines arrivals within a window Configurable queue delay Predictive serving/short requests
Microbatch Controls memory/pipeline granularity More scheduling overhead Pipelines and long prefill chunks
Continuous batch Handles variable sequence completion Complex fairness and memory policy LLM token generation
Workload-aware scheduling
Workload Primary objective Scheduling emphasis Risk
Short chat TTFT and stable ITL Small queue budget, decode reservation Underutilization if too strict
Long RAG Bounded TTFT Token admission, prefix reuse, chunked prefill Prefill interference/KV pressure
Batch summary Cost and aggregate throughput Large batches/deadlines Interactive starvation
Tool-heavy agent Task completion Release model capacity during tool waits Retry loops/stale state

Decision checklist

  1. What bounded queue and deadline policy applies to each class?
  2. Which resource—requests, active tokens, KV bytes, or compute—drives admission?
  3. How are prefill and decode shares controlled?
  4. What fairness and starvation protections exist?
  5. How do cache locality and load balance interact?
  6. How are cancelled requests removed from active batches?
  7. What overload response prevents retry storms?

Common mistakes

  • Using maximum concurrency as the operating target.
  • Counting requests equally when token and memory costs differ.
  • Allowing large prefill to block active decode without telemetry.
  • Implementing priority without starvation protection.
  • Hiding queue time from latency reports.
  • Retrying overload immediately and synchronously.
  • Keeping accelerator reservations while an agent waits on external tools.

Sources and further reading


  1. vLLM documentation
    (opens in a new tab)

    vLLM · Official documentation · accessed 2026-06-21 UTC

  2. Triton dynamic batcher
    (opens in a new tab)

    NVIDIA Triton · Official documentation · accessed 2026-06-21 UTC

  3. Ray Serve production guide
    (opens in a new tab)

    Ray · Official documentation · accessed 2026-06-21 UTC

  4. KServe autoscaling
    (opens in a new tab)

    KServe · Official documentation · accessed 2026-06-21 UTC

Last reviewed: 2026-06-21 UTC

Maintenance record

Found an error, outdated capability, or unclear category boundary? Submit a correction with a supporting source.