Search ARuntime.com

Find runtime definitions and implementation guidance

Search page titles, summaries, headings, glossary terms, use cases, and runtime-directory entries.

Enter at least two characters.

Mechanics

LLM Inference

Understand LLM inference mechanics, including prefill, decode, KV cache, continuous batching, prefix reuse, quantization, structured generation, cancellation, tail latency, and production metrics.

Audience: Technical readers Reading time: 6 minutes Status: Production guidance Last reviewed:

Key takeaways

  • Prefill and decode have different compute and memory profiles and should be measured separately.
  • Time to first token includes queueing, routing, tokenization, prefill, and delivery—not only accelerator execution.
  • Decode performance depends on memory bandwidth, KV-cache access, batch composition, scheduling, and sequence length.
  • Raw tokens per second is insufficient; use TTFT, TPOT or ITL, tail latency, quality, and Goodput under defined traffic.
  • Cancellation, structured outputs, cache cleanup, and backpressure are part of the inference runtime contract.

Runtime boundary

A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.

Receives

Tokenized prompts, model/adapter selection, sampling settings, output constraints, priority, deadline, and cache hints.

Owns

Model execution, attention state, prefill/decode scheduling, sampling, streaming, cancellation cleanup, and token-level telemetry.

Emits

Token events or complete outputs, usage counters, finish reason, cache lifecycle events, timing spans, and errors.

Does not own

User authorization, enterprise context policy, tool side-effect permission, or product workflow.

Failure modes

OOM, queue overload, prefill interference, cache exhaustion, slow decode, invalid structured output, cancellation leaks, and model-load failure.

Evidence and metrics

Queue time, TTFT, prefill, TPOT, ITL, E2E, throughput, Goodput, HBM/KV occupancy, cancellation cleanup, and cost.

Prefill

Prefill processes prompt tokens and builds key/value attention state. Large matrix operations expose parallel compute, while long prompts increase work before the first output.

Implementation

Measure tokenization, queue, prompt transfer, model prefill, and first-token delivery separately. Use prefix reuse only when token/model/config identity matches.

Operational implications

Long prefills can interfere with active decode streams. Chunking or phase-specific workers can bound the effect.

Measure

Prompt tokens, queue, prefill time, TTFT, cache hit, and prefill tokens avoided.

Decode

Decode generates subsequent tokens sequentially for each request and repeatedly reads weights and relevant cache state.

Implementation

Batch many active sequences while controlling memory and fairness. Keep the sampling and structured-output path on the critical trace.

Operational implications

Decode can be memory-bandwidth-bound. High aggregate utilization can still produce poor per-request ITL under oversized batches.

Measure

TPOT, ITL distribution, active sequences/tokens, bandwidth, output throughput, and Goodput.

TTFT, TPOT, ITL, and E2E

These metrics describe different portions of user-visible latency. Definitions and aggregation must be explicit.

Implementation

Define the request boundary, whether queue/network/tokenization are included, how token gaps are weighted, and how output length affects E2E.

Operational implications

Use percentiles by workload class rather than one global average.

Measure

p50/p95/p99 queue, TTFT, TPOT/ITL, E2E, output length, and errors.

Continuous batching and scheduling

The active batch changes as sequences arrive, finish, pause, or cancel.

Implementation

Use token- and memory-aware admission, bounded queues, priority/fairness, prefill/decode policy, and upstream backpressure.

Operational implications

Maximum concurrency is rarely the best operating point; tail latency and cache pressure rise before hard failure.

Measure

Batch size/composition, admitted/rejected, queue, active tokens, Goodput, and fairness.

KV cache and context

KV state grows with active tokens, layers, attention dimensions, and cache precision. Long contexts reduce concurrency.

Implementation

Use paged allocation, exact prefix reuse, eviction, and optional tiered offload under explicit tenant and retention policy.

Operational implications

Track physical cache capacity rather than only advertised context window. Cleanup on finish/cancel must be prompt.

Measure

Blocks/tokens used, allocation failures, hit rate, evictions, transfer, and cleanup time.

Precision and quantization

Weights, activations, and KV cache can use different precision.

Implementation

Name each format, kernel path, calibration or conversion method, and quality test.

Operational implications

Lower precision can improve fit or bandwidth but may change quality and available kernels.

Measure

Memory, throughput, latency, task quality, numerical errors, and fallback.

Structured generation

Grammar, JSON schema, or finite-state constraints restrict allowed tokens and reduce invalid output.

Implementation

Compile constraints before or during request handling, validate final output, and ensure compatibility with batching and speculation.

Operational implications

Constraint construction and token masking add overhead; invalid schemas need deterministic errors.

Measure

Schema compile time, token-mask overhead, valid-output rate, retries, and TPOT impact.

Streaming and cancellation

Streaming exposes incremental events while cancellation must remove queued/active work and release state.

Implementation

Define accepted, first-token, delta, usage, completed, cancelled, and failed events. Propagate client disconnect and deadlines.

Operational implications

Slow cleanup turns abandoned clients into hidden capacity leaks.

Measure

Disconnect-to-cancel time, cache release, orphaned requests, stream errors, and completion reason.

Latency diagnosis

Phase attribution turns symptoms into runtime actions.

Implementation

Correlate gateway, queue, tokenizer, prefill, decode, cache, tool, and delivery spans.

Operational implications

High TTFT with low queue differs from queue overload; irregular ITL can signal phase interference or host stalls.

Measure

Phase percentiles, scheduler timeline, cache state, transfer, host CPU, and kernel timing.

Reference tables

Prefill versus decode
Property Prefill Decode
Input per request step Many prompt tokens One or speculative token group
Typical sensitivity Compute, prompt length, queue Memory bandwidth, cache, scheduling
Primary user metric TTFT TPOT / ITL
State Builds KV cache Reads and extends KV cache
Common interference Long prompts delay other work Large active batches pressure memory
Inference metric guide
Metric Measures Common misuse
Queue time Admission-to-execution delay Omitted from model latency
TTFT Request to first delivered token Compared without prompt length/cache state
TPOT Average post-first-token gap per request Formula/weighting undisclosed
ITL Individual or token-weighted token gaps Used interchangeably with TPOT
E2E latency Request through final result Compared with different output lengths
Throughput Tokens or requests per interval Presented without latency/quality
Goodput Work within SLO and quality bounds Thresholds left undefined

Decision checklist

  1. What prompt and output distributions define the workload?
  2. Which latency boundaries are included in TTFT and E2E?
  3. What Goodput SLOs apply by request class?
  4. How are prefill, decode, and structured-generation work scheduled?
  5. What is the KV-cache budget and eviction policy?
  6. How do timeout, cancellation, and client disconnect release resources?
  7. Which precision choices are quality-approved?
  8. How is overload propagated upstream?

Common mistakes

  • Publishing aggregate tokens per second without prompt/output distributions and SLOs.
  • Calling accelerator execution time TTFT while excluding queue and delivery.
  • Maximizing concurrency until tail latency becomes unusable.
  • Ignoring cold versus warm prefix-cache state.
  • Treating maximum context window as the practical default.
  • Leaking KV memory after cancellation or disconnect.
  • Enabling structured generation without measuring constraint overhead.

Sources and further reading


  1. vLLM documentation
    (opens in a new tab)

    vLLM · Official documentation · accessed 2026-06-21 UTC

  2. PagedAttention paper
    (opens in a new tab)

    USENIX / vLLM authors · Peer-reviewed paper · accessed 2026-06-21 UTC

  3. TensorRT-LLM documentation
    (opens in a new tab)

    NVIDIA · Official documentation · accessed 2026-06-21 UTC

  4. SGLang documentation
    (opens in a new tab)

    SGLang · Official documentation · accessed 2026-06-21 UTC

  5. MLPerf Inference: Datacenter
    (opens in a new tab)

    MLCommons · Benchmark specification · accessed 2026-06-21 UTC

Last reviewed: 2026-06-21 UTC

Maintenance record

Found an error, outdated capability, or unclear category boundary? Submit a correction with a supporting source.