Key takeaways
- Prefill and decode have different compute and memory profiles and should be measured separately.
- Time to first token includes queueing, routing, tokenization, prefill, and delivery—not only accelerator execution.
- Decode performance depends on memory bandwidth, KV-cache access, batch composition, scheduling, and sequence length.
- Raw tokens per second is insufficient; use TTFT, TPOT or ITL, tail latency, quality, and Goodput under defined traffic.
- Cancellation, structured outputs, cache cleanup, and backpressure are part of the inference runtime contract.
Runtime boundary
A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.
Receives
Tokenized prompts, model/adapter selection, sampling settings, output constraints, priority, deadline, and cache hints.
Owns
Model execution, attention state, prefill/decode scheduling, sampling, streaming, cancellation cleanup, and token-level telemetry.
Emits
Token events or complete outputs, usage counters, finish reason, cache lifecycle events, timing spans, and errors.
Does not own
User authorization, enterprise context policy, tool side-effect permission, or product workflow.
Failure modes
OOM, queue overload, prefill interference, cache exhaustion, slow decode, invalid structured output, cancellation leaks, and model-load failure.
Evidence and metrics
Queue time, TTFT, prefill, TPOT, ITL, E2E, throughput, Goodput, HBM/KV occupancy, cancellation cleanup, and cost.
Prefill
Prefill processes prompt tokens and builds key/value attention state. Large matrix operations expose parallel compute, while long prompts increase work before the first output.
Implementation
Measure tokenization, queue, prompt transfer, model prefill, and first-token delivery separately. Use prefix reuse only when token/model/config identity matches.
Operational implications
Long prefills can interfere with active decode streams. Chunking or phase-specific workers can bound the effect.
Measure
Prompt tokens, queue, prefill time, TTFT, cache hit, and prefill tokens avoided.
Decode
Decode generates subsequent tokens sequentially for each request and repeatedly reads weights and relevant cache state.
Implementation
Batch many active sequences while controlling memory and fairness. Keep the sampling and structured-output path on the critical trace.
Operational implications
Decode can be memory-bandwidth-bound. High aggregate utilization can still produce poor per-request ITL under oversized batches.
Measure
TPOT, ITL distribution, active sequences/tokens, bandwidth, output throughput, and Goodput.
TTFT, TPOT, ITL, and E2E
These metrics describe different portions of user-visible latency. Definitions and aggregation must be explicit.
Implementation
Define the request boundary, whether queue/network/tokenization are included, how token gaps are weighted, and how output length affects E2E.
Operational implications
Use percentiles by workload class rather than one global average.
Measure
p50/p95/p99 queue, TTFT, TPOT/ITL, E2E, output length, and errors.
Continuous batching and scheduling
The active batch changes as sequences arrive, finish, pause, or cancel.
Implementation
Use token- and memory-aware admission, bounded queues, priority/fairness, prefill/decode policy, and upstream backpressure.
Operational implications
Maximum concurrency is rarely the best operating point; tail latency and cache pressure rise before hard failure.
Measure
Batch size/composition, admitted/rejected, queue, active tokens, Goodput, and fairness.
KV cache and context
KV state grows with active tokens, layers, attention dimensions, and cache precision. Long contexts reduce concurrency.
Implementation
Use paged allocation, exact prefix reuse, eviction, and optional tiered offload under explicit tenant and retention policy.
Operational implications
Track physical cache capacity rather than only advertised context window. Cleanup on finish/cancel must be prompt.
Measure
Blocks/tokens used, allocation failures, hit rate, evictions, transfer, and cleanup time.
Precision and quantization
Weights, activations, and KV cache can use different precision.
Implementation
Name each format, kernel path, calibration or conversion method, and quality test.
Operational implications
Lower precision can improve fit or bandwidth but may change quality and available kernels.
Measure
Memory, throughput, latency, task quality, numerical errors, and fallback.
Structured generation
Grammar, JSON schema, or finite-state constraints restrict allowed tokens and reduce invalid output.
Implementation
Compile constraints before or during request handling, validate final output, and ensure compatibility with batching and speculation.
Operational implications
Constraint construction and token masking add overhead; invalid schemas need deterministic errors.
Measure
Schema compile time, token-mask overhead, valid-output rate, retries, and TPOT impact.
Streaming and cancellation
Streaming exposes incremental events while cancellation must remove queued/active work and release state.
Implementation
Define accepted, first-token, delta, usage, completed, cancelled, and failed events. Propagate client disconnect and deadlines.
Operational implications
Slow cleanup turns abandoned clients into hidden capacity leaks.
Measure
Disconnect-to-cancel time, cache release, orphaned requests, stream errors, and completion reason.
Latency diagnosis
Phase attribution turns symptoms into runtime actions.
Implementation
Correlate gateway, queue, tokenizer, prefill, decode, cache, tool, and delivery spans.
Operational implications
High TTFT with low queue differs from queue overload; irregular ITL can signal phase interference or host stalls.
Measure
Phase percentiles, scheduler timeline, cache state, transfer, host CPU, and kernel timing.
Reference tables
| Property | Prefill | Decode |
|---|---|---|
| Input per request step | Many prompt tokens | One or speculative token group |
| Typical sensitivity | Compute, prompt length, queue | Memory bandwidth, cache, scheduling |
| Primary user metric | TTFT | TPOT / ITL |
| State | Builds KV cache | Reads and extends KV cache |
| Common interference | Long prompts delay other work | Large active batches pressure memory |
| Metric | Measures | Common misuse |
|---|---|---|
| Queue time | Admission-to-execution delay | Omitted from model latency |
| TTFT | Request to first delivered token | Compared without prompt length/cache state |
| TPOT | Average post-first-token gap per request | Formula/weighting undisclosed |
| ITL | Individual or token-weighted token gaps | Used interchangeably with TPOT |
| E2E latency | Request through final result | Compared with different output lengths |
| Throughput | Tokens or requests per interval | Presented without latency/quality |
| Goodput | Work within SLO and quality bounds | Thresholds left undefined |
Decision checklist
- What prompt and output distributions define the workload?
- Which latency boundaries are included in TTFT and E2E?
- What Goodput SLOs apply by request class?
- How are prefill, decode, and structured-generation work scheduled?
- What is the KV-cache budget and eviction policy?
- How do timeout, cancellation, and client disconnect release resources?
- Which precision choices are quality-approved?
- How is overload propagated upstream?
Common mistakes
- Publishing aggregate tokens per second without prompt/output distributions and SLOs.
- Calling accelerator execution time TTFT while excluding queue and delivery.
- Maximizing concurrency until tail latency becomes unusable.
- Ignoring cold versus warm prefix-cache state.
- Treating maximum context window as the practical default.
- Leaking KV memory after cancellation or disconnect.
- Enabling structured generation without measuring constraint overhead.
Sources and further reading
-
vLLM documentation
(opens in a new tab)
-
PagedAttention paper
(opens in a new tab)
-
TensorRT-LLM documentation
(opens in a new tab)
-
SGLang documentation
(opens in a new tab)
-
MLPerf Inference: Datacenter
(opens in a new tab)
Last reviewed: 2026-06-21 UTC
