LLM Inference - aRuntime.com

Key takeaways

Prefill and decode have different compute and memory profiles and should be measured separately.
Time to first token includes queueing, routing, tokenization, prefill, and delivery—not only accelerator execution.
Decode performance depends on memory bandwidth, KV-cache access, batch composition, scheduling, and sequence length.
Raw tokens per second is insufficient; use TTFT, TPOT or ITL, tail latency, quality, and Goodput under defined traffic.
Cancellation, structured outputs, cache cleanup, and backpressure are part of the inference runtime contract.

Runtime boundary

A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.

Receives

Tokenized prompts, model/adapter selection, sampling settings, output constraints, priority, deadline, and cache hints.

Owns

Model execution, attention state, prefill/decode scheduling, sampling, streaming, cancellation cleanup, and token-level telemetry.

Emits

Token events or complete outputs, usage counters, finish reason, cache lifecycle events, timing spans, and errors.

Does not own

User authorization, enterprise context policy, tool side-effect permission, or product workflow.

Failure modes

OOM, queue overload, prefill interference, cache exhaustion, slow decode, invalid structured output, cancellation leaks, and model-load failure.

Evidence and metrics

Queue time, TTFT, prefill, TPOT, ITL, E2E, throughput, Goodput, HBM/KV occupancy, cancellation cleanup, and cost.

Prefill

Prefill processes prompt tokens and builds key/value attention state. Large matrix operations expose parallel compute, while long prompts increase work before the first output.

Implementation

Measure tokenization, queue, prompt transfer, model prefill, and first-token delivery separately. Use prefix reuse only when token/model/config identity matches.

Operational implications

Long prefills can interfere with active decode streams. Chunking or phase-specific workers can bound the effect.

Measure

Prompt tokens, queue, prefill time, TTFT, cache hit, and prefill tokens avoided.

Decode

Decode generates subsequent tokens sequentially for each request and repeatedly reads weights and relevant cache state.

Implementation

Batch many active sequences while controlling memory and fairness. Keep the sampling and structured-output path on the critical trace.

Operational implications

Decode can be memory-bandwidth-bound. High aggregate utilization can still produce poor per-request ITL under oversized batches.

Measure

TPOT, ITL distribution, active sequences/tokens, bandwidth, output throughput, and Goodput.

TTFT, TPOT, ITL, and E2E

These metrics describe different portions of user-visible latency. Definitions and aggregation must be explicit.

Implementation

Define the request boundary, whether queue/network/tokenization are included, how token gaps are weighted, and how output length affects E2E.

Operational implications

Use percentiles by workload class rather than one global average.

Measure

p50/p95/p99 queue, TTFT, TPOT/ITL, E2E, output length, and errors.

Continuous batching and scheduling

The active batch changes as sequences arrive, finish, pause, or cancel.

Implementation

Use token- and memory-aware admission, bounded queues, priority/fairness, prefill/decode policy, and upstream backpressure.

Operational implications

Maximum concurrency is rarely the best operating point; tail latency and cache pressure rise before hard failure.

Measure

Batch size/composition, admitted/rejected, queue, active tokens, Goodput, and fairness.

KV cache and context

KV state grows with active tokens, layers, attention dimensions, and cache precision. Long contexts reduce concurrency.

Implementation

Use paged allocation, exact prefix reuse, eviction, and optional tiered offload under explicit tenant and retention policy.

Operational implications

Track physical cache capacity rather than only advertised context window. Cleanup on finish/cancel must be prompt.

Measure

Blocks/tokens used, allocation failures, hit rate, evictions, transfer, and cleanup time.

Precision and quantization

Weights, activations, and KV cache can use different precision.

Implementation

Name each format, kernel path, calibration or conversion method, and quality test.

Operational implications

Lower precision can improve fit or bandwidth but may change quality and available kernels.

Measure

Memory, throughput, latency, task quality, numerical errors, and fallback.

Structured generation

Grammar, JSON schema, or finite-state constraints restrict allowed tokens and reduce invalid output.

Implementation

Compile constraints before or during request handling, validate final output, and ensure compatibility with batching and speculation.

Operational implications

Constraint construction and token masking add overhead; invalid schemas need deterministic errors.

Measure

Schema compile time, token-mask overhead, valid-output rate, retries, and TPOT impact.

Streaming and cancellation

Streaming exposes incremental events while cancellation must remove queued/active work and release state.

Implementation

Define accepted, first-token, delta, usage, completed, cancelled, and failed events. Propagate client disconnect and deadlines.

Operational implications

Slow cleanup turns abandoned clients into hidden capacity leaks.

Measure

Disconnect-to-cancel time, cache release, orphaned requests, stream errors, and completion reason.

Latency diagnosis

Phase attribution turns symptoms into runtime actions.

Implementation

Correlate gateway, queue, tokenizer, prefill, decode, cache, tool, and delivery spans.

Operational implications

High TTFT with low queue differs from queue overload; irregular ITL can signal phase interference or host stalls.

Measure

Phase percentiles, scheduler timeline, cache state, transfer, host CPU, and kernel timing.

Reference tables

Prefill versus decode
Property	Prefill	Decode
Input per request step	Many prompt tokens	One or speculative token group
Typical sensitivity	Compute, prompt length, queue	Memory bandwidth, cache, scheduling
Primary user metric	TTFT	TPOT / ITL
State	Builds KV cache	Reads and extends KV cache
Common interference	Long prompts delay other work	Large active batches pressure memory

Inference metric guide
Metric	Measures	Common misuse
Queue time	Admission-to-execution delay	Omitted from model latency
TTFT	Request to first delivered token	Compared without prompt length/cache state
TPOT	Average post-first-token gap per request	Formula/weighting undisclosed
ITL	Individual or token-weighted token gaps	Used interchangeably with TPOT
E2E latency	Request through final result	Compared with different output lengths
Throughput	Tokens or requests per interval	Presented without latency/quality
Goodput	Work within SLO and quality bounds	Thresholds left undefined

Decision checklist

What prompt and output distributions define the workload?
Which latency boundaries are included in TTFT and E2E?
What Goodput SLOs apply by request class?
How are prefill, decode, and structured-generation work scheduled?
What is the KV-cache budget and eviction policy?
How do timeout, cancellation, and client disconnect release resources?
Which precision choices are quality-approved?
How is overload propagated upstream?

Common mistakes

Publishing aggregate tokens per second without prompt/output distributions and SLOs.
Calling accelerator execution time TTFT while excluding queue and delivery.
Maximizing concurrency until tail latency becomes unusable.
Ignoring cold versus warm prefix-cache state.
Treating maximum context window as the practical default.
Leaking KV memory after cancellation or disconnect.
Enabling structured generation without measuring constraint overhead.

Sources and further reading

vLLM documentation
(opens in a new tab)

vLLM · Official documentation · accessed 2026-06-21 UTC
PagedAttention paper
(opens in a new tab)

USENIX / vLLM authors · Peer-reviewed paper · accessed 2026-06-21 UTC
TensorRT-LLM documentation
(opens in a new tab)

NVIDIA · Official documentation · accessed 2026-06-21 UTC
SGLang documentation
(opens in a new tab)

SGLang · Official documentation · accessed 2026-06-21 UTC
MLPerf Inference: Datacenter
(opens in a new tab)

MLCommons · Benchmark specification · accessed 2026-06-21 UTC

Last reviewed: 2026-06-21 UTC

Key takeaways

Runtime boundary

Receives

Owns

Emits

Does not own

Failure modes

Evidence and metrics

Prefill

Implementation

Operational implications

Measure

Decode

Implementation

Operational implications

Measure

TTFT, TPOT, ITL, and E2E

Implementation

Operational implications

Measure

Continuous batching and scheduling

Implementation

Operational implications

Measure

KV cache and context

Implementation

Operational implications

Measure

Precision and quantization

Implementation

Operational implications

Measure

Structured generation

Implementation

Operational implications

Measure

Streaming and cancellation

Implementation

Operational implications

Measure

Latency diagnosis

Implementation

Operational implications

Measure

Reference tables

Decision checklist

Common mistakes

Sources and further reading

Maintenance record