Key takeaways
- Benchmark the decision you need to make; offline throughput, single-stream latency, server Goodput, and task success are different experiments.
- Publish exact model, tokenizer, precision, hardware, runtime, driver, context/output distributions, concurrency, cache state, and warmup.
- Raw tokens per second without latency, quality, error, and traffic shape is not a production ranking.
- Use Goodput to count work meeting declared latency and quality constraints.
- Compare equivalent layers and configurations; do not rank a serving platform against an engine without a scope statement.
- Retain raw outputs and configuration as evidence, including failures and variance.
Runtime boundary
A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.
Receives
Decision question, workload distribution, artifacts, environment, method, metric definitions, SLO/quality gates, and run controls.
Owns
Experimental control, metric definitions, disclosure, repeatability, quality checks, and fair comparison scope.
Emits
Reproducible configuration, raw results, summaries, uncertainty, interpretation, limitations, and source artifacts.
Does not own
Universal conclusions outside the tested workload or vendor claims accepted without reproduction.
Failure modes
Cherry-picking, hidden warm cache, mismatched quantization, client bottleneck, unstable hardware, omitted errors, and unsupported generalization.
Evidence and metrics
TTFT, TPOT/ITL, E2E, throughput, Goodput, errors, queue, memory, power, cost, quality, variance, and confidence intervals.
Define the decision and scenario
Start with the architecture question: interactive latency, maximum SLO-compliant load, local-device fit, cold start, cost, power, or task completion.
Implementation
Choose online/offline, single/multi-stream, streaming/non-streaming, arrival process, duration, and SLOs to match the question.
Operational implications
A benchmark without a decision encourages metric shopping.
Measure
Scenario completeness, SLO definition, workload similarity, and excluded conditions.
Workload disclosure
Prompt/input length, output length, batch/concurrency, request arrival, tenant/model mix, adapters, tools, cache hit, and cancellation affect results.
Implementation
Use production distributions or a published synthetic distribution and preserve seeds/fixtures.
Operational implications
Constant tiny prompts and fixed output limits can make serving results look unrealistically stable.
Measure
Length distributions, arrivals, concurrency, cache hits, cancellations, and mix.
Artifact and quality equivalence
Model revision, tokenizer, prompt template, precision, quantization, adapter, structured constraints, and generation settings must match or be deliberately compared.
Implementation
Run task/quality evaluation and disclose tolerances, evaluator version, sample size, and failures.
Operational implications
A smaller or more aggressively quantized model is not a pure runtime speedup.
Measure
Task quality, invalid output, exact/semantic match, numerical tolerance, and refusal/safety rates.
Environment disclosure
Hardware model/count/topology, memory, power mode, firmware, driver, OS, container, runtime, backend, libraries, compiler, and flags define the result.
Implementation
Capture an immutable environment manifest and device telemetry.
Operational implications
Cloud instance names can hide hardware revisions or shared-resource effects.
Measure
Version manifest, clock/power, thermals, utilization, topology, and drift.
Warmup and cache state
JIT, engine build, kernel selection, model load, file cache, prefix cache, graph capture, and autotuning create cold/warm regimes.
Implementation
Report model load separately; define warmup requests/tokens/time; explicitly reset or preseed caches.
Operational implications
Discarding slow startup without disclosure misleads serverless and rolling-update decisions.
Measure
Cold start, warmup, cache hit, first measured delta, and steady state.
Latency metrics
Queue time, TTFT, TPOT/ITL, E2E, tool time, and time-to-task-completion answer different questions.
Implementation
Publish formulas, request boundary, percentiles, weighting, output lengths, and errors.
Operational implications
Means hide tails; E2E changes mechanically with output length; TPOT and ITL are often used inconsistently.
Measure
p50/p90/p95/p99/max, histogram, errors/timeouts, and output length.
Throughput and Goodput
Throughput counts work per interval; Goodput counts work that also meets SLO and quality criteria.
Implementation
Use an external load generator, sweep offered load, and report the knee where queue/tail/error rises.
Operational implications
An open-loop client can overload the system; a closed-loop client can hide queueing. State the model.
Measure
Offered/achieved load, Goodput, queue, errors, concurrency, active tokens, and saturation.
Memory, cost, and energy
Performance includes peak/resident memory, cache capacity, accelerator/host utilization, power, energy, and total cost.
Implementation
Report allocation and process/device memory; include idle reserve, storage/network, tools, observability, failures, and pricing date/region.
Operational implications
Cost per token can hide quality, request success, output length, and idle capacity.
Measure
Peak/resident bytes, energy/task, cost/success, utilization, and unallocated cost.
Repetition and uncertainty
System noise, compilation, thermals, autoscaling, network, and request randomness create variance.
Implementation
Run independent repetitions, stabilize environment, retain per-request raw data, and publish spread/confidence where useful.
Operational implications
A single best run is not a benchmark.
Measure
Run count, median/mean, standard deviation/CI, outliers, and drift.
Interpretation and claims
Conclusions must be bounded to tested versions, hardware, workload, SLO, and quality.
Implementation
State what the result supports, what it does not, and which variables remain uncontrolled.
Operational implications
Avoid “fastest,” “production-ready,” or “industry standard” unless scope and evidence justify them.
Measure
Claim-to-evidence traceability, limitations, reproduction status, and review date.
Reference tables
| Area | Required disclosure |
|---|---|
| Model/artifact | Exact revision, tokenizer, precision/quantization, adapter, hash |
| Hardware | Model/count/topology, memory, power mode, cloud instance details |
| Software | OS/container, driver, runtime/backend, libraries/compiler, flags |
| Workload | Input/output distributions, arrivals, concurrency, cache, tools |
| Method | Warmup, run duration, repetitions, client model, sampling |
| Metrics | Definitions/boundaries, percentiles, errors, Goodput thresholds |
| Quality | Dataset, evaluator, sample size, criteria, regressions |
| Cost/power | Date/region/rates, idle reserve, energy method |
| Evidence | Raw results, config, scripts, logs, review date |
| Metric | Definition | Report with |
|---|---|---|
| TTFT | Request boundary to first delivered token | Prompt length, queue/cache, percentile |
| TPOT | Average post-first-token duration per output token | Formula, output length, percentile |
| ITL | Gap between successive token events | Weighting and distribution |
| E2E | Request to final result | Output length, errors/timeouts |
| Throughput | Work completed per interval | Latency/error/load |
| Goodput | Work meeting SLO and quality | Explicit thresholds |
| Task success | Correct outcome per task | Evaluator, side effects, cost/latency |
Decision checklist
- What architecture decision will this benchmark inform?
- Is the workload production-shaped and fully disclosed?
- Are model, tokenizer, precision, and quality equivalent?
- Are cold, warm, and cache states explicit?
- Are queue, TTFT, TPOT/ITL, E2E, errors, and Goodput defined?
- Is the load generator external and non-bottlenecking?
- Are memory, power, cost, and failures included?
- Are repeated runs and raw evidence retained?
- Are conclusions limited to the tested scope?
Common mistakes
- Copying vendor benchmark numbers into a cross-vendor ranking.
- Comparing different model revisions or quantization as if only the runtime changed.
- Using average latency without tails or errors.
- Reporting throughput from a closed-loop client with no offered-load context.
- Hiding warmup or prefix-cache state.
- Ignoring quality regression.
- Measuring on the same host with a CPU-bound load generator.
- Publishing only the best run.
- Using current cloud prices without date and region.
Sources and further reading
-
MLPerf Inference: Datacenter
(opens in a new tab)
-
MLPerf Inference: Edge
(opens in a new tab)
-
vLLM benchmarking
(opens in a new tab)
-
Triton Performance Analyzer
(opens in a new tab)
-
Reproducibility and Replicability in Science
(opens in a new tab)
Last reviewed: 2026-06-21 UTC
