Benchmarking AI Runtimes

Key takeaways

Benchmark the decision you need to make; offline throughput, single-stream latency, server Goodput, and task success are different experiments.
Publish exact model, tokenizer, precision, hardware, runtime, driver, context/output distributions, concurrency, cache state, and warmup.
Raw tokens per second without latency, quality, error, and traffic shape is not a production ranking.
Use Goodput to count work meeting declared latency and quality constraints.
Compare equivalent layers and configurations; do not rank a serving platform against an engine without a scope statement.
Retain raw outputs and configuration as evidence, including failures and variance.

Runtime boundary

A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.

Receives

Decision question, workload distribution, artifacts, environment, method, metric definitions, SLO/quality gates, and run controls.

Owns

Experimental control, metric definitions, disclosure, repeatability, quality checks, and fair comparison scope.

Emits

Reproducible configuration, raw results, summaries, uncertainty, interpretation, limitations, and source artifacts.

Does not own

Universal conclusions outside the tested workload or vendor claims accepted without reproduction.

Failure modes

Cherry-picking, hidden warm cache, mismatched quantization, client bottleneck, unstable hardware, omitted errors, and unsupported generalization.

Evidence and metrics

TTFT, TPOT/ITL, E2E, throughput, Goodput, errors, queue, memory, power, cost, quality, variance, and confidence intervals.

Define the decision and scenario

Start with the architecture question: interactive latency, maximum SLO-compliant load, local-device fit, cold start, cost, power, or task completion.

Implementation

Choose online/offline, single/multi-stream, streaming/non-streaming, arrival process, duration, and SLOs to match the question.

Operational implications

A benchmark without a decision encourages metric shopping.

Measure

Scenario completeness, SLO definition, workload similarity, and excluded conditions.

Workload disclosure

Prompt/input length, output length, batch/concurrency, request arrival, tenant/model mix, adapters, tools, cache hit, and cancellation affect results.

Implementation

Use production distributions or a published synthetic distribution and preserve seeds/fixtures.

Operational implications

Constant tiny prompts and fixed output limits can make serving results look unrealistically stable.

Measure

Length distributions, arrivals, concurrency, cache hits, cancellations, and mix.

Artifact and quality equivalence

Model revision, tokenizer, prompt template, precision, quantization, adapter, structured constraints, and generation settings must match or be deliberately compared.

Implementation

Run task/quality evaluation and disclose tolerances, evaluator version, sample size, and failures.

Operational implications

A smaller or more aggressively quantized model is not a pure runtime speedup.

Measure

Task quality, invalid output, exact/semantic match, numerical tolerance, and refusal/safety rates.

Environment disclosure

Hardware model/count/topology, memory, power mode, firmware, driver, OS, container, runtime, backend, libraries, compiler, and flags define the result.

Implementation

Capture an immutable environment manifest and device telemetry.

Operational implications

Cloud instance names can hide hardware revisions or shared-resource effects.

Measure

Version manifest, clock/power, thermals, utilization, topology, and drift.

Warmup and cache state

JIT, engine build, kernel selection, model load, file cache, prefix cache, graph capture, and autotuning create cold/warm regimes.

Implementation

Report model load separately; define warmup requests/tokens/time; explicitly reset or preseed caches.

Operational implications

Discarding slow startup without disclosure misleads serverless and rolling-update decisions.

Measure

Cold start, warmup, cache hit, first measured delta, and steady state.

Latency metrics

Queue time, TTFT, TPOT/ITL, E2E, tool time, and time-to-task-completion answer different questions.

Implementation

Publish formulas, request boundary, percentiles, weighting, output lengths, and errors.

Operational implications

Means hide tails; E2E changes mechanically with output length; TPOT and ITL are often used inconsistently.

Measure

p50/p90/p95/p99/max, histogram, errors/timeouts, and output length.

Throughput and Goodput

Throughput counts work per interval; Goodput counts work that also meets SLO and quality criteria.

Implementation

Use an external load generator, sweep offered load, and report the knee where queue/tail/error rises.

Operational implications

An open-loop client can overload the system; a closed-loop client can hide queueing. State the model.

Measure

Offered/achieved load, Goodput, queue, errors, concurrency, active tokens, and saturation.

Memory, cost, and energy

Performance includes peak/resident memory, cache capacity, accelerator/host utilization, power, energy, and total cost.

Implementation

Report allocation and process/device memory; include idle reserve, storage/network, tools, observability, failures, and pricing date/region.

Operational implications

Cost per token can hide quality, request success, output length, and idle capacity.

Measure

Peak/resident bytes, energy/task, cost/success, utilization, and unallocated cost.

Repetition and uncertainty

System noise, compilation, thermals, autoscaling, network, and request randomness create variance.

Implementation

Run independent repetitions, stabilize environment, retain per-request raw data, and publish spread/confidence where useful.

Operational implications

A single best run is not a benchmark.

Measure

Run count, median/mean, standard deviation/CI, outliers, and drift.

Interpretation and claims

Conclusions must be bounded to tested versions, hardware, workload, SLO, and quality.

Implementation

State what the result supports, what it does not, and which variables remain uncontrolled.

Operational implications

Avoid “fastest,” “production-ready,” or “industry standard” unless scope and evidence justify them.

Measure

Claim-to-evidence traceability, limitations, reproduction status, and review date.

Reference tables

Benchmark disclosure minimum
Area	Required disclosure
Model/artifact	Exact revision, tokenizer, precision/quantization, adapter, hash
Hardware	Model/count/topology, memory, power mode, cloud instance details
Software	OS/container, driver, runtime/backend, libraries/compiler, flags
Workload	Input/output distributions, arrivals, concurrency, cache, tools
Method	Warmup, run duration, repetitions, client model, sampling
Metrics	Definitions/boundaries, percentiles, errors, Goodput thresholds
Quality	Dataset, evaluator, sample size, criteria, regressions
Cost/power	Date/region/rates, idle reserve, energy method
Evidence	Raw results, config, scripts, logs, review date

Metric definitions
Metric	Definition	Report with
TTFT	Request boundary to first delivered token	Prompt length, queue/cache, percentile
TPOT	Average post-first-token duration per output token	Formula, output length, percentile
ITL	Gap between successive token events	Weighting and distribution
E2E	Request to final result	Output length, errors/timeouts
Throughput	Work completed per interval	Latency/error/load
Goodput	Work meeting SLO and quality	Explicit thresholds
Task success	Correct outcome per task	Evaluator, side effects, cost/latency

Decision checklist

What architecture decision will this benchmark inform?
Is the workload production-shaped and fully disclosed?
Are model, tokenizer, precision, and quality equivalent?
Are cold, warm, and cache states explicit?
Are queue, TTFT, TPOT/ITL, E2E, errors, and Goodput defined?
Is the load generator external and non-bottlenecking?
Are memory, power, cost, and failures included?
Are repeated runs and raw evidence retained?
Are conclusions limited to the tested scope?

Common mistakes

Copying vendor benchmark numbers into a cross-vendor ranking.
Comparing different model revisions or quantization as if only the runtime changed.
Using average latency without tails or errors.
Reporting throughput from a closed-loop client with no offered-load context.
Hiding warmup or prefix-cache state.
Ignoring quality regression.
Measuring on the same host with a CPU-bound load generator.
Publishing only the best run.
Using current cloud prices without date and region.

Sources and further reading

MLPerf Inference: Datacenter
(opens in a new tab)

MLCommons · Benchmark specification · accessed 2026-06-21 UTC
MLPerf Inference: Edge
(opens in a new tab)

MLCommons · Benchmark specification · accessed 2026-06-21 UTC
vLLM benchmarking
(opens in a new tab)

vLLM · Official documentation · accessed 2026-06-21 UTC
Triton Performance Analyzer
(opens in a new tab)

NVIDIA · Official documentation · accessed 2026-06-21 UTC
Reproducibility and Replicability in Science
(opens in a new tab)

National Academies · Authoritative report · accessed 2026-06-21 UTC

Last reviewed: 2026-06-21 UTC

Key takeaways

Runtime boundary

Receives

Owns

Emits

Does not own

Failure modes

Evidence and metrics

Define the decision and scenario

Implementation

Operational implications

Measure

Workload disclosure

Implementation

Operational implications

Measure

Artifact and quality equivalence

Implementation

Operational implications

Measure

Environment disclosure

Implementation

Operational implications

Measure

Warmup and cache state

Implementation

Operational implications

Measure

Latency metrics

Implementation

Operational implications

Measure

Throughput and Goodput

Implementation

Operational implications

Measure

Memory, cost, and energy

Implementation

Operational implications

Measure

Repetition and uncertainty

Implementation

Operational implications

Measure

Interpretation and claims

Implementation

Operational implications

Measure

Reference tables

Decision checklist

Common mistakes

Sources and further reading

Maintenance record