Search ARuntime.com

Find runtime definitions and implementation guidance

Search page titles, summaries, headings, glossary terms, use cases, and runtime-directory entries.

Enter at least two characters.

Operations

Benchmarking AI Runtimes

Benchmark AI runtimes correctly with TTFT, TPOT, ITL, Goodput, throughput, tail latency, quality, concurrency, cache state, hardware, methodology, cost, power, and reproducibility.

Audience: Technical readers Reading time: 6 minutes Status: Production guidance Last reviewed:

Key takeaways

  • Benchmark the decision you need to make; offline throughput, single-stream latency, server Goodput, and task success are different experiments.
  • Publish exact model, tokenizer, precision, hardware, runtime, driver, context/output distributions, concurrency, cache state, and warmup.
  • Raw tokens per second without latency, quality, error, and traffic shape is not a production ranking.
  • Use Goodput to count work meeting declared latency and quality constraints.
  • Compare equivalent layers and configurations; do not rank a serving platform against an engine without a scope statement.
  • Retain raw outputs and configuration as evidence, including failures and variance.

Runtime boundary

A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.

Receives

Decision question, workload distribution, artifacts, environment, method, metric definitions, SLO/quality gates, and run controls.

Owns

Experimental control, metric definitions, disclosure, repeatability, quality checks, and fair comparison scope.

Emits

Reproducible configuration, raw results, summaries, uncertainty, interpretation, limitations, and source artifacts.

Does not own

Universal conclusions outside the tested workload or vendor claims accepted without reproduction.

Failure modes

Cherry-picking, hidden warm cache, mismatched quantization, client bottleneck, unstable hardware, omitted errors, and unsupported generalization.

Evidence and metrics

TTFT, TPOT/ITL, E2E, throughput, Goodput, errors, queue, memory, power, cost, quality, variance, and confidence intervals.

Define the decision and scenario

Start with the architecture question: interactive latency, maximum SLO-compliant load, local-device fit, cold start, cost, power, or task completion.

Implementation

Choose online/offline, single/multi-stream, streaming/non-streaming, arrival process, duration, and SLOs to match the question.

Operational implications

A benchmark without a decision encourages metric shopping.

Measure

Scenario completeness, SLO definition, workload similarity, and excluded conditions.

Workload disclosure

Prompt/input length, output length, batch/concurrency, request arrival, tenant/model mix, adapters, tools, cache hit, and cancellation affect results.

Implementation

Use production distributions or a published synthetic distribution and preserve seeds/fixtures.

Operational implications

Constant tiny prompts and fixed output limits can make serving results look unrealistically stable.

Measure

Length distributions, arrivals, concurrency, cache hits, cancellations, and mix.

Artifact and quality equivalence

Model revision, tokenizer, prompt template, precision, quantization, adapter, structured constraints, and generation settings must match or be deliberately compared.

Implementation

Run task/quality evaluation and disclose tolerances, evaluator version, sample size, and failures.

Operational implications

A smaller or more aggressively quantized model is not a pure runtime speedup.

Measure

Task quality, invalid output, exact/semantic match, numerical tolerance, and refusal/safety rates.

Environment disclosure

Hardware model/count/topology, memory, power mode, firmware, driver, OS, container, runtime, backend, libraries, compiler, and flags define the result.

Implementation

Capture an immutable environment manifest and device telemetry.

Operational implications

Cloud instance names can hide hardware revisions or shared-resource effects.

Measure

Version manifest, clock/power, thermals, utilization, topology, and drift.

Warmup and cache state

JIT, engine build, kernel selection, model load, file cache, prefix cache, graph capture, and autotuning create cold/warm regimes.

Implementation

Report model load separately; define warmup requests/tokens/time; explicitly reset or preseed caches.

Operational implications

Discarding slow startup without disclosure misleads serverless and rolling-update decisions.

Measure

Cold start, warmup, cache hit, first measured delta, and steady state.

Latency metrics

Queue time, TTFT, TPOT/ITL, E2E, tool time, and time-to-task-completion answer different questions.

Implementation

Publish formulas, request boundary, percentiles, weighting, output lengths, and errors.

Operational implications

Means hide tails; E2E changes mechanically with output length; TPOT and ITL are often used inconsistently.

Measure

p50/p90/p95/p99/max, histogram, errors/timeouts, and output length.

Throughput and Goodput

Throughput counts work per interval; Goodput counts work that also meets SLO and quality criteria.

Implementation

Use an external load generator, sweep offered load, and report the knee where queue/tail/error rises.

Operational implications

An open-loop client can overload the system; a closed-loop client can hide queueing. State the model.

Measure

Offered/achieved load, Goodput, queue, errors, concurrency, active tokens, and saturation.

Memory, cost, and energy

Performance includes peak/resident memory, cache capacity, accelerator/host utilization, power, energy, and total cost.

Implementation

Report allocation and process/device memory; include idle reserve, storage/network, tools, observability, failures, and pricing date/region.

Operational implications

Cost per token can hide quality, request success, output length, and idle capacity.

Measure

Peak/resident bytes, energy/task, cost/success, utilization, and unallocated cost.

Repetition and uncertainty

System noise, compilation, thermals, autoscaling, network, and request randomness create variance.

Implementation

Run independent repetitions, stabilize environment, retain per-request raw data, and publish spread/confidence where useful.

Operational implications

A single best run is not a benchmark.

Measure

Run count, median/mean, standard deviation/CI, outliers, and drift.

Interpretation and claims

Conclusions must be bounded to tested versions, hardware, workload, SLO, and quality.

Implementation

State what the result supports, what it does not, and which variables remain uncontrolled.

Operational implications

Avoid “fastest,” “production-ready,” or “industry standard” unless scope and evidence justify them.

Measure

Claim-to-evidence traceability, limitations, reproduction status, and review date.

Reference tables

Benchmark disclosure minimum
Area Required disclosure
Model/artifact Exact revision, tokenizer, precision/quantization, adapter, hash
Hardware Model/count/topology, memory, power mode, cloud instance details
Software OS/container, driver, runtime/backend, libraries/compiler, flags
Workload Input/output distributions, arrivals, concurrency, cache, tools
Method Warmup, run duration, repetitions, client model, sampling
Metrics Definitions/boundaries, percentiles, errors, Goodput thresholds
Quality Dataset, evaluator, sample size, criteria, regressions
Cost/power Date/region/rates, idle reserve, energy method
Evidence Raw results, config, scripts, logs, review date
Metric definitions
Metric Definition Report with
TTFT Request boundary to first delivered token Prompt length, queue/cache, percentile
TPOT Average post-first-token duration per output token Formula, output length, percentile
ITL Gap between successive token events Weighting and distribution
E2E Request to final result Output length, errors/timeouts
Throughput Work completed per interval Latency/error/load
Goodput Work meeting SLO and quality Explicit thresholds
Task success Correct outcome per task Evaluator, side effects, cost/latency

Decision checklist

  1. What architecture decision will this benchmark inform?
  2. Is the workload production-shaped and fully disclosed?
  3. Are model, tokenizer, precision, and quality equivalent?
  4. Are cold, warm, and cache states explicit?
  5. Are queue, TTFT, TPOT/ITL, E2E, errors, and Goodput defined?
  6. Is the load generator external and non-bottlenecking?
  7. Are memory, power, cost, and failures included?
  8. Are repeated runs and raw evidence retained?
  9. Are conclusions limited to the tested scope?

Common mistakes

  • Copying vendor benchmark numbers into a cross-vendor ranking.
  • Comparing different model revisions or quantization as if only the runtime changed.
  • Using average latency without tails or errors.
  • Reporting throughput from a closed-loop client with no offered-load context.
  • Hiding warmup or prefix-cache state.
  • Ignoring quality regression.
  • Measuring on the same host with a CPU-bound load generator.
  • Publishing only the best run.
  • Using current cloud prices without date and region.

Sources and further reading


  1. MLPerf Inference: Datacenter
    (opens in a new tab)

    MLCommons · Benchmark specification · accessed 2026-06-21 UTC

  2. MLPerf Inference: Edge
    (opens in a new tab)

    MLCommons · Benchmark specification · accessed 2026-06-21 UTC

  3. vLLM benchmarking
    (opens in a new tab)

    vLLM · Official documentation · accessed 2026-06-21 UTC

  4. Triton Performance Analyzer
    (opens in a new tab)

    NVIDIA · Official documentation · accessed 2026-06-21 UTC

  5. Reproducibility and Replicability in Science
    (opens in a new tab)

    National Academies · Authoritative report · accessed 2026-06-21 UTC

Last reviewed: 2026-06-21 UTC

Maintenance record

Found an error, outdated capability, or unclear category boundary? Submit a correction with a supporting source.