Benchmarking compares a precisely defined workload, configuration, and objective. ARuntime does not assign a universal runtime score because model, hardware, precision, context, concurrency, cache, and service objectives change the result.
Key takeaways
- Report the complete workload and environment.
- Use percentile latency and SLO-constrained goodput, not peak throughput alone.
- Separate model quality changes from performance improvements.
Benchmark scope
State the question: engine efficiency, server admission, distributed scaling, edge latency, agent completion, cost, or energy. Define excluded dimensions and avoid comparing products that own different layers as if they were direct substitutes.
Required workload specification
- Model name, exact version/commit, tokenizer, adapters, and model format
- Hardware model/count/topology, host, memory, interconnect, driver, and power mode
- Runtime and backend versions plus configuration
- Precision and quantization with calibration
- Input and output length distributions
- Concurrency, arrival process, priority, and batching behavior
- Cache state, prefix reuse, warm/cold state, and model residency
- Streaming, stop conditions, and output-quality validation
- Network, storage, region, and client measurement point
Metrics
| Area | Metrics |
|---|---|
| Latency | queue, TTFT, TPOT, end-to-end, percentile and jitter |
| Throughput | requests, input/output tokens, sequences, successful tasks per time |
| Goodput | completed work meeting defined quality and latency objectives |
| Capacity | concurrency, memory, cache occupancy, model residency |
| Reliability | error, timeout, overload, recovery, duplicate-effect rate |
| Cost/energy | cost or energy per successful, quality-valid outcome |
Measurement method
Use synchronized clocks, warmup criteria, stable run duration, confidence intervals or repeated trials, and separated client/server measurements. Record discarded runs and anomalies. Load generation should represent production arrivals rather than only closed-loop maximum load. MLPerf provides published methodologies for defined inference scenarios; use it where applicable without implying it covers every agentic runtime question. [ar_cite id=”mlperf-inference” label=”MLPerf Inference”]
Agentic and task benchmarks
Measure time to successful outcome, first-attempt success, tool and policy behavior, retries, approval delay, recovery, cost, evidence completeness, and side-effect correctness. Freeze external tool fixtures or record them precisely. Distinguish a model failing to plan from a tool, authorization, or infrastructure failure.
Comparison safeguards
- Use equivalent output quality and supported features.
- Disclose vendor tuning and excluded requests.
- Do not extrapolate one model/hardware result to all workloads.
- Separate benchmark sponsor, operator, and reviewer.
- Mark unverified fields and UTC verification date.
- Publish scripts/configuration where licensing permits.
Reporting template
A benchmark report records objective, system under test, versions, workload, hardware, method, metrics, quality checks, raw result location, exclusions, anomalies, and limitations. Directory entries link to scoped evidence rather than a universal score.
