Key takeaways
- Start with workload, trust, deployment, and operating constraints before comparing products.
- Compare like layers and state which components a candidate does not provide.
- Use mandatory gates before weighted scoring to avoid selecting a fast but noncompliant option.
- Run a production-shaped proof with quality, Goodput, failure, recovery, and operability evidence.
- Record a decision with versions, assumptions, alternatives, trade-offs, and review triggers.
Runtime boundary
A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.
Receives
Requirements, workload, model/artifacts, SLOs, deployment/hardware, data policy, team capability, budget, and candidate evidence.
Owns
Requirement normalization, comparison scope, evidence quality, decision rationale, and re-evaluation triggers.
Emits
Layered architecture, shortlist, test plan, scorecard, decision record, risks, and review date.
Does not own
A permanent universal ranking or conclusions based only on vendor marketing.
Failure modes
Category mismatch, hidden mandatory constraint, unrealistic benchmark, support/license surprise, lock-in, and operational overload.
Evidence and metrics
Requirement coverage, proof pass/fail, Goodput, quality, recovery, team effort, cost, portability, and unresolved risk.
Classify the runtime layer
Determine whether the need is compiler/graph execution, inference engine, model server, serving platform, local/edge/browser runtime, or agentic execution.
Implementation
Create a layer diagram and list required external components.
Operational implications
A model server is not automatically an agent runtime; an engine is not automatically an autoscaled service.
Measure
Layer coverage, integration count, and unsupported responsibilities.
Model and artifact fit
Check model family, formats, custom operations, adapters, multimodality, dynamic shapes, context, precision, and conversion path.
Implementation
Use exact target artifacts and representative edge cases in a compatibility spike.
Operational implications
“Supports ONNX” or “OpenAI-compatible” does not prove every model or behavior.
Measure
Load success, operator coverage, parity, structured output, adapter support, and limits.
Workload and performance
Define arrival, prompt/output, concurrency, streaming, cache, tools, and SLOs.
Implementation
Benchmark Goodput, p95/p99, errors, memory, cost, and quality using the intended topology.
Operational implications
Do not choose on one raw throughput chart.
Measure
Goodput, TTFT/TPOT/E2E, memory, errors, quality, and cost.
Deployment and hardware
Match cloud, private, local, browser, edge, air-gapped, topology, accelerators, and upgrade process.
Implementation
Verify official target support and fleet compatibility including drivers and delegates.
Operational implications
A runtime that performs well on one vendor may create unacceptable lock-in or fleet fragmentation.
Measure
Target coverage, compatibility, artifact variants, update effort, and portability.
Security and governance
Assess identity, tenant isolation, data boundaries, egress, artifact integrity, tool policy, approvals, audit, retention, and deletion.
Implementation
Use mandatory gates for regulatory, residency, and privileged actions.
Operational implications
Security gaps cannot always be offset by performance score.
Measure
Control coverage, policy decisions, audit completeness, incidents, and exceptions.
Agentic and durable behavior
For tools/agents assess typed calls, MCP/A2A/OpenAPI/JSON Schema integration, idempotency, checkpointing, approval, memory, and replay.
Implementation
Test long-running failure, ambiguous tool outcomes, resume, and compensation.
Operational implications
Framework convenience is not durable execution evidence.
Measure
Resume success, duplicate prevention, tool errors, approval, memory governance, and replay.
Operations and ecosystem
Evaluate observability, health, scaling, rollout, recovery, documentation, release cadence, security policy, licensing, community/vendor support, and team skills.
Implementation
Review official repository, release/support policy, upgrade history, and operational runbook.
Operational implications
A mature project can still be the wrong layer or require expertise the team lacks.
Measure
Upgrade effort, incident recovery, contribution/support response, staffing, and toil.
Controlled proof and scorecard
Use fail/pass gates followed by weighted scoring for qualified candidates.
Implementation
Publish benchmark method, integration findings, failure tests, total cost, and risks; avoid fake precision in weights.
Operational implications
Decision quality depends more on evidence than the number of scorecard columns.
Measure
Gate pass, score sensitivity, unresolved risk, proof effort, and recommendation confidence.
Decision record and review
Document selected components, exact versions, alternatives, assumptions, rejected reasons, migration/rollback, and review triggers.
Implementation
Review on model/hardware/workload/policy/license/support changes or scheduled date.
Operational implications
A decision without review triggers becomes accidental lock-in.
Measure
Assumption drift, review age, migration cost, and trigger events.
Reference tables
| Gate | Example evidence |
|---|---|
| Model compatibility | Exact artifact loads; parity passes |
| Deployment/hardware | Supported target and tested compatibility tuple |
| Data/privacy | Approved residency, egress, retention, deletion |
| Security | Identity, isolation, artifact integrity, tool policy |
| SLO/quality | Goodput and quality under production workload |
| Operations | Readiness, rollout, observability, recovery |
| License/support | Approved license and viable support lifecycle |
| Criterion | Evidence |
|---|---|
| Performance efficiency | Controlled Goodput, memory, power, cost |
| Portability | Formats, backends, APIs, export/migration path |
| Developer fit | Language/API, documentation, testability |
| Operational fit | Scaling, rollout, health, telemetry, upgrade |
| Ecosystem maturity | Releases, security policy, adoption, maintainers |
| Total cost | Infrastructure, service, integration, operations, exit |
Decision checklist
- Which runtime layer or layers are actually being selected?
- Which requirements are mandatory gates?
- What exact model and artifact must run?
- What production traffic and SLOs define success?
- Which deployment and hardware targets are required?
- What data, identity, tool, and retention policies apply?
- Does the workload need durable state or human approval?
- Can the team operate, upgrade, and recover the stack?
- What controlled proof will generate comparable evidence?
- What event will trigger re-evaluation?
Common mistakes
- Selecting a product before classifying the runtime layer.
- Using feature-count scorecards with no mandatory gates.
- Comparing vendor benchmark numbers from different configurations.
- Ignoring conversion, tokenizer, and preprocessing parity.
- Assuming compatibility APIs imply behavioral parity.
- Underestimating operational skill and upgrade cost.
- Choosing a hosted service without an exit/data-residency plan.
- Failing to record versions and review triggers.
Sources and further reading
-
ONNX Runtime architecture
(opens in a new tab)
-
Triton architecture
(opens in a new tab)
-
KServe ServingRuntime
(opens in a new tab)
-
ExecuTorch overview
(opens in a new tab)
-
Web Neural Network API
(opens in a new tab)
-
NIST AI Risk Management Framework
(opens in a new tab)
Last reviewed: 2026-06-21 UTC
