Select an AI runtime architecture by workload, execution unit, deployment boundary, service objective, state, trust, and failure behavior. Do not begin with a universal product ranking.
Key takeaways
- Name the runtime layer before comparing products.
- Prefer the smallest architecture that meets consequence and operating needs.
- Verify capabilities against official documentation and a workload-specific proof.
Start with the job
Execute a portable model
Evaluate graph/portable inference runtimes.
Maximize hardware-specific performance
Evaluate vendor-optimized compilers and engines.
Serve an LLM at concurrency
Evaluate generative engines plus model serving.
Scale one model across hosts
Add distributed inference.
Run in browser, mobile, or edge
Evaluate packaging, delegates, footprint, offline, and update.
Run long-lived tool-using work
Add an agentic application runtime and durable workflow semantics.
Selection questions
- What is the primary execution unit?
- Which model formats, operations, shapes, precisions, and hardware are required?
- What latency, goodput, availability, deadline, energy, and cost objectives apply?
- What state exists, who owns it, and how long must it survive?
- What data and trust boundaries are crossed?
- What external side effects occur?
- What failure, retry, compensation, and approval behavior is required?
- What evidence must be available?
- Which capabilities are verified for the exact version?
Category matrix
| Need | Primary category | Common additions |
|---|---|---|
| Cross-hardware exported model | Portable graph/inference runtime | Target backend, application packaging |
| Lowest latency on one accelerator family | Hardware-optimized engine | Model server, profiler |
| High-concurrency generative API | Generative engine | Serving, gateway, observability |
| Multi-node large-model inference | Distributed runtime | Serving, cache tier, scheduler |
| Device-local inference | Edge/mobile/browser runtime | Update, fallback, telemetry |
| Tool-using durable task | Agentic application runtime | Workflow, policy, sandbox, evidence |
Deployment decision
Choose embedded, local service, cluster service, managed cloud, browser, edge, confidential, or hybrid based on latency, privacy, connectivity, operations, hardware, and compliance. Include migration and exit paths; API compatibility alone may not preserve model behavior or operational semantics.
Trust and consequence
As consequences increase, require stronger identity, tool contracts, isolation, approval, idempotency, evidence, and incident response. Low-risk read-only generation can use a simpler path. Architecture should scale controls, not marketing labels.
Evaluation process
- Shortlist by required category and verified features.
- Build a representative conformance and workload suite.
- Measure correctness, quality, latency, goodput, reliability, cost, and operations.
- Test failure, upgrade, rollback, security, and portability.
- Record caveats and unverified dimensions.
Architecture decision record
Record context, decision, alternatives, layer ownership, versions, sources, workload, assumptions, risks, migration, validation results, and review date. Revisit when workload, model, hardware, or product capability changes.
