This report maps the current AI runtime landscape across development frameworks, compilers, inference engines, serving systems, edge runtimes, streaming systems, and agent runtimes. ARuntime uses it as a taxonomy input and redistributes its strongest ideas into focused reference pages.
Key takeaways
- Runtime families describe operational specializations; layers describe where responsibilities sit in an end-to-end stack.
- A production system normally composes several runtime families.
- Hardware target, state model, service objective, and failure boundary matter more than a marketing label.
Scope and contribution
The report starts from the observation that “runtime” is overloaded. It distinguishes framework-native execution, distributed training, portable inference, hardware-optimized inference, generative-model engines, model serving, edge/browser execution, agent runtimes, and streaming or real-time systems. The contribution is not a claim that these families are mutually exclusive. It is a practical lens for asking which component owns compilation, model state, admission, batching, placement, durable workflow state, or device constraints.
ARuntime’s seven-layer model is deliberately orthogonal. A generative engine primarily occupies the model-inference layer; KServe primarily occupies the serving layer; IREE spans compiler and execution concerns; an agent framework primarily occupies the agentic/application layer. A product can therefore have one primary family and several secondary categories.
Current runtime families
| Family | Primary job | Typical ARuntime layer | Boundary question |
|---|---|---|---|
| Framework-native | Model authoring, automatic differentiation, eager or compiled execution | Layers 1–3 | Where does flexible development end and deployment compilation begin? |
| Distributed training | Shard parameters, gradients, optimizer state, and pipeline stages | Layers 0–4 | Which system owns collective failure and checkpoint recovery? |
| Portable inference | Execute exported models across heterogeneous providers | Layers 1–3 | Which operators and targets are actually supported for this release? |
| Hardware optimized | Fuse operations and select target-specific kernels and precision | Layers 1–3 | What portability is traded for target performance? |
| Generative model | Schedule prefill/decode and manage KV cache | Layer 3 | How are latency, throughput, memory, and quality balanced? |
| Model serving | Expose engines through APIs, versions, health, batching, and scale | Layer 4 | Where is the service and rollout boundary? |
| Edge/browser/streaming | Meet device, energy, privacy, and real-time constraints | Layers 0–4 | What fallback preserves product meaning? |
| Agent/application | Manage task state, tools, authority, recovery, and evidence | Layer 5 | Who owns consequential side effects and durable workflow state? |
Shared internal mechanisms
Across families, runtimes repeatedly solve scheduling, memory planning, compilation, precision, parallelism, state persistence, and telemetry. The unit differs. A compiler schedules operations; a generative engine schedules sequence steps; a server schedules requests; a distributed runtime schedules shards and cache transfers; an agentic runtime schedules model calls, tools, approvals, retries, and waits.
This distinction prevents false comparisons. “Supports batching” does not mean the same thing in a graph compiler, LLM engine, server, or workflow system. Directory fields therefore record responsibility and scope rather than converting every feature into one Boolean score.
Cross-cutting trade-offs
- Latency versus throughput: batch formation, queue delay, cache locality, and prefill/decode interference change the operating point.
- Portability versus target optimization: portable IR and execution providers improve reach; vendor-specific kernels can improve performance on a narrower target.
- Developer ergonomics versus control: automatic behavior accelerates adoption but can hide scheduling and failure assumptions.
- State durability versus overhead: checkpoints and replay improve recovery while adding storage, serialization, and correctness obligations.
- Privacy versus operational convenience: on-device execution minimizes egress; managed services simplify operations but expand trust boundaries.
- Determinism versus parallel performance: reproducible execution may require slower algorithms or constrained scheduling.
What ARuntime promoted
The report directly informed the taxonomy, runtime selection guide, hardware and compiler pages, generative-inference guidance, serving and distributed pages, edge/browser coverage, directory model, and emerging-architecture research index. Claims were promoted only where the source registry contained an appropriate primary specification, official documentation, or original research record.
What was not promoted
Future dates, universal performance multipliers, unsourced maturity labels, and broad product comparisons were omitted or rewritten as scoped questions. The report’s examples are discovery leads, not automatic evidence. Product capabilities remain version-scoped and must be rechecked against official documentation before material comparison.
Recommended reading path
- Begin with AI Runtime Taxonomy.
- Use Runtime Stack Overview to place responsibilities.
- Open Runtime Selection Guide for workload decisions.
- Use the Runtime Directory only after the category and scope are explicit.
- Review Benchmarking before comparing quantitative results.
