Search ARuntime.com

Find runtime definitions and implementation guidance

Search page titles, summaries, headings, glossary terms, use cases, and runtime-directory entries.

Enter at least two characters.

Foundations

AI Runtime Taxonomy

A vendor-neutral classification of the systems that prepare, execute, serve, distribute, govern, and apply artificial intelligence models. The taxonomy separates stack layers from runtime categories, execution models, deployment boundaries, and hardware targets so unlike systems are not compared as though they solve the same problem.

Key takeaways

  • Classify a system by the responsibility it performs, not by the word “runtime” in its product name or marketing description.
  • Use at least six coordinates: stack layer, runtime category, execution model, deployment boundary, hardware target, and operational owner.
  • Model formats, intermediate representations, APIs, and interoperability protocols are important runtime inputs and boundaries, but they are not complete runtimes by themselves.
  • Many production systems span categories. A useful classification states which responsibilities are native, delegated to another component, or supplied by the surrounding platform.
  • Comparisons are valid only within a declared scope. An inference engine, model server, serving control plane, and agentic runtime solve different parts of the execution stack.

Scope

Why taxonomy is necessary

The phrase AI runtime is used for several different kinds of software. A compiler team may mean the graph and code-generation system beneath a framework. An inference engineer may mean the engine that loads weights, schedules batches, and manages device memory. A platform team may mean a model-serving control plane. An application team may mean the governed execution environment around context, tools, memory, policy, and long-running state.

Those meanings are related, but they are not interchangeable. Without an explicit taxonomy, architecture discussions collapse into category errors: an interchange format is compared with an executable runtime, an inference engine is compared with a Kubernetes control plane, or an agent framework is treated as if it already supplies identity, authorization, durability, evaluation, and incident evidence.

The ARuntime.com taxonomy uses responsibility-first classification. It asks what a system receives, what it produces, what it optimizes, where it operates, what failure modes it owns, and which responsibilities remain outside its boundary. This also prevents product names from becoming definitions.

Classification rule

Classify the responsibility first, the implementation second, and the product last. When a product spans responsibilities, list each category and identify the handoff between them.

Method

Use six coordinates, not one label

A single category is rarely enough to describe a production runtime. The same inference engine can run locally, behind a model server, inside a Kubernetes serving platform, or as one component of an agentic application. Record the following axes together.

Where does the system operate?

Stack layer

Hardware, kernels, compiler and graph execution, inference, serving, agentic execution, or product workflow.

What outcome does it own?

Primary responsibility

Code generation, graph execution, token generation, API serving, distributed coordination, tool execution, policy, or user workflow.

What crosses the boundary?

Input and output

IR, model artifact, tensors, prompts, tokens, inference requests, tasks, tool calls, traces, or domain records.

When and how is work prepared?

Execution model

Eager, interpreted graph, JIT, AOT, delegated, dataflow, or a hybrid of several modes.

Where does it run and scale?

Deployment boundary

Process, device, browser, edge node, server, cluster, managed service, or a hybrid topology.

Who controls reliability and trust?

Operating ownership

Application team, platform team, hardware vendor, cloud provider, end user, or several owners with explicit contracts.

Formats and IRs belong in this description as inputs and contracts. ONNX defines an extensible computation graph, data types, and operators; it does not itself define a production service or application workflow.[1] StableHLO similarly serves as a portability layer between frameworks and compilers.[2]

Layer model

The seven-layer AI runtime stack

Layers describe where a responsibility sits. Categories describe what kind of runtime performs it. One product may span several layers, and one layer may contain several products.

ARuntime.com layered taxonomy from hardware substrate to product workflow
  1. Layer 0

    Hardware and system substrate

    Provide compute, memory, storage, interconnects, drivers, isolation, and device availability.

    Input
    Executable kernels, buffers, device commands, and communication operations.
    Output
    Completed device work, memory state, faults, and hardware telemetry.
    Boundary
    CPU, GPU, TPU, NPU or DSP, FPGA, custom accelerator, memory hierarchy, driver, and operating system.
  2. Layer 1

    Kernels and hardware libraries

    Implement high-performance tensor, attention, communication, and device-memory primitives.

    Input
    Tensor operations, layouts, shapes, precision, and target-device parameters.
    Output
    Device-specific computation and communication results.
    Boundary
    Vendor libraries, generated kernels, SIMD intrinsics, tensor-core instructions, collectives, and device APIs.
  3. Layer 2

    Compiler and graph runtime

    Import or capture model programs, transform them, partition work, lower operations, plan memory, and prepare executable modules.

    Input
    Framework graph, model format, high-level IR, shapes, target constraints, and backend capabilities.
    Output
    Optimized graph, execution plan, compiled module, delegated subgraph, or target-specific code.
    Boundary
    Frontend, IR, optimization passes, partitioner, code generator, execution provider, delegate, and graph executor.
  4. Layer 3

    Model and LLM inference engine

    Load model state and execute forward inference efficiently for one or more requests.

    Input
    Weights, tokenizer or preprocessing state, tensors or prompts, sampling parameters, and cache state.
    Output
    Predictions, embeddings, tokens, structured output, and engine-level telemetry.
    Boundary
    Model loading, prefill and decode, KV cache, batching, quantization, kernel dispatch, streaming, and device parallelism.
  5. Layer 4

    Serving and distributed runtime

    Expose inference over a service boundary and operate models across replicas, devices, or nodes.

    Input
    Network request, model identity, version, routing metadata, priority, deadline, and tenant context.
    Output
    Service response, stream, health state, scheduling decision, deployment event, and operational telemetry.
    Boundary
    HTTP or RPC APIs, queues, model repositories, schedulers, autoscaling, rollout, traffic policy, multi-node routing, and recovery.
  6. Layer 5

    Agentic and application runtime

    Turn an authorized task into governed, observable work across models, context, tools, memory, and humans.

    Input
    Actor, tenant, task, authority, context policy, model constraints, tool permissions, deadline, and budget.
    Output
    Structured result, evidence, tool outcomes, memory changes, policy decisions, trace, and review state.
    Boundary
    Context assembly, retrieval, routing, typed tools, durable state, policy enforcement, evaluation, replay, approval, and compensation.
  7. Layer 6

    Product and workflow layer

    Deliver the user experience and domain workflow in which runtime capabilities create business or user value.

    Input
    User intent, domain records, application state, workflow rules, and product-specific constraints.
    Output
    User-visible outcome, business transaction, domain update, notification, and product analytics.
    Boundary
    UI, domain services, business process, collaboration, product policy, records, and human interaction.

Compiler-oriented systems illustrate the layer distinction. XLA is documented as an ML compiler that optimizes linear algebra and generates target-specific code.[3] IREE explicitly separates compiler and standalone runtime components, demonstrating that a single project can bridge build-time and execution-time boundaries without making those responsibilities identical.[4]

Runtime category boundary map

Runtime category boundary mapA boundary map places compiler and graph runtimes, inference engines, model servers, gateways and routers, distributed runtimes, workflow engines, agent frameworks, agentic application runtimes, and product applications in overlapping responsibility regions. The horizontal axis moves from model artifact execution to business work. The vertical axis moves from stateless request handling to durable state and governed side effects. Overlaps are labeled secondary categories rather than forced single-category assignments.Model-artifact execution → governed business workStateless handling → durable state and side effectsCompilerInference engineModel serverGatewayDistributed runtimeWorkflow engineAgent frameworkAgentic runtimeProduct application
Categories describe responsibilities, not marketing labels; adjacent products can overlap without becoming interchangeable.
Text description

A boundary map places compiler and graph runtimes, inference engines, model servers, gateways and routers, distributed runtimes, workflow engines, agent frameworks, agentic application runtimes, and product applications in overlapping responsibility regions. The horizontal axis moves from model artifact execution to business work. The vertical axis moves from stateless request handling to durable state and governed side effects. Overlaps are labeled secondary categories rather than forced single-category assignments.

Core matrix

Runtime categories by responsibility

The table classifies categories by their primary contract. Representative systems are examples, not rankings, endorsements, or claims that every listed system supports the same models, hardware, execution modes, or operational guarantees.

AI runtime category scope matrix
Category Primary responsibility Typical input Typical output Optimization scope Deployment scope Representative systems What it is not
Framework runtime Execute framework operations and manage tensors, devices, dispatch, and often automatic differentiation. Framework program, tensors, modules, and device selections. Tensor results, model outputs, gradients when training, and framework events. Operator dispatch, graph capture, framework compilation, device placement, and library selection. Development process, training environment, notebook, service process, or embedded framework runtime. PyTorch and TensorFlow execution environments; exact behavior depends on eager, graph, and compiler modes. Not automatically a production model-serving or agent-governance platform.
Compiler and graph runtime Transform a model program and execute or package the optimized graph for target hardware. Graph, model format, IR, shapes, backend capabilities, and optimization settings. Optimized graph, delegated subgraphs, compiled module, execution plan, or target code. Graph rewriting, fusion, partitioning, lowering, scheduling, code generation, and memory planning. Build host plus local, server, mobile, embedded, browser, or accelerator target. OpenXLA/XLA, IREE, ONNX Runtime graph and execution-provider pipeline. Not necessarily a network service, autoscaler, model registry, or agent workflow engine.
General inference runtime Load trained models and execute predictions across one or more hardware backends. Model artifact and typed tensors or media-derived inputs. Predictions, embeddings, scores, generated tensors, and runtime metrics. Graph optimization, device selection, backend compilation, quantization, asynchronous execution, and buffer reuse. Application process, service backend, desktop, edge server, mobile device, or embedded system. ONNX Runtime and OpenVINO Runtime are representative overlapping examples. Not the same thing as a model file, a hosted API, or a complete distributed serving platform.
LLM inference engine Execute autoregressive and related generative-model workloads with specialized scheduling and memory management. Model weights, tokenized prompts, sampling or decoding controls, adapters, and cache state. Token streams, embeddings, structured generation, log probabilities, and engine metrics. Prefill and decode scheduling, KV-cache allocation, continuous batching, prefix reuse, quantization, and parallelism. Local process, single-node GPU server, multi-GPU host, or distributed inference deployment. vLLM is a representative inference-and-serving engine; other engines may expose different boundaries. Not by itself a complete enterprise workflow, authorization, governance, or human-review runtime.
Model server Expose one or more models through stable service protocols and manage per-model execution behavior. HTTP, RPC, C API, or internal inference request plus model and version identity. Inference response or stream, health status, model status, and server metrics. Request queues, dynamic or sequence batching, instance groups, model loading, repository policy, and backend dispatch. Long-running service process, container, VM, edge server, or node inside a larger platform. NVIDIA Triton Inference Server and OpenVINO Model Server are representative examples. Not identical to the inference backend it invokes, and not automatically a cluster control plane.
Serving platform and control plane Deploy, route, scale, update, and observe model-serving workloads across infrastructure. Deployment specification, model location, resource policy, traffic policy, and service objectives. Running services, routing state, revisions, autoscaling decisions, rollout status, and platform telemetry. Replica placement, autoscaling, traffic splitting, model caching, admission, resilience, and infrastructure utilization. Kubernetes cluster, distributed compute cluster, managed cloud platform, or private serving environment. KServe and Ray Serve represent different platform styles and operating abstractions. Not necessarily the component that performs tensor kernels or model-specific cache management.
Edge, mobile, and TinyML runtime Execute models within device limits for memory, power, thermals, binary size, and offline operation. Prepared or converted model program, local sensor or application data, and target-device configuration. Local inference result, device telemetry, and application-visible state. AOT preparation, static memory planning, selective kernels, delegates, quantization, and heterogeneous partitioning. Phone, wearable, embedded Linux device, automotive system, microcontroller, or bare-metal target. ExecuTorch and LiteRT are representative on-device stacks with different preparation and backend models. Not a universal substitute for high-capacity data-center serving or centralized fleet operations.
Browser runtime Execute models within the web security and resource boundary while preserving UI responsiveness and compatibility. Downloaded model assets, browser data, user permission state, and selected web compute backend. Client-side prediction or generation result, browser-visible progress, and local cache state. WebAssembly, WebGPU, WebNN, workers, model caching, quantization, memory disposal, and fallback. Browser tab, worker, progressive web application, extension, or embedded web view. Browser libraries may use ONNX Runtime Web, WebGPU, WebAssembly, or WebNN-backed execution paths. Not equivalent to a server runtime and not guaranteed to have uniform hardware support across clients.
Agentic and application runtime Coordinate stateful work across models, context, tools, policy, memory, evaluation, and human control. Authorized task envelope, actor and tenant identity, context rules, tool contracts, budgets, and deadlines. Audited task result, evidence, tool side effects, memory changes, trace, and review state. Context selection, model routing, durable execution, retries, tool concurrency, cost controls, evaluation, and recovery. Application service, workflow engine, desktop application, private platform, or hybrid edge-cloud system. Usually assembled from application code and frameworks; protocols such as MCP and A2A may connect parts without supplying the full runtime. Not merely a prompt template, chat UI, model API wrapper, or unconstrained autonomous loop.
Managed model API Provide remotely hosted model capabilities behind a provider-owned service boundary. Authenticated API request, model selection, prompt or tensor data, and provider-supported controls. Hosted inference response or stream plus provider usage metadata. Provider-controlled routing, capacity, model deployment, caching, safety controls, and infrastructure operation. External managed service reached over a network. Commercial and cloud model endpoints; capabilities and guarantees vary by provider. Not the consuming application’s complete runtime because identity, context, tools, memory, policy, and domain workflow remain outside it.

ONNX Runtime is a useful overlap case: it performs provider-independent graph optimization, queries execution-provider capabilities, partitions the graph, and retains fallback execution for unsupported work.[5] vLLM represents a different category emphasis, documenting LLM-specific model execution, paged cache management, continuous batching, prefix caching, quantization, and serving.[6]

Execution

Execution-engine categories

Execution model is a separate axis from product category. A framework can start in eager mode and add graph capture. A graph runtime can interpret some operations while compiling delegated subgraphs. An edge stack can prepare most work AOT and retain runtime fallback.

Eager or imperative execution

Preparation
Operations dispatch as the program runs.
Strength
Flexible control flow, immediate debugging, and simple experimentation.
Trade-off
Per-operation dispatch overhead and limited whole-program optimization unless graph capture is added.

Interpreted graph execution

Preparation
A graph is loaded and an executor dispatches its nodes or fused subgraphs.
Strength
Portable graph semantics and runtime selection of execution providers or delegates.
Trade-off
Runtime parsing, partitioning, dispatch, and fallback behavior must be managed.

Just-in-time compilation

Preparation
The runtime captures or specializes a program for observed inputs and compiles it on demand.
Strength
Shape- and workload-specific optimization without requiring every variant before deployment.
Trade-off
Warmup, compilation cache management, guards, recompilation, and more complex reproducibility.

Ahead-of-time compilation

Preparation
The program is lowered and packaged before it reaches the production target.
Strength
Predictable startup, smaller runtime surface, static memory planning, and constrained-device suitability.
Trade-off
Reduced flexibility for unanticipated shapes, control flow, operators, or target changes.

Delegated or partitioned execution

Preparation
A coordinator assigns supported subgraphs to ordered backends and retains fallback work.
Strength
Heterogeneous acceleration without requiring one backend to support the full model.
Trade-off
Boundary copies, unsupported operations, backend compilation, and placement diagnostics can dominate performance.

Dataflow and streaming execution

Preparation
Stages activate as inputs become available and may run concurrently or as a pipeline.
Strength
Useful for continuous media, multi-stage inference, asynchronous tools, and distributed workflows.
Trade-off
Backpressure, ordering, checkpointing, partial failure, and end-to-end tracing require explicit design.

Continue to Eager, Graph, JIT, AOT, Interpreted, and Dataflow Execution.

Service boundary

Serving and orchestration categories

Serving categories are best separated by the unit of control. A model server controls requests and model instances within its service boundary. A serving platform controls deployments and traffic across services. A distributed inference runtime coordinates one model or request across devices or nodes.

Embedded execution

The application calls the runtime in-process.

Typically owns: Model loading, local scheduling, device use, and result handling inside one process boundary.

Usually remains outside: Network admission, fleet rollout, cross-node routing, or external service contracts unless the application adds them.

Model server

A long-running service exposes model inference through an API.

Typically owns: Protocol handling, model and version selection, queues, batching, health, and backend invocation.

Usually remains outside: Cluster-wide deployment policy, enterprise workflow authority, or domain-specific tool governance by default.

Serving platform

A control plane deploys and operates model-server workloads.

Typically owns: Desired state, placement, autoscaling, rollout, traffic, resource policy, and service-level operations.

Usually remains outside: Every model-specific optimization or every application-level business decision.

Distributed inference runtime

One model request or cache state spans multiple devices, processes, or nodes.

Typically owns: Parallelism, communication, cache transfer, placement, routing, synchronization, and partial-failure behavior.

Usually remains outside: The complete user workflow or authorization model unless explicitly integrated.

Managed inference service

A provider owns the serving infrastructure behind an external API.

Typically owns: Provider-side capacity, deployment, reliability, and supported model behavior.

Usually remains outside: The caller’s context provenance, tool permissions, memory, audit requirements, and product workflow.

Triton documents a model repository, HTTP or gRPC or C API request boundary, per-model schedulers, configurable batching, and backend dispatch—characteristics of a model server.[7] KServe describes a Kubernetes-based serving platform with distinct control-plane and data-plane responsibilities.[8] Ray Serve adds another style: a scalable, framework-agnostic library that can serve models and arbitrary Python application logic.[9]

Application execution

The agentic runtime category

An agentic runtime begins where model inference becomes governed work. It normalizes a task, binds actor and tenant identity, assembles authorized context, chooses a model route, brokers typed tools, records explicit memory changes, applies policy checkpoints, supports durable or interruptible state, and emits evidence sufficient for evaluation and incident analysis.

Required boundary

Authority and policy

Identity, delegated authority, read and write permissions, tool allowlists, rate limits, approval rules, data boundaries, and deterministic enforcement outside the prompt.

Required execution

State and side effects

Checkpoints, idempotency, retries, cancellation, timeouts, compensation, tool result validation, memory writes, and resumable long-running work.

Required evidence

Trace and evaluation

Context provenance, model route, tool calls, policy decisions, token and cost data, latency, failures, human-review events, evaluation results, and final outcome.

Protocols participate in the runtime; they do not replace it

MCP defines host, client, and server relationships plus tools, resources, prompts, capability negotiation, and lifecycle behavior for connections to external systems.[13] A2A defines communication and interoperability between independent agent systems.[14] The runtime still needs authorization, durable state, side-effect controls, observability, evaluation, and recovery around those protocol exchanges.

Continue to the Agentic Runtimes guide.

Placement

Deployment categories

Deployment is not a substitute for runtime category. It adds constraints around latency, privacy, scale, updates, observability, availability, and hardware. Record both.

Cloud or data center

Boundary: Server, GPU host, accelerator cluster, or managed platform.

Distinguishing concerns: High capacity, centralized operations, network boundary, and elastic or scheduled infrastructure.

Private cloud or air-gapped

Boundary: Organization-controlled infrastructure with restricted external connectivity.

Distinguishing concerns: Data residency, controlled software supply chain, private model distribution, and constrained update paths.

Local desktop or workstation

Boundary: User-managed CPU, GPU, unified memory, and local storage.

Distinguishing concerns: Interactive single-user operation, local privacy, model-size constraints, and hardware variability.

Edge server

Boundary: On-premises or near-device compute node.

Distinguishing concerns: Low network latency to devices, local data boundary, intermittent cloud connectivity, and fleet operations.

Mobile or embedded device

Boundary: Phone, wearable, vehicle, appliance, robot, or embedded Linux target.

Distinguishing concerns: Power, thermals, binary size, platform APIs, local sensors, and heterogeneous accelerators.

TinyML or bare metal

Boundary: Microcontroller or operating-system-free target.

Distinguishing concerns: Kilobyte-scale memory budgets, static allocation, integer kernels, deterministic timing, and constrained updates.

Browser

Boundary: Web sandbox, worker, browser cache, and user-device compute.

Distinguishing concerns: Download cost, client compatibility, permission model, main-thread isolation, and privacy-preserving local execution.

Serverless or scale-to-zero

Boundary: Ephemeral function, container, or microVM execution.

Distinguishing concerns: Cold start, model loading, execution limits, burst scaling, and per-invocation economics.

Hybrid

Boundary: Coordinated local, edge, and cloud runtime paths.

Distinguishing concerns: Routing by privacy, latency, cost, capability, availability, and model freshness across boundaries.

ExecuTorch illustrates an edge runtime whose core responsibility is loading prepared program files and executing lowered instructions on devices.[10] LiteRT describes a broader on-device stack spanning conversion, runtime, and optimization.[11] In the browser, WebNN is an API abstraction over operating-system and hardware ML capabilities, not a complete application runtime on its own.[12]

Substrate

Hardware-target categories

Hardware is a classification axis because it changes precision support, memory hierarchy, compiler path, operator coverage, scheduling, fallback behavior, power, and cost. It does not determine the complete runtime architecture by itself.

CPU

General-purpose cores, large memory capacity, mature system integration, and strong control-flow flexibility.

Runtime question: Which vector instructions, threading model, memory layout, and kernel library are used?

GPU

High parallel throughput and memory bandwidth with explicit device-memory and kernel scheduling concerns.

Runtime question: How are kernels selected, launches reduced, memory reused, streams coordinated, and multi-GPU communication handled?

TPU or matrix accelerator

Specialized tensor computation with compiler-driven placement, supported operation sets, and device-specific topology.

Runtime question: Which compiler, precision, shape, partitioning, and collective assumptions constrain portability?

NPU or DSP

Power-efficient on-device acceleration with vendor SDKs, delegates, quantized paths, and restricted operator coverage.

Runtime question: Which subgraphs are supported, what falls back, and what copies occur between CPU, GPU, and accelerator memory?

FPGA

Reconfigurable pipelines with predictable latency and a specialized compilation and deployment toolchain.

Runtime question: How are graphs synthesized, bitstreams managed, buffers streamed, and updates validated?

Custom accelerator

Workload-specific architecture and software stack optimized for selected operations, precision, or dataflow.

Runtime question: What compiler/runtime contract, model coverage, fallback path, telemetry, and vendor-lock-in boundary exist?

Continue to CPU, GPU, TPU, NPU, FPGA, and Heterogeneous Hardware.

Overlap

Systems that span categories

Overlapping scope is normal. The problem is not that a product spans categories; the problem is describing it with one broad label that hides which responsibilities it actually owns.

Compiler/graph runtime + general inference runtime + backend integration

ONNX Runtime

It loads an ONNX graph, applies provider-independent optimization, partitions subgraphs by execution-provider capability, and executes supported work across ordered providers with fallback. The ONNX file is an input format; ONNX Runtime is the executable system.

Compiler + deployable module format + lightweight runtime

IREE

Its project structure deliberately includes compiler and standalone runtime components. Programs move through compiler pipelines and then execute through runtime modules and hardware-abstraction drivers on selected targets.

LLM inference engine + online serving surface

vLLM

It combines model execution, request processing, KV-cache management, batching, prefix reuse, quantization, and online API serving. That does not automatically make it a cluster deployment control plane or an application-level agent runtime.

Model server + schedulers + multi-backend integration

NVIDIA Triton Inference Server

It owns service protocols, model repositories, per-model schedulers, batching, and backend dispatch. The selected backend still owns the underlying model execution and device-specific engine behavior.

Serving control plane + standardized data-plane patterns

KServe

It separates management from inference execution. Controllers reconcile desired serving resources and scaling policy, while data-plane runtimes process requests through model-serving components and protocols.

Distributed application serving + model serving composition

Ray Serve

It can serve framework models and arbitrary Python business logic, compose deployments, queue requests, stream responses, batch dynamically, and scale replicas. The model engine used inside a deployment remains a separate responsibility.

AOT preparation + edge runtime + backend delegation

ExecuTorch

Models are prepared and lowered before deployment, then a small runtime loads the program and executes its instruction sequence. Backends and delegates can own supported partitions on device-specific accelerators.

Model conversion + on-device runtime + accelerator delegates

LiteRT

It provides an on-device deployment stack that includes conversion, runtime execution, optimization, and hardware-acceleration paths. Device, API, and delegate support determine the actual runtime boundary.

OpenVINO documentation also distinguishes its runtime API and compiled-model lifecycle from OpenVINO Model Server, which hosts models behind network protocols.[15][16] That distinction is representative of a broader rule: embedded inference and remote serving are related but separate categories.

A product spanning multiple runtime layers

A product spanning multiple runtime layersAn unnamed illustrative platform spans an inference engine, model-serving API, distributed scheduler, gateway, tool broker, policy service, and product workflow. Solid boxes identify primary responsibilities and outlined boxes identify integrated dependencies. Contract boundaries separate model execution from request governance. The example deliberately avoids a vendor name and does not imply that every product spans these layers.Product workflowPolicy and tool brokerGateway and routingServing and schedulingInference engineExternal compiler and hardware dependencies
An illustrative platform can implement several layers while each responsibility remains separately testable and replaceable.
Text description

An unnamed illustrative platform spans an inference engine, model-serving API, distributed scheduler, gateway, tool broker, policy service, and product workflow. Solid boxes identify primary responsibilities and outlined boxes identify integrated dependencies. Contract boundaries separate model execution from request governance. The example deliberately avoids a vendor name and does not imply that every product spans these layers.

Corrections

Common classification mistakes

Calling a model format a runtime

ONNX, StableHLO, GGUF, and similar artifacts describe or package model programs and data. A runtime must load, transform, or execute them.

Treating a compiler and model server as interchangeable

A compiler transforms programs; a model server accepts service requests and manages model availability, scheduling, and responses. A product can contain both.

Equating a hosted model API with the application runtime

The provider owns remote inference. The consuming system still owns identity, context, permissions, tools, memory, policy, workflow, and evidence unless explicitly delegated.

Calling every agent framework a production agent runtime

A framework may help define graphs, tools, or messages. A production runtime must also define durable state, authority, side-effect controls, failure recovery, telemetry, evaluation, and operations.

Comparing an inference engine directly with a serving control plane

The engine optimizes model execution. The control plane deploys, scales, routes, and updates serving workloads. Compare each category within its own scope.

Assuming one product belongs to exactly one layer

Integrated systems often span compiler, runtime, server, and control-plane responsibilities. Document native functions, delegated functions, and surrounding dependencies.

Using deployment location as the only category

“Cloud,” “edge,” and “browser” describe placement. They do not reveal whether the component is a compiler, inference runtime, model server, workflow runtime, or product.

Treating MCP or A2A as a complete runtime

MCP standardizes application connections to tools, resources, and prompts; A2A standardizes communication between agents. Both can participate in a runtime without providing all execution, state, policy, and operational responsibilities.

Decision path

Classify a system with this decision tree

Follow the questions in order and keep every category that applies. The result should be a multi-axis description, not a forced single label.

Responsibility-first AI runtime classification path
  1. 1

    Are you transforming, lowering, partitioning, or generating code for a model program?

    Start with the compiler and graph-runtime category.

  2. 2

    Are you loading a trained model and producing predictions, embeddings, or tokens?

    Start with a general or model-specific inference-engine category.

  3. 3

    Are you exposing models over HTTP, RPC, C API, or another service boundary?

    Add the model-server category.

  4. 4

    Are you deploying, scaling, routing, versioning, or rolling out model-serving workloads?

    Add the serving-platform or distributed-runtime category.

  5. 5

    Are you operating within browser, mobile, edge, or microcontroller constraints?

    Apply the relevant deployment category and classify its compiler, runtime, delegate, and application responsibilities separately.

  6. 6

    Are you coordinating identity, context, models, tools, memory, policy, evaluation, long-running state, or human approval?

    Add the agentic and application-runtime category.

  7. 7

    Are you describing the UI, domain workflow, business records, or final user outcome?

    You are at the product and workflow layer, which may use several runtimes underneath.

Example classification statement

“This deployment uses an AOT compiler and delegated edge inference runtime on an NPU-capable mobile device, exposes no network model server, and adds an application-level runtime for identity, context, typed tools, policy, telemetry, and human approval.”

FAQ

Frequently asked questions

Why can the same product appear in more than one category?

Products often integrate several components for usability or performance. Classify each concrete responsibility, then state which parts are native, optional, delegated to a backend, or supplied by an external platform.

Is ONNX a runtime?

No. ONNX is an open specification for model graphs, data types, and operators. ONNX Runtime is a separate executable system that optimizes, partitions, and runs ONNX models.

Is a model server an inference engine?

A model server may embed or call an inference engine, but the categories are distinct. The engine performs model execution; the server owns protocols, queues, model availability, scheduling, and responses.

Is an agent framework a runtime?

It can be part of one. The runtime label becomes stronger when the system owns execution state, tool contracts, permissions, retries, durable checkpoints, observability, evaluation, and recovery rather than only graph or prompt construction.

Where do MCP and A2A fit?

They are interoperability protocols adjacent to the agentic runtime. MCP connects AI applications with tools, resources, and prompts. A2A supports communication and task collaboration between independent agents. Neither replaces runtime policy or execution controls.

What is the minimum information needed for a fair comparison?

Declare the category, layer, workload, model and format, version, hardware, precision, execution mode, deployment boundary, concurrency, metric definitions, cache state, reliability requirements, and trust assumptions.

Sources and further reading

This taxonomy is ARuntime.com analysis built from official specifications and project documentation. Representative systems are categorized by documented responsibility and are not ranked or endorsed.

  1. Open Neural Network Exchange Intermediate Representation Specification Opens in a new tab. — ONNX project; Official specification. Defines ONNX as an extensible computation-graph specification with data types and built-in operators. Accessed 2026-06-21 UTC.
  2. StableHLO Specification Opens in a new tab. — OpenXLA project; Official specification. Describes StableHLO as a portable high-level operation set between ML frameworks and compilers. Accessed 2026-06-21 UTC.
  3. XLA Architecture Opens in a new tab. — OpenXLA project; Official project documentation. Explains the ML compiler role, optimization pipeline, and target-specific code generation. Accessed 2026-06-21 UTC.
  4. IREE Developer Overview Opens in a new tab. — IREE project; Official project documentation. Documents the project’s separate compiler, standalone runtime, integrations, and full compiler-to-runtime workflows. Accessed 2026-06-21 UTC.
  5. ONNX Runtime Architecture Opens in a new tab. — ONNX Runtime project; Official project documentation. Explains graph optimization, execution-provider capability discovery, partitioning, and fallback execution. Accessed 2026-06-21 UTC.
  6. vLLM Documentation Opens in a new tab. — vLLM project; Official project documentation. Documents LLM inference and serving features including paged KV-cache management, batching, prefix caching, quantization, and distributed execution. Accessed 2026-06-21 UTC.
  7. Triton Architecture Opens in a new tab. — NVIDIA; Official project documentation. Describes model repositories, service protocols, per-model schedulers, batching, and backend dispatch. Accessed 2026-06-21 UTC.
  8. KServe System Architecture Overview Opens in a new tab. — KServe project; Official project documentation. Separates Kubernetes-based serving management through a control plane from inference execution in the data plane. Accessed 2026-06-21 UTC.
  9. Ray Serve: Scalable and Programmable Serving Opens in a new tab. — Ray project; Official project documentation. Defines Ray Serve as a scalable, framework-agnostic serving library for models and arbitrary application logic. Accessed 2026-06-21 UTC.
  10. ExecuTorch Runtime Overview Opens in a new tab. — PyTorch / ExecuTorch project; Official project documentation. Describes loading prepared program files and executing lowered model instructions on edge devices. Accessed 2026-06-21 UTC.
  11. LiteRT Overview Opens in a new tab. — Google AI Edge; Official project documentation. Describes conversion, runtime, optimization, and on-device deployment across edge platforms. Accessed 2026-06-21 UTC.
  12. Web Neural Network API Opens in a new tab. — W3C Web Machine Learning Working Group; Standards specification. Defines a web-friendly, hardware-agnostic abstraction over operating-system and hardware ML capabilities. Accessed 2026-06-21 UTC.
  13. Model Context Protocol Architecture Overview Opens in a new tab. — Model Context Protocol project; Official protocol documentation. Defines host, client, and server relationships plus tools, resources, prompts, lifecycle, and capability negotiation. Accessed 2026-06-21 UTC.
  14. Agent2Agent Protocol Specification Opens in a new tab. — A2A Protocol project; Official protocol specification. Defines communication and interoperability between independent agent systems. Accessed 2026-06-21 UTC.
  15. Running and Integrating an OpenVINO Inference Pipeline Opens in a new tab. — OpenVINO project; Official project documentation. Documents the runtime API, model loading, compilation, device selection, and inference-request lifecycle. Accessed 2026-06-21 UTC.
  16. OpenVINO Model Server Opens in a new tab. — OpenVINO project; Official project documentation. Separately describes hosting models and exposing remote inference over network protocols. Accessed 2026-06-21 UTC.

Last reviewed: 2026-06-23 UTC