Search ARuntime.com

Find runtime definitions and implementation guidance

Search page titles, summaries, headings, glossary terms, use cases, and runtime-directory entries.

Enter at least two characters.

ARuntime Reference

Runtime Stack Overview

How the seven AI runtime layers interact, which concerns cross the stack, and how ownership should be divided.

Audience: Technical readers Reading time: 3 minutes Status: Architecture Last reviewed:

The ARuntime stack is a responsibility map from silicon to product outcome. It is designed to make hidden ownership visible without implying that every deployment needs seven independent services.

Key takeaways

  • Layers are separated by execution unit and operational responsibility.
  • Cross-cutting concerns—identity, policy, observability, evidence, configuration, security, and recovery—must connect several layers.
  • Architecture quality depends on explicit interfaces and failure ownership, not the number of products used.

[ar_diagram id=”seven-layer-stack”]

Reference model

[ar_runtime_stack]

Layer numbers describe dependency direction, not value. A product application at Layer 6 depends on lower layers but remains accountable for domain outcome. Layer 5 coordinates consequential model-backed work. Layers 3 and 4 make model execution efficient and available. Layers 0 through 2 convert model programs into physical execution.

Lower execution layers

Layer 0 — Hardware and system substrate

Owns processors, accelerators, memory, storage, network fabric, operating system, drivers, isolation, and resource accounting. Hardware topology shapes every upper-layer decision.

Layer 1 — Kernels and hardware libraries

Provides optimized mathematical operations and communication collectives. Precision, kernel availability, workspace needs, and numerical behavior are part of the runtime contract with the compiler and engine.

Layer 2 — Compiler and graph runtime

Transforms high-level programs into target execution through IRs, graph rewriting, fusion, partitioning, lowering, scheduling, and memory planning. Portable compiler stacks may retarget several backends; vendor runtimes may deliver deeper specialization for one hardware family.

Serving and distributed layers

Layer 3 — Model and LLM inference engine

Owns model loading and invocation. For generative models it manages prefill, decode, KV cache, continuous batching, structured output, streaming, and token-level telemetry.

Layer 4 — Serving and distributed runtime

Turns engines into services and coordinated clusters. It owns network APIs, repositories, health, versions, request queues, routing, batching, scaling, rollouts, parallelism, collectives, remote cache, and node failure. These concerns may be split between a server and a separate distributed scheduler.

Agentic and product layers

Layer 5 — Agentic and application runtime

Turns model requests into bounded tasks. It owns actor and tenant context, context assembly, route constraints, tool contracts, credentials, memory policy, approvals, budgets, checkpoints, recovery, evaluation, evidence, and trace correlation.

Layer 6 — Product and workflow layer

Owns the user-facing problem, domain records, accountability, and definition of success. A runtime can enforce a policy but cannot decide an organization’s legitimate business purpose without governance supplied by this layer.

Cross-cutting concerns

Identity and tenancy

Resolve actor, workload, service, tenant, and delegation across requests and tools.

Configuration and versioning

Pin model, compiler, engine, policy, prompt, tool, and schema versions.

Security and policy

Apply isolation, least privilege, data boundaries, egress, approvals, and incident controls.

Observability and evidence

Correlate infrastructure, model, tool, policy, business, and evaluation events without over-collecting sensitive data.

Cost and capacity

Budget hardware, tokens, context, tools, queue time, retries, and human attention.

Failure recovery

Detect, retry, shed, restore, compensate, escalate, and preserve evidence.

Composition patterns

  • Embedded inference: one application process embeds a portable engine and runs on CPU, GPU, or NPU.
  • Dedicated model service: a product calls a model server backed by one or more engines.
  • Disaggregated inference: prefill, decode, and cache services scale independently across a cluster.
  • Browser or edge execution: packaging, capability detection, local cache, and fallback bridge device and hosted paths.
  • Durable agent execution: an agent framework runs inside a workflow/runtime boundary with tools, checkpoints, approvals, and evidence.
  • Hybrid governed architecture: local policy and sensitive context combine with hosted inference under explicit egress and retention rules.

Stack anti-patterns

  • Letting the model prompt carry identity, secrets, authorization, and durable state as untyped prose.
  • Binding product code directly to a provider-specific model identifier without a route and fallback contract.
  • Optimizing token throughput while ignoring queueing, tool I/O, approval latency, or completion rate.
  • Using one global cache or memory namespace across tenants.
  • Retrying non-idempotent tools without a stable operation key.
  • Recording raw prompts and tool results indefinitely under the label “observability.”
  • Calling every component an “AI operating system,” making ownership impossible to audit.

Maintenance record

Found an error, outdated capability, or unclear category boundary? Submit a correction with a supporting source.