Reference Architecture

The ARuntime reference architecture connects model execution to product behavior through explicit runtime layers, cross-cutting controls, and versioned contracts. It is a blueprint for reasoning about ownership, not a requirement to deploy a separate service for every box.

Key takeaways

Separate compiler, inference, serving, distributed, agentic, and product responsibilities even when one platform implements several.
Use planes to distinguish management decisions, active execution, domain data, and evidence.
Make identity, authority, budgets, versioning, recovery, and retention explicit across the lifecycle.
Design failure behavior before optimizing the happy path.

[ar_diagram id=”seven-layer-stack”]

Layered reference model

Hardware substrateProcessors, accelerators, memory, storage, fabric, operating system, drivers, isolation, and resource accounting.
Kernel and communications layerOptimized operators, numerical libraries, collectives, transfer primitives, and hardware-specific implementations.
Compiler and graph executionGraph capture, IR, rewriting, fusion, partitioning, lowering, code generation, scheduling, and memory planning.
Inference engineModel loading, prefill, decode, cache allocation, batching, structured output, and token-level telemetry.
Serving and distributed executionAPIs, repositories, admission, request scheduling, routing, autoscaling, rollout, parallelism, remote state, and node-failure handling.
Agentic application runtimeTask boundaries, actor and tenant context, context assembly, model route, tools, memory, policy, approvals, evaluation, recovery, and evidence.
Product and workflow layerUser experience, domain rules, systems of record, accountable decisions, and definition of success.

The interfaces between layers should be narrow enough to test and substitute. A serving layer should not infer business authority from model text. An application runtime should not pretend to control numerical kernels through prompt instructions. A product should not treat a successful HTTP response as evidence that a task achieved its intended outcome.

Cross-cutting concerns

Identity and tenancy

Human, service, workload, and delegated identities; tenant isolation; session ownership; credential binding.

Configuration and secrets

Versioned configuration, secret references, late credential resolution, rotation, and non-disclosure to models.

Policy and security

Data classification, least privilege, egress, tool classes, approvals, rate limits, sandboxing, and incident response.

Observability and evaluation

Infrastructure, model, tool, policy, business, and quality signals correlated without requiring raw sensitive prompts.

Cost and capacity

Tokens, GPU time, queue budgets, tool calls, storage, human review, retries, and energy.

Versioning

Model, engine, compiler, prompt, tool, policy, schema, retrieval index, and application release.

Audit and evidence

Request identity, sources, decisions, side effects, artifacts, failures, recovery, and unresolved uncertainty.

Failure recovery

Timeouts, retries, cancellation, checkpoint, compensation, rollback, escalation, and safe termination.

Control, execution, data, and evidence planes

[ar_diagram id=”control-execution-planes”]

Control plane

Defines desired state and governs the fleet: identity, tenancy, configuration, policies, model and tool registries, provisioning, routing rules, rollout, quotas, and lifecycle.

Execution plane

Runs active work: model processes, sandboxes, workers, caches, tool adapters, workflow activities, and isolated session state.

Data plane

Carries prompts, tensors, tokens, cache transfers, tool payloads, domain reads/writes, and artifacts. Data-classification and minimization rules follow the data, not just the request.

Evidence plane

Persists correlated, minimized records of configuration, decisions, actions, outcomes, and failures. It should remain queryable when an execution worker has disappeared.

The evidence plane is separated because ordinary logs are often ephemeral, operationally scoped, or too sensitive for broad review. Evidence can be implemented using existing telemetry, event, and artifact systems; the architectural requirement is durable correlation and controlled access.

Request lifecycle

[ar_diagram id=”agentic-request-lifecycle”]

Request admission: validate protocol, identity, tenant, quota, deadline, idempotency key, and contract version.
Authority resolution: determine allowed data, models, tools, side effects, and delegation.
Risk classification: assign consequence, reversibility, data classification, and review requirements.
Context assembly: select sources and memory under provenance, freshness, scope, and token budgets.
Model-route selection: choose deployment, region, precision, capability, cost, and fallback within policy.
Inference: serving and engine layers admit, schedule, execute, and stream model output.
Tool planning: treat model output as a proposal; construct a typed call rather than executing free-form text.
Authorization: validate schema, permission class, credentials, egress, side-effect class, and approval conditions.
Tool execution: apply timeout, idempotency, concurrency, isolation, and observability requirements.
Validation: check tool result, output contract, domain constraints, and unexpected state changes.
Human approval: pause without holding unnecessary compute; present evidence and expiration.
Response finalization: produce the user result, citations, status, and known uncertainty.
Memory decision: choose what may be retained, corrected, consolidated, exported, or deleted.
Evidence persistence: correlate request, model, tool, policy, approval, artifact, and recovery records.
Evaluation: score task outcome, safety, cost, latency, and evidence completeness.
Incident or recovery handling: retry, compensate, restore, escalate, terminate, and preserve the failure record.

Contract boundaries

Reference contracts and their owners
Contract	Purpose	Required owner
Runtime request	Identity, task, risk, context, route, tools, memory, budget, output, trace, deadline, retention	Application/runtime boundary
Model route	Capability, deployment, region, precision, fallback, provider and budget constraints	Serving/application routing
Tool	Typed I/O, permission and side-effect class, authentication reference, retry, idempotency, compensation	Tool owner plus runtime registry
Policy decision	Allow, deny, transform, require approval, or escalate with reason and expiry	Policy authority
Evidence record	Correlated, minimized proof of decisions, actions, artifacts, failures, and uncertainty	Evidence plane owner
Trace	Infrastructure, model, tool, policy, business, and evaluation correlation	Observability owner

Downloadable schemas are available in the developer reference. They are editorial reference contracts, not an industry standard or a shipped SDK.

Failure model

[ar_diagram id=”failure-recovery-state-machine”]

Required failure semantics
Failure	Detection	User-visible behavior	Retry and recovery	Evidence
Model timeout or refusal	Deadline/stop reason	Bounded failure or qualified fallback	Retry only within budget and policy; route fallback may change capability	Route, model version, attempts, final status
Invalid structured output	Schema/domain validation	No tool side effect	Repair or regenerate with bounded attempts	Validation errors and retry strategy
Context retrieval failure	Source timeout, empty set, freshness rule	State missing context or stop	Retry source, use approved fallback, never fabricate	Requested and unavailable sources
Tool timeout	Tool deadline and status	Pending, failed, or partial result	Retry only if idempotent or operation status can be queried	Operation key and observed side effects
Authorization denial	Policy decision	Explain bounded denial	No automatic privilege expansion; approval only through configured path	Policy version, reason codes, authority
Partial side effect	Tool result plus system-of-record check	Show partial state and recovery status	Compensate or reconcile; do not blind retry	Changed resources and compensation
Duplicate execution	Idempotency record	Return prior result or conflict	Never repeat irreversible effect	Original and duplicate request identifiers
Budget exhaustion	Token, time, cost, or action counters	Pause or return partial qualified result	Approval may extend budget; otherwise stop	Budget use and termination reason
Trace persistence failure	Evidence write status	Fail closed for evidence-required work or mark degraded	Buffer within bounded retention; reconcile later	Gap record and recovery status

Deployment profiles

Embedded local: application, engine, and policy run on one device; use OS boundaries and local encrypted storage.
Service-oriented: product calls shared model services; route and data contracts cross network boundaries.
Disaggregated cluster: prefill, decode, cache, and routing scale independently; network and remote state become first-class.
Managed agent runtime: control plane provisions isolated task environments and durable state.
Hybrid: local context, tools, policy, or evidence combine with hosted model execution under explicit egress and fallback rules.
Browser/edge: capability detection, model packaging, update, offline behavior, and device limits shape the architecture.

Architecture review checklist

Name every runtime layer and execution unit.
Identify state owner, lifetime, retention, and deletion for each state class.
Pin contracts and versions crossing team or trust boundaries.
Document actor, tenant, workload, and delegated identity.
Classify tool side effects and prove idempotency or compensation.
Specify overload, deadline, cancellation, fallback, and degraded modes.
Keep system-of-record state outside model memory.
Correlate evidence without storing unnecessary raw sensitive input.
Test permission denial, partial side effects, crash recovery, duplicate delivery, and trace outage.
Assign an accountable owner for every failure boundary.

Find runtime definitions and implementation guidance