Key takeaways
- Layers describe where a responsibility sits in the AI execution stack; planes describe how operational responsibilities cut across those layers.
- The control plane selects and coordinates work, the context plane assembles authorized state and evidence, the execution plane performs model and tool operations, and the trust plane governs every boundary.
- Every component should have an explicit contract covering inputs, outputs, authority, failure semantics, cancellation, idempotency, telemetry, and version compatibility.
- Control-plane failure must not silently corrupt data-plane execution. Preserve known-good configuration, bound retries, and make degraded behavior explicit.
- Replaceability comes from stable contracts, capability negotiation, portable identity and telemetry, owned data schemas, and replayable conformance tests—not from a single universal API.
Definition and scope
Planes and layers answer different architecture questions
A layer identifies where a responsibility sits in the execution stack: hardware, kernels, compiler or graph runtime, inference engine, serving infrastructure, agentic runtime, or product workflow. A plane groups operational responsibilities that cross those layers. The same inference deployment can participate in a control plane for rollout and routing, a context plane for input construction, an execution plane for model work, and a trust plane for identity, policy, telemetry, and audit.
This distinction avoids two common design errors. First, it prevents teams from calling one product “the runtime” even though the application depends on several independently operated systems. Second, it prevents cross-cutting concerns such as identity or observability from being assigned to a single box when they must be enforced and propagated at multiple boundaries.
Kubernetes uses a control plane to make global decisions and reconcile desired state while worker nodes run application workloads.[1][2] KServe applies a similar distinction to model serving by separating lifecycle management from the inference data plane.[3][4][5] ARuntime.com extends that operational idea to context and trust because production AI requests depend on governed data assembly, delegated authority, validation, evaluation, and evidence as well as model execution.
Layers
Where does the responsibility sit?
Use layers to locate hardware, compiler, inference, serving, agentic, and product responsibilities and to identify which system owns each boundary.
Planes
How is the responsibility operated?
Use planes to separate coordination, context, execution, and trust concerns that must remain coherent across multiple layers and deployments.
Contracts
What crosses the boundary?
Version the data, authority, errors, cancellation, side effects, telemetry, and compatibility behavior exchanged between components.
Reference model
The four-plane runtime architecture
The four planes are not four mandatory services. They are an architectural test: each production responsibility must have a clear owner, contract, enforcement point, and failure behavior. A compact deployment may implement several planes in one process. A distributed deployment may divide them among controllers, gateways, workers, stores, policy services, and telemetry pipelines.
The trust plane surrounds the architecture. A control rail coordinates the request. Context and execution planes exchange a versioned context and execution contract through the runtime coordinator. Telemetry and evaluation flow back through the trust plane.
Control plane
Decide what should run, where it should run, and which constraints apply.
- Configuration, capability discovery, model and tool registration, routing, rollout, quotas, deadlines, budgets, admission, scheduling policy, and durable workflow coordination.
- Reconcile desired state with observed state and preserve a known-good execution configuration during partial failures.
- Inputs
- Runtime contract, deployment configuration, capabilities, health, capacity, policy metadata, and service objectives.
- Outputs
- Execution plan, route, reservation, desired state, retry or fallback decision, and lifecycle event.
- Avoid
- Putting high-volume tensor or token data through the control plane, or letting control changes bypass versioning and review.
Context plane
Assemble the smallest authorized state and evidence set needed for the task.
- Retrieval, domain-data access, working state, long-term memory, provenance, redaction, ranking, compression, caching, and prompt or tensor preparation.
- Keep source ownership and access policy explicit instead of treating the prompt as the system of record.
- Inputs
- Actor, tenant, task, context policy, source constraints, session state, query, and freshness requirements.
- Outputs
- Versioned context bundle with provenance, citations, redaction state, freshness, and size accounting.
- Avoid
- Copying unrestricted data into prompts, merging tenants, or persisting model-generated memory without an explicit write policy.
Execution plane
Perform the authorized work and emit typed results and progress events.
- Model inference, graph execution, request batching, streaming, typed tool invocation, durable activities, local or edge execution, and side-effect reconciliation.
- Honor cancellation, deadlines, resource reservations, idempotency rules, and output contracts.
- Inputs
- Approved execution plan, model or tool request, context bundle, credentials reference, deadline, budget, and cancellation signal.
- Outputs
- Tokens, predictions, tool results, state transitions, side-effect status, usage records, and completion or continuation events.
- Avoid
- Allowing a model proposal to execute a privileged tool directly, or retrying ambiguous side effects without reconciliation.
Trust plane
Authorize, validate, observe, evaluate, and retain evidence across every plane.
- Identity, delegated authority, policy decisions and enforcement, schema validation, secrets, egress, isolation, evaluation, human approval, audit, trace propagation, redaction, and retention.
- Make runtime evidence sufficient to explain what executed, under whose authority, with which data and versions, and with what outcome.
- Inputs
- Identity and workload evidence, policy data, proposed action, schemas, runtime events, evaluation criteria, and retention rules.
- Outputs
- Allow, deny, constrain, redact, approve, escalate, audit, evaluation, incident, and retention decisions.
- Avoid
- Treating a system prompt as the enforcement boundary or collecting sensitive telemetry without redaction and retention controls.
Trust is not a final review step
The trust plane surrounds the other three planes because authorization, validation, isolation, trace propagation, redaction, evaluation, and retention must be enforced where an action or data transfer occurs. A prompt can express policy intent, but it cannot serve as the sole enforcement boundary for a privileged operation.
Cross-cutting responsibilities
Layer and plane cross-matrix
The matrix below shows why planes cannot be assigned only to the agentic layer. Hardware placement is a control concern. Tensor layout belongs to the context of low-level execution. Model scheduling belongs to execution and control. Identity and evidence remain trust concerns from the product boundary down to workload and device isolation.
| Runtime layer | Control | Context | Execution | Trust |
|---|---|---|---|---|
| 0 — Hardware and system substrate | Placement, reservations, quotas, health, and topology. | Data locality, device-access bounds, and encrypted storage paths. | Device commands, memory transfer, collectives, and faults. | Isolation, attestation, firmware or driver integrity, and workload identity. |
| 1 — Kernels and hardware libraries | Kernel selection, precision, tuning profile, and compatibility. | Tensor layouts, shapes, precision, and buffer ownership. | Tensor, attention, communication, and memory primitives. | Library provenance, signed artifacts, bounds checking, and low-level telemetry. |
| 2 — Compiler and graph runtime | Build profiles, target selection, partitioning, and fallback policy. | Input and output schemas, shape constraints, and constants. | Optimized graph, code generation, delegated subgraphs, and memory plans. | Artifact integrity, pass and compiler versions, validation, and reproducibility evidence. |
| 3 — Inference engine | Admission, route, batch, cache, priority, and model version. | Prompts, token IDs, tensors, KV state, and structured-output constraints. | Prediction, prefill, decode, sampling, streaming, and device parallelism. | Tenant isolation, output validation, usage, model provenance, and safety evidence. |
| 4 — Serving and distributed runtime | Deploy, scale, roll out, load, route, and recover replicas. | Network request envelope, model identity, tenant, deadline, and trace context. | Protocol handling, queueing, scheduling, batching, replica and node dispatch. | Authentication, workload identity, rate limits, policy, audit, and distributed trace propagation. |
| 5 — Agentic and application runtime | Task state, model and tool routing, budgets, retries, durable workflows, and handoffs. | Retrieval, working state, long-term memory, citations, and domain records. | Model calls, typed tools, compensations, streams, and human-review actions. | Delegated authority, policy, schema validation, evaluation, approvals, retention, and replay. |
| 6 — Product and workflow | Business workflow, product configuration, feature policy, and user intent. | Domain records, user state, collaboration, and product-specific history. | User-visible action, transaction, notification, and domain update. | Business authorization, privacy, consent, compliance evidence, and outcome accountability. |
Component model
Components, inputs, outputs, failure modes, and telemetry
A useful architecture diagram names boxes. A production architecture also defines what each box receives, what it is allowed to do, how it reports failure, and what evidence proves the boundary worked. The following matrix treats the runtime as a cooperating set of components rather than a chain of opaque SDK calls.
| Component | Primary plane | Responsibility | Inputs | Outputs | Failure modes | Required telemetry |
|---|---|---|---|---|---|---|
| Request boundary | Control + Trust | Authenticate the caller, normalize the task, validate the contract, establish deadline and budget, and start correlation. | Request envelope, actor or workload identity, tenant, task, output contract, permissions, risk, and trace context. | Accepted runtime request or typed rejection, normalized authority, deadline, budget, and trace identifiers. | Invalid schema, failed authentication, stale credentials, replay, tenant ambiguity, or unsupported contract version. | Admission result, actor and tenant references, contract version, latency, rejection class, risk, and redaction state. |
| Runtime coordinator | Control | Coordinate the request state machine, route work, propagate cancellation, apply budgets, and choose synchronous, streaming, or durable execution. | Normalized request, capabilities, health, capacity, policy constraints, workflow state, and service objectives. | Execution plan, ordered or parallel work, continuation state, fallback, compensation, or handoff decision. | State-machine defect, duplicate dispatch, lost cancellation, stale route, deadlock, or unbounded retry loop. | State transitions, route choice, attempt, deadline remaining, retry reason, fallback, and completion disposition. |
| Capability registry and router | Control | Describe available model, provider, tool, hardware, locality, privacy, and cost capabilities and select a compatible route. | Task requirements, model constraints, provider health, location, budget, latency target, and policy. | Selected route plus compatibility explanation, fallback chain, and reservation request. | Stale capability data, incompatible model or schema, unsafe fallback, region mismatch, or route oscillation. | Candidate set, rejection reasons, selected route, policy version, health snapshot, estimated and actual usage. |
| Context orchestrator | Context + Trust | Plan retrieval and state access, enforce source policy, rank and compress evidence, and build a provenance-aware context bundle. | Task, actor, tenant, query, context policy, source catalog, session state, memory policy, and token or byte budget. | Context bundle, citations, source references, freshness, redaction annotations, and omitted-source reasons. | Cross-tenant retrieval, stale index, source outage, injection-bearing content, budget overflow, or lost provenance. | Queries, source IDs, access decisions, hit counts, ranking, bytes or tokens, cache status, and redactions. |
| Context provider adapter | Context | Expose typed read operations for a source while preserving native identity, freshness, pagination, and provenance semantics. | Authorized source request, filters, cursor, consistency and freshness requirement, and cancellation signal. | Typed records or documents, provenance, version or ETag, completeness, and next cursor. | Permission mismatch, partial results, schema drift, rate limit, stale cursor, or source timeout. | Source latency, result count, version, cache state, retry, rate-limit state, and error class. |
| Model adapter | Execution + Control | Translate a stable runtime model request into a provider, engine, or local-runtime invocation without leaking vendor-specific details into the coordinator. | Model request, context or tensors, route parameters, output schema, generation or inference settings, deadline, and cancellation. | Typed model events, token or prediction stream, structured result, usage, finish status, and provider evidence. | Provider error, unsupported feature, timeout, partial stream, invalid structured output, or capacity rejection. | Provider and model version, queue and execution time, token or tensor counts, cache use, attempts, finish reason, and error class. |
| Inference or graph runtime | Execution | Load prepared artifacts and execute graph, tensor, or token-generation work on supported hardware. | Model artifact, tensors or token IDs, cache state, batching and scheduling parameters, precision, and device assignment. | Predictions, embeddings, logits or tokens, cache updates, device metrics, and engine status. | Artifact incompatibility, device loss, out-of-memory, kernel failure, partition mismatch, or numerical error. | Load and warmup, queue, batch, prefill or compute, decode, device utilization, memory, and backend errors. |
| Tool broker | Execution + Trust | Discover typed tools, validate arguments, resolve identity and credentials, authorize, invoke, rate-limit, audit, and reconcile side effects. | Tool name and version, typed arguments, actor and delegated authority, idempotency key, deadline, and approval state. | Typed result, evidence, side-effect classification, external record identifiers, and reconciliation state. | Unauthorized use, schema failure, credential error, ambiguous commit, unsafe egress, or non-idempotent retry. | Tool version, authorization, argument reference, duration, retry, idempotency key, external IDs, and side-effect state. |
| Memory and state manager | Context + Trust | Manage working state, durable checkpoints, approved long-term memory, retention, deletion, provenance, and optimistic or transactional updates. | Session or workflow key, read or write intent, memory policy, expected version, provenance, and retention class. | State snapshot, applied mutation, version, conflict, checkpoint, or deletion evidence. | Lost update, stale version, poisoning, over-retention, cross-tenant access, or acknowledged-but-uncommitted write. | State key reference, operation, version, conflict, durability, retention, deletion, and provenance. |
| Policy decision and enforcement | Trust | Evaluate structured policy input separately from the component that blocks, constrains, or permits the action. | Actor and workload identity, action, resource, context classification, tool or model proposal, risk, policy and data versions. | Allow, deny, constraints, obligations, approval requirement, redactions, and explanation references. | Policy unavailable, stale bundle, decision/enforcement mismatch, fail-open behavior, or missing evidence. | Decision, rule and policy version, input digest, enforcement result, latency, obligation completion, and override. |
| Durable workflow and queue | Control + Execution | Persist long-running progress, timers, retries, task queues, approval waits, compensations, and resumable state. | Workflow definition and version, task, checkpoint, event, timer, retry policy, and idempotent activity contract. | Scheduled work, durable event history, resumed state, compensation, completion, or human-review task. | Duplicate activity, nondeterministic replay, poison message, stuck workflow, version skew, or queue starvation. | Workflow and run IDs, event history position, task queue, attempts, heartbeat, wait reason, and terminal state. |
| Telemetry and evaluation pipeline | Trust | Correlate traces, metrics, logs, cost, evaluation, policy, and incident evidence without exposing sensitive payloads by default. | Runtime events, propagated trace context, resource attributes, evaluation criteria, redaction and sampling policy. | Spans, metrics, logs, evaluation results, cost records, alerts, replay references, and audit evidence. | Broken correlation, high-cardinality overload, sensitive-data leakage, sampling bias, or mandatory audit loss. | Pipeline health, dropped data, sampling, exporter latency, redaction, schema version, and retention state. |
| Human review surface | Trust + Control | Present enough context, evidence, policy rationale, and proposed effects for an authorized person to approve, edit, reject, or escalate. | Review task, proposed result or action, evidence, policy reasons, risk, deadline, and allowed reviewer actions. | Signed decision, edits, reason, escalation, resumed workflow, or cancellation. | Wrong reviewer, missing context, stale proposal, bypass, timeout, or ambiguous approval scope. | Reviewer identity reference, decision, reason, timestamp, affected action, policy, and resumed state. |
ONNX Runtime illustrates one replaceable execution boundary: providers report supported graph capabilities, the runtime partitions compatible subgraphs, and unsupported work can fall through to later providers.[6] Triton places protocol handling and per-model scheduling in front of backend execution.[7] Ray Serve separates proxies, controller state, deployments, and replicas, which makes request flow and component recovery independently visible.[8] These products do not implement the whole reference architecture; they demonstrate why responsibilities and boundaries should be named precisely.
Replaceable interfaces
Five contracts that prevent the coordinator from becoming a monolith
Interfaces should preserve meaningful differences rather than pretending every engine, provider, data source, or tool is identical. OpenAPI can describe HTTP interactions and JSON Schema can describe and validate structured instance data, but operational contracts must also define semantics such as cancellation, streaming, idempotency, authority, compatibility, and side-effect status.[10][11]
Execution with control metadata
Model adapter interface
Normalize provider, engine, or local-runtime differences behind a versioned request and event contract.
describeCapabilities() → ModelCapabilities
execute(ModelRequest request, CancellationSignal cancellation) → stream<ModelEvent>
health() → HealthStatus
cancel(ExecutionId executionId) → CancellationResult
- Contract must carry
- Model and provider identifiers, capability and contract versions, deadline, output schema, usage accounting, finish reason, and trace context.
- Invariant
- The coordinator must not depend on provider-specific response shapes, mutable model aliases, or secret material.
Context with trust enforcement
Context provider interface
Make source access typed, attributable, cancelable, pageable, and explicit about freshness and completeness.
describeSource() → ContextSourceCapabilities
query(ContextQuery query, AccessContext access, CancellationSignal cancellation) → ContextPage
fetch(ContextReference reference, AccessContext access) → ContextItem
- Contract must carry
- Source and record references, version or ETag, access decision, timestamp, classification, provenance, redaction, and next-page cursor.
- Invariant
- Retrieved content is untrusted input until policy, provenance, and injection defenses have been applied.
Execution bounded by trust
Tool broker interface
Separate model-proposed actions from deterministic validation, authorization, credential resolution, execution, and reconciliation.
listTools(AccessContext access) → ToolDescriptor[]
authorize(ToolCall call, AccessContext access) → ToolDecision
invoke(AuthorizedToolCall call, CancellationSignal cancellation) → ToolResult
reconcile(ToolExecutionId executionId) → SideEffectStatus
- Contract must carry
- Tool name and version, JSON-compatible input and output schemas, required permission, side-effect class, timeout, idempotency behavior, and approval requirement.
- Invariant
- No high-impact tool executes only because a model emitted a syntactically valid call.
Context with trust and durable control
Memory manager interface
Distinguish ephemeral working state, durable workflow checkpoints, and reviewed long-term memory.
read(MemoryQuery query, AccessContext access) → MemorySnapshot
write(MemoryMutation mutation, AccessContext access, ExpectedVersion version) → MemoryWriteResult
checkpoint(WorkflowState state) → CheckpointReference
delete(MemoryReference reference, DeletionPolicy policy) → DeletionEvidence
- Contract must carry
- Tenant and subject scope, memory type, provenance, author, expected version, retention class, sensitivity, and deletion semantics.
- Invariant
- A generated statement is not durable memory until an explicit, policy-approved mutation succeeds.
Trust
Policy decision and enforcement interfaces
Keep policy evaluation independent from the code that actually blocks, constrains, or allows an operation.
decide(PolicyInput input) → PolicyDecision
enforce(PolicyDecision decision, ProposedOperation operation) → EnforcementResult
explain(DecisionReference reference) → DecisionExplanation
- Contract must carry
- Policy and data versions, actor and workload identity, resource, action, risk, decision, obligations, expiration, and enforcement evidence.
- Invariant
- A policy decision is incomplete until the enforcement point records whether each obligation was applied.
Policy decision and enforcement are separate responsibilities
Open Policy Agent explicitly separates policy decision-making from enforcement, while NIST zero-trust architecture distinguishes policy decision components from the policy enforcement point.[13][14] The runtime should therefore record both the decision and whether each obligation—redaction, approval, rate limit, restricted credential, or retention rule—was actually applied at the action boundary.
Operational separation
Control-plane decisions should not carry high-volume execution data
The control plane owns desired state, capability metadata, rollout, policy, quotas, routing rules, and recovery. The request or data plane applies a known version of those decisions to tokens, tensors, tool calls, context records, and user-visible results. The separation reduces blast radius and prevents every request from depending on a central management call.
| Operation | Control-plane responsibility | Request/data-plane responsibility | Architecture rule |
|---|---|---|---|
| Configuration and version rollout | Owns desired state, validation, staged rollout, rollback, and compatibility. | Consumes an activated, immutable or versioned configuration snapshot. | A live request should not observe a half-applied configuration. |
| Model and tool registration | Registers capabilities, versions, schemas, health policies, and deployment intent. | Invokes only activated versions and reports actual behavior and health. | Discovery metadata and execution behavior must be version-correlated. |
| Routing and admission | Defines policy, quotas, priorities, budgets, and fallback chains. | Applies the decision to each request and enforces queue or rejection behavior. | Route decisions require observed capacity and policy evidence, not static names alone. |
| Context access | Defines source catalog, access policy, freshness, retention, and budget. | Performs authorized reads, ranking, redaction, and bundle construction. | Policy metadata is control state; retrieved content is request data. |
| Inference execution | Selects model, engine, precision, region, deployment, and scheduler policy. | Queues, batches, executes, streams, and returns usage and status. | Do not put token or tensor payloads through a central management service. |
| Tool execution | Defines tool catalog, permissions, approval, egress, timeout, and retry policy. | Validates each call, resolves credentials, invokes, and reconciles effects. | The enforcement point belongs on the execution path, close to the side effect. |
| Autoscaling and recovery | Reconciles desired capacity, health, rollout, and replacement of failed workers. | Reports queue, latency, utilization, errors, and readiness; drains safely. | Recovery decisions should be based on workload and service objectives, not one utilization metric. |
| Telemetry policy | Defines schemas, sampling, redaction, retention, exporters, and audit requirements. | Emits correlated events and applies required local redaction before export. | Sensitive prompts and tool payloads should not be collected by default. |
Distributed inference adds another control boundary: an engine may divide tensor or pipeline work across devices or nodes, while serving infrastructure handles routing and lifecycle. vLLM documents single-node and multi-node parallel configurations, illustrating why placement and scaling policy are distinct from token execution.[9]
Execution timing
Synchronous gates, streams, queues, and durable workflows
“Asynchronous” is not a single architecture. Parallel source reads, streamed model output, queued work, durable long-running workflows, and eventual secondary indexing have different correctness and recovery requirements. The runtime must state which response acknowledges admission, which acknowledges a committed state change, and which merely reports progress.
| Responsibility | Execution mode | Why | Recovery contract |
|---|---|---|---|
| Authentication, authorization, and request validation | Synchronous gate | Execution cannot begin safely without a valid caller, tenant, contract, and authority. | Reject with a typed error; do not queue an invalid request. |
| Context retrieval from independent sources | Parallel asynchronous reads within a bounded synchronous phase | Parallelism reduces latency, but the coordinator still needs a complete or explicitly partial bundle before model execution. | Per-source timeout, cancellation, completeness flag, and documented degraded-source policy. |
| Interactive model generation | Synchronous admission with asynchronous streaming | The caller needs immediate acceptance and progressive output while the engine continues work. | Propagate cancellation, report partial completion, release reservations, and preserve finish reason. |
| Read-only low-latency tool call | Usually synchronous and cancelable | The result is needed to continue the current response and has no external side effect. | Bounded retry for classified transient errors; return explicit unavailable or partial state. |
| Privileged or irreversible tool action | Durable asynchronous activity with approval and idempotency | External effects may outlive the request and require reconciliation or human review. | Persist intent, idempotency key, approval, external IDs, attempt history, and compensation state. |
| Long-running agent task | Durable workflow | The task must survive process, node, network, and approval delays without holding an HTTP request open. | Checkpoint state, version workflow definitions, replay deterministic decisions, and isolate side effects in idempotent activities. |
| Memory write | Synchronous acknowledgement for required state; durable outbox for derived indexes | The caller must know whether canonical state committed, while secondary indexing can lag. | Optimistic version check, conflict response, durable outbox, and no false acknowledgement. |
| Telemetry and evaluation | Asynchronous by default; synchronous only for mandatory audit or policy gates | Observability should not dominate latency, but required evidence may be part of the transaction. | Buffer or durable outbox; fail closed only for explicitly mandatory records. |
A durable workflow system records enough history to recover long-running execution after process or infrastructure failures.[19] That does not make every activity exactly once. External side effects still require idempotency, fencing, reconciliation, or compensation because retries and ambiguous outcomes remain possible.
Evidence model
Trace events should explain authority, versions, decisions, and outcomes
OpenTelemetry defines traces as collections of spans with parent-child relationships, span context, attributes, events, links, and status.[16] W3C Trace Context standardizes propagation fields across service boundaries, and CloudEvents defines a portable event envelope.[17][18] A runtime-specific schema can build on those standards without storing raw prompts, credentials, personal data, or tool payloads by default.
The example below records a tool-authorization event. It references the request, actor, tenant, contract, tool version, side-effect class, policy version, obligations, duration, and redaction state. Production implementations should use opaque references or controlled evidence stores for sensitive payloads.
Minimum correlation fields
- Trace, span, parent, event, request, workflow, and attempt identifiers
- Actor, tenant, workload, delegated authority, and contract versions
- Plane, component, model route, tool version, policy decision, and state transition
- Queue, execution, tool, retrieval, evaluation, cost, and outcome timing
- Redaction, retention, sampling, error classification, and evidence references
Conceptual trace event
{
"schemaVersion": "runtime.trace-event.v1",
"traceId": "8f13d9c2f90e4be5a7e37e52ca3d314b",
"spanId": "20af4c4becc041e2",
"parentSpanId": "8d2ee7cc1bc14cb0",
"eventId": "evt_01JARCHITECTURE",
"timestampUtc": "2026-06-20T18:42:13.418Z",
"eventType": "tool.authorization.completed",
"plane": "trust",
"component": "tool-broker",
"attempt": 1,
"requestRef": "req_01JREFERENCE",
"actorRef": "actor:internal-user-42",
"tenantRef": "tenant:example",
"contractVersion": "runtime.request.v2",
"operation": {
"tool": "customer-record.update@3",
"sideEffectClass": "reversible-write",
"idempotencyKeyRef": "idem:5f22…"
},
"policy": {
"decision": "allow-with-obligations",
"policyVersion": "tool-access.2026-06-18",
"obligations": [
"redact-before-log",
"retain-audit-365d"
]
},
"durationMs": 7,
"status": "ok",
"redactionStatus": "payload-references-only"
}
Synthetic identifiers and values. Payloads are represented by references rather than embedded sensitive content.
Workload identity should also be portable across service boundaries. SPIFFE defines workload identities and trust domains that can support short-lived, verifiable service identity instead of distributing long-lived shared secrets.[15]
Topology
Single-node, distributed, edge/cloud, and agentic variants
The logical planes remain useful even when physical deployment changes. The objective is not to maximize service count. It is to preserve contracts, observability, policy enforcement, and recoverability when components move between processes, hosts, regions, or devices.
Single-node modular runtime
All planes share one host but remain separated by modules, contracts, queues, and local policy boundaries.
Typical components
- API and request boundary
- Coordinator and router
- Local context adapters and state store
- Model adapter plus local or remote inference endpoint
- Tool broker and policy engine
- Local trace exporter and evaluation queue
- Strengths
- Low network complexity, simple debugging, strong fit for development, appliances, desktop, regulated local deployments, and moderate workloads.
- Risks
- Shared process or host blast radius, resource contention, limited horizontal scaling, and temptation to bypass internal contracts.
Distributed runtime
Planes become independently deployable services or worker pools connected by versioned APIs, queues, events, and trace context.
Typical components
- Redundant ingress and runtime coordinators
- Capability, configuration, and policy control services
- Context, retrieval, and memory services
- Inference gateways, schedulers, replicas, and multi-node engines
- Tool workers and durable workflow workers
- Telemetry collectors, evaluators, and review services
- Strengths
- Independent scaling, fault isolation, regional placement, specialized hardware pools, rolling upgrades, and stronger ownership boundaries.
- Risks
- Network partitions, version skew, duplicate delivery, distributed cancellation, higher observability demands, and larger identity and secret surface.
Edge/cloud runtime
A constrained edge execution plane combines with cloud control, context, model distribution, evaluation, and optional fallback.
Typical components
- Signed and versioned model or program packages
- Device capability and health reporting
- Local inference, preprocessing, cache, and policy subset
- Cloud model registry, rollout, telemetry, and fleet coordination
- Explicit data-residency and fallback rules
- Strengths
- Low latency, offline operation, local privacy, bandwidth reduction, and use of device accelerators.
- Risks
- Intermittent connectivity, delayed policy or model rollout, hardware fragmentation, constrained telemetry, thermal limits, and recovery of partially applied updates.
Agentic runtime deployment
The agentic layer coordinates long-running tasks across model serving, context, tools, memory, policy, evaluation, and human review.
Typical components
- Versioned task and response contracts
- Durable coordinator and checkpointed workflow state
- Context and memory managers with provenance
- Model and tool adapters behind deterministic policy gates
- Evaluation, replay, cost, and review records
- Strengths
- Explicit authority, recoverable state, controlled tools, model portability, reviewable decisions, and end-to-end evidence.
- Risks
- Compounding error, excessive agency, state poisoning, ambiguous side effects, runaway cost, stale authority, and non-reproducible context.
Single-node does not mean unstructured
A desktop, appliance, or private host can preserve adapter interfaces, local queues, policy boundaries, versioned state, and trace events. Keeping logical seams inside one process makes later distribution possible without requiring premature network services.
Edge execution changes where trust is enforced
ExecuTorch separates ahead-of-time preparation from a small on-device runtime and supports delegation to hardware-specific backends.[20] WebNN defines a hardware-agnostic graph API for browser execution.[21] In both cases, package integrity, capability detection, local data policy, offline behavior, and delayed telemetry become part of the architecture.
Resilience
Failure domains and containment rules
Reliability depends less on whether a component can retry and more on whether the runtime understands what failed, what may have already happened, and what evidence is available. A provider timeout before any output differs from a partial stream. A read differs from an irreversible write. A stale control snapshot differs from a corrupt canonical record.
| Failure domain | Potential blast radius | Detection | Containment and recovery | Rule |
|---|---|---|---|---|
| Control configuration and routing | Wrong model, region, tool, quota, or policy may affect many requests. | Config validation, staged rollout, route-diff tests, policy simulation, and outcome monitoring. | Immutable versions, canary, automatic rollback, known-good snapshot, and bounded blast radius. | Never edit shared live configuration without versioning and activation state. |
| Context source or index | Missing, stale, cross-tenant, or poisoned context can degrade correctness and privacy. | Source freshness, provenance checks, access tests, retrieval evaluation, and anomaly detection. | Per-source isolation, safe partial mode, source disable switch, cache invalidation, and reindex workflow. | Keep canonical data ownership outside the prompt and record every source reference. |
| Inference engine or provider | Requests fail, slow down, return partial streams, or consume excess memory and capacity. | Readiness, queue and latency metrics, engine errors, device health, output validation, and synthetic probes. | Circuit breaker, drain, replica replacement, compatible fallback, admission reduction, and cache cleanup. | A fallback must preserve the output contract and trust constraints, not only availability. |
| Tool and external side effect | Money, access, messages, or records may be changed twice or left in an unknown state. | Idempotency records, external transaction IDs, reconciliation reads, timeout classification, and audit events. | Pause workflow, reconcile external state, compensate when possible, and require review for ambiguity. | Never retry an ambiguous write blindly. |
| Workflow and queue | Tasks may stall, duplicate, replay incorrectly, or miss approvals and deadlines. | Queue age, event-history progress, heartbeats, attempt counts, poison-message detection, and version checks. | Dead-letter path, workflow reset or migration, idempotent activities, bounded retries, and operator tooling. | Persist decision state; do not infer progress from logs alone. |
| Identity, policy, or secrets | Unauthorized access, confused deputy behavior, data exfiltration, or unavailable execution. | Decision and enforcement correlation, workload attestation, secret-access audit, policy tests, and expiry alarms. | Fail closed for privileged actions, revoke credentials, isolate workload, freeze high-risk tools, and preserve evidence. | Treat actor identity, workload identity, and delegated authority as separate facts. |
| Telemetry and evaluation | Incidents become invisible, audit evidence is lost, or sensitive content leaks into observability systems. | Exporter health, dropped-span metrics, schema validation, redaction tests, retention checks, and cardinality alarms. | Durable buffer, local minimal audit, exporter failover, payload suppression, and incident notification. | Define which evidence is mandatory before production traffic. |
| Network partition and version skew | Components disagree about capabilities, policy, workflow state, or request ownership. | Compatibility handshake, lease and epoch checks, trace gaps, stale-config alarms, and duplicate-owner detection. | Fencing tokens, idempotency, quorum or single-writer rules, backward-compatible contracts, and controlled degradation. | Assume retries and duplicate delivery across network boundaries. |
Fallback is a trust decision, not only an availability decision
A fallback route must preserve the output contract, data boundary, identity, tool permissions, policy obligations, and evidence requirements. Sending a sensitive request to a healthy but unauthorized region or weaker execution path is not graceful degradation.
Portability
Replaceability requires owned semantics and migration evidence
A thin wrapper around a vendor SDK is useful isolation, but it is not complete portability. The owned contract must cover behavior: capability discovery, default settings, output ordering, streaming, cancellation, usage, error classes, side effects, version compatibility, and audit evidence. Data and workflow state must remain exportable, and a replacement must pass replay and conformance fixtures.
Own the canonical contracts
Keep runtime request, response, tool, context, memory, policy, and trace schemas under product or platform governance rather than adopting one vendor response as the internal domain model.
Version behavior, not only endpoints
Contract versions must cover semantics, defaults, error classes, cancellation, ordering, side effects, and compatibility—not just a URL or method signature.
Negotiate capabilities
Adapters should declare supported modalities, schemas, limits, streaming, tools, precision, locality, and cancellation instead of forcing every provider into the same least-common-denominator behavior.
Isolate vendor extensions
Provider-specific optimizations belong behind the adapter and may be exposed as optional capabilities, not spread through coordinator, workflow, and product code.
Keep data portable
Retain source records, provenance, memory, checkpoints, evaluations, and audit data in documented schemas with export, retention, and deletion paths.
Use portable identity and correlation
Propagate stable actor, tenant, workload, request, trace, and idempotency identifiers across adapters and infrastructure boundaries.
Test with replay and conformance fixtures
A replacement adapter should pass recorded contract fixtures, failure cases, policy expectations, numerical or semantic tolerances, and side-effect reconciliation tests.
Plan migration and coexistence
Support shadow traffic, canaries, dual writes only when safe, backward-compatible reads, and explicit rollback while old and new components coexist.
The Model Context Protocol demonstrates capability negotiation and explicit host, client, and server roles for tools, resources, and prompts.[12] It can be one integration boundary inside the context or tool architecture, but it does not replace product identity, policy, workflow, memory, evaluation, or model-serving responsibilities.
Implementation review
Reference architecture checklist
Boundaries and contracts
- Identify the layer and plane owned by each component and document any delegated responsibilities.
- Version request, response, context, model, tool, memory, policy, trace, and error contracts.
- Define capability discovery, compatibility, deprecation, and fallback behavior.
- Carry actor, tenant, workload, request, trace, deadline, budget, and idempotency identifiers explicitly.
Control and execution
- Separate desired-state management from high-volume request and tensor or token paths.
- Make admission, routing, queueing, batching, cancellation, retry, and load-shedding policies observable.
- Use known-good snapshots, staged rollout, rollback, and compatibility checks for control changes.
- Document which work is synchronous, streaming, queued, durable, interruptible, or human-gated.
Context and state
- Keep canonical records in owned systems of record and emit provenance-aware context references.
- Enforce source-level access, freshness, redaction, tenant isolation, and token or byte budgets.
- Distinguish working state, durable checkpoints, and long-term memory with separate retention rules.
- Use optimistic versions or transactions for state changes and never acknowledge an uncommitted write.
Trust and side effects
- Enforce privileged tool calls outside prompts with typed schemas, identity, policy, egress, approval, and rate limits.
- Separate policy decision from enforcement and record whether obligations were applied.
- Use workload identity, short-lived credentials, secret references, and least privilege for service-to-service calls.
- Classify read, reversible write, irreversible write, and externally ambiguous operations before defining retries.
Observability and recovery
- Propagate standard trace context and emit plane, component, contract, route, model, tool, policy, state, evaluation, and outcome attributes.
- Redact sensitive prompts, credentials, personal data, and tool payloads before export; define retention and access.
- Map every failure domain to detection, containment, safe retry, reconciliation, rollback, and operator evidence.
- Maintain replay fixtures, conformance suites, chaos and failure-injection tests, and post-deployment runbooks.
NIST’s AI Risk Management Framework treats trustworthiness as a lifecycle concern spanning design, development, deployment, use, and evaluation.[22] The checklist therefore includes operating evidence, recovery, retention, and review—not only component diagrams.
FAQ
Frequently asked questions
What is the difference between a layer and a plane?
A layer locates a responsibility in the execution stack, such as inference or serving. A plane groups operational responsibilities that cut across layers, such as control, context, execution, or trust. One component can sit primarily in a layer while participating in several planes.
Is the control plane outside the request path?
Management operations such as configuration, rollout, and autoscaling should be separated from the high-volume data path. Per-request admission, routing, deadline, and workflow decisions still apply control-plane policy on the request path, preferably from a local or versioned snapshot rather than a remote management dependency for every decision.
Can the trust plane be implemented as middleware?
Some controls can be middleware, but trust is broader than one interceptor. Identity, policy, schema validation, secrets, tool authorization, workload isolation, evaluation, audit, redaction, retention, and human approval occur at different boundaries. Each enforcement point must be close enough to the action it controls.
Is a model adapter the same as a model server?
No. A model adapter is an internal contract boundary that normalizes one or more providers, engines, or servers. A model server is a deployable component that exposes inference through a protocol and owns queueing, scheduling, model availability, and responses.
Where should context and memory live?
Canonical domain records stay in systems of record. The context plane reads and packages authorized evidence for the current task. Working state and durable workflow checkpoints have explicit schemas. Long-term memory is an approved, provenance-aware write with retention and deletion semantics—not an automatic copy of model output.
When should an agent task become asynchronous?
Use durable asynchronous execution when work outlives an interactive deadline, waits for humans or external events, performs privileged side effects, needs reliable retry or compensation, or must survive process and node failures. Return a continuation or review reference instead of holding an HTTP request open.
How does a distributed runtime avoid duplicate work?
It cannot assume exactly-once transport. Use stable request and activity identifiers, idempotency keys, fencing or ownership epochs, durable state transitions, deduplication, reconciliation, and side-effect-specific retry rules.
What makes a runtime component replaceable?
A stable owned contract, explicit capabilities, isolated vendor extensions, portable identity and telemetry, documented data ownership, deterministic error semantics, replay and conformance fixtures, and a migration plan. Merely wrapping an SDK does not guarantee portability.
Sources and further reading
This reference architecture synthesizes official project documentation, specifications, and government guidance. Product examples clarify responsibility boundaries; inclusion is not an endorsement or performance ranking.
- Kubernetes Components Opens in a new tab. — Kubernetes project; Official project documentation. Defines the cluster control plane and worker-node components. Accessed 2026-06-21 UTC.
- Controllers Opens in a new tab. — Kubernetes project; Official project documentation. Explains control loops that reconcile current state toward desired state. Accessed 2026-06-21 UTC.
- KServe System Architecture Overview Opens in a new tab. — KServe project; Official project documentation. Separates management in the control plane from inference execution in the data plane. Accessed 2026-06-21 UTC.
- KServe Control Plane Opens in a new tab. — KServe project; Official project documentation. Describes lifecycle and management responsibilities independent from the inference data plane. Accessed 2026-06-21 UTC.
- KServe Data Plane Opens in a new tab. — KServe project; Official project documentation. Describes high-performance execution of prediction, generation, transformation, and explanation requests. Accessed 2026-06-21 UTC.
- ONNX Runtime Architecture Opens in a new tab. — ONNX Runtime project; Official project documentation. Execution-provider abstraction, provider-independent optimization, graph partitioning, fallback, and heterogeneous execution. Accessed 2026-06-21 UTC.
- Triton Architecture Opens in a new tab. — NVIDIA; Official project documentation. Model repository, protocols, per-model schedulers, batching, backends, and response flow. Accessed 2026-06-21 UTC.
- Ray Serve Architecture Opens in a new tab. — Ray project; Official project documentation. Proxy, controller, replicas, request flow, deployment state, and component fault tolerance. Accessed 2026-06-21 UTC.
- Parallelism and Scaling Opens in a new tab. — vLLM project; Official project documentation. Single-node and multi-node tensor and pipeline parallel inference and serving. Accessed 2026-06-21 UTC.
- OpenAPI Specification 3.2.0 Opens in a new tab. — OpenAPI Initiative; Standards specification. Language-neutral description of HTTP API capabilities and interaction contracts. Accessed 2026-06-21 UTC.
- JSON Schema Specification Opens in a new tab. — JSON Schema project; Standards specification. Specifications and vocabularies for describing and validating JSON instance data. Accessed 2026-06-21 UTC.
- Model Context Protocol Architecture Overview Opens in a new tab. — Model Context Protocol project; Official protocol documentation. Host, client, and server roles, lifecycle, capabilities, tools, resources, and prompts. Accessed 2026-06-21 UTC.
- Open Policy Agent Opens in a new tab. — Open Policy Agent project; Official project documentation. Decouples policy decision-making from policy enforcement using structured input and decisions. Accessed 2026-06-21 UTC.
- Zero Trust Architecture, NIST SP 800-207 Opens in a new tab. — National Institute of Standards and Technology; Government standard. Defines policy engine, policy administrator, policy enforcement point, identity, and continuous access decisions in zero-trust architecture. Accessed 2026-06-21 UTC.
- SPIFFE Concepts Opens in a new tab. — SPIFFE project; Official specification guidance. Portable workload identities, trust domains, and verifiable identity documents. Accessed 2026-06-21 UTC.
- Traces Opens in a new tab. — OpenTelemetry project; Official project documentation. Trace, span, parent-child relationship, span context, attributes, events, links, and status concepts. Accessed 2026-06-21 UTC.
- Trace Context Opens in a new tab. — W3C Distributed Tracing Working Group; W3C Recommendation. Vendor-neutral traceparent and tracestate propagation across service boundaries. Accessed 2026-06-21 UTC.
- CloudEvents Specification Opens in a new tab. — CloudEvents project / CNCF; Official specification. Common event-description envelope for portable event declaration and delivery. Accessed 2026-06-21 UTC.
- What is Temporal? Opens in a new tab. — Temporal Technologies; Official project documentation. Durable workflow execution, event history, workers, recovery, and long-running application state. Accessed 2026-06-21 UTC.
- ExecuTorch Concepts Opens in a new tab. — PyTorch / ExecuTorch project; Official project documentation. Ahead-of-time edge preparation, delegation, memory planning, runtime execution, and portable kernels. Accessed 2026-06-21 UTC.
- Web Neural Network API Opens in a new tab. — W3C Web Machine Learning Working Group; W3C specification. Hardware-agnostic neural-network graph execution for web applications through platform capabilities. Accessed 2026-06-21 UTC.
- NIST AI Risk Management Framework Opens in a new tab. — National Institute of Standards and Technology; Government framework. Voluntary framework for incorporating trustworthiness considerations into AI design, development, use, and evaluation. Accessed 2026-06-21 UTC.
Last reviewed: 2026-06-23 UTC
