Key takeaways
- An agent framework expresses flows; a production runtime must also own authority, state, durability, policy, audit, and recovery.
- Tool descriptions are not permissions. High-impact actions require deterministic authorization and approval outside prompt text.
- Working, session, and long-term memory have different lifetimes, trust, retention, and review requirements.
- Durable execution must survive process and network failure without duplicating irreversible side effects.
- MCP standardizes tool/resource exchange, but product identity, policy, security, and approval remain runtime responsibilities.
- Replayable traces and evaluation gates are required to investigate and improve long-running behavior.
Runtime boundary
A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.
Receives
Actor, tenant, normalized task, purpose/risk, permitted context classes, model constraints, tool allowlist, memory policy, budget, deadline, and approvals.
Owns
Task state, context assembly, routing, tool brokerage, policy checkpoints, approvals, durable workflow, memory lifecycle, evaluation, and traceability.
Emits
Structured result, evidence, tool outcomes, policy decisions, memory changes, human-review state, replay handle, cost/timing, and compensation status.
Does not own
Authority not explicitly delegated, unrestricted database access, or permission inferred solely from model output.
Failure modes
Prompt injection, confused deputy, unauthorized tool use, duplicate side effects, memory poisoning, infinite loops, context leakage, and stale state.
Evidence and metrics
Task success, tool success, approvals, policy denies, retries, steps, context provenance, cost, latency, escalations, memory writes, and replay completeness.
Framework versus runtime
Frameworks compose prompts, nodes, or model-directed graphs. A production runtime establishes identity, authority, durable state, policy, tools, evaluation, and recovery.
Implementation
Map every framework component to a runtime responsibility and identify external services such as workflow, policy, secrets, and observability.
Operational implications
Do not claim production durability or security from a graph API alone.
Measure
Durable completion, recovery, policy coverage, tool errors, and audit completeness.
Request boundary and authority
Every run needs an actor, tenant, purpose, task, risk, data policy, tool permissions, budget, deadline, and approval rules.
Implementation
Use a versioned request envelope. Verify identity and resolve delegated authority before context or tools are exposed.
Operational implications
Downstream services must not trust prompt claims or user-controlled tenant fields.
Measure
Boundary validation failures, permission scope, policy decisions, and rejected requests.
Context assembly and provenance
Context providers return content with source, classification, freshness, tenant scope, and retrieval rationale.
Implementation
Assemble the smallest useful context, label untrusted content as data, and record included/rejected sources.
Operational implications
Avoid raw database access when business definitions require a semantic layer or typed domain API.
Measure
Context bytes/tokens, source count, freshness, classification blocks, citations, and retrieval latency.
MCP tools and resources
MCP defines client/server exchange for tools, resources, prompts, and capabilities.
Implementation
Expose only permitted servers/capabilities; validate schemas, URI/resource scope, lifecycle, and server identity.
Operational implications
Protocol interoperability does not grant business authorization or prove server safety.
Measure
Capabilities negotiated, tool/resource errors, server identity, and protocol version.
Tool brokerage
The broker converts a model proposal into a deterministic, authorized, idempotent execution.
Implementation
Validate schema and target, authorize, apply budgets/rate limits, require approval, execute with timeout, validate output, redact, and audit.
Operational implications
Do not let the model choose credentials or infer write authority.
Measure
Validation/authorization/approval latency, tool success, retries, side-effect class, and denials.
Memory scopes
Working memory serves one run; session memory spans a conversation/case; long-term memory crosses sessions; systems of record remain authoritative.
Implementation
Use explicit schemas, provenance, confidence, tenant, owner, expiry, review, deletion, and conflict policy.
Operational implications
Never write arbitrary model text directly into durable memory.
Measure
Reads/writes by scope, approvals, expiry/deletion, conflicts, poisoning detections, and hit value.
Durable execution
Long tasks persist versioned state after meaningful transitions and resume after process or dependency failure.
Implementation
Use workflow checkpoints, timers, idempotency keys, activity heartbeats, and explicit compensation.
Operational implications
A model retry may be non-deterministic; a tool write may have succeeded despite timeout. Query authoritative state before replay.
Measure
Resume success, duplicate prevention, ambiguous outcomes, compensation, and task age.
Human approval
Privileged or irreversible actions pause with a clear proposal, target, evidence, side effects, risk, and expiry.
Implementation
Bind approval to exact normalized arguments and a single-use or scoped token; authenticate reviewer authority.
Operational implications
A vague “approve agent” button delegates too much.
Measure
Approval rate/time, expiry, changes after review, unauthorized approvals, and post-action verification.
Evaluation and replay
Evaluation can gate model output, tool proposals, or final task success. Replay reconstructs control decisions from versions and protected references.
Implementation
Record evaluator/version/criteria/evidence and store trace/state references without exposing hidden chain-of-thought.
Operational implications
Replay may reproduce workflow decisions without identical stochastic text.
Measure
Evaluation coverage/score, blocked actions, replay completeness, trace gaps, and incident resolution.
Semantic layer integration
Governed metrics and business joins belong behind typed domain interfaces rather than arbitrary model-generated SQL.
Implementation
Expose approved semantic queries or APIs with identity, row/column policy, result limits, and provenance.
Operational implications
This improves consistency, security, observability, and change management.
Measure
Query validity, denied fields, result limits, metric version, and citation/provenance.
Reference tables
| Component | Primary responsibility | What it does not prove |
|---|---|---|
| Agent framework | Express model-driven flow | Production durability or tool authority |
| Agentic runtime | Governed execution and state | Business authority beyond policy |
| Tool protocol | Discovery and typed exchange | Tool safety or user approval |
| Workflow engine | Durable steps, timers, retries | AI-specific context/evaluation |
| Observability layer | Traces, metrics, logs, evaluations | Permission to act |
| Product application | UX and business workflow | Low-level execution efficiency |
| Stage | Runtime action | Evidence |
|---|---|---|
| Discover | Expose permitted capabilities | Catalog/server version and scope |
| Propose | Model returns typed call | Structured arguments |
| Validate | Schema, target, business rules | Validation result |
| Authorize | Policy/delegated authority | Decision ID/reason |
| Approve | Human/independent gate | Approver and expiry |
| Execute | Timeout, rate, idempotency, sandbox | Invocation and side-effect class |
| Validate result | Schema/safety checks | Status and redaction |
| Commit state | Workflow/memory update | Versioned state change |
| Trace | Link all events | Trace/replay handle |
| Scope | Lifetime | Typical content | Primary risk |
|---|---|---|---|
| Working | One task/run | Plan, intermediate results, counters | Context overflow/stale branch |
| Session/thread | Conversation or case | Preferences and unresolved state | Cross-user leakage |
| Long-term user | Across sessions | Approved stable facts/preferences | Poisoning/unwanted retention |
| Organizational | Shared durable knowledge | Policies and reviewed facts | Broad blast radius |
| System of record | Business-defined | Authoritative records | Irreversible side effects |
Decision checklist
- What identity and tenant scope enter every run?
- Which authority is delegated, for how long, and over which resources?
- How is context classified, minimized, and traced?
- Which tools are visible and which actions require approval?
- How are retries idempotent across model and tool steps?
- What memory scopes exist and who may write/delete them?
- How can a run resume after process or dependency failure?
- Which evaluation or policy gate can halt execution?
- What evidence is retained for replay without leaking secrets?
Common mistakes
- Calling prompt templates and tool calling a production runtime.
- Treating tool descriptions as authorization.
- Giving agents raw database access instead of governed domain interfaces.
- Writing model output directly into long-term memory.
- Retrying irreversible tools after ambiguous timeouts.
- Keeping accelerator reservations while tools run.
- Logging secrets or full sensitive prompts.
- Assuming MCP supplies product-specific policy.
- Exposing hidden chain-of-thought instead of evidence and decisions.
Sources and further reading
-
Model Context Protocol specification
(opens in a new tab)
-
MCP tools
(opens in a new tab)
-
MCP resources
(opens in a new tab)
-
Temporal durable execution
(opens in a new tab)
-
LangGraph persistence
(opens in a new tab)
-
OpenTelemetry concepts
(opens in a new tab)
-
NIST AI Risk Management Framework
(opens in a new tab)
Last reviewed: 2026-06-21 UTC
