Start Here on ARuntime.com
These internal pages provide the local foundation before comparing external frameworks, protocols, model servers, observability systems, and security guidance.
AI Runtime Overview
Clarifies what “AI runtime” means across compiler, inference, serving, edge, browser, and agentic layers.
How AI Runtimes Work
Follows a request from boundary checks through context assembly, model execution, tools, validation, and response packaging.
Runtime Architecture
Explains the control, context, execution, and trust planes that make a runtime replaceable and observable.
Deployment Patterns
Compares local, cloud, edge, browser, Kubernetes, serverless, and hybrid deployment models.
Security and Governance
Maps prompt injection, delegated authority, tool permissions, memory, audit, and human review to runtime controls.
Developer Guide
Defines contract-driven implementation patterns for typed tools, routing, traces, memory writes, and fallback behavior.
AI Runtime Glossary
Provides reviewed terms for compiler runtimes, model serving, KV cache, observability, governance, and agentic execution.
External Resource Directory
The directory separates authoring frameworks, orchestration runtimes, protocols, serving layers, telemetry systems, and governance resources. That separation matters because no single product necessarily supplies every layer of an AI runtime.
Agent framework roles are not interchangeable
- Agent framework
- Defines agent logic, tools, prompts, handoffs, and application-level behavior.
- Orchestration runtime
- Coordinates durable state, checkpoints, streaming, retries, interruption, and resumption.
- User-interface runtime adapter
- Connects chat or assistant UI state to backend execution and streaming events.
- Application middleware
- Connects application code to model providers, tools, schemas, and product workflows.
- Hosted runtime service
- Runs or operates part of the execution loop as a managed platform boundary.
Agent Frameworks and Orchestration
These resources sit above raw model calls. Some define agent loops, some coordinate durable state, and some connect user interfaces to existing backend runtimes. Treat them as distinct roles inside the application runtime layer.
Agent framework
OpenAI Agents SDK
OpenAI
OpenAI’s Agents SDK is a Python framework for building agentic workflows with agents, tools, handoffs, guardrails, sessions, tracing, and human-in-the-loop patterns. It is useful when an application wants managed agent execution above direct model calls, not merely a single completion endpoint.
Orchestration runtime
LangGraph
LangChain
LangGraph focuses on low-level orchestration for long-running, stateful agent and workflow execution. It is relevant when the runtime must persist state, resume after failures, stream intermediate events, support human interruptions, and keep application logic separate from one-shot model calls.
Agent development kit
Google Agent Development Kit
Google ADK provides documentation and tooling for building agent applications around models, tools, and structured execution flows. On a runtime resources page, it belongs in the agent-development category rather than the low-level inference-engine category, because it addresses application behavior above model serving.
Application middleware
Microsoft Semantic Kernel
Microsoft
Semantic Kernel is application middleware for composing AI services, prompts, plugins, memory-oriented patterns, and orchestration logic in enterprise applications. It can help standardize how product code invokes model and tool capabilities, but production runtime boundaries still require external policy, identity, logging, and deployment controls.
Application and UI middleware
Vercel AI SDK
Vercel
The Vercel AI SDK is aimed at web application developers who need streaming model responses, provider abstraction, UI integration, and server-side helpers. It is not a complete runtime by itself; it is best viewed as middleware that connects product interfaces to model and tool execution paths.
User-interface runtime adapter
Assistant UI Runtime Guide
assistant-ui
Assistant UI’s runtime guide helps developers choose how a chat or assistant interface should connect to backend execution. It is useful for understanding front-end runtime adapters, message state, streaming UI behavior, and integration boundaries between interface components and server-side agent logic.
Agent deployment runtime
AgentScope Runtime for Java
AgentScope
AgentScope Runtime for Java is positioned around agent deployment and tool sandboxing for Java-oriented ecosystems. It is relevant when runtime design must isolate tools, manage sessions, expose observable execution, and integrate with Java agent frameworks rather than relying solely on Python-first orchestration stacks.
Protocols and Contracts
Protocols and schema systems help define execution boundaries: what tools exist, how agents discover capabilities, how requests are typed, and how structured outputs can be validated before downstream action.
Tool and context protocol
Model Context Protocol
Model Context Protocol
MCP standardizes how AI applications connect to external tools, data sources, prompts, and workflows. For runtime architecture, it is most useful at the tool and context boundary: discovery, permissioning, connector design, and controlled access to systems outside the model provider.
Agent interoperability protocol
Agent2Agent Protocol
A2A Protocol project
A2A addresses communication and collaboration between agents rather than connecting one AI application to tools. It belongs in runtime planning when multiple agents, specialized services, or organizations need typed handoffs, task status exchange, and clearer boundaries for cross-agent collaboration.
API contract specification
OpenAPI Specification
OpenAPI Initiative
OpenAPI defines machine-readable HTTP API contracts. In AI runtimes, it is useful for turning APIs into typed tools, documenting authentication and parameters, validating request and response shapes, and separating tool execution contracts from prompt text or ad hoc natural-language instructions.
Validation specification
JSON Schema
JSON Schema project
JSON Schema defines how JSON documents can be validated against a formal schema. Runtime designers use it for tool inputs, structured model outputs, response envelopes, trace events, configuration, and policy checks that should fail deterministically before side effects occur.
Inference and Serving
Inference engines and model servers usually belong to the execution plane. They load models, schedule requests, batch traffic, expose APIs, and use hardware efficiently, but they do not automatically provide complete agent memory, permission, or governance systems.
High-throughput LLM serving
vLLM
vLLM project
vLLM is an LLM serving engine focused on high-throughput generation, online serving, batching, prefix caching, structured outputs, and distributed deployment patterns. It is a model execution and serving component; it still needs surrounding runtime controls for identity, tool authorization, memory policy, and product workflow.
Local model execution
Ollama
Ollama
Ollama helps developers run and manage local models through a convenient local service and CLI-oriented workflow. It is useful for desktop, prototype, and private local inference patterns, but production runtime design still needs routing policy, observability, authorization, and deployment controls around it.
Portable model inference
ONNX Runtime
Microsoft / ONNX Runtime project
ONNX Runtime executes ONNX model graphs across CPU, GPU, mobile, browser, and accelerator backends through execution-provider abstractions. It is relevant when portability, graph optimization, hardware delegation, and model-format interoperability matter more than a single provider-specific serving stack.
Multi-framework inference serving
NVIDIA Triton Inference Server
NVIDIA
Triton Inference Server provides multi-framework model serving with HTTP and gRPC APIs, model repositories, scheduling, dynamic batching, and observability integrations. It is a serving platform for the execution layer, not a full agent runtime or policy system by itself.
Kubernetes-native model serving
KServe
KServe project
KServe provides Kubernetes-native abstractions for deploying, scaling, and operating model inference services. It is most relevant when runtime teams need platform-level serving resources, traffic management, standardized inference endpoints, and integration with Kubernetes operations rather than a standalone local inference engine.
Distributed application serving
Ray Serve
Ray project
Ray Serve is a scalable serving library for composing Python applications, model deployments, and distributed services on Ray. It fits runtime architectures that need programmable request handling, autoscaling, composition of multiple model or preprocessing steps, and deployment across a distributed application substrate.
Observability and Evaluation
AI runtime observability is broader than log collection. A useful runtime captures traces, metrics, logs, evaluation results, prompt and model comparisons, replay material, and incident context while applying redaction and retention controls.
Telemetry semantic conventions
OpenTelemetry Generative AI Semantic Conventions
OpenTelemetry
OpenTelemetry’s Generative AI semantic conventions define shared telemetry vocabulary for model requests, providers, usage, and related signals. The formerly linked documentation page now points to the maintained semantic-conventions repository, so ARuntime uses the current repository location for link accuracy.
LLM observability and evaluation
Langfuse
Langfuse
Langfuse provides observability, tracing, prompt management, evaluations, datasets, and production monitoring for LLM applications. It is useful when teams need to analyze runtime behavior across model calls, tool steps, latency, cost, quality, and prompt or model changes.
AI observability and evaluation
Arize Phoenix
Arize AI
Arize Phoenix is an observability and evaluation system for AI applications, including tracing, datasets, experiments, and LLM evaluation workflows. It is relevant for runtime teams that need to debug requests, compare outputs, inspect retrieval behavior, and connect offline evaluations with production traces.
Security and Governance
Security resources help translate runtime risk into controls around identity, delegated authority, prompt injection, memory poisoning, supply chain, tool approval, auditability, incident modeling, and risk management.
Security guidance project
OWASP GenAI Security Project
OWASP
The OWASP GenAI Security Project publishes open guidance, threat lists, governance resources, and educational material for securing generative and agentic AI systems. It is useful for mapping runtime controls to prompt injection, excessive agency, data exposure, memory risks, and operational governance.
Risk management framework
NIST AI Risk Management Framework
NIST
The NIST AI Risk Management Framework provides a structured vocabulary for identifying, measuring, managing, and governing AI risks. Runtime teams can use it to connect engineering controls with organizational risk processes, documentation, monitoring, incident handling, and accountability expectations.
Adversary knowledge base
MITRE ATLAS
MITRE
MITRE ATLAS catalogs adversary tactics and techniques against AI-enabled systems. It helps runtime and security teams think about threat modeling, incident response, supply-chain exposure, model behavior manipulation, data attacks, and controls that should exist outside prompt-level instructions.
Protocol security guidance
MCP Security Best Practices
Model Context Protocol
The MCP security guidance covers authorization risks, confused-deputy problems, token handling, server-side request forgery, session hijacking, local-server compromise, and scope minimization. It is directly relevant to runtime designs that expose external tools or data sources through MCP connectors.
How to Choose an AI Runtime
Use requirements first. A serving engine, protocol, orchestration framework, and observability system may all be necessary, but they are different layers of the runtime architecture.
| Requirement | Runtime capability | Resource category | Questions to ask |
|---|---|---|---|
| Stateful, long-running tasks | Durable execution and checkpointing | Agent frameworks and orchestration | Can the runtime persist state, resume after failure, expose checkpoints, and support human interruption without corrupting the task? |
| Tool use | Schemas, authentication, permissions, retries, and output validation | Protocols and contracts | Are tool inputs typed, authenticated, authorized, rate limited, idempotent where needed, and validated before side effects? |
| Multiple models | Routing, fallback, privacy, budget, and capability policies | Developer runtime architecture | Can routing policy select models by task, risk, latency, context window, cost, data boundary, and provider availability? |
| Local models | Local inference and model-server adapters | Inference and serving | Does the stack support local model management, hardware constraints, privacy requirements, and adapter-level failure isolation? |
| Human approval | Interruptible execution and resumable state | Agent orchestration and governance | Can privileged actions pause for approval, preserve state, record the decision, and continue or cancel predictably? |
| Interoperability | MCP, A2A, OpenAPI, and JSON Schema support | Protocols and contracts | Which boundaries are standardized: tool discovery, agent-to-agent work, HTTP APIs, request schemas, or structured outputs? |
| Production debugging | Distributed traces, replay, metrics, and evaluation | Observability and evaluation | Can engineers reconstruct request timelines, tool decisions, fallbacks, token usage, cost, and evaluation results without leaking sensitive data? |
| Regulated workloads | Identity, audit records, retention controls, and policy enforcement | Security and governance | Are actors, tenants, permissions, retention, redaction, approvals, and audit trails enforced outside prompt text? |
| High throughput | Batching, caching, concurrency, autoscaling, and backpressure | Inference and serving | Does the serving layer support batching, scheduling, cache reuse, autoscaling, admission control, and clear SLO trade-offs? |
No single product necessarily supplies every layer of an AI runtime. Treat the runtime as a composition of contracts, execution services, policies, telemetry, and product workflows.
Reference Architecture Flow
-
01
Request boundary
Identify actor, tenant, task, risk, deadline, and allowed authority.
-
02
Context assembly
Retrieve only the context required for the task and data boundary.
-
03
Model routing
Select model/provider based on capability, privacy, budget, latency, and fallback policy.
-
04
Model execution
Run inference through an engine or provider and capture token/cost timing.
-
05
Tool authorization
Check permissions, scopes, schemas, risk level, and approval requirements.
-
06
Tool execution
Execute typed tools with retries, timeouts, idempotency rules, and audit fields.
-
07
Memory update
Write memory only when policy allows it and the update is explicit.
-
08
Policy validation
Validate output, citations, constraints, and side-effect boundaries.
-
09
Telemetry and evaluation
Record redacted traces, metrics, evaluations, and incident replay material.
-
10
Response or human handoff
Return structured output or pause for review with state preserved.
Resource Selection Policy
- Direct relevance to AI runtime architecture.
- Availability of authoritative documentation.
- Clear technical purpose.
- Active maintenance or standards relevance.
- Practical production use.
- Educational value.
- No payment for placement.
Last reviewed: 2026-06-21 UTC
Frequently Asked Questions
What is the difference between a model API and an AI runtime?
A model API exposes model capability. An AI runtime governs the surrounding execution: identity, context, routing, tools, memory, policy checks, telemetry, evaluation, fallbacks, and handoff behavior. A runtime may call one or more model APIs, but it should not be reduced to them.
Is an agent framework the same as an AI runtime?
No. An agent framework helps build agent behavior, but a production runtime must also define execution boundaries, deployment, authorization, logging, incident recovery, cost controls, evaluation, and operational ownership. Some frameworks include runtime features, but they do not automatically supply every production control.
Is a model server a complete runtime?
Usually not. A model server handles model loading, APIs, scheduling, batching, health, and sometimes model repositories or rollouts. It is normally one execution-plane component inside a larger runtime architecture that also governs tools, context, memory, policy, and workflow state.
What is the difference between MCP and A2A?
MCP primarily standardizes how AI applications connect to tools, data sources, prompts, and workflows. A2A addresses communication and collaboration between agents. They can complement each other, but they solve different boundary problems and should not be treated as direct substitutes.
Can an AI runtime use multiple model providers?
Yes. A runtime can route across providers by task type, risk, context size, latency, cost, privacy, data residency, availability, and quality requirements. The important part is making routing policy explicit and observable rather than hiding it in application code.
Can an AI runtime run entirely on local infrastructure?
Yes. Local runtimes can use local model servers, desktop GPUs, private Kubernetes clusters, edge devices, or air-gapped deployments. The architecture still needs policy, observability, update controls, model provenance, and recovery behavior even when no public cloud provider is involved.
What runtime events should be logged?
Log request boundaries, context retrieval, model route, provider/model identity, tool calls, permission decisions, retries, fallbacks, memory reads/writes, token usage, cost, latency, evaluation results, human-review events, and final outcome. Sensitive data needs redaction and retention policy first.
Where should tool permissions be enforced?
Tool permissions should be enforced outside the prompt at deterministic runtime boundaries. Use authentication, authorization, schema validation, rate limits, policy checks, approval requirements, and audit logging before executing a high-impact action.
How should long-running agent tasks recover from failures?
They need durable state, checkpoints, idempotent tool operations, cancellation support, retry budgets, compensating actions, trace replay, and a clear human handoff path. Recovery should resume from known state rather than rerunning the entire task blindly.
How frequently should resource links be reviewed?
Review high-change runtime resources at least quarterly and after major releases, security advisories, protocol revisions, or project governance changes. Link pages should show a UTC last-reviewed date and avoid evergreen claims about version support or project status.
