AI Runtime Resources - aRuntime.com

Start Here on ARuntime.com

These internal pages provide the local foundation before comparing external frameworks, protocols, model servers, observability systems, and security guidance.

AI Runtime Overview

Clarifies what “AI runtime” means across compiler, inference, serving, edge, browser, and agentic layers.

How AI Runtimes Work

Follows a request from boundary checks through context assembly, model execution, tools, validation, and response packaging.

Runtime Architecture

Explains the control, context, execution, and trust planes that make a runtime replaceable and observable.

Deployment Patterns

Compares local, cloud, edge, browser, Kubernetes, serverless, and hybrid deployment models.

Security and Governance

Maps prompt injection, delegated authority, tool permissions, memory, audit, and human review to runtime controls.

Developer Guide

Defines contract-driven implementation patterns for typed tools, routing, traces, memory writes, and fallback behavior.

AI Runtime Glossary

Provides reviewed terms for compiler runtimes, model serving, KV cache, observability, governance, and agentic execution.

External Resource Directory

The directory separates authoring frameworks, orchestration runtimes, protocols, serving layers, telemetry systems, and governance resources. That separation matters because no single product necessarily supplies every layer of an AI runtime.

Agent framework roles are not interchangeable

Agent framework: Defines agent logic, tools, prompts, handoffs, and application-level behavior.
Orchestration runtime: Coordinates durable state, checkpoints, streaming, retries, interruption, and resumption.
User-interface runtime adapter: Connects chat or assistant UI state to backend execution and streaming events.
Application middleware: Connects application code to model providers, tools, schemas, and product workflows.
Hosted runtime service: Runs or operates part of the execution loop as a managed platform boundary.

Agent Frameworks and Orchestration

These resources sit above raw model calls. Some define agent loops, some coordinate durable state, and some connect user interfaces to existing backend runtimes. Treat them as distinct roles inside the application runtime layer.

Agent framework

OpenAI Agents SDK

OpenAI

OpenAI’s Agents SDK is a Python framework for building agentic workflows with agents, tools, handoffs, guardrails, sessions, tracing, and human-in-the-loop patterns. It is useful when an application wants managed agent execution above direct model calls, not merely a single completion endpoint.

Tools
Guardrails
Tracing

Visit official documentation

Orchestration runtime

LangGraph

LangChain

LangGraph focuses on low-level orchestration for long-running, stateful agent and workflow execution. It is relevant when the runtime must persist state, resume after failures, stream intermediate events, support human interruptions, and keep application logic separate from one-shot model calls.

Durable execution
State
Human review

Visit official documentation

Agent development kit

Google Agent Development Kit

Google

Google ADK provides documentation and tooling for building agent applications around models, tools, and structured execution flows. On a runtime resources page, it belongs in the agent-development category rather than the low-level inference-engine category, because it addresses application behavior above model serving.

Agents
Tools
Development

Visit official documentation

Application middleware

Microsoft Semantic Kernel

Microsoft

Semantic Kernel is application middleware for composing AI services, prompts, plugins, memory-oriented patterns, and orchestration logic in enterprise applications. It can help standardize how product code invokes model and tool capabilities, but production runtime boundaries still require external policy, identity, logging, and deployment controls.

Planning
Plugins
Enterprise apps

Visit official documentation

Application and UI middleware

Vercel AI SDK

Vercel

The Vercel AI SDK is aimed at web application developers who need streaming model responses, provider abstraction, UI integration, and server-side helpers. It is not a complete runtime by itself; it is best viewed as middleware that connects product interfaces to model and tool execution paths.

Streaming
Web apps
Model providers

Visit official documentation

User-interface runtime adapter

Assistant UI Runtime Guide

assistant-ui

Assistant UI’s runtime guide helps developers choose how a chat or assistant interface should connect to backend execution. It is useful for understanding front-end runtime adapters, message state, streaming UI behavior, and integration boundaries between interface components and server-side agent logic.

UI state
Frontend
Adapters

Visit official documentation

Agent deployment runtime

AgentScope Runtime for Java

AgentScope

AgentScope Runtime for Java is positioned around agent deployment and tool sandboxing for Java-oriented ecosystems. It is relevant when runtime design must isolate tools, manage sessions, expose observable execution, and integrate with Java agent frameworks rather than relying solely on Python-first orchestration stacks.

Java
Sandboxing
Tools

Visit official documentation

Protocols and Contracts

Protocols and schema systems help define execution boundaries: what tools exist, how agents discover capabilities, how requests are typed, and how structured outputs can be validated before downstream action.

Tool and context protocol

Model Context Protocol

MCP standardizes how AI applications connect to external tools, data sources, prompts, and workflows. For runtime architecture, it is most useful at the tool and context boundary: discovery, permissioning, connector design, and controlled access to systems outside the model provider.

Tools
Data sources
Discovery

Visit official documentation

Agent interoperability protocol

Agent2Agent Protocol

A2A Protocol project

A2A addresses communication and collaboration between agents rather than connecting one AI application to tools. It belongs in runtime planning when multiple agents, specialized services, or organizations need typed handoffs, task status exchange, and clearer boundaries for cross-agent collaboration.

Agent collaboration
Interoperability
Handoffs

Visit official documentation

API contract specification

OpenAPI Specification

OpenAPI Initiative

OpenAPI defines machine-readable HTTP API contracts. In AI runtimes, it is useful for turning APIs into typed tools, documenting authentication and parameters, validating request and response shapes, and separating tool execution contracts from prompt text or ad hoc natural-language instructions.

HTTP APIs
Schemas
Tool contracts

Visit official documentation

Validation specification

JSON Schema

JSON Schema project

JSON Schema defines how JSON documents can be validated against a formal schema. Runtime designers use it for tool inputs, structured model outputs, response envelopes, trace events, configuration, and policy checks that should fail deterministically before side effects occur.

Validation
Structured output
Schemas

Visit official documentation

Inference and Serving

Inference engines and model servers usually belong to the execution plane. They load models, schedule requests, batch traffic, expose APIs, and use hardware efficiently, but they do not automatically provide complete agent memory, permission, or governance systems.

High-throughput LLM serving

vLLM

vLLM project

vLLM is an LLM serving engine focused on high-throughput generation, online serving, batching, prefix caching, structured outputs, and distributed deployment patterns. It is a model execution and serving component; it still needs surrounding runtime controls for identity, tool authorization, memory policy, and product workflow.

LLM serving
Batching
KV cache

Visit official documentation

Local model execution

Ollama

Ollama helps developers run and manage local models through a convenient local service and CLI-oriented workflow. It is useful for desktop, prototype, and private local inference patterns, but production runtime design still needs routing policy, observability, authorization, and deployment controls around it.

Local models
Developer workstation
Model management

Visit official documentation

Portable model inference

ONNX Runtime

Microsoft / ONNX Runtime project

ONNX Runtime executes ONNX model graphs across CPU, GPU, mobile, browser, and accelerator backends through execution-provider abstractions. It is relevant when portability, graph optimization, hardware delegation, and model-format interoperability matter more than a single provider-specific serving stack.

ONNX
Execution providers
Portability

Visit official documentation

Multi-framework inference serving

NVIDIA Triton Inference Server

NVIDIA

Triton Inference Server provides multi-framework model serving with HTTP and gRPC APIs, model repositories, scheduling, dynamic batching, and observability integrations. It is a serving platform for the execution layer, not a full agent runtime or policy system by itself.

Model server
gRPC
Dynamic batching

Visit official documentation

Kubernetes-native model serving

KServe

KServe project

KServe provides Kubernetes-native abstractions for deploying, scaling, and operating model inference services. It is most relevant when runtime teams need platform-level serving resources, traffic management, standardized inference endpoints, and integration with Kubernetes operations rather than a standalone local inference engine.

Kubernetes
Autoscaling
Rollouts

Visit official documentation

Distributed application serving

Ray Serve

Ray project

Ray Serve is a scalable serving library for composing Python applications, model deployments, and distributed services on Ray. It fits runtime architectures that need programmable request handling, autoscaling, composition of multiple model or preprocessing steps, and deployment across a distributed application substrate.

Python services
Autoscaling
Composition

Visit official documentation

Observability and Evaluation

AI runtime observability is broader than log collection. A useful runtime captures traces, metrics, logs, evaluation results, prompt and model comparisons, replay material, and incident context while applying redaction and retention controls.

Telemetry semantic conventions

OpenTelemetry Generative AI Semantic Conventions

OpenTelemetry

OpenTelemetry’s Generative AI semantic conventions define shared telemetry vocabulary for model requests, providers, usage, and related signals. The formerly linked documentation page now points to the maintained semantic-conventions repository, so ARuntime uses the current repository location for link accuracy.

Tracing
Metrics
Semantic conventions

Visit official documentation

LLM observability and evaluation

Langfuse

Langfuse provides observability, tracing, prompt management, evaluations, datasets, and production monitoring for LLM applications. It is useful when teams need to analyze runtime behavior across model calls, tool steps, latency, cost, quality, and prompt or model changes.

Tracing
Evaluation
Prompt management

Visit official documentation

AI observability and evaluation

Arize Phoenix

Arize AI

Arize Phoenix is an observability and evaluation system for AI applications, including tracing, datasets, experiments, and LLM evaluation workflows. It is relevant for runtime teams that need to debug requests, compare outputs, inspect retrieval behavior, and connect offline evaluations with production traces.

Tracing
Evaluation
Datasets

Visit official documentation

Security and Governance

Security resources help translate runtime risk into controls around identity, delegated authority, prompt injection, memory poisoning, supply chain, tool approval, auditability, incident modeling, and risk management.

Security guidance project

OWASP GenAI Security Project

OWASP

The OWASP GenAI Security Project publishes open guidance, threat lists, governance resources, and educational material for securing generative and agentic AI systems. It is useful for mapping runtime controls to prompt injection, excessive agency, data exposure, memory risks, and operational governance.

Prompt injection
Agent security
Controls

Visit official documentation

Risk management framework

NIST AI Risk Management Framework

NIST

The NIST AI Risk Management Framework provides a structured vocabulary for identifying, measuring, managing, and governing AI risks. Runtime teams can use it to connect engineering controls with organizational risk processes, documentation, monitoring, incident handling, and accountability expectations.

Risk
Governance
Assurance

Visit official documentation

Adversary knowledge base

MITRE ATLAS

MITRE

MITRE ATLAS catalogs adversary tactics and techniques against AI-enabled systems. It helps runtime and security teams think about threat modeling, incident response, supply-chain exposure, model behavior manipulation, data attacks, and controls that should exist outside prompt-level instructions.

Threat modeling
Adversary behavior
Incident analysis

Visit official documentation

Protocol security guidance

MCP Security Best Practices

Model Context Protocol

The MCP security guidance covers authorization risks, confused-deputy problems, token handling, server-side request forgery, session hijacking, local-server compromise, and scope minimization. It is directly relevant to runtime designs that expose external tools or data sources through MCP connectors.

MCP
Authorization
Tool access

Visit official documentation

How to Choose an AI Runtime

Use requirements first. A serving engine, protocol, orchestration framework, and observability system may all be necessary, but they are different layers of the runtime architecture.

Requirements mapped to runtime capabilities and resource categories.

Requirement	Runtime capability	Resource category	Questions to ask
Stateful, long-running tasks	Durable execution and checkpointing	Agent frameworks and orchestration	Can the runtime persist state, resume after failure, expose checkpoints, and support human interruption without corrupting the task?
Tool use	Schemas, authentication, permissions, retries, and output validation	Protocols and contracts	Are tool inputs typed, authenticated, authorized, rate limited, idempotent where needed, and validated before side effects?
Multiple models	Routing, fallback, privacy, budget, and capability policies	Developer runtime architecture	Can routing policy select models by task, risk, latency, context window, cost, data boundary, and provider availability?
Local models	Local inference and model-server adapters	Inference and serving	Does the stack support local model management, hardware constraints, privacy requirements, and adapter-level failure isolation?
Human approval	Interruptible execution and resumable state	Agent orchestration and governance	Can privileged actions pause for approval, preserve state, record the decision, and continue or cancel predictably?
Interoperability	MCP, A2A, OpenAPI, and JSON Schema support	Protocols and contracts	Which boundaries are standardized: tool discovery, agent-to-agent work, HTTP APIs, request schemas, or structured outputs?
Production debugging	Distributed traces, replay, metrics, and evaluation	Observability and evaluation	Can engineers reconstruct request timelines, tool decisions, fallbacks, token usage, cost, and evaluation results without leaking sensitive data?
Regulated workloads	Identity, audit records, retention controls, and policy enforcement	Security and governance	Are actors, tenants, permissions, retention, redaction, approvals, and audit trails enforced outside prompt text?
High throughput	Batching, caching, concurrency, autoscaling, and backpressure	Inference and serving	Does the serving layer support batching, scheduling, cache reuse, autoscaling, admission control, and clear SLO trade-offs?

No single product necessarily supplies every layer of an AI runtime. Treat the runtime as a composition of contracts, execution services, policies, telemetry, and product workflows.

Reference Architecture Flow

01

Request boundary

Identify actor, tenant, task, risk, deadline, and allowed authority.
02

Context assembly

Retrieve only the context required for the task and data boundary.
03

Model routing

Select model/provider based on capability, privacy, budget, latency, and fallback policy.
04

Model execution

Run inference through an engine or provider and capture token/cost timing.
05

Tool authorization

Check permissions, scopes, schemas, risk level, and approval requirements.
06

Tool execution

Execute typed tools with retries, timeouts, idempotency rules, and audit fields.
07

Memory update

Write memory only when policy allows it and the update is explicit.
08

Policy validation

Validate output, citations, constraints, and side-effect boundaries.
09

Telemetry and evaluation

Record redacted traces, metrics, evaluations, and incident replay material.
10

Response or human handoff

Return structured output or pause for review with state preserved.

Resource Selection Policy

Direct relevance to AI runtime architecture.
Availability of authoritative documentation.
Clear technical purpose.
Active maintenance or standards relevance.
Practical production use.
Educational value.
No payment for placement.

Last reviewed: 2026-06-21 UTC

Review workflow and deprecation policy

Frequently Asked Questions

What is the difference between a model API and an AI runtime?

A model API exposes model capability. An AI runtime governs the surrounding execution: identity, context, routing, tools, memory, policy checks, telemetry, evaluation, fallbacks, and handoff behavior. A runtime may call one or more model APIs, but it should not be reduced to them.

Is an agent framework the same as an AI runtime?

No. An agent framework helps build agent behavior, but a production runtime must also define execution boundaries, deployment, authorization, logging, incident recovery, cost controls, evaluation, and operational ownership. Some frameworks include runtime features, but they do not automatically supply every production control.

Is a model server a complete runtime?

Usually not. A model server handles model loading, APIs, scheduling, batching, health, and sometimes model repositories or rollouts. It is normally one execution-plane component inside a larger runtime architecture that also governs tools, context, memory, policy, and workflow state.

What is the difference between MCP and A2A?

MCP primarily standardizes how AI applications connect to tools, data sources, prompts, and workflows. A2A addresses communication and collaboration between agents. They can complement each other, but they solve different boundary problems and should not be treated as direct substitutes.

Can an AI runtime use multiple model providers?

Yes. A runtime can route across providers by task type, risk, context size, latency, cost, privacy, data residency, availability, and quality requirements. The important part is making routing policy explicit and observable rather than hiding it in application code.

Can an AI runtime run entirely on local infrastructure?

Yes. Local runtimes can use local model servers, desktop GPUs, private Kubernetes clusters, edge devices, or air-gapped deployments. The architecture still needs policy, observability, update controls, model provenance, and recovery behavior even when no public cloud provider is involved.

What runtime events should be logged?

Log request boundaries, context retrieval, model route, provider/model identity, tool calls, permission decisions, retries, fallbacks, memory reads/writes, token usage, cost, latency, evaluation results, human-review events, and final outcome. Sensitive data needs redaction and retention policy first.

Where should tool permissions be enforced?

Tool permissions should be enforced outside the prompt at deterministic runtime boundaries. Use authentication, authorization, schema validation, rate limits, policy checks, approval requirements, and audit logging before executing a high-impact action.

How should long-running agent tasks recover from failures?

They need durable state, checkpoints, idempotent tool operations, cancellation support, retry budgets, compensating actions, trace replay, and a clear human handoff path. Recovery should resume from known state rather than rerunning the entire task blindly.

How frequently should resource links be reviewed?

Review high-change runtime resources at least quarterly and after major releases, security advisories, protocol revisions, or project governance changes. Link pages should show a UTC last-reviewed date and avoid evergreen claims about version support or project status.