Search ARuntime.com

Find runtime definitions and implementation guidance

Search page titles, summaries, headings, glossary terms, use cases, and runtime-directory entries.

Enter at least two characters.

Developer

AI Runtime Examples

Complete AI runtime architecture examples for local assistants, enterprise RAG, browser inference, mobile vision, high-throughput LLM serving, durable agents, and hybrid routing.

Audience: Technical readers Reading time: 5 minutes Status: Foundational Last reviewed:

Key takeaways

  • Each example begins with data, authority, SLO, deployment, and failure constraints.
  • The same model can require a different runtime architecture in a browser, edge device, private cluster, or durable agent workflow.
  • Tools and systems-of-record writes are governed side effects, not ordinary model output.
  • Observability and evaluation are part of every example.
  • Examples use synthetic identifiers and omit production secrets.

Runtime boundary

A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.

Receives

Scenario requirements and component choices.

Owns

Educational architecture patterns and reusable contract ideas.

Emits

A runtime topology, execution path, controls, metrics, and failure/recovery plan.

Does not own

A universal template or vendor endorsement.

Failure modes

Copying an example without adapting identity, data, scale, policy, and failure assumptions.

Evidence and metrics

Scenario-specific task success, latency, quality, cost, policy, and recovery.

Local private research assistant

A desktop application runs a quantized local model, local embeddings, and a read-only indexed document store.

Implementation

The runtime contract disables remote fallback, exposes file sources through a typed context provider, and records citations without uploading document content.

Operational implications

If the model cannot fit, the application offers a smaller approved model or fails explicitly. No network tool is available.

Measure

Load, TTFT/TPOT, RAM/VRAM, citation validity, index freshness, and outbound bytes.

Enterprise RAG with semantic layer

An internal assistant answers governed business questions through typed semantic metrics and approved documents.

Implementation

Identity/tenant enter the boundary; row/field policy filters context; the router selects a private model; output includes evidence and metric version.

Operational implications

The model cannot run arbitrary SQL. Unsupported metric questions return a typed limitation.

Measure

Context provenance, policy denies, metric version, answer evaluation, latency, and cost.

Browser document classifier

A web app downloads a small signed/content-addressed ONNX model and runs WebNN, WebGPU, or Wasm in a Worker.

Implementation

Capability routing is local and remote fallback is opt-in; assets cache by hash; GPU buffers dispose after each batch.

Operational implications

On unsupported or memory-constrained browsers, a non-AI form remains usable.

Measure

Download/cache, initialization, classification latency, memory, fallback, and UI responsiveness.

Mobile camera inference

A prepared ExecuTorch program partitions supported operations to an NPU delegate and keeps fallback bounded.

Implementation

The app runs camera preprocessing, inference, and postprocessing within a sustained thermal budget and stores no raw image by default.

Operational implications

A signed staged update retains the last-good artifact. Unsupported devices use a smaller CPU model.

Measure

Delegate coverage, p99 latency, energy, thermals, peak RAM, update success, and quality.

High-throughput LLM service

A private GPU cluster runs an LLM engine behind a model server and Kubernetes serving platform.

Implementation

Paged KV, continuous batching, bounded admission, prefix reuse, readiness, and Goodput-based autoscaling are enabled; a gateway owns auth and quotas.

Operational implications

Overload returns a stable retry-after error rather than unbounded queueing.

Measure

Queue, TTFT, TPOT, Goodput, cache hit/prefill avoided, HBM, errors, and cost.

Durable case-resolution agent

A workflow coordinates context, model calls, tools, human approval, and resumable state over hours.

Implementation

Typed tools carry idempotency; status changes require permission and conditional approval; memory writes are explicit; ambiguous timeouts trigger authoritative outcome checks.

Operational implications

The model server may restart without losing task state. Human review sees exact action arguments and evidence.

Measure

Task success/time, steps, tool retries, duplicate prevention, approvals, policy, cost, and replay.

Hybrid field assistant

A field device uses a local model offline and routes complex approved tasks to a private cloud when connected.

Implementation

The route policy considers data class, connectivity, model capability, deadline, and consent; state sync uses versions and idempotent commands.

Operational implications

Sensitive cases fail closed if the private route is unavailable; queued writes are reconciled before replay.

Measure

Route/fallback, offline success, sync conflicts, duplicate prevention, latency, and model/version parity.

Reference tables

Example map
Scenario Primary runtime layers Highest-risk boundary
Local research assistant Local inference, context, product Private documents/outbound data
Enterprise RAG Context, agentic, private serving Tenant/semantic data access
Browser classifier Browser graph runtime/product Client storage/fallback
Mobile vision Edge compiler/runtime/product Device fleet/model update
LLM service Engine/server/platform Capacity/tenant isolation
Durable agent Agentic/workflow/tools Irreversible side effects
Hybrid field assistant Edge/private cloud/agentic Data movement/state reconciliation

Decision checklist

  1. Which example most closely matches the deployment and data boundary?
  2. What authority, side effects, and memory must be added or removed?
  3. Which SLO and workload distributions differ?
  4. What fallback is permitted?
  5. Which component is authoritative for business state?
  6. What failure injection will prove recovery?

Common mistakes

  • Copying model/provider choices without compatibility testing.
  • Adding tools to a read-only example without authorization.
  • Using central telemetry that violates a local privacy requirement.
  • Treating local cache as durable product memory.
  • Removing approval to improve demo speed.
  • Skipping workload and failure tests because the happy path works.

Sources and further reading


  1. ExecuTorch overview
    (opens in a new tab)

    PyTorch · Official documentation · accessed 2026-06-21 UTC

  2. ONNX Runtime Web
    (opens in a new tab)

    ONNX Runtime · Official documentation · accessed 2026-06-21 UTC

  3. vLLM documentation
    (opens in a new tab)

    vLLM · Official documentation · accessed 2026-06-21 UTC

  4. Temporal documentation
    (opens in a new tab)

    Temporal · Official documentation · accessed 2026-06-21 UTC

  5. Model Context Protocol specification
    (opens in a new tab)

    MCP · Protocol specification · accessed 2026-06-21 UTC

Last reviewed: 2026-06-21 UTC

Maintenance record

Found an error, outdated capability, or unclear category boundary? Submit a correction with a supporting source.