Key takeaways
- Each example begins with data, authority, SLO, deployment, and failure constraints.
- The same model can require a different runtime architecture in a browser, edge device, private cluster, or durable agent workflow.
- Tools and systems-of-record writes are governed side effects, not ordinary model output.
- Observability and evaluation are part of every example.
- Examples use synthetic identifiers and omit production secrets.
Runtime boundary
A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.
Receives
Scenario requirements and component choices.
Owns
Educational architecture patterns and reusable contract ideas.
Emits
A runtime topology, execution path, controls, metrics, and failure/recovery plan.
Does not own
A universal template or vendor endorsement.
Failure modes
Copying an example without adapting identity, data, scale, policy, and failure assumptions.
Evidence and metrics
Scenario-specific task success, latency, quality, cost, policy, and recovery.
Local private research assistant
A desktop application runs a quantized local model, local embeddings, and a read-only indexed document store.
Implementation
The runtime contract disables remote fallback, exposes file sources through a typed context provider, and records citations without uploading document content.
Operational implications
If the model cannot fit, the application offers a smaller approved model or fails explicitly. No network tool is available.
Measure
Load, TTFT/TPOT, RAM/VRAM, citation validity, index freshness, and outbound bytes.
Enterprise RAG with semantic layer
An internal assistant answers governed business questions through typed semantic metrics and approved documents.
Implementation
Identity/tenant enter the boundary; row/field policy filters context; the router selects a private model; output includes evidence and metric version.
Operational implications
The model cannot run arbitrary SQL. Unsupported metric questions return a typed limitation.
Measure
Context provenance, policy denies, metric version, answer evaluation, latency, and cost.
Browser document classifier
A web app downloads a small signed/content-addressed ONNX model and runs WebNN, WebGPU, or Wasm in a Worker.
Implementation
Capability routing is local and remote fallback is opt-in; assets cache by hash; GPU buffers dispose after each batch.
Operational implications
On unsupported or memory-constrained browsers, a non-AI form remains usable.
Measure
Download/cache, initialization, classification latency, memory, fallback, and UI responsiveness.
Mobile camera inference
A prepared ExecuTorch program partitions supported operations to an NPU delegate and keeps fallback bounded.
Implementation
The app runs camera preprocessing, inference, and postprocessing within a sustained thermal budget and stores no raw image by default.
Operational implications
A signed staged update retains the last-good artifact. Unsupported devices use a smaller CPU model.
Measure
Delegate coverage, p99 latency, energy, thermals, peak RAM, update success, and quality.
High-throughput LLM service
A private GPU cluster runs an LLM engine behind a model server and Kubernetes serving platform.
Implementation
Paged KV, continuous batching, bounded admission, prefix reuse, readiness, and Goodput-based autoscaling are enabled; a gateway owns auth and quotas.
Operational implications
Overload returns a stable retry-after error rather than unbounded queueing.
Measure
Queue, TTFT, TPOT, Goodput, cache hit/prefill avoided, HBM, errors, and cost.
Durable case-resolution agent
A workflow coordinates context, model calls, tools, human approval, and resumable state over hours.
Implementation
Typed tools carry idempotency; status changes require permission and conditional approval; memory writes are explicit; ambiguous timeouts trigger authoritative outcome checks.
Operational implications
The model server may restart without losing task state. Human review sees exact action arguments and evidence.
Measure
Task success/time, steps, tool retries, duplicate prevention, approvals, policy, cost, and replay.
Hybrid field assistant
A field device uses a local model offline and routes complex approved tasks to a private cloud when connected.
Implementation
The route policy considers data class, connectivity, model capability, deadline, and consent; state sync uses versions and idempotent commands.
Operational implications
Sensitive cases fail closed if the private route is unavailable; queued writes are reconciled before replay.
Measure
Route/fallback, offline success, sync conflicts, duplicate prevention, latency, and model/version parity.
Reference tables
| Scenario | Primary runtime layers | Highest-risk boundary |
|---|---|---|
| Local research assistant | Local inference, context, product | Private documents/outbound data |
| Enterprise RAG | Context, agentic, private serving | Tenant/semantic data access |
| Browser classifier | Browser graph runtime/product | Client storage/fallback |
| Mobile vision | Edge compiler/runtime/product | Device fleet/model update |
| LLM service | Engine/server/platform | Capacity/tenant isolation |
| Durable agent | Agentic/workflow/tools | Irreversible side effects |
| Hybrid field assistant | Edge/private cloud/agentic | Data movement/state reconciliation |
Decision checklist
- Which example most closely matches the deployment and data boundary?
- What authority, side effects, and memory must be added or removed?
- Which SLO and workload distributions differ?
- What fallback is permitted?
- Which component is authoritative for business state?
- What failure injection will prove recovery?
Common mistakes
- Copying model/provider choices without compatibility testing.
- Adding tools to a read-only example without authorization.
- Using central telemetry that violates a local privacy requirement.
- Treating local cache as durable product memory.
- Removing approval to improve demo speed.
- Skipping workload and failure tests because the happy path works.
Sources and further reading
-
ExecuTorch overview
(opens in a new tab)
-
ONNX Runtime Web
(opens in a new tab)
-
vLLM documentation
(opens in a new tab)
-
Temporal documentation
(opens in a new tab)
-
Model Context Protocol specification
(opens in a new tab)
Last reviewed: 2026-06-21 UTC
