Key takeaways
- Deployment location changes latency, privacy, capacity, update, and observability boundaries.
- The same product may route among local, edge, private, and managed paths according to capability, data policy, connectivity, cost, and SLO.
- Serverless and scale-to-zero move model download, compilation, and warmup into cold-start design.
- Air-gapped and regulated deployments require artifact provenance, offline update, and local evidence collection.
- Choose a pattern with an explicit operating model, not only a model benchmark.
Runtime boundary
A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.
Receives
Workload requirements, model size, data classification, connectivity, hardware, update cadence, SLOs, cost, and ownership constraints.
Owns
Placement boundaries, network paths, update/scaling mechanisms, isolation model, and operational responsibility.
Emits
A topology, routing policy, artifact distribution plan, scaling model, telemetry path, ownership map, and rollback design.
Does not own
Model quality or a universal assumption that local is private and cloud is infinitely scalable.
Failure modes
Cold start, unavailable accelerator, network partition, stale model, update drift, insufficient telemetry, and data-boundary violation.
Evidence and metrics
Latency by segment, Goodput, startup/readiness, transfer bytes, offline success, fleet update rate, cost, power, and recovery.
Browser deployment
Models execute in the user agent through Wasm, WebGPU, WebNN, or a local JS runtime.
Implementation
Use capability discovery, versioned model assets, Workers, caching, disposal, and explicit server fallback.
Operational implications
The client fleet supplies capacity but hardware/browser variation and storage eviction complicate support.
Measure
Download/cache, initialization, local latency, memory, fallback, and offline success.
Local desktop deployment
Models execute on user-controlled CPU/GPU hardware, often through a local engine or API.
Implementation
Manage quantized packages, memory fit, offload, local API security, updates, and hardware diagnostics.
Operational implications
Local execution avoids WAN dependence but still has telemetry, supply-chain, and device-security boundaries.
Measure
Load, TTFT/TPOT, RAM/VRAM, sustained power, package update, and API errors.
Mobile and embedded deployment
Models execute close to sensors and users under power, thermal, memory, and OS lifecycle limits.
Implementation
Use AOT artifacts, static planning, delegate partitioning, signed staged updates, and offline policy.
Operational implications
Short developer benchmarks do not prove sustained behavior under camera, UI, and battery load.
Measure
Peak RAM, delegate coverage, latency, energy, thermals, update success, and offline tasks.
Edge server deployment
Site-local servers provide stronger accelerators and centralize nearby device traffic.
Implementation
Deploy containerized or appliance runtimes with local registry/cache, disconnected operation, site failover, and buffered telemetry.
Operational implications
Many small sites multiply operations, version drift, and hardware variation.
Measure
Site capacity, LAN latency, model version, disconnected duration, update adoption, and recovery.
Private cloud and Kubernetes
Organization-controlled clusters host model servers and agent services.
Implementation
Use runtime definitions, GPU/device operators, network policy, model repositories, autoscaling, and tested upgrade rings.
Operational implications
Control increases responsibility for drivers, scheduling, storage, security, and capacity.
Measure
Replica Goodput, queue, scale-up, utilization, rollout, cost, and incident recovery.
Managed cloud inference
A provider supplies runtime, accelerator, endpoint, or platform capabilities.
Implementation
Pin provider/model release where possible, set region/residency, quotas, fallback, observability, and cost controls.
Operational implications
Managed operation reduces platform burden but adds service limits, roadmap, pricing, and lock-in.
Measure
Provider latency, throttles, errors, cost, region route, version, and fallback.
Serverless and microVM patterns
Request-driven functions or containers scale to zero and isolate invocations.
Implementation
Model cold-start phases, package/storage limits, concurrency fan-out, idempotency, and durable external state.
Operational implications
Large resident models and long-lived KV state usually fit persistent servers better.
Measure
Cold/warm start, model load, invocation, throttle, retry, downstream Goodput, and cost.
Hybrid routing
A policy router chooses local, edge, private, or managed execution.
Implementation
Evaluate data classification, device capability, model quality, connectivity, SLO, budget, and state location before moving data.
Operational implications
Fallback must disclose changed residency/capability and reconcile state without duplicate side effects.
Measure
Route distribution, fallback reason, latency/cost by path, data transfer, and conflicts.
Air-gapped and regulated operation
Disconnected or tightly controlled zones require local artifacts, identity, policy, telemetry, and updates.
Implementation
Use signed offline packages, local registries, controlled evidence export, known-good rollback, and retention governance.
Operational implications
Operational freshness and patching are harder; plan review and import/export procedures.
Measure
Version age, update success, integrity failures, local Goodput, evidence export, and recovery.
Research and benchmark environments
Controlled environments isolate runtime variables for measurement and experimentation.
Implementation
Pin software/hardware, publish methodology, separate test credentials/data, and avoid treating benchmark topology as production-ready.
Operational implications
A benchmark result does not define availability, security, rollout, or operating cost.
Measure
Reproducibility, variance, configuration drift, and experiment integrity.
Reference tables
| Pattern | Privacy boundary | Latency | Model capacity | Scaling | Update complexity |
|---|---|---|---|---|---|
| Browser | Client origin/device | Local after startup | Device constrained | By client fleet | Web assets and cache invalidation |
| Local desktop | User-controlled host | Local | RAM/GPU constrained | Per host | App/runtime/model updates |
| Mobile/edge device | Device/site | Very low | Power/thermal constrained | Fleet distribution | App/firmware/fleet rollout |
| Edge server | Local site/LAN | Low | Moderate-large | Node pool/site | Distributed infrastructure |
| Private cloud/K8s | Organization | Network plus queue | Large | Cluster autoscaling | Platform/model rollout |
| Managed cloud | Provider/region | Network plus service | Large/elastic | Managed quotas/capacity | Provider API/releases |
| Serverless | Provider invocation boundary | Cold/warm dependent | Often limited/specialized | Scale-to-zero/burst | Startup critical |
| Air-gapped | Disconnected zone | Local network | Installed hardware | Planned/manual | Signed offline packages |
Decision checklist
- What data and artifacts cross each physical or organizational boundary?
- Which hardware, memory, connectivity, and power assumptions are guaranteed?
- What startup and warmup behavior applies after restart or cache loss?
- How are model and runtime versions distributed, verified, and rolled back?
- What route and fallback policy applies?
- How are telemetry and incident evidence retained while offline?
- Who owns every failure domain and cost center?
Common mistakes
- Calling browser or local execution private while exporting prompts or telemetry.
- Assuming managed capacity is infinite or immediately available.
- Designing serverless inference without measuring model cold start.
- Using hybrid fallback that silently changes data residency.
- Updating models independently of runtime compatibility.
- Treating a benchmark topology as a production operating model.
Sources and further reading
-
KServe architecture
(opens in a new tab)
-
ExecuTorch overview
(opens in a new tab)
-
ONNX Runtime Web
(opens in a new tab)
-
Web Neural Network API
(opens in a new tab)
-
Knative Serving
(opens in a new tab)
Last reviewed: 2026-06-21 UTC
