Key takeaways
- Browser AI is a runtime-selection problem among Wasm, WebGPU, WebNN, and remote fallback—not one capability.
- WebGPU enables general GPU compute but availability, features, memory, and browser policy vary.
- WebNN provides a graph API that can map to platform accelerators where supported and requires progressive enhancement.
- Model download, tokenizer/assets, storage, initialization, and disposal often dominate user experience.
- Local execution can improve privacy, but telemetry, fallback, model URLs, and browser storage remain data paths.
Runtime boundary
A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.
Receives
Web application, model assets, browser/device capabilities, user input, privacy/fallback policy, and storage quota.
Owns
Backend selection, model download/cache, Worker isolation, GPU buffer lifecycle, UI responsiveness, and progressive fallback.
Emits
Client-local results, progress/token events, cache state, capability telemetry, and optional fallback requests.
Does not own
Universal API support, unlimited memory, or privacy when the application exports data.
Failure modes
Unsupported API, feature mismatch, OOM/device loss, tab suspension, cache eviction, UI blocking, download failure, and unsafe fallback.
Evidence and metrics
Download/cache, initialization, TTFT/latency, main-thread blocking, GPU memory, disposal, fallback, and offline success.
WebAssembly path
Wasm provides broad CPU execution with optional SIMD and threads where browser isolation/features allow.
Implementation
Use Worker execution, feature detection, bounded memory, and a portable model/backend.
Operational implications
It is a strong baseline for small models but may be too slow or memory-limited for large generative workloads.
Measure
Initialization, CPU utilization, memory, main-thread blocking, and latency.
WebGPU path
WebGPU exposes modern GPU compute and device-local resources.
Implementation
Detect adapter/device features and limits, keep tensors on GPU where possible, handle device loss, and dispose buffers.
Operational implications
Availability, memory limits, shader compilation, and driver/browser variation require fallback.
Measure
Adapter/device, compile, GPU memory, transfers, kernel time, device loss, and TTFT.
WebNN path
WebNN represents neural-network graphs and lets the platform select CPU/GPU/NPU acceleration.
Implementation
Query support, compile a graph, manage operands/tensors, and fall back when operators or implementation are unavailable.
Operational implications
It is a standards path, not a guarantee that every browser/device accelerates every model.
Measure
API/operator coverage, graph build, backend choice, latency, and fallback.
Model delivery and storage
Weights, graph, tokenizer, adapters, and runtime code have distinct versions and caching behavior.
Implementation
Use content-addressed assets, integrity where practical, progress, quota checks, resumable/chunked delivery, and a complete manifest.
Operational implications
Browser storage is quota-controlled and can be evicted; partial caches must not be treated as ready.
Measure
Download bytes/time, cache hit, quota failure, manifest integrity, and eviction.
Workers and UI responsiveness
Initialization, tokenization, and inference can block rendering if run on the main thread.
Implementation
Use dedicated Workers, a stable message protocol, cancellation, and careful transfer/shared-buffer strategy.
Operational implications
Copying large tensors across threads can erase gains; SharedArrayBuffer has security requirements.
Measure
Long tasks, input delay, message bytes/time, cancel, Worker restart, and UI responsiveness.
GPU resource lifecycle
GPU tensors and buffers can persist beyond one request.
Implementation
Use runtime-specific dispose/release, handle cancellation/navigation/device loss, and bound reusable pools.
Operational implications
Leaks produce gradual tab crashes and are hard to see through JavaScript heap metrics alone.
Measure
GPU memory estimates, buffers created/destroyed, request cleanup, and device loss.
Graph capture and I/O binding
Stable graphs and shapes can reduce dispatch and copies; I/O binding keeps data on device.
Implementation
Use only for supported shapes and resource lifetimes; include capture state in benchmark disclosure.
Operational implications
Dynamic inputs or output readback can reduce value.
Measure
Capture hit, copies avoided, binding failures, shape fallback, and latency.
Privacy and fallback
Local inference keeps data off a model server only if analytics, crash reporting, plugins, and fallback also do.
Implementation
Document outbound paths and require policy/user choice when remote fallback changes residency. Return route evidence.
Operational implications
Extensions and untrusted assets remain part of the threat model.
Measure
Outbound bytes, remote fallback, consent, telemetry mode, and route distribution.
Progressive enhancement
A usable product needs capability tiers and a non-AI or remote alternative.
Implementation
Select WebNN, WebGPU, Wasm, smaller local model, explicit server, or non-AI UX according to policy and capability.
Operational implications
Avoid blank or broken experiences on unsupported devices.
Measure
Capability tier, fallback reason, conversion, task success, and support incidents.
Reference tables
| Path | Strength | Constraint | Fallback role |
|---|---|---|---|
| WebAssembly | Broad CPU reach | CPU throughput/memory | Baseline local path |
| WebGPU | General GPU compute | Availability/device limits | High-performance local path |
| WebNN | Platform graph acceleration | Implementation/operator coverage | Preferred path where verified |
| Remote inference | Large models/central capacity | Network/cost/residency | Explicit policy fallback |
| Use case | Preferred path | Fallback | Primary risk |
|---|---|---|---|
| Small classifier | Wasm or WebNN | Remote | Startup > task cost |
| Embeddings/search | WebGPU/WebNN | Wasm or remote | Memory/cache |
| Local LLM | WebGPU | Smaller local or explicit remote | Download/VRAM/compatibility |
| Offline private tool | Verified local | No or approved private fallback | Telemetry leakage |
| Broad public feature | Progressive enhancement | Remote or non-AI UX | Uneven support |
Decision checklist
- Which browsers, devices, and API features are in scope?
- What model download and storage budget is acceptable?
- What backend selection and fallback order is explicit?
- Can work stay off the main thread?
- How are GPU resources disposed on cancellation or route change?
- What data leaves the browser?
- How does the app behave offline or after cache eviction?
- What usable fallback exists for unsupported devices?
Common mistakes
- Assuming WebGPU or WebNN is available everywhere.
- Downloading large assets without consent/quota checks.
- Blocking the main thread during initialization.
- Leaking GPU buffers across repeated runs.
- Calling local execution private while analytics sends prompts.
- Silently falling back to a remote model.
- Treating cached files as complete without a manifest.
Sources and further reading
-
WebGPU API
(opens in a new tab)
-
Web Neural Network API
(opens in a new tab)
-
ONNX Runtime Web
(opens in a new tab)
-
WebGPU execution provider
(opens in a new tab)
-
Web Workers API
(opens in a new tab)
-
Storage quotas and eviction
(opens in a new tab)
Last reviewed: 2026-06-21 UTC
