Key takeaways
- Edge deployment is a packaging and lifecycle problem as much as an inference problem.
- AOT preparation and static memory planning reduce target footprint and startup work.
- Delegation partitions supported operations to GPU/NPU backends and requires an explicit fallback policy.
- Thermal and sustained-power limits can dominate short benchmark results.
- Offline operation requires local model, policy, telemetry buffering, update, and rollback design.
Runtime boundary
A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.
Receives
Prepared model program, device capabilities, sensors/local inputs, power policy, update package, and offline rules.
Owns
AOT package, delegate partitioning, memory, device lifecycle, offline execution, and fleet compatibility.
Emits
Local results, device metrics, buffered telemetry, model-update state, and fallback/degradation status.
Does not own
Cloud availability, unrestricted sensor collection, or proof that a delegate supports every operation.
Failure modes
Unsupported delegate op, CPU fallback, memory exhaustion, thermal throttling, battery drain, interrupted update, stale fleet, and lost telemetry.
Evidence and metrics
Binary/model size, peak RAM, startup, latency, energy, thermals, delegate coverage, update success, and offline success.
Edge constraints
Devices have bounded RAM, storage, sustained power, cooling, background execution, and connectivity.
Implementation
Define worst-case input, concurrent system load, duty cycle, operating mode, and offline duration.
Operational implications
Foreground developer tests can pass while background, thermal, or battery behavior fails in production.
Measure
Peak/sustained latency, RAM, storage, power, temperature, and task completion.
ExecuTorch and AOT preparation
ExecuTorch exports PyTorch programs into a portable on-device representation with a compact C++ runtime.
Implementation
Retain export constraints, backend versions, delegated partitions, quantization recipe, memory plan, and target matrix.
Operational implications
The export pipeline is part of the release and supply-chain evidence.
Measure
Export coverage, PTE/package size, load, startup, delegate partitions, and parity.
Delegates and partitioning
Delegates or execution providers claim supported graph regions for GPU, NPU, DSP, or vendor runtime.
Implementation
Record final partition, transfers, layouts, fallback, and fail-closed rules.
Operational implications
Node percentage is less important than useful contiguous regions and transfer cost.
Measure
Delegate coverage, partition count, bytes transferred, fallback, latency, and power.
Static and dynamic memory
Static plans reuse known tensor lifetimes; dynamic buffers handle variable input and backend workspaces.
Implementation
Measure exact device peak including camera/audio buffers, UI, OS, and concurrent components.
Operational implications
A model that fits alone can fail inside the full application.
Measure
Peak RSS, allocator calls, fragmentation, OOM, and input-shape headroom.
Quantization and hardware kernels
Low precision reduces storage and can use efficient NPU/DSP paths.
Implementation
Validate conversion, calibration/training, supported operations, accumulator precision, and task quality.
Operational implications
If unsupported operations return to CPU, power and latency can worsen.
Measure
Model size, quality, delegate kernel use, fallback, energy, and latency.
Real-time and thermal behavior
Some edge systems require bounded deadlines and stable performance.
Implementation
Test worst-case scheduling, input rate, competing workloads, priority, thermal steady state, and power mode.
Operational implications
Average latency does not prove deadline compliance.
Measure
p99/max latency, deadline miss, jitter, temperature, frequency, power, and battery.
Offline operation
The device must continue with local models, policy, identity cache, state, and error handling when disconnected.
Implementation
Define what can execute offline, expiry of credentials/data, queued sync, conflict resolution, and telemetry buffering.
Operational implications
Do not queue privileged actions that require current central authority.
Measure
Offline task success, queue age, sync conflicts, stale policy/model, and buffered telemetry.
Fleet updates
Large fleets require signed staged rollout and rollback across hardware classes.
Implementation
Use device capability checks, resumable download, integrity, rings, health gates, and retained known-good model.
Operational implications
Interrupted updates must not brick inference or remove last-good artifacts.
Measure
Adoption, update failures, rollback, version age, storage pressure, and compatibility.
Privacy and telemetry
Local sensors and personal data can be sensitive even when inference stays on device.
Implementation
Minimize collection, classify/buffer securely, provide consent, aggregate where possible, and include model/runtime version.
Operational implications
Delayed telemetry can be misinterpreted without device state and UTC event time.
Measure
Raw-data collection, upload bytes, consent, retention, redaction, and delayed event age.
Reference tables
| Constraint | Runtime response | Evidence |
|---|---|---|
| Small binary/storage | Selective kernels/compressed assets | Package size/dependency map |
| Limited RAM | Static planning/buffer reuse/quantization | Peak RSS/allocation trace |
| Power/thermal | NPU delegation/duty cycle | Sustained power/temperature |
| Intermittent connectivity | Offline models/queued telemetry | Offline scenario tests |
| Hardware diversity | Capability discovery/fallback | Coverage by device class |
| Fleet updates | Signed staged rollout/rollback | Version adoption/failure |
Decision checklist
- Is offline operation required and for how long?
- Which accelerator is guaranteed for each device class?
- Is CPU fallback acceptable for latency, power, and privacy?
- What peak RAM and persistent storage budget exists?
- How will models/runtimes update and roll back?
- What sustained thermal and battery targets apply?
- How are offline traces buffered and protected?
Common mistakes
- Benchmarking a cool device for only a few seconds.
- Assuming delegate support from a marketing device label.
- Allowing silent CPU fallback in a real-time route.
- Shipping without enough RAM for activations and input buffers.
- Using unsigned or non-rollbackable model updates.
- Collecting raw sensor data in telemetry without policy.
Sources and further reading
-
ExecuTorch overview
(opens in a new tab)
-
ExecuTorch delegation
(opens in a new tab)
-
LiteRT
(opens in a new tab)
-
ONNX Runtime mobile
(opens in a new tab)
Last reviewed: 2026-06-21 UTC
