Edge, Mobile, and TinyML

Key takeaways

Edge deployment is a packaging and lifecycle problem as much as an inference problem.
AOT preparation and static memory planning reduce target footprint and startup work.
Delegation partitions supported operations to GPU/NPU backends and requires an explicit fallback policy.
Thermal and sustained-power limits can dominate short benchmark results.
Offline operation requires local model, policy, telemetry buffering, update, and rollback design.

Runtime boundary

A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.

Receives

Prepared model program, device capabilities, sensors/local inputs, power policy, update package, and offline rules.

Owns

AOT package, delegate partitioning, memory, device lifecycle, offline execution, and fleet compatibility.

Emits

Local results, device metrics, buffered telemetry, model-update state, and fallback/degradation status.

Does not own

Cloud availability, unrestricted sensor collection, or proof that a delegate supports every operation.

Failure modes

Unsupported delegate op, CPU fallback, memory exhaustion, thermal throttling, battery drain, interrupted update, stale fleet, and lost telemetry.

Evidence and metrics

Binary/model size, peak RAM, startup, latency, energy, thermals, delegate coverage, update success, and offline success.

Edge constraints

Devices have bounded RAM, storage, sustained power, cooling, background execution, and connectivity.

Implementation

Define worst-case input, concurrent system load, duty cycle, operating mode, and offline duration.

Operational implications

Foreground developer tests can pass while background, thermal, or battery behavior fails in production.

Measure

Peak/sustained latency, RAM, storage, power, temperature, and task completion.

ExecuTorch and AOT preparation

ExecuTorch exports PyTorch programs into a portable on-device representation with a compact C++ runtime.

Implementation

Retain export constraints, backend versions, delegated partitions, quantization recipe, memory plan, and target matrix.

Operational implications

The export pipeline is part of the release and supply-chain evidence.

Measure

Export coverage, PTE/package size, load, startup, delegate partitions, and parity.

Delegates and partitioning

Delegates or execution providers claim supported graph regions for GPU, NPU, DSP, or vendor runtime.

Implementation

Record final partition, transfers, layouts, fallback, and fail-closed rules.

Operational implications

Node percentage is less important than useful contiguous regions and transfer cost.

Measure

Delegate coverage, partition count, bytes transferred, fallback, latency, and power.

Static and dynamic memory

Static plans reuse known tensor lifetimes; dynamic buffers handle variable input and backend workspaces.

Implementation

Measure exact device peak including camera/audio buffers, UI, OS, and concurrent components.

Operational implications

A model that fits alone can fail inside the full application.

Measure

Peak RSS, allocator calls, fragmentation, OOM, and input-shape headroom.

Quantization and hardware kernels

Low precision reduces storage and can use efficient NPU/DSP paths.

Implementation

Validate conversion, calibration/training, supported operations, accumulator precision, and task quality.

Operational implications

If unsupported operations return to CPU, power and latency can worsen.

Measure

Model size, quality, delegate kernel use, fallback, energy, and latency.

Real-time and thermal behavior

Some edge systems require bounded deadlines and stable performance.

Implementation

Test worst-case scheduling, input rate, competing workloads, priority, thermal steady state, and power mode.

Operational implications

Average latency does not prove deadline compliance.

Measure

p99/max latency, deadline miss, jitter, temperature, frequency, power, and battery.

Offline operation

The device must continue with local models, policy, identity cache, state, and error handling when disconnected.

Implementation

Define what can execute offline, expiry of credentials/data, queued sync, conflict resolution, and telemetry buffering.

Operational implications

Do not queue privileged actions that require current central authority.

Measure

Offline task success, queue age, sync conflicts, stale policy/model, and buffered telemetry.

Fleet updates

Large fleets require signed staged rollout and rollback across hardware classes.

Implementation

Use device capability checks, resumable download, integrity, rings, health gates, and retained known-good model.

Operational implications

Interrupted updates must not brick inference or remove last-good artifacts.

Measure

Adoption, update failures, rollback, version age, storage pressure, and compatibility.

Privacy and telemetry

Local sensors and personal data can be sensitive even when inference stays on device.

Implementation

Minimize collection, classify/buffer securely, provide consent, aggregate where possible, and include model/runtime version.

Operational implications

Delayed telemetry can be misinterpreted without device state and UTC event time.

Measure

Raw-data collection, upload bytes, consent, retention, redaction, and delayed event age.

Reference tables

Edge constraint and response
Constraint	Runtime response	Evidence
Small binary/storage	Selective kernels/compressed assets	Package size/dependency map
Limited RAM	Static planning/buffer reuse/quantization	Peak RSS/allocation trace
Power/thermal	NPU delegation/duty cycle	Sustained power/temperature
Intermittent connectivity	Offline models/queued telemetry	Offline scenario tests
Hardware diversity	Capability discovery/fallback	Coverage by device class
Fleet updates	Signed staged rollout/rollback	Version adoption/failure

Decision checklist

Is offline operation required and for how long?
Which accelerator is guaranteed for each device class?
Is CPU fallback acceptable for latency, power, and privacy?
What peak RAM and persistent storage budget exists?
How will models/runtimes update and roll back?
What sustained thermal and battery targets apply?
How are offline traces buffered and protected?

Common mistakes

Benchmarking a cool device for only a few seconds.
Assuming delegate support from a marketing device label.
Allowing silent CPU fallback in a real-time route.
Shipping without enough RAM for activations and input buffers.
Using unsigned or non-rollbackable model updates.
Collecting raw sensor data in telemetry without policy.

Sources and further reading

ExecuTorch overview
(opens in a new tab)

PyTorch · Official documentation · accessed 2026-06-21 UTC
ExecuTorch delegation
(opens in a new tab)

PyTorch · Official documentation · accessed 2026-06-21 UTC
LiteRT
(opens in a new tab)

Google AI Edge · Official documentation · accessed 2026-06-21 UTC
ONNX Runtime mobile
(opens in a new tab)

ONNX Runtime · Official documentation · accessed 2026-06-21 UTC

Last reviewed: 2026-06-21 UTC

Key takeaways

Runtime boundary

Receives

Owns

Emits

Does not own

Failure modes

Evidence and metrics

Edge constraints

Implementation

Operational implications

Measure

ExecuTorch and AOT preparation

Implementation

Operational implications

Measure

Delegates and partitioning

Implementation

Operational implications

Measure

Static and dynamic memory

Implementation

Operational implications

Measure

Quantization and hardware kernels

Implementation

Operational implications

Measure

Real-time and thermal behavior

Implementation

Operational implications

Measure

Offline operation

Implementation

Operational implications

Measure

Fleet updates

Implementation

Operational implications

Measure

Privacy and telemetry

Implementation

Operational implications

Measure

Reference tables

Decision checklist

Common mistakes

Sources and further reading

Maintenance record