Search ARuntime.com

Find runtime definitions and implementation guidance

Search page titles, summaries, headings, glossary terms, use cases, and runtime-directory entries.

Enter at least two characters.

Deployment

Edge, Mobile, and TinyML

Design edge, mobile, and TinyML runtimes with AOT artifacts, ExecuTorch, delegates, quantization, CPU/GPU/NPU partitioning, offline operation, fleet updates, and real-time constraints.

Audience: Technical readers Reading time: 5 minutes Status: Production guidance Last reviewed:

Key takeaways

  • Edge deployment is a packaging and lifecycle problem as much as an inference problem.
  • AOT preparation and static memory planning reduce target footprint and startup work.
  • Delegation partitions supported operations to GPU/NPU backends and requires an explicit fallback policy.
  • Thermal and sustained-power limits can dominate short benchmark results.
  • Offline operation requires local model, policy, telemetry buffering, update, and rollback design.

Runtime boundary

A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.

Receives

Prepared model program, device capabilities, sensors/local inputs, power policy, update package, and offline rules.

Owns

AOT package, delegate partitioning, memory, device lifecycle, offline execution, and fleet compatibility.

Emits

Local results, device metrics, buffered telemetry, model-update state, and fallback/degradation status.

Does not own

Cloud availability, unrestricted sensor collection, or proof that a delegate supports every operation.

Failure modes

Unsupported delegate op, CPU fallback, memory exhaustion, thermal throttling, battery drain, interrupted update, stale fleet, and lost telemetry.

Evidence and metrics

Binary/model size, peak RAM, startup, latency, energy, thermals, delegate coverage, update success, and offline success.

Edge constraints

Devices have bounded RAM, storage, sustained power, cooling, background execution, and connectivity.

Implementation

Define worst-case input, concurrent system load, duty cycle, operating mode, and offline duration.

Operational implications

Foreground developer tests can pass while background, thermal, or battery behavior fails in production.

Measure

Peak/sustained latency, RAM, storage, power, temperature, and task completion.

ExecuTorch and AOT preparation

ExecuTorch exports PyTorch programs into a portable on-device representation with a compact C++ runtime.

Implementation

Retain export constraints, backend versions, delegated partitions, quantization recipe, memory plan, and target matrix.

Operational implications

The export pipeline is part of the release and supply-chain evidence.

Measure

Export coverage, PTE/package size, load, startup, delegate partitions, and parity.

Delegates and partitioning

Delegates or execution providers claim supported graph regions for GPU, NPU, DSP, or vendor runtime.

Implementation

Record final partition, transfers, layouts, fallback, and fail-closed rules.

Operational implications

Node percentage is less important than useful contiguous regions and transfer cost.

Measure

Delegate coverage, partition count, bytes transferred, fallback, latency, and power.

Static and dynamic memory

Static plans reuse known tensor lifetimes; dynamic buffers handle variable input and backend workspaces.

Implementation

Measure exact device peak including camera/audio buffers, UI, OS, and concurrent components.

Operational implications

A model that fits alone can fail inside the full application.

Measure

Peak RSS, allocator calls, fragmentation, OOM, and input-shape headroom.

Quantization and hardware kernels

Low precision reduces storage and can use efficient NPU/DSP paths.

Implementation

Validate conversion, calibration/training, supported operations, accumulator precision, and task quality.

Operational implications

If unsupported operations return to CPU, power and latency can worsen.

Measure

Model size, quality, delegate kernel use, fallback, energy, and latency.

Real-time and thermal behavior

Some edge systems require bounded deadlines and stable performance.

Implementation

Test worst-case scheduling, input rate, competing workloads, priority, thermal steady state, and power mode.

Operational implications

Average latency does not prove deadline compliance.

Measure

p99/max latency, deadline miss, jitter, temperature, frequency, power, and battery.

Offline operation

The device must continue with local models, policy, identity cache, state, and error handling when disconnected.

Implementation

Define what can execute offline, expiry of credentials/data, queued sync, conflict resolution, and telemetry buffering.

Operational implications

Do not queue privileged actions that require current central authority.

Measure

Offline task success, queue age, sync conflicts, stale policy/model, and buffered telemetry.

Fleet updates

Large fleets require signed staged rollout and rollback across hardware classes.

Implementation

Use device capability checks, resumable download, integrity, rings, health gates, and retained known-good model.

Operational implications

Interrupted updates must not brick inference or remove last-good artifacts.

Measure

Adoption, update failures, rollback, version age, storage pressure, and compatibility.

Privacy and telemetry

Local sensors and personal data can be sensitive even when inference stays on device.

Implementation

Minimize collection, classify/buffer securely, provide consent, aggregate where possible, and include model/runtime version.

Operational implications

Delayed telemetry can be misinterpreted without device state and UTC event time.

Measure

Raw-data collection, upload bytes, consent, retention, redaction, and delayed event age.

Reference tables

Edge constraint and response
Constraint Runtime response Evidence
Small binary/storage Selective kernels/compressed assets Package size/dependency map
Limited RAM Static planning/buffer reuse/quantization Peak RSS/allocation trace
Power/thermal NPU delegation/duty cycle Sustained power/temperature
Intermittent connectivity Offline models/queued telemetry Offline scenario tests
Hardware diversity Capability discovery/fallback Coverage by device class
Fleet updates Signed staged rollout/rollback Version adoption/failure

Decision checklist

  1. Is offline operation required and for how long?
  2. Which accelerator is guaranteed for each device class?
  3. Is CPU fallback acceptable for latency, power, and privacy?
  4. What peak RAM and persistent storage budget exists?
  5. How will models/runtimes update and roll back?
  6. What sustained thermal and battery targets apply?
  7. How are offline traces buffered and protected?

Common mistakes

  • Benchmarking a cool device for only a few seconds.
  • Assuming delegate support from a marketing device label.
  • Allowing silent CPU fallback in a real-time route.
  • Shipping without enough RAM for activations and input buffers.
  • Using unsigned or non-rollbackable model updates.
  • Collecting raw sensor data in telemetry without policy.

Sources and further reading


  1. ExecuTorch overview
    (opens in a new tab)

    PyTorch · Official documentation · accessed 2026-06-21 UTC

  2. ExecuTorch delegation
    (opens in a new tab)

    PyTorch · Official documentation · accessed 2026-06-21 UTC

  3. LiteRT
    (opens in a new tab)

    Google AI Edge · Official documentation · accessed 2026-06-21 UTC

  4. ONNX Runtime mobile
    (opens in a new tab)

    ONNX Runtime · Official documentation · accessed 2026-06-21 UTC

Last reviewed: 2026-06-21 UTC

Maintenance record

Found an error, outdated capability, or unclear category boundary? Submit a correction with a supporting source.