Run inference on constrained devices with quantization, delegate selection, thermal and battery controls, offline operation, and signed updates.
Key takeaways
- Primary risk: Unsupported operators, hidden CPU fallback, resource exhaustion, stale models, unsafe action, and unobservable field failures.
- Keep authoritative domain state outside model memory.
- Measure task outcome, safe failure, and evidence—not output fluency alone.
Problem
Run inference on constrained devices with quantization, delegate selection, thermal and battery controls, offline operation, and signed updates.
Principal risk: Unsupported operators, hidden CPU fallback, resource exhaustion, stale models, unsafe action, and unobservable field failures.
Why runtime layers are needed
A single model invocation cannot reliably own identity, authorization, durable state, external side effects, recovery, or evidence. The runtime composes the necessary compiler/inference/serving path with application controls appropriate to this use case.
Reference architecture
- Signed model package and compatibility manifest
- Portable/target-specific runtime with CPU/GPU/NPU delegates
- Static or bounded memory allocator
- Device scheduler aware of power, thermal, and foreground state
- Local policy and safe fallback
- Telemetry queue with privacy and connectivity limits
- Atomic update and rollback
Request flow
- Identify exact device, OS, runtime, delegate, and available resources.
- Select a compatible model/precision package.
- Validate signature, storage, and memory before activation.
- Warm critical paths without blocking device startup.
- Execute under bounded thread, memory, time, and power budgets.
- Validate output and apply deterministic safety rules before action.
- Queue minimized telemetry and synchronize when allowed.
- Activate updates atomically and roll back on health failure.
Contracts
- Deployment manifest declares model hash, runtime version, operator set, delegate, precision, memory bound, and device compatibility.
- Request contract defines latency/deadline, offline behavior, privacy, and hosted fallback.
- Update contract defines signature, rollout cohort, health criteria, and rollback.
Use the runtime request, tool, policy and approval, evidence, and trace schemas as versioned reference boundaries.
Failure modes
- Delegate rejects an operator
- Unexpected CPU fallback violates deadline
- Memory pressure kills the process
- Thermal throttling degrades control loop
- Battery policy suspends inference
- Model update is partial or incompatible
- Offline queue grows beyond retention
Security considerations
- Verify artifacts and updates.
- Use OS secure storage for keys and protected data.
- Limit local tools and sensors to declared purpose.
- Apply deterministic safety envelopes outside model output.
- Minimize and encrypt queued telemetry.
Observability
Correlate request, model route, context sources, tool operations, policy decisions, approvals, artifacts, failures, recovery, and domain outcome. Apply redaction and retention before exporting traces.
Evaluation and metrics
- Deadline and worst-case latency
- Delegate/fallback rate
- Peak memory
- Battery and thermal impact
- Offline availability
- Update/rollback success
- Field crash and recovery
- Quality by device/precision
Implementation checklist
- Benchmark every supported device tier.
- Expose actual delegate and fallback in diagnostics.
- Test low storage, memory pressure, thermal, and airplane mode.
- Define safe behavior when inference misses deadline.
- Roll out by cohort with rollback.
- Prefer smaller deterministic models when they meet the task.
