Heterogeneous Runtime Compilers

MLIR-style multi-level representations and portable runtimes separate high-level model semantics from target-specific lowering, enabling one model program to reach CPUs, GPUs, mobile devices, and accelerators.

Audience: Technical readers Reading time: 2 minutes Status: Research synthesis Last reviewed: 2026-06-23 UTC

Heterogeneous compiler runtimes retarget model programs across CPUs, GPUs, NPUs, DSPs, and other accelerators through layered intermediate representations and hardware abstraction. The goal is portable deployment without giving up all target-specific optimization.

Key takeaways

Portability depends on operator, shape, precision, memory, and synchronization coverage—not only file-format import.
Multi-level IR allows optimization at model, tensor, loop, and target levels.
Partitioning across devices can lose more to transfer and synchronization than it gains from acceleration.

The heterogeneity problem

AI applications increasingly encounter several processors in one device and several hardware families across deployment targets. Hand-maintaining separate graphs and kernels creates drift. A retargetable compiler keeps a common source representation while lowering through target-aware passes.

Multi-level IR

MLIR provides an extensible infrastructure of dialects and conversions rather than one fixed IR. [ar_cite id=”mlir” label=”MLIR”] High-level tensor semantics can coexist with lower-level loop, memory, vector, and target operations. This lets optimizations happen before information is discarded.

Retargetable execution

IREE uses MLIR-based compilation and a portable runtime/HAL model for deployment from servers to constrained devices. [ar_cite id=”iree” label=”IREE”] Ahead-of-time artifacts reduce runtime compilation needs, while target backends and dispatch metadata map work to devices.

Cross-device partitioning

Partitioning assigns operations to devices based on support, cost, memory, and locality. The compiler must insert transfers and synchronization and preserve semantics. A fallback to CPU may be correct but too slow; a split across CPU and NPU may contend for shared memory. Profile end-to-end execution and expose the actual partition plan.

Edge and embedded lowering

Constrained targets favor ahead-of-time compilation, static memory planning, bounded shapes, quantized operators, and small runtime dependencies. ExecuTorch and OpenVINO illustrate deployment-focused runtimes for device hardware families. [ar_cite id=”executorch” label=”ExecuTorch”] [ar_cite id=”openvino” label=”OpenVINO”]

Limits and failure modes

Frontend cannot represent a model operation or dynamic control flow.
Target backend lacks a kernel or precision.
Graph partition introduces expensive transfers.
Shape specialization creates too many artifacts.
Compiler and runtime versions produce incompatible binaries.
Numerical differences exceed application tolerance.
Fallback violates latency, power, or privacy requirements.

Selection guidance

Choose based on supported model path, targets, artifact lifecycle, debugging, quantization, dynamic-shape behavior, binary size, licensing, and target-specific performance. Maintain a representative conformance suite across devices and block deployment when an unexpected fallback or partition appears.

Find runtime definitions and implementation guidance