Key takeaways
- Compilation is a chain of explicit representations and contracts, not one optimization switch.
- Partitioning determines which backend owns each subgraph and where data crosses runtime boundaries.
- Shape analysis, precision, memory planning, and unsupported-operator policy must be recorded as deployment evidence.
- AOT artifacts improve predictability but embed compatibility assumptions; JIT improves specialization but adds warmup and cache behavior.
- Silent fallback can preserve correctness while destroying latency, capacity, or power objectives.
Runtime boundary
A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.
Receives
Framework programs or graphs, weights, input constraints, optimization settings, target descriptions, and backend capability reports.
Owns
Representation transitions, rewrite legality, partitioning, lowering, scheduling, code generation, and compile/cache diagnostics.
Emits
Rewritten graphs, backend partitions, lowered IR, generated kernels or library calls, memory plans, serialized modules, and compatibility metadata.
Does not own
Request admission, application authorization, model-quality acceptance, or deployment rollout.
Failure modes
Unsupported operations, incorrect shape assumptions, graph-break leakage, partition thrashing, precision drift, register pressure, ABI mismatch, and fallback.
Evidence and metrics
Compile time, graph coverage, partition count, generated code size, peak memory, kernel count, fallback share, warmup, and parity.
Frontend capture and import
The frontend converts framework behavior or a portable graph into an internal program representation. It must expose parameters, constants, control flow, side effects, shapes, types, and aliasing.
Implementation
Version the exporter and source model. Record graph breaks, unsupported host-language behavior, custom operations, and input constraints.
Operational implications
Treat capture coverage as a release gate. Test rarely used branches and dynamic behavior, not only the happy path.
Measure
Captured node/branch coverage, graph-break count, export duration, and import validation failures.
Shape, type, and alias analysis
Analysis determines tensor ranks, symbolic dimensions, dtypes, layouts, lifetimes, and possible memory aliasing.
Implementation
Represent bounded dynamic dimensions and guards explicitly. Use conservative assumptions where writes or views can alias.
Operational implications
A wrong symbolic assumption can compile successfully and fail only for a rare production shape.
Measure
Shape-profile coverage, guard failures, inferred versus runtime shape mismatches, and peak memory by profile.
Middle-end graph transformations
Provider-independent passes canonicalize operations, fold constants, eliminate dead work, simplify algebra, fuse patterns, insert quantization transforms, and choose layouts.
Implementation
Record pass order and relevant flags. Run numerical and task-level parity after transformations that change precision or operation order.
Operational implications
More fusion is not always better: large kernels can spill registers, lower occupancy, or increase compilation time.
Measure
Node and kernel count, bytes moved, fusion groups, compilation time, numerical parity, and spills.
Backend partitioning
The runtime or compiler assigns supported subgraphs to execution providers, delegates, vendor libraries, or custom code generators.
Implementation
Query capabilities, produce a final partition map, and define whether unsupported nodes fail, route elsewhere, or execute on CPU.
Operational implications
Minimize alternating partitions that force copies or synchronization. Placement must be visible in production telemetry.
Measure
Partition count, provider coverage, transfer bytes/time, unsupported nodes, and fallback share.
Lowering and scheduling
High-level operations become target-oriented loops, tensor programs, library calls, or kernel IR. Schedules choose tiling, vectorization, layouts, parallel mapping, and memory stages.
Implementation
Bind the target architecture and resource limits. Use cost models or autotuning where the search cost is justified.
Operational implications
Schedules can be highly shape- and device-specific. Preserve the selected schedule and tuning evidence.
Measure
Kernel latency, occupancy, memory traffic, code size, tuning time, and run-to-run variance.
Code generation and linking
The backend emits native code, GPU kernels, vendor-engine plans, or another target representation and packages weights and metadata.
Implementation
Store compiler/toolchain versions, target capability, flags, libraries, ABI assumptions, and hashes in the artifact manifest.
Operational implications
Compiled artifacts should be reproducible or at least traceable to an immutable build environment.
Measure
Build duration, artifact size, deterministic hash, load compatibility, and link/runtime errors.
Memory planning
The compiler can assign tensor lifetimes to reusable buffers, choose layouts, and preallocate static regions.
Implementation
Model dynamic dimensions and backend boundaries. Include workspace, activation, communication, and alignment overhead.
Operational implications
A static plan reduces allocations but can fail when actual shapes or concurrency exceed assumptions.
Measure
Peak planned versus observed memory, allocation count, fragmentation, buffer reuse, and OOM rate.
Load, warmup, and readiness
The runtime loads artifacts, allocates weights/buffers, restores caches, registers kernels, and may trigger JIT, graph capture, or autotuning.
Implementation
Use versioned warmup fixtures and do not mark ready until required models, instances, and dependencies pass checks.
Operational implications
Never hide compilation and warmup inside the first customer request.
Measure
Load duration, warmup duration, ready time, first-request delta, cache state, and load failures.
Fallback and failure policy
Compilation can encounter unsupported operators, shapes, precision, devices, or invalid artifacts.
Implementation
Classify each failure and define fail-closed, CPU fallback, alternate backend, alternate model, or route rejection.
Operational implications
Silent fallback is an operational failure when it violates SLO, power, privacy, or cost requirements.
Measure
Fallback count/reason, rejected requests, CPU share, alternate-route success, and incident frequency.
Reference tables
| Stage | Input | Output | Common failure | Evidence |
|---|---|---|---|---|
| Capture/import | Framework program or graph | Frontend IR | Unsupported control flow | Capture coverage and graph breaks |
| Analysis | Typed graph | Shapes, types, aliases, constraints | Wrong symbolic assumption | Profiles and guards |
| Rewrite/optimize | High-level IR | Equivalent optimized IR | Semantic or numerical drift | Pass list and parity tests |
| Partition | Graph plus capabilities | Backend subgraphs | Fragmentation/transfer overhead | Partition map and fallback nodes |
| Lower/schedule | Backend subgraph | Target IR/schedule | Resource pressure | Target, precision, schedule |
| Codegen/package | Target IR and weights | Module/artifact | ABI mismatch | Toolchain, hashes, manifest |
| Load/warmup | Artifact and runtime | Ready instance | Allocation/JIT failure | Load/warmup/ready evidence |
| Boundary | Primary responsibility | Typical lifetime | Key output |
|---|---|---|---|
| Compiler | Transform and specialize programs | Build, export, or first run | IR, generated code, artifact |
| Graph runtime | Load, optimize/partition, dispatch kernels | Model load and requests | Executed graph and telemetry |
| Inference engine | Efficient forward/token execution | Serving process | Predictions or generated tokens |
Decision checklist
- Which frontend and IR preserve the required control flow and side effects?
- What shapes and dtypes are static, symbolic, profiled, or unsupported?
- Which backend owns each partition and what transfers are introduced?
- Is fallback permitted for this request class?
- Can the artifact be reproduced from a versioned build manifest?
- What warmup work must complete before readiness?
- Which parity and capacity tests block promotion?
Common mistakes
- Describing compilation without showing representation and partition transitions.
- Assuming every unsupported operation fails loudly.
- Benchmarking generated kernels without including copies and layout conversions.
- Treating one successful shape as proof of dynamic-shape support.
- Shipping artifacts without toolchain, runtime, driver, and target metadata.
- Sending user traffic before warmup and allocation complete.
Sources and further reading
-
ONNX Runtime architecture
(opens in a new tab)
-
Execution Providers
(opens in a new tab)
-
XLA architecture
(opens in a new tab)
-
StableHLO specification
(opens in a new tab)
-
ExecuTorch overview
(opens in a new tab)
-
Bring Your Own Codegen
(opens in a new tab)
Last reviewed: 2026-06-21 UTC
