Key takeaways
- Eager execution maximizes flexibility and debuggability but exposes host-language and per-operation dispatch costs.
- Graph execution enables whole-program optimization, partitioning, memory planning, and reproducible artifacts.
- JIT compilation specializes against observed inputs and hardware; AOT compilation moves work into a controlled build pipeline.
- Dynamic shapes are implemented through symbolic dimensions, guards, profiles, padding, recompilation, or fallback—not one universal capability.
- Most production systems are hybrid and must expose graph breaks, cache behavior, and fallback as operational evidence.
Runtime boundary
A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.
Receives
Framework code or an exported graph, tensor shapes and dtypes, control-flow assumptions, target capabilities, and optimization policy.
Owns
Capture semantics, specialization boundaries, compilation timing, cache keys, graph-break behavior, and fallback policy.
Emits
Immediate operator calls, an optimized graph, compiled modules, guard sets, cache entries, or an executable artifact.
Does not own
Business workflow, model-quality acceptance, request authorization, or deployment rollout.
Failure modes
Graph breaks, recompilation storms, unsupported control flow, stale caches, silent fallback, and shape-dependent correctness drift.
Evidence and metrics
Capture coverage, compile time, warmup latency, cache hit rate, recompilation count, fallback share, peak memory, and steady-state latency.
Eager and imperative execution
Operations execute as the host program reaches them. Ordinary language control flow and side effects are easy to express, but the runtime sees less of the whole program.
Implementation
Keep preprocessing, model calls, and postprocessing explicit. Profile host dispatch and synchronization rather than attributing all time to kernels.
Operational implications
Use eager mode for exploration, debugging, unsupported dynamic behavior, or controlled fallbacks. Avoid letting an accidental eager path become an invisible production default.
Measure
Per-op dispatch time, host CPU, synchronization count, eager fallback rate, and end-to-end latency.
Static graph execution
A captured graph makes operations, data dependencies, constants, types, and often shape constraints available before repeated execution.
Implementation
Validate graph semantics, run rewrites, partition supported subgraphs, plan buffers, and serialize a versioned artifact or execution plan.
Operational implications
Treat graph capture coverage and unsupported nodes as release evidence. A graph that executes partly on an unintended backend can remain numerically correct while missing SLOs.
Measure
Graph node count, optimized node count, partition count, provider placement, load time, and parity results.
Tracing, scripting, and export
Tracing observes operations for representative inputs, while scripting or export mechanisms attempt to preserve more explicit control flow and constraints.
Implementation
Version representative fixtures, record which branches were captured, and include dynamic-shape constraints or guards in the artifact manifest.
Operational implications
Re-run export when framework, model, tokenizer, or shape assumptions change. Test rarely used branches independently.
Measure
Captured branch coverage, export failures, guard failures, and unrepresented branch incidents.
Just-in-time compilation
JIT compiles at first use or when an existing specialization does not match. It can optimize for the actual device and shapes.
Implementation
Build strong cache keys from model hash, compiler/runtime version, device capability, precision, and shape constraints. Bound the number of variants.
Operational implications
Separate cold, warm, and recompilation latency. Production nodes that compile code require security, storage, and cache-eviction controls.
Measure
Cold compile time, JIT cache hit, variants per model, recompilations per request class, and warm Goodput.
Ahead-of-time compilation
AOT creates target-ready artifacts before deployment. It can reduce target footprint and make startup, signing, and reproducibility more predictable.
Implementation
Store toolchain, target, ABI, precision, shape profiles, and artifact hashes. Build a compatibility matrix for runtime, driver, and hardware.
Operational implications
Use AOT where cold starts, edge footprint, regulated build provenance, or offline deployment matter more than late specialization.
Measure
Build duration, artifact size, load/warmup time, compatibility failures, and reproducibility hashes.
Dynamic shapes, guards, and profiles
Dynamic-shape support ranges from bounded symbolic dimensions to multiple static profiles or generalized kernels.
Implementation
Model the production shape distribution, define accepted ranges, and decide whether out-of-profile inputs pad, recompile, route, or fail.
Operational implications
Alert on guard failures and compilation churn. Test worst-case memory, not only common shapes.
Measure
Shape-profile coverage, guard failure rate, padding overhead, compilation churn, and peak memory by profile.
Hybrid execution
Production systems commonly compile a stable model region while retaining dynamic routing, retrieval, tools, or unsupported operators outside it.
Implementation
Define every boundary, data transfer, synchronization point, and fallback. Keep product workflow separate from model-execution optimization.
Operational implications
Prefer explicit hybrid architecture over claims that an entire application is compiled. Trace which path each request used.
Measure
Compiled-region coverage, boundary transfer time, synchronization, fallback, and end-to-end Goodput.
Reference tables
| Model | Compilation point | Flexibility | Primary strength | Primary risk |
|---|---|---|---|---|
| Eager / imperative | None or per operation | Highest | Debugging and dynamic behavior | Dispatch overhead and weak global optimization |
| Static graph | Export or load | Moderate | Whole-graph rewrites and reproducibility | Unsupported dynamic behavior |
| JIT | First use or specialization | High behind guards | Device- and shape-specific code | Warmup and recompilation storms |
| AOT | Build/package time | Lowest | Predictable startup and compact targets | Compatibility matrix and reduced flexibility |
| Dataflow / streaming | Varies | Pipeline-oriented | Continuous asynchronous inputs | Backpressure and state coordination |
| Hybrid | Multiple stages | Balanced | Compiled hot path with dynamic boundaries | Boundary and fallback complexity |
Decision checklist
- Which inputs, shapes, dtypes, and branches occur in production?
- Where may graph breaks occur and what synchronization do they introduce?
- Can production nodes compile code or must artifacts be built and signed elsewhere?
- How many compiled variants can be cached safely?
- What happens when an operator, shape, or device is unsupported?
- How are warmup, cache misses, and recompilation represented in SLOs?
Common mistakes
- Calling a framework “compiled” without identifying capture coverage and fallback behavior.
- Benchmarking only one warmed shape while production traffic has broad variability.
- Treating dynamic-shape support as unlimited instead of documenting ranges and guards.
- Shipping AOT artifacts without runtime, driver, precision, and target metadata.
- Hiding CPU or eager fallback because outputs remain numerically valid.
Sources and further reading
-
torch.compile
(opens in a new tab)
-
torch.export
(opens in a new tab)
-
ONNX Runtime high-level design
(opens in a new tab)
-
ExecuTorch overview
(opens in a new tab)
-
StableHLO specification
(opens in a new tab)
Last reviewed: 2026-06-21 UTC
