Key takeaways
- A training checkpoint is not necessarily a deployable model, and an interchange graph is not necessarily an executable artifact.
- Compiler IRs preserve operations and constraints for transformation; target artifacts embed backend and hardware assumptions.
- Tokenizers, preprocessing, chat templates, quantization metadata, licenses, and provenance are part of the deployment package.
- Conversion can change operator semantics, dynamic-shape behavior, precision, and control flow.
- Artifacts require immutable identity, integrity checks, compatibility metadata, parity tests, and rollback.
Runtime boundary
A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.
Receives
Framework modules, weights, tokenizers, preprocessors, shape constraints, quantization metadata, and target requirements.
Owns
Serialization schema, operator set, metadata conventions, compatibility versioning, and artifact packaging.
Emits
Portable graphs, compiler IR, packaged weights, backend engines, edge programs, manifests, and hashes.
Does not own
Serving APIs, rollout, request policy, or proof of output quality after conversion.
Failure modes
Missing operators, semantic mismatch, stale tokenizer, incompatible opset, corrupt weights, silent precision changes, and unreproducible builds.
Evidence and metrics
Artifact size, load time, conversion coverage, unsupported nodes, parity error, reproducibility hash, and compatibility results.
Interchange formats
Interchange formats move model graphs and tensors among frameworks, runtimes, and vendor tools.
Implementation
Record format/opset version, model source revision, dynamic-shape constraints, external data files, and converter logs.
Operational implications
Treat successful serialization as the start of compatibility testing, not proof of deployability.
Measure
Converter coverage, unsupported nodes, artifact size, load success, and parity.
Compiler intermediate representations
IRs represent programs while analysis, rewriting, partitioning, lowering, and code generation occur.
Implementation
Identify the dialect or level, type and shape semantics, custom operations, and version compatibility.
Operational implications
MLIR is an infrastructure with dialects, not one universal model file. StableHLO is a versioned operation set for compiler interchange.
Measure
Pass success, verification errors, IR size, shape constraints, and lowering coverage.
Target-specific executable artifacts
Engines and compiled modules embed backend, precision, shape, and hardware assumptions.
Implementation
Store toolchain/runtime/driver versions, target capability, profiles, build flags, hashes, and model lineage.
Operational implications
Regenerate artifacts through a reproducible pipeline and preserve known-good packages for rollback.
Measure
Build time, load time, compatibility pass rate, artifact size, and warmup.
Weight-oriented local packages
Local formats such as GGUF package model tensors and metadata for compatible engines and quantized execution.
Implementation
Bind exact model architecture, tokenizer, quantization method, source checkpoint, license, and hash.
Operational implications
Nominal bit width does not define quality or kernel behavior. Test the exact package in the target engine.
Measure
Disk/RAM use, mmap behavior, load time, quality, and token performance.
Auxiliary assets
Tokenizers, vocabulary, image/audio preprocessors, special tokens, templates, and adapters affect runtime behavior.
Implementation
Version them with the model and include them in artifact integrity and rollback.
Operational implications
A stale tokenizer or chat template can change output while the weights remain identical.
Measure
Asset hash mismatches, tokenization parity, preprocessing parity, and invalid structured output.
Conversion and parity
Conversion maps operations, constants, layouts, shapes, and precision into another representation.
Implementation
Compare representative outputs, structured behavior, tokenizer/preprocessing, and edge cases within precision-appropriate tolerances.
Operational implications
Preserve conversion commands and logs. Reject silent unsupported-node fallback unless the policy explicitly permits it.
Measure
Numerical/task parity, unsupported-operation count, inserted casts, and fallback share.
Provenance and integrity
Model artifacts are supply-chain inputs with potentially executable behavior.
Implementation
Use immutable URIs, content hashes or signatures, source/derived lineage, build metadata, license, and promotion state.
Operational implications
Fail closed on unexpected hashes or schema versions. Limit who can publish to production registries.
Measure
Signature/checksum failures, provenance completeness, promotion history, and deletion/rollback success.
Reference tables
| Format or artifact | Primary role | Typical consumer | Runtime implication |
|---|---|---|---|
| ONNX | Portable model graph | ONNX Runtime, compilers, vendor tools | Opset and provider coverage matter |
| StableHLO | Portable compiler operation set | OpenXLA-compatible compilers | Versioned compiler interchange |
| MLIR dialects | Multi-level compiler representation | Compiler transformations/backends | Not one deployable model format |
| torch.export program | Captured PyTorch graph | AOT/edge/backend pipelines | Explicit constraints and graph semantics |
| TensorRT engine | Target-specific plan | TensorRT runtime | Compatibility tied to target/runtime |
| ExecuTorch PTE | On-device program | ExecuTorch runtime | Portable and delegated partitions |
| GGUF | Local model package | llama.cpp-compatible engines | Weights plus model metadata and quantization |
Decision checklist
- Is this artifact for interchange, optimization, or direct execution?
- Which operator, shape, control-flow, and precision semantics must be preserved?
- Are tokenizer and preprocessing versions bound to the model?
- How is numerical and behavioral parity tested?
- Which runtime, backend, driver, and hardware versions are compatible?
- How are artifacts signed, licensed, rolled back, and deleted?
Common mistakes
- Calling every serialized model an IR.
- Shipping weights without tokenizer or preprocessing provenance.
- Treating conversion success as proof of equivalent behavior.
- Rebuilding target engines without preserving build parameters.
- Loading mutable or unsigned artifact URLs in production.
Sources and further reading
-
ONNX introduction and specification
(opens in a new tab)
-
StableHLO specification
(opens in a new tab)
-
MLIR documentation
(opens in a new tab)
-
torch.export
(opens in a new tab)
-
ExecuTorch getting started
(opens in a new tab)
-
GGUF format
(opens in a new tab)
Last reviewed: 2026-06-21 UTC
