Model Formats and Intermediate Representations

Key takeaways

A training checkpoint is not necessarily a deployable model, and an interchange graph is not necessarily an executable artifact.
Compiler IRs preserve operations and constraints for transformation; target artifacts embed backend and hardware assumptions.
Tokenizers, preprocessing, chat templates, quantization metadata, licenses, and provenance are part of the deployment package.
Conversion can change operator semantics, dynamic-shape behavior, precision, and control flow.
Artifacts require immutable identity, integrity checks, compatibility metadata, parity tests, and rollback.

Runtime boundary

A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.

Receives

Framework modules, weights, tokenizers, preprocessors, shape constraints, quantization metadata, and target requirements.

Owns

Serialization schema, operator set, metadata conventions, compatibility versioning, and artifact packaging.

Emits

Portable graphs, compiler IR, packaged weights, backend engines, edge programs, manifests, and hashes.

Does not own

Serving APIs, rollout, request policy, or proof of output quality after conversion.

Failure modes

Missing operators, semantic mismatch, stale tokenizer, incompatible opset, corrupt weights, silent precision changes, and unreproducible builds.

Evidence and metrics

Artifact size, load time, conversion coverage, unsupported nodes, parity error, reproducibility hash, and compatibility results.

Interchange formats

Interchange formats move model graphs and tensors among frameworks, runtimes, and vendor tools.

Implementation

Record format/opset version, model source revision, dynamic-shape constraints, external data files, and converter logs.

Operational implications

Treat successful serialization as the start of compatibility testing, not proof of deployability.

Measure

Converter coverage, unsupported nodes, artifact size, load success, and parity.

Compiler intermediate representations

IRs represent programs while analysis, rewriting, partitioning, lowering, and code generation occur.

Implementation

Identify the dialect or level, type and shape semantics, custom operations, and version compatibility.

Operational implications

MLIR is an infrastructure with dialects, not one universal model file. StableHLO is a versioned operation set for compiler interchange.

Measure

Pass success, verification errors, IR size, shape constraints, and lowering coverage.

Target-specific executable artifacts

Engines and compiled modules embed backend, precision, shape, and hardware assumptions.

Implementation

Store toolchain/runtime/driver versions, target capability, profiles, build flags, hashes, and model lineage.

Operational implications

Regenerate artifacts through a reproducible pipeline and preserve known-good packages for rollback.

Measure

Build time, load time, compatibility pass rate, artifact size, and warmup.

Weight-oriented local packages

Local formats such as GGUF package model tensors and metadata for compatible engines and quantized execution.

Implementation

Bind exact model architecture, tokenizer, quantization method, source checkpoint, license, and hash.

Operational implications

Nominal bit width does not define quality or kernel behavior. Test the exact package in the target engine.

Measure

Disk/RAM use, mmap behavior, load time, quality, and token performance.

Auxiliary assets

Tokenizers, vocabulary, image/audio preprocessors, special tokens, templates, and adapters affect runtime behavior.

Implementation

Version them with the model and include them in artifact integrity and rollback.

Operational implications

A stale tokenizer or chat template can change output while the weights remain identical.

Measure

Asset hash mismatches, tokenization parity, preprocessing parity, and invalid structured output.

Conversion and parity

Conversion maps operations, constants, layouts, shapes, and precision into another representation.

Implementation

Compare representative outputs, structured behavior, tokenizer/preprocessing, and edge cases within precision-appropriate tolerances.

Operational implications

Preserve conversion commands and logs. Reject silent unsupported-node fallback unless the policy explicitly permits it.

Measure

Numerical/task parity, unsupported-operation count, inserted casts, and fallback share.

Provenance and integrity

Model artifacts are supply-chain inputs with potentially executable behavior.

Implementation

Use immutable URIs, content hashes or signatures, source/derived lineage, build metadata, license, and promotion state.

Operational implications

Fail closed on unexpected hashes or schema versions. Limit who can publish to production registries.

Measure

Signature/checksum failures, provenance completeness, promotion history, and deletion/rollback success.

Reference tables

Format and artifact roles
Format or artifact	Primary role	Typical consumer	Runtime implication
ONNX	Portable model graph	ONNX Runtime, compilers, vendor tools	Opset and provider coverage matter
StableHLO	Portable compiler operation set	OpenXLA-compatible compilers	Versioned compiler interchange
MLIR dialects	Multi-level compiler representation	Compiler transformations/backends	Not one deployable model format
torch.export program	Captured PyTorch graph	AOT/edge/backend pipelines	Explicit constraints and graph semantics
TensorRT engine	Target-specific plan	TensorRT runtime	Compatibility tied to target/runtime
ExecuTorch PTE	On-device program	ExecuTorch runtime	Portable and delegated partitions
GGUF	Local model package	llama.cpp-compatible engines	Weights plus model metadata and quantization

Decision checklist

Is this artifact for interchange, optimization, or direct execution?
Which operator, shape, control-flow, and precision semantics must be preserved?
Are tokenizer and preprocessing versions bound to the model?
How is numerical and behavioral parity tested?
Which runtime, backend, driver, and hardware versions are compatible?
How are artifacts signed, licensed, rolled back, and deleted?

Common mistakes

Calling every serialized model an IR.
Shipping weights without tokenizer or preprocessing provenance.
Treating conversion success as proof of equivalent behavior.
Rebuilding target engines without preserving build parameters.
Loading mutable or unsigned artifact URLs in production.

Sources and further reading

ONNX introduction and specification
(opens in a new tab)

ONNX · Official specification · accessed 2026-06-21 UTC
StableHLO specification
(opens in a new tab)

OpenXLA · Official specification · accessed 2026-06-21 UTC
MLIR documentation
(opens in a new tab)

LLVM Project · Official documentation · accessed 2026-06-21 UTC
torch.export
(opens in a new tab)

PyTorch · Official documentation · accessed 2026-06-21 UTC
ExecuTorch getting started
(opens in a new tab)

PyTorch · Official documentation · accessed 2026-06-21 UTC
GGUF format
(opens in a new tab)

ggml-org · Official repository documentation · accessed 2026-06-21 UTC

Last reviewed: 2026-06-21 UTC

Key takeaways

Runtime boundary

Receives

Owns

Emits

Does not own

Failure modes

Evidence and metrics

Interchange formats

Implementation

Operational implications

Measure

Compiler intermediate representations

Implementation

Operational implications

Measure

Target-specific executable artifacts

Implementation

Operational implications

Measure

Weight-oriented local packages

Implementation

Operational implications

Measure

Auxiliary assets

Implementation

Operational implications

Measure

Conversion and parity

Implementation

Operational implications

Measure

Provenance and integrity

Implementation

Operational implications

Measure

Reference tables

Decision checklist

Common mistakes

Sources and further reading

Maintenance record