Search ARuntime.com

Find runtime definitions and implementation guidance

Search page titles, summaries, headings, glossary terms, use cases, and runtime-directory entries.

Enter at least two characters.

Foundations

Model Formats and Intermediate Representations

Compare ONNX, StableHLO, MLIR dialects, exported PyTorch programs, GGUF, TensorRT engines, ExecuTorch PTE files, and auxiliary model assets by role and runtime implications.

Audience: Technical readers Reading time: 5 minutes Status: Foundational Last reviewed:

Key takeaways

  • A training checkpoint is not necessarily a deployable model, and an interchange graph is not necessarily an executable artifact.
  • Compiler IRs preserve operations and constraints for transformation; target artifacts embed backend and hardware assumptions.
  • Tokenizers, preprocessing, chat templates, quantization metadata, licenses, and provenance are part of the deployment package.
  • Conversion can change operator semantics, dynamic-shape behavior, precision, and control flow.
  • Artifacts require immutable identity, integrity checks, compatibility metadata, parity tests, and rollback.

Runtime boundary

A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.

Receives

Framework modules, weights, tokenizers, preprocessors, shape constraints, quantization metadata, and target requirements.

Owns

Serialization schema, operator set, metadata conventions, compatibility versioning, and artifact packaging.

Emits

Portable graphs, compiler IR, packaged weights, backend engines, edge programs, manifests, and hashes.

Does not own

Serving APIs, rollout, request policy, or proof of output quality after conversion.

Failure modes

Missing operators, semantic mismatch, stale tokenizer, incompatible opset, corrupt weights, silent precision changes, and unreproducible builds.

Evidence and metrics

Artifact size, load time, conversion coverage, unsupported nodes, parity error, reproducibility hash, and compatibility results.

Interchange formats

Interchange formats move model graphs and tensors among frameworks, runtimes, and vendor tools.

Implementation

Record format/opset version, model source revision, dynamic-shape constraints, external data files, and converter logs.

Operational implications

Treat successful serialization as the start of compatibility testing, not proof of deployability.

Measure

Converter coverage, unsupported nodes, artifact size, load success, and parity.

Compiler intermediate representations

IRs represent programs while analysis, rewriting, partitioning, lowering, and code generation occur.

Implementation

Identify the dialect or level, type and shape semantics, custom operations, and version compatibility.

Operational implications

MLIR is an infrastructure with dialects, not one universal model file. StableHLO is a versioned operation set for compiler interchange.

Measure

Pass success, verification errors, IR size, shape constraints, and lowering coverage.

Target-specific executable artifacts

Engines and compiled modules embed backend, precision, shape, and hardware assumptions.

Implementation

Store toolchain/runtime/driver versions, target capability, profiles, build flags, hashes, and model lineage.

Operational implications

Regenerate artifacts through a reproducible pipeline and preserve known-good packages for rollback.

Measure

Build time, load time, compatibility pass rate, artifact size, and warmup.

Weight-oriented local packages

Local formats such as GGUF package model tensors and metadata for compatible engines and quantized execution.

Implementation

Bind exact model architecture, tokenizer, quantization method, source checkpoint, license, and hash.

Operational implications

Nominal bit width does not define quality or kernel behavior. Test the exact package in the target engine.

Measure

Disk/RAM use, mmap behavior, load time, quality, and token performance.

Auxiliary assets

Tokenizers, vocabulary, image/audio preprocessors, special tokens, templates, and adapters affect runtime behavior.

Implementation

Version them with the model and include them in artifact integrity and rollback.

Operational implications

A stale tokenizer or chat template can change output while the weights remain identical.

Measure

Asset hash mismatches, tokenization parity, preprocessing parity, and invalid structured output.

Conversion and parity

Conversion maps operations, constants, layouts, shapes, and precision into another representation.

Implementation

Compare representative outputs, structured behavior, tokenizer/preprocessing, and edge cases within precision-appropriate tolerances.

Operational implications

Preserve conversion commands and logs. Reject silent unsupported-node fallback unless the policy explicitly permits it.

Measure

Numerical/task parity, unsupported-operation count, inserted casts, and fallback share.

Provenance and integrity

Model artifacts are supply-chain inputs with potentially executable behavior.

Implementation

Use immutable URIs, content hashes or signatures, source/derived lineage, build metadata, license, and promotion state.

Operational implications

Fail closed on unexpected hashes or schema versions. Limit who can publish to production registries.

Measure

Signature/checksum failures, provenance completeness, promotion history, and deletion/rollback success.

Reference tables

Format and artifact roles
Format or artifact Primary role Typical consumer Runtime implication
ONNX Portable model graph ONNX Runtime, compilers, vendor tools Opset and provider coverage matter
StableHLO Portable compiler operation set OpenXLA-compatible compilers Versioned compiler interchange
MLIR dialects Multi-level compiler representation Compiler transformations/backends Not one deployable model format
torch.export program Captured PyTorch graph AOT/edge/backend pipelines Explicit constraints and graph semantics
TensorRT engine Target-specific plan TensorRT runtime Compatibility tied to target/runtime
ExecuTorch PTE On-device program ExecuTorch runtime Portable and delegated partitions
GGUF Local model package llama.cpp-compatible engines Weights plus model metadata and quantization

Decision checklist

  1. Is this artifact for interchange, optimization, or direct execution?
  2. Which operator, shape, control-flow, and precision semantics must be preserved?
  3. Are tokenizer and preprocessing versions bound to the model?
  4. How is numerical and behavioral parity tested?
  5. Which runtime, backend, driver, and hardware versions are compatible?
  6. How are artifacts signed, licensed, rolled back, and deleted?

Common mistakes

  • Calling every serialized model an IR.
  • Shipping weights without tokenizer or preprocessing provenance.
  • Treating conversion success as proof of equivalent behavior.
  • Rebuilding target engines without preserving build parameters.
  • Loading mutable or unsigned artifact URLs in production.

Sources and further reading


  1. ONNX introduction and specification
    (opens in a new tab)

    ONNX · Official specification · accessed 2026-06-21 UTC

  2. StableHLO specification
    (opens in a new tab)

    OpenXLA · Official specification · accessed 2026-06-21 UTC

  3. MLIR documentation
    (opens in a new tab)

    LLVM Project · Official documentation · accessed 2026-06-21 UTC

  4. torch.export
    (opens in a new tab)

    PyTorch · Official documentation · accessed 2026-06-21 UTC

  5. ExecuTorch getting started
    (opens in a new tab)

    PyTorch · Official documentation · accessed 2026-06-21 UTC

  6. GGUF format
    (opens in a new tab)

    ggml-org · Official repository documentation · accessed 2026-06-21 UTC

Last reviewed: 2026-06-21 UTC

Maintenance record

Found an error, outdated capability, or unclear category boundary? Submit a correction with a supporting source.