Search ARuntime.com

Find runtime definitions and implementation guidance

Search page titles, summaries, headings, glossary terms, use cases, and runtime-directory entries.

Enter at least two characters.

ARuntime Reference

AI Model Formats and Packaging

Understand model artifacts, serialization, quantization, packaging, provenance, and runtime compatibility across GGUF, Safetensors, ONNX, and compiled engines.

Audience: Technical readers Reading time: 5 minutes Status: Foundational Last reviewed:

An AI model format is a contract for representing model artifacts so tools can exchange, inspect, load, transform, or execute them. Formats may store tensors only, a computational graph, typed metadata, tokenizers, quantization parameters, auxiliary modules, or a complete deployment package.

Key takeaways

  • A model architecture, a model format, a quantization method, and an inference runtime are different categories.
  • Format selection is constrained by the runtime, hardware, precision, packaging, provenance, and update model.
  • No single format optimizes equally for training interchange, local inference, high-concurrency serving, browser deployment, and supply-chain evidence.

Definition and scope

A model is the learned function and its parameters. A model artifact is the concrete collection of weights, configuration, tokenizer assets, preprocessing rules, adapters, and metadata needed to reproduce or deploy that model. A model format defines how some or all of those assets are represented.

The format sits at the boundary between model creation and execution. It is consumed by compilers, graph runtimes, or inference engines. It does not by itself schedule requests, allocate a cluster, authorize tools, preserve user memory, or produce an accountable business outcome.

Category rule: ask four separate questions: What model is this? How are its artifacts encoded? Which engine can load them? Which serving or application runtime owns the request?

Artifact layers

Model identity

Architecture, parameter set, tokenizer behavior, post-training method, task specialization, and version lineage.

Serialization

Tensor layout, graph representation, metadata types, offsets, shards, alignment, and optional resources.

Optimization representation

Precision, quantization scales, sparsity, compiled kernels, target-specific plans, or adapter deltas.

Distribution package

Manifest, hashes, signatures, license, model card, provenance, configuration, tokenizer, and deployment policy.

One file may span several layers, but combining them does not erase the distinctions. A single-file artifact can still require a specific engine and external trust policy. A multi-file repository can still represent one coherent model release.

Common format roles

Model format responsibilities and typical use
Format or representation Primary role Strength Important boundary
GGUF Inference-oriented model artifact for GGML-based executors Typed metadata, tensor data, efficient loading, broad local-inference ecosystem Not a model family or universal runtime
Safetensors Safe and fast tensor serialization Simple tensor storage, zero-copy access, sharding-friendly repositories Configuration and tokenizer assets are commonly separate
ONNX Portable model graph and operator interchange Graph semantics, tooling, execution providers, compiler/runtime portability Operator and target coverage still determine executability
Framework checkpoint Training and adaptation state Optimizer, scheduler, gradients, and framework-native flexibility May be unsafe or unsuitable as a production deployment artifact
Compiled engine or executable Target-specific execution plan Specialized kernels, memory plan, fast startup on a known target Usually less portable across hardware, runtime, or shape changes

Safetensors documents a simple tensor format designed for safe and fast zero-copy loading. [ar_cite id=”safetensors-docs” label=”Safetensors”] ONNX defines a graph-oriented exchange format. [ar_cite id=”onnx-format” label=”ONNX”] GGUF combines tensors with typed metadata for GGML-based inference. [ar_cite id=”gguf-spec” label=”GGUF specification”]

Formats and quantization are not the same

Quantization changes how numerical values are represented or approximated. A file format determines how the resulting tensors and metadata are stored. Some formats are strongly associated with particular quantization schemes, but the concepts remain separate.

  • Algorithm: decides how full-precision weights or activations are mapped to lower precision.
  • Tensor type: records the resulting representation, scale, block layout, or precision.
  • Container: stores tensor bytes and the metadata required to decode them.
  • Kernel: performs native or dequantizing computation on the target hardware.
  • Runtime policy: decides which representation is acceptable for quality, latency, memory, and security.

GPTQ, for example, is a post-training quantization method rather than a universal model container. [ar_cite id=”gptq-paper” label=”GPTQ paper”]

Compatibility is a matrix

“Supports the format” is not enough. A deployment must align the model architecture, tokenizer, tensor types, quantization variant, auxiliary components, engine version, hardware backend, and generation features. A parser may recognize a file while the engine lacks an operator, kernel, projector, adapter, or cache layout required to run it.

Record compatibility as a tested tuple:

model release + artifact hash + format version + tensor types
+ engine version + backend + hardware + execution configuration

That tuple should be part of the deployment manifest and evidence record rather than reconstructed from a filename after an incident.

Provenance, integrity, and trust

A model file is executable input to a privileged parser and inference process. Treat downloaded artifacts as untrusted until provenance, expected hashes, license, model identity, and compatibility have been checked. Keep parser and runtime versions patched, apply size and allocation limits, and perform validation before mapping large data regions.

A format may carry metadata without proving that the metadata is true. Supply-chain evidence therefore belongs in a signed manifest or transparency workflow that binds hashes for every required artifact, configuration file, tokenizer, adapter, and policy. Model-transparency tooling is one emerging implementation direction. [ar_cite id=”model-transparency” label=”Model Transparency”]

Selection questions

  1. What stage? Training checkpoint, interchange graph, local inference artifact, high-concurrency serving package, or compiled target binary?
  2. What runtime? Which exact engine and version will load the artifact?
  3. What hardware? CPU, GPU, NPU, browser, mobile, or heterogeneous placement?
  4. What precision? Full precision, low-precision floating point, integer quantization, sparsity, or adapters?
  5. What packaging? Single file, shards, sidecars, repository, or signed deployment bundle?
  6. What trust evidence? Hashes, signer, provenance, model card, license, review date, and vulnerability process?
  7. What update path? Can the system roll out, validate, and roll back artifacts without changing application code?

Changelog

2026-06-22 UTC: Published the first model-format reference, separating model identity, serialization, quantization, runtime compatibility, and supply-chain evidence.

Maintenance record

Found an error, outdated capability, or unclear category boundary? Submit a correction with a supporting source.