An AI model format is a contract for representing model artifacts so tools can exchange, inspect, load, transform, or execute them. Formats may store tensors only, a computational graph, typed metadata, tokenizers, quantization parameters, auxiliary modules, or a complete deployment package.
Key takeaways
- A model architecture, a model format, a quantization method, and an inference runtime are different categories.
- Format selection is constrained by the runtime, hardware, precision, packaging, provenance, and update model.
- No single format optimizes equally for training interchange, local inference, high-concurrency serving, browser deployment, and supply-chain evidence.
Definition and scope
A model is the learned function and its parameters. A model artifact is the concrete collection of weights, configuration, tokenizer assets, preprocessing rules, adapters, and metadata needed to reproduce or deploy that model. A model format defines how some or all of those assets are represented.
The format sits at the boundary between model creation and execution. It is consumed by compilers, graph runtimes, or inference engines. It does not by itself schedule requests, allocate a cluster, authorize tools, preserve user memory, or produce an accountable business outcome.
Artifact layers
Model identity
Architecture, parameter set, tokenizer behavior, post-training method, task specialization, and version lineage.
Serialization
Tensor layout, graph representation, metadata types, offsets, shards, alignment, and optional resources.
Optimization representation
Precision, quantization scales, sparsity, compiled kernels, target-specific plans, or adapter deltas.
Distribution package
Manifest, hashes, signatures, license, model card, provenance, configuration, tokenizer, and deployment policy.
One file may span several layers, but combining them does not erase the distinctions. A single-file artifact can still require a specific engine and external trust policy. A multi-file repository can still represent one coherent model release.
Common format roles
| Format or representation | Primary role | Strength | Important boundary |
|---|---|---|---|
| GGUF | Inference-oriented model artifact for GGML-based executors | Typed metadata, tensor data, efficient loading, broad local-inference ecosystem | Not a model family or universal runtime |
| Safetensors | Safe and fast tensor serialization | Simple tensor storage, zero-copy access, sharding-friendly repositories | Configuration and tokenizer assets are commonly separate |
| ONNX | Portable model graph and operator interchange | Graph semantics, tooling, execution providers, compiler/runtime portability | Operator and target coverage still determine executability |
| Framework checkpoint | Training and adaptation state | Optimizer, scheduler, gradients, and framework-native flexibility | May be unsafe or unsuitable as a production deployment artifact |
| Compiled engine or executable | Target-specific execution plan | Specialized kernels, memory plan, fast startup on a known target | Usually less portable across hardware, runtime, or shape changes |
Safetensors documents a simple tensor format designed for safe and fast zero-copy loading. [ar_cite id=”safetensors-docs” label=”Safetensors”] ONNX defines a graph-oriented exchange format. [ar_cite id=”onnx-format” label=”ONNX”] GGUF combines tensors with typed metadata for GGML-based inference. [ar_cite id=”gguf-spec” label=”GGUF specification”]
Formats and quantization are not the same
Quantization changes how numerical values are represented or approximated. A file format determines how the resulting tensors and metadata are stored. Some formats are strongly associated with particular quantization schemes, but the concepts remain separate.
- Algorithm: decides how full-precision weights or activations are mapped to lower precision.
- Tensor type: records the resulting representation, scale, block layout, or precision.
- Container: stores tensor bytes and the metadata required to decode them.
- Kernel: performs native or dequantizing computation on the target hardware.
- Runtime policy: decides which representation is acceptable for quality, latency, memory, and security.
GPTQ, for example, is a post-training quantization method rather than a universal model container. [ar_cite id=”gptq-paper” label=”GPTQ paper”]
Compatibility is a matrix
“Supports the format” is not enough. A deployment must align the model architecture, tokenizer, tensor types, quantization variant, auxiliary components, engine version, hardware backend, and generation features. A parser may recognize a file while the engine lacks an operator, kernel, projector, adapter, or cache layout required to run it.
Record compatibility as a tested tuple:
model release + artifact hash + format version + tensor types
+ engine version + backend + hardware + execution configuration
That tuple should be part of the deployment manifest and evidence record rather than reconstructed from a filename after an incident.
Provenance, integrity, and trust
A model file is executable input to a privileged parser and inference process. Treat downloaded artifacts as untrusted until provenance, expected hashes, license, model identity, and compatibility have been checked. Keep parser and runtime versions patched, apply size and allocation limits, and perform validation before mapping large data regions.
A format may carry metadata without proving that the metadata is true. Supply-chain evidence therefore belongs in a signed manifest or transparency workflow that binds hashes for every required artifact, configuration file, tokenizer, adapter, and policy. Model-transparency tooling is one emerging implementation direction. [ar_cite id=”model-transparency” label=”Model Transparency”]
Selection questions
- What stage? Training checkpoint, interchange graph, local inference artifact, high-concurrency serving package, or compiled target binary?
- What runtime? Which exact engine and version will load the artifact?
- What hardware? CPU, GPU, NPU, browser, mobile, or heterogeneous placement?
- What precision? Full precision, low-precision floating point, integer quantization, sparsity, or adapters?
- What packaging? Single file, shards, sidecars, repository, or signed deployment bundle?
- What trust evidence? Hashes, signer, provenance, model card, license, review date, and vulnerability process?
- What update path? Can the system roll out, validate, and roll back artifacts without changing application code?
Changelog
2026-06-22 UTC: Published the first model-format reference, separating model identity, serialization, quantization, runtime compatibility, and supply-chain evidence.
