Memory-centric AI runtimes treat model weights, KV state, and intermediate artifacts as a distributed hierarchy that can be placed independently from compute. The objective is to reduce expensive recomputation and make capacity available beyond one accelerator.
Key takeaways
- Memory capacity, bandwidth, latency, coherence, and ownership are separate design variables.
- CXL and shared-memory research may reduce copies, but software must still provide directories, synchronization, isolation, and recovery.
- Many advanced designs remain research or specialized deployments rather than default production practice.
Definition
A conventional engine treats device memory as the primary execution state and spills outward when necessary. A memory-centric runtime exposes several tiers—accelerator memory, host memory, coherent expansion, local storage, remote cache—and schedules computation around state placement.
Memory hierarchy
| Tier | Strength | Constraint |
|---|---|---|
| Accelerator memory | Highest local bandwidth and direct kernel access | Scarce capacity and high cost |
| Host memory | Larger capacity and simpler management | Transfer and synchronization overhead |
| Coherent attached memory | Byte-addressable expansion and shared-pool potential | Topology, coherence, and software maturity |
| Local NVMe | Large, inexpensive persistence | Much higher latency |
| Remote cache/storage | Cluster-wide reuse and elasticity | Network bandwidth, consistency, and security |
Coherent memory fabrics
Compute Express Link specifies coherent and memory-expansion capabilities over compatible interconnects. [ar_cite id=”cxl” label=”CXL”] For AI runtimes, the architectural opportunity is a rack-scale pool that several hosts or accelerators can address through a more memory-like interface than ordinary object or network storage.
Coherence at the protocol or device level does not remove the need for application ownership, leases, versioning, and failure handling. The runtime must know which worker may publish, mutate, consume, or evict a cache object.
Processing near memory
For extremely large or sparse-attention state, research considers filtering or attention work closer to memory so the system moves only selected data. The potential benefit is reduced bus and device-memory pressure; the cost is specialized hardware, programming models, and new correctness boundaries. ARuntime treats these as emerging architectures, not assumed platform features.
Consistency and ownership
- Stable object identifiers independent of virtual addresses
- Model, layer, layout, precision, sequence, and token-position versioning
- Single-writer or explicit multi-writer semantics
- Lease, lock, and stale-owner recovery
- Tenant and workload isolation
- Integrity checks and secure deletion
- Fallback when the shared tier is unavailable
Trade-offs
Memory pooling can improve utilization while making latency less predictable. Shared state can improve reuse while enlarging the trust boundary. Fine-grained access can reduce copies while increasing synchronization overhead. Benchmark against a simpler local-memory and RDMA baseline under the exact context, model, hardware, and load.
Research status
Hierarchical cache and remote offload are present in production systems. Rack-scale coherent KV pools and processing-near-memory designs remain active research and specialized engineering. Quantitative claims should be interpreted within the cited prototype’s hardware, model, workload, and comparison baseline.
