Memory-Centric AI Runtimes

Memory-centric AI runtimes treat model weights, KV state, and intermediate artifacts as a distributed hierarchy that can be placed independently from compute. The objective is to reduce expensive recomputation and make capacity available beyond one accelerator.

Key takeaways

Memory capacity, bandwidth, latency, coherence, and ownership are separate design variables.
CXL and shared-memory research may reduce copies, but software must still provide directories, synchronization, isolation, and recovery.
Many advanced designs remain research or specialized deployments rather than default production practice.

Definition

A conventional engine treats device memory as the primary execution state and spills outward when necessary. A memory-centric runtime exposes several tiers—accelerator memory, host memory, coherent expansion, local storage, remote cache—and schedules computation around state placement.

Memory hierarchy

AI runtime memory tiers
Tier	Strength	Constraint
Accelerator memory	Highest local bandwidth and direct kernel access	Scarce capacity and high cost
Host memory	Larger capacity and simpler management	Transfer and synchronization overhead
Coherent attached memory	Byte-addressable expansion and shared-pool potential	Topology, coherence, and software maturity
Local NVMe	Large, inexpensive persistence	Much higher latency
Remote cache/storage	Cluster-wide reuse and elasticity	Network bandwidth, consistency, and security

Coherent memory fabrics

Compute Express Link specifies coherent and memory-expansion capabilities over compatible interconnects. [ar_cite id=”cxl” label=”CXL”] For AI runtimes, the architectural opportunity is a rack-scale pool that several hosts or accelerators can address through a more memory-like interface than ordinary object or network storage.

Coherence at the protocol or device level does not remove the need for application ownership, leases, versioning, and failure handling. The runtime must know which worker may publish, mutate, consume, or evict a cache object.

Shared cache architecture

TraCT explores a rack-scale shared-memory KV-cache design over CXL and reconstructs coordination in software where cross-host atomics or full coherence are unavailable. [ar_cite id=”tract” label=”TraCT”] This illustrates the central challenge: faster access is useful only when directories, locks, object identities, and recovery are correct.

Processing near memory

For extremely large or sparse-attention state, research considers filtering or attention work closer to memory so the system moves only selected data. The potential benefit is reduced bus and device-memory pressure; the cost is specialized hardware, programming models, and new correctness boundaries. ARuntime treats these as emerging architectures, not assumed platform features.

Consistency and ownership

Stable object identifiers independent of virtual addresses
Model, layer, layout, precision, sequence, and token-position versioning
Single-writer or explicit multi-writer semantics
Lease, lock, and stale-owner recovery
Tenant and workload isolation
Integrity checks and secure deletion
Fallback when the shared tier is unavailable

Trade-offs

Memory pooling can improve utilization while making latency less predictable. Shared state can improve reuse while enlarging the trust boundary. Fine-grained access can reduce copies while increasing synchronization overhead. Benchmark against a simpler local-memory and RDMA baseline under the exact context, model, hardware, and load.

Research status

Hierarchical cache and remote offload are present in production systems. Rack-scale coherent KV pools and processing-near-memory designs remain active research and specialized engineering. Quantitative claims should be interpreted within the cited prototype’s hardware, model, workload, and comparison baseline.

Find runtime definitions and implementation guidance