KV Cache and Runtime Memory

KV cache is generation state that lets an autoregressive model reuse attention keys and values from prior tokens. It is one of the dominant capacity and scheduling resources in LLM inference, but it is not durable application memory.

Key takeaways

Cache demand scales with model architecture, sequence length, concurrency, precision, and layer count.
Paging, reuse, offload, and remote tiers trade memory capacity against lookup, transfer, and fragmentation costs.
Cache contents may contain sensitive derived state and need tenant isolation and lifecycle controls.

Role of KV state

During prefill and decode, attention layers produce key and value tensors for previous positions. Retaining them avoids recomputing the full prefix at every new token. The engine associates cache blocks with a model/deployment version, sequence, layer, position, and precision.

Conversation history stored in a database can reconstruct a prompt but cannot replace the engine’s active KV state without re-prefill. Conversely, KV state cannot serve as an auditable long-term memory because it is opaque, engine-specific, and normally ephemeral.

Capacity drivers

Capacity increases with the number of layers, KV heads, head dimension, bytes per element, and retained tokens. Concurrency multiplies the total. Sliding-window or sparse attention can change growth, and quantized cache formats reduce bytes at possible quality or kernel cost. Capacity planning should use the exact model configuration rather than a generic parameter-count estimate.

Allocation and paging

Block or page allocators reduce the need for contiguous per-sequence regions and allow sequences with different lengths to share the cache pool. The runtime still must manage fragmentation, reference counts, eviction, and out-of-memory behavior. An admission policy should reserve enough headroom for decode progress so already accepted sequences are not stranded.

Prefix reuse

When requests share a verified prefix, the engine or serving layer can reuse cached state and avoid repeated prefill. Reuse keys must incorporate model, tokenizer, prompt bytes or canonical representation, adapter, decoding-relevant configuration, and privacy scope. Cross-tenant reuse may leak information through timing or access patterns and should be disabled unless a deliberate security design allows it.

Offload and remote tiers

Cold blocks may move from device memory to host memory, storage, or a remote cache service. Tiering expands capacity but adds transfer latency and unpredictability. Prefetching should be based on expected next use, while deadlines and queue policy determine whether to wait, recompute, or reject.

Compression and precision

Cache quantization or compression can reduce capacity and transport cost. Evaluate quality, encode/decode overhead, kernel support, error accumulation, and interaction with attention implementations. Research results should not be generalized across models or hardware without reproduction.

Security and retention

Partition cache namespaces by tenant and model deployment.
Do not expose cache handles as authorization tokens.
Encrypt or protect remote cache transport and storage.
Invalidate state on model, tokenizer, adapter, or policy incompatibility.
Define secure cleanup for ended, expired, or revoked sessions.
Exclude cache payloads from routine logs and evidence exports.

Metrics

Track allocated and free blocks, per-sequence occupancy, prefix-hit rate, eviction, spill and restore latency, transfer volume, allocation failure, fragmentation, recomputation, and cache-attributed TTFT/TPOT. Correlate with model version, sequence lengths, concurrency, and scheduler policy.

Find runtime definitions and implementation guidance