KV Cache and Runtime Memory

Key takeaways

KV-cache demand grows with active tokens and model architecture, so context length and concurrency compete for one memory pool.
Paged allocation reduces fragmentation but does not by itself create cross-request prefix reuse.
Prefix caching reuses exact token prefixes; radix-tree organization makes branching reuse explicit.
Tiered offload adds capacity only when transfer plus restore is cheaper than recomputation.
Cache isolation, retention, provenance, and eviction are governance concerns as well as performance choices.

Runtime boundary

A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.

Receives

Per-layer key/value tensors, positions, block tables, prefix identifiers, tenant/session scope, and memory policy.

Owns

Block allocation, mapping, sharing, reference counts, eviction, offload, restore, isolation, and lifecycle.

Emits

Attention-ready cache blocks, allocation/eviction events, reuse hits, transfers, and cleanup evidence.

Does not own

Permission to retain sensitive prompts indefinitely or share state across business contexts.

Failure modes

OOM, fragmentation, stale/incorrect reuse, cross-tenant leakage, eviction thrash, slow offload, orphaned blocks, and cache poisoning.

Evidence and metrics

Allocated/free blocks, active tokens, hit rate, prefill avoided, evictions, transfer time, fragmentation, cleanup latency, and cache age.

Why KV state exists

Self-attention uses key and value tensors derived from prior tokens. Retaining them avoids recomputing the entire prefix for every decode step.

Implementation

Size with the exact model: active tokens × layers × two tensors × KV heads × head dimension × bytes per element, plus layout and metadata.

Operational implications

Grouped-query or multi-query attention can use fewer KV heads. Do not publish a universal bytes-per-token number.

Measure

Bytes/token measured, active tokens, layers/heads, precision, and physical allocation.

Paged allocation

Fixed-size physical blocks remove the requirement for one contiguous sequence buffer and reduce fragmentation.

Implementation

Map logical token blocks through per-sequence block tables and maintain reference counts for shared blocks.

Operational implications

Block size trades final-block waste against metadata and lookup overhead.

Measure

Block occupancy, final-block waste, allocation latency, free-list pressure, and OOM.

Prefix caching

Exact token prefixes such as shared system prompts or repeated documents can reuse previously computed cache state.

Implementation

Key by model, adapter, tokenizer, relevant configuration, tokens, and security scope. Validate integrity and invalidate on version changes.

Operational implications

A hit is valuable in proportion to prefill work avoided, not just request count.

Measure

Hit rate, matched tokens, prefill tokens/time avoided, age, and evictions.

RadixAttention and branching reuse

A radix tree stores common token prefixes as shared roots with unique conversation or agent branches as leaves.

Implementation

Use reference counts, longest-prefix lookup, and leaf-aware eviction. Preserve tenant and model boundaries.

Operational implications

Branch-heavy multi-turn and agent workflows can benefit more than unrelated prompts.

Measure

Tree nodes, shared tokens, branch depth, hit length, eviction/re-miss, and lookup time.

Tiered cache offload

KV state may move from GPU memory to host RAM, NVMe, or remote cache tiers.

Implementation

Compare lookup, serialization, transfer, restore, and synchronization with recomputation. Use topology-aware transport and checksums.

Operational implications

Offload can hurt short prompts, low reuse, or slow links. Lower tiers expand the security and retention boundary.

Measure

Bytes transferred, tier hit, restore latency, failures, and recompute comparison.

Cache-aware routing

A router can send a request to a worker that already holds the longest useful prefix.

Implementation

Combine locality with queue, memory headroom, health, and tenant policy; expire metadata on eviction or restart.

Operational implications

Locality-only routing can overload one worker or rely on stale state.

Measure

Locality score, routed hit length, queue delta, stale-location misses, and load balance.

Isolation and retention

KV state can contain representations of sensitive input and must be scoped and deleted according to policy.

Implementation

Include tenant/data classification in cache keys, prohibit unauthorized sharing, audit reuse, and define cleanup or zeroization requirements.

Operational implications

Memory pools may retain bytes after logical free. Stronger isolation can require process boundaries or memory clearing.

Measure

Cross-tenant share attempts, retention age, cleanup/zeroization, deletion coverage, and audit events.

Lifecycle and failure recovery

Completion, cancellation, timeout, model unload, and worker failure must release or reconcile cache references.

Implementation

Make release idempotent. On worker failure, invalidate location metadata and decide whether to restore, recompute, or fail.

Operational implications

Orphaned blocks create slow capacity leaks that appear as declining concurrency.

Measure

Cancel-to-release time, orphan scans, recovered blocks, restore success, and memory drift.

Reference tables

KV-cache mechanisms
Mechanism	Primary benefit	Best-fit workload	Primary risk
Paged allocation	Reduce fragmentation and increase active capacity	Variable-length concurrent generation	Metadata/kernel complexity
Prefix caching	Avoid repeated exact-prefix prefill	Shared system prompts and repeated RAG context	Retention and invalidation
Radix-tree reuse	Share branching prefixes	Multi-turn and agent workflows	Tree eviction and routing complexity
Host-memory offload	Expand effective cache capacity	Reusable long prefixes with fast links	Transfer latency and host pressure
NVMe/remote tier	Persist/share very large caches	Repeated long-document analysis	Storage/network latency and privacy
Cache-aware routing	Preserve locality across replicas	Multi-replica serving	Load imbalance and stale metadata

Storage tier trade-offs
Tier	Relative latency	Capacity	Typical use	Risk
GPU HBM/VRAM	Low	Limited	Active KV blocks	OOM and weight contention
CPU RAM	Moderate	Larger	Warm offload	NUMA and transfer overhead
Local NVMe	High	Large	Cold reusable prefixes	I/O contention
Remote storage/cache	Highest/variable	Very large	Fleet sharing/long retention	Network, consistency, privacy

Decision checklist

What exact architecture and cache precision determine bytes per token?
What active-token and block budget exists per replica and tenant?
Which prefixes are safe and valuable to reuse?
What fields make up the cache key and invalidation boundary?
When is offload faster than recomputation?
How are location and ownership coordinated across workers?
What cleanup or zeroization is required?

Common mistakes

Using a universal KV bytes-per-token number.
Equating paged allocation with prefix caching.
Reusing across tenants because token hashes match.
Reporting hit rate without prefill work avoided.
Offloading without comparing restore to recomputation.
Leaving cancelled sequences referenced by schedulers or block tables.
Reusing cache after model, adapter, tokenizer, or policy changes.

Sources and further reading

PagedAttention paper
(opens in a new tab)

USENIX / vLLM authors · Peer-reviewed paper · accessed 2026-06-21 UTC
Automatic prefix caching
(opens in a new tab)

vLLM · Official documentation · accessed 2026-06-21 UTC
RadixAttention
(opens in a new tab)

SGLang · Official documentation · accessed 2026-06-21 UTC
LMCache documentation
(opens in a new tab)

LMCache · Official documentation · accessed 2026-06-21 UTC
NVIDIA Dynamo documentation
(opens in a new tab)

NVIDIA · Official documentation · accessed 2026-06-21 UTC

Last reviewed: 2026-06-21 UTC

Key takeaways

Runtime boundary

Receives

Owns

Emits

Does not own

Failure modes

Evidence and metrics

Why KV state exists

Implementation

Operational implications

Measure

Paged allocation

Implementation

Operational implications

Measure

Prefix caching

Implementation

Operational implications

Measure

RadixAttention and branching reuse

Implementation

Operational implications

Measure

Tiered cache offload

Implementation

Operational implications

Measure

Cache-aware routing

Implementation

Operational implications

Measure

Isolation and retention

Implementation

Operational implications

Measure

Lifecycle and failure recovery

Implementation

Operational implications

Measure

Reference tables

Decision checklist

Common mistakes

Sources and further reading

Maintenance record