Key takeaways
- KV-cache demand grows with active tokens and model architecture, so context length and concurrency compete for one memory pool.
- Paged allocation reduces fragmentation but does not by itself create cross-request prefix reuse.
- Prefix caching reuses exact token prefixes; radix-tree organization makes branching reuse explicit.
- Tiered offload adds capacity only when transfer plus restore is cheaper than recomputation.
- Cache isolation, retention, provenance, and eviction are governance concerns as well as performance choices.
Runtime boundary
A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.
Receives
Per-layer key/value tensors, positions, block tables, prefix identifiers, tenant/session scope, and memory policy.
Owns
Block allocation, mapping, sharing, reference counts, eviction, offload, restore, isolation, and lifecycle.
Emits
Attention-ready cache blocks, allocation/eviction events, reuse hits, transfers, and cleanup evidence.
Does not own
Permission to retain sensitive prompts indefinitely or share state across business contexts.
Failure modes
OOM, fragmentation, stale/incorrect reuse, cross-tenant leakage, eviction thrash, slow offload, orphaned blocks, and cache poisoning.
Evidence and metrics
Allocated/free blocks, active tokens, hit rate, prefill avoided, evictions, transfer time, fragmentation, cleanup latency, and cache age.
Why KV state exists
Self-attention uses key and value tensors derived from prior tokens. Retaining them avoids recomputing the entire prefix for every decode step.
Implementation
Size with the exact model: active tokens × layers × two tensors × KV heads × head dimension × bytes per element, plus layout and metadata.
Operational implications
Grouped-query or multi-query attention can use fewer KV heads. Do not publish a universal bytes-per-token number.
Measure
Bytes/token measured, active tokens, layers/heads, precision, and physical allocation.
Paged allocation
Fixed-size physical blocks remove the requirement for one contiguous sequence buffer and reduce fragmentation.
Implementation
Map logical token blocks through per-sequence block tables and maintain reference counts for shared blocks.
Operational implications
Block size trades final-block waste against metadata and lookup overhead.
Measure
Block occupancy, final-block waste, allocation latency, free-list pressure, and OOM.
Prefix caching
Exact token prefixes such as shared system prompts or repeated documents can reuse previously computed cache state.
Implementation
Key by model, adapter, tokenizer, relevant configuration, tokens, and security scope. Validate integrity and invalidate on version changes.
Operational implications
A hit is valuable in proportion to prefill work avoided, not just request count.
Measure
Hit rate, matched tokens, prefill tokens/time avoided, age, and evictions.
RadixAttention and branching reuse
A radix tree stores common token prefixes as shared roots with unique conversation or agent branches as leaves.
Implementation
Use reference counts, longest-prefix lookup, and leaf-aware eviction. Preserve tenant and model boundaries.
Operational implications
Branch-heavy multi-turn and agent workflows can benefit more than unrelated prompts.
Measure
Tree nodes, shared tokens, branch depth, hit length, eviction/re-miss, and lookup time.
Tiered cache offload
KV state may move from GPU memory to host RAM, NVMe, or remote cache tiers.
Implementation
Compare lookup, serialization, transfer, restore, and synchronization with recomputation. Use topology-aware transport and checksums.
Operational implications
Offload can hurt short prompts, low reuse, or slow links. Lower tiers expand the security and retention boundary.
Measure
Bytes transferred, tier hit, restore latency, failures, and recompute comparison.
Cache-aware routing
A router can send a request to a worker that already holds the longest useful prefix.
Implementation
Combine locality with queue, memory headroom, health, and tenant policy; expire metadata on eviction or restart.
Operational implications
Locality-only routing can overload one worker or rely on stale state.
Measure
Locality score, routed hit length, queue delta, stale-location misses, and load balance.
Isolation and retention
KV state can contain representations of sensitive input and must be scoped and deleted according to policy.
Implementation
Include tenant/data classification in cache keys, prohibit unauthorized sharing, audit reuse, and define cleanup or zeroization requirements.
Operational implications
Memory pools may retain bytes after logical free. Stronger isolation can require process boundaries or memory clearing.
Measure
Cross-tenant share attempts, retention age, cleanup/zeroization, deletion coverage, and audit events.
Lifecycle and failure recovery
Completion, cancellation, timeout, model unload, and worker failure must release or reconcile cache references.
Implementation
Make release idempotent. On worker failure, invalidate location metadata and decide whether to restore, recompute, or fail.
Operational implications
Orphaned blocks create slow capacity leaks that appear as declining concurrency.
Measure
Cancel-to-release time, orphan scans, recovered blocks, restore success, and memory drift.
Reference tables
| Mechanism | Primary benefit | Best-fit workload | Primary risk |
|---|---|---|---|
| Paged allocation | Reduce fragmentation and increase active capacity | Variable-length concurrent generation | Metadata/kernel complexity |
| Prefix caching | Avoid repeated exact-prefix prefill | Shared system prompts and repeated RAG context | Retention and invalidation |
| Radix-tree reuse | Share branching prefixes | Multi-turn and agent workflows | Tree eviction and routing complexity |
| Host-memory offload | Expand effective cache capacity | Reusable long prefixes with fast links | Transfer latency and host pressure |
| NVMe/remote tier | Persist/share very large caches | Repeated long-document analysis | Storage/network latency and privacy |
| Cache-aware routing | Preserve locality across replicas | Multi-replica serving | Load imbalance and stale metadata |
| Tier | Relative latency | Capacity | Typical use | Risk |
|---|---|---|---|---|
| GPU HBM/VRAM | Low | Limited | Active KV blocks | OOM and weight contention |
| CPU RAM | Moderate | Larger | Warm offload | NUMA and transfer overhead |
| Local NVMe | High | Large | Cold reusable prefixes | I/O contention |
| Remote storage/cache | Highest/variable | Very large | Fleet sharing/long retention | Network, consistency, privacy |
Decision checklist
- What exact architecture and cache precision determine bytes per token?
- What active-token and block budget exists per replica and tenant?
- Which prefixes are safe and valuable to reuse?
- What fields make up the cache key and invalidation boundary?
- When is offload faster than recomputation?
- How are location and ownership coordinated across workers?
- What cleanup or zeroization is required?
Common mistakes
- Using a universal KV bytes-per-token number.
- Equating paged allocation with prefix caching.
- Reusing across tenants because token hashes match.
- Reporting hit rate without prefill work avoided.
- Offloading without comparing restore to recomputation.
- Leaving cancelled sequences referenced by schedulers or block tables.
- Reusing cache after model, adapter, tokenizer, or policy changes.
Sources and further reading
-
PagedAttention paper
(opens in a new tab)
-
Automatic prefix caching
(opens in a new tab)
-
RadixAttention
(opens in a new tab)
-
LMCache documentation
(opens in a new tab)
-
NVIDIA Dynamo documentation
(opens in a new tab)
Last reviewed: 2026-06-21 UTC
