Search ARuntime.com

Find runtime definitions and implementation guidance

Search page titles, summaries, headings, glossary terms, use cases, and runtime-directory entries.

Enter at least two characters.

Mechanics

KV Cache and Runtime Memory

A deep guide to KV-cache architecture, PagedAttention, prefix caching, RadixAttention, tiered offload, cache-aware routing, isolation, memory sizing, and observability.

Audience: Technical readers Reading time: 6 minutes Status: Foundational Last reviewed:

Key takeaways

  • KV-cache demand grows with active tokens and model architecture, so context length and concurrency compete for one memory pool.
  • Paged allocation reduces fragmentation but does not by itself create cross-request prefix reuse.
  • Prefix caching reuses exact token prefixes; radix-tree organization makes branching reuse explicit.
  • Tiered offload adds capacity only when transfer plus restore is cheaper than recomputation.
  • Cache isolation, retention, provenance, and eviction are governance concerns as well as performance choices.

Runtime boundary

A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.

Receives

Per-layer key/value tensors, positions, block tables, prefix identifiers, tenant/session scope, and memory policy.

Owns

Block allocation, mapping, sharing, reference counts, eviction, offload, restore, isolation, and lifecycle.

Emits

Attention-ready cache blocks, allocation/eviction events, reuse hits, transfers, and cleanup evidence.

Does not own

Permission to retain sensitive prompts indefinitely or share state across business contexts.

Failure modes

OOM, fragmentation, stale/incorrect reuse, cross-tenant leakage, eviction thrash, slow offload, orphaned blocks, and cache poisoning.

Evidence and metrics

Allocated/free blocks, active tokens, hit rate, prefill avoided, evictions, transfer time, fragmentation, cleanup latency, and cache age.

Why KV state exists

Self-attention uses key and value tensors derived from prior tokens. Retaining them avoids recomputing the entire prefix for every decode step.

Implementation

Size with the exact model: active tokens × layers × two tensors × KV heads × head dimension × bytes per element, plus layout and metadata.

Operational implications

Grouped-query or multi-query attention can use fewer KV heads. Do not publish a universal bytes-per-token number.

Measure

Bytes/token measured, active tokens, layers/heads, precision, and physical allocation.

Paged allocation

Fixed-size physical blocks remove the requirement for one contiguous sequence buffer and reduce fragmentation.

Implementation

Map logical token blocks through per-sequence block tables and maintain reference counts for shared blocks.

Operational implications

Block size trades final-block waste against metadata and lookup overhead.

Measure

Block occupancy, final-block waste, allocation latency, free-list pressure, and OOM.

Prefix caching

Exact token prefixes such as shared system prompts or repeated documents can reuse previously computed cache state.

Implementation

Key by model, adapter, tokenizer, relevant configuration, tokens, and security scope. Validate integrity and invalidate on version changes.

Operational implications

A hit is valuable in proportion to prefill work avoided, not just request count.

Measure

Hit rate, matched tokens, prefill tokens/time avoided, age, and evictions.

RadixAttention and branching reuse

A radix tree stores common token prefixes as shared roots with unique conversation or agent branches as leaves.

Implementation

Use reference counts, longest-prefix lookup, and leaf-aware eviction. Preserve tenant and model boundaries.

Operational implications

Branch-heavy multi-turn and agent workflows can benefit more than unrelated prompts.

Measure

Tree nodes, shared tokens, branch depth, hit length, eviction/re-miss, and lookup time.

Tiered cache offload

KV state may move from GPU memory to host RAM, NVMe, or remote cache tiers.

Implementation

Compare lookup, serialization, transfer, restore, and synchronization with recomputation. Use topology-aware transport and checksums.

Operational implications

Offload can hurt short prompts, low reuse, or slow links. Lower tiers expand the security and retention boundary.

Measure

Bytes transferred, tier hit, restore latency, failures, and recompute comparison.

Cache-aware routing

A router can send a request to a worker that already holds the longest useful prefix.

Implementation

Combine locality with queue, memory headroom, health, and tenant policy; expire metadata on eviction or restart.

Operational implications

Locality-only routing can overload one worker or rely on stale state.

Measure

Locality score, routed hit length, queue delta, stale-location misses, and load balance.

Isolation and retention

KV state can contain representations of sensitive input and must be scoped and deleted according to policy.

Implementation

Include tenant/data classification in cache keys, prohibit unauthorized sharing, audit reuse, and define cleanup or zeroization requirements.

Operational implications

Memory pools may retain bytes after logical free. Stronger isolation can require process boundaries or memory clearing.

Measure

Cross-tenant share attempts, retention age, cleanup/zeroization, deletion coverage, and audit events.

Lifecycle and failure recovery

Completion, cancellation, timeout, model unload, and worker failure must release or reconcile cache references.

Implementation

Make release idempotent. On worker failure, invalidate location metadata and decide whether to restore, recompute, or fail.

Operational implications

Orphaned blocks create slow capacity leaks that appear as declining concurrency.

Measure

Cancel-to-release time, orphan scans, recovered blocks, restore success, and memory drift.

Reference tables

KV-cache mechanisms
Mechanism Primary benefit Best-fit workload Primary risk
Paged allocation Reduce fragmentation and increase active capacity Variable-length concurrent generation Metadata/kernel complexity
Prefix caching Avoid repeated exact-prefix prefill Shared system prompts and repeated RAG context Retention and invalidation
Radix-tree reuse Share branching prefixes Multi-turn and agent workflows Tree eviction and routing complexity
Host-memory offload Expand effective cache capacity Reusable long prefixes with fast links Transfer latency and host pressure
NVMe/remote tier Persist/share very large caches Repeated long-document analysis Storage/network latency and privacy
Cache-aware routing Preserve locality across replicas Multi-replica serving Load imbalance and stale metadata
Storage tier trade-offs
Tier Relative latency Capacity Typical use Risk
GPU HBM/VRAM Low Limited Active KV blocks OOM and weight contention
CPU RAM Moderate Larger Warm offload NUMA and transfer overhead
Local NVMe High Large Cold reusable prefixes I/O contention
Remote storage/cache Highest/variable Very large Fleet sharing/long retention Network, consistency, privacy

Decision checklist

  1. What exact architecture and cache precision determine bytes per token?
  2. What active-token and block budget exists per replica and tenant?
  3. Which prefixes are safe and valuable to reuse?
  4. What fields make up the cache key and invalidation boundary?
  5. When is offload faster than recomputation?
  6. How are location and ownership coordinated across workers?
  7. What cleanup or zeroization is required?

Common mistakes

  • Using a universal KV bytes-per-token number.
  • Equating paged allocation with prefix caching.
  • Reusing across tenants because token hashes match.
  • Reporting hit rate without prefill work avoided.
  • Offloading without comparing restore to recomputation.
  • Leaving cancelled sequences referenced by schedulers or block tables.
  • Reusing cache after model, adapter, tokenizer, or policy changes.

Sources and further reading


  1. PagedAttention paper
    (opens in a new tab)

    USENIX / vLLM authors · Peer-reviewed paper · accessed 2026-06-21 UTC

  2. Automatic prefix caching
    (opens in a new tab)

    vLLM · Official documentation · accessed 2026-06-21 UTC

  3. RadixAttention
    (opens in a new tab)

    SGLang · Official documentation · accessed 2026-06-21 UTC

  4. LMCache documentation
    (opens in a new tab)

    LMCache · Official documentation · accessed 2026-06-21 UTC

  5. NVIDIA Dynamo documentation
    (opens in a new tab)

    NVIDIA · Official documentation · accessed 2026-06-21 UTC

Last reviewed: 2026-06-21 UTC

Maintenance record

Found an error, outdated capability, or unclear category boundary? Submit a correction with a supporting source.