KV Cache Offloading: Unlocking AI Inference Efficiency with NVMe SSDs

13th May,2026browse number：363DapuStor

As large language models continue to scale and agentic AI applications become more common, AI inference systems are facing growing pressure across both compute and storage. In Transformer-based architectures, KV Cache is a critical mechanism for storing previously computed states. It helps improve inference efficiency and reduce redundant computation.

However, as long-context workloads increase, the cost and capacity limits of GPU memory are becoming a physical bottleneck for scaling inference clusters. Offloading large KV Cache datasets to a more cost-effective external storage tier is emerging as a practical approach.

This article explains how KV Cache works in LLM inference and discusses how DapuStor NVMe SSDs can help build a cost-efficient storage layer for AI infrastructure.

01 The Role of KV Cache in AI Inference

When a large language model (LLM) generates text, it relies on matrix operations within the multi-head attention (MHA) mechanism. A key part of this process is computing the K and V matrices, which is where KV Cache becomes important.

Without caching, the model would need to recompute the full sequence context every time a new request is coming. With KV Cache, previously computed K/V states are stored and reused directly.

When a new request is received, the model only needs to compute the new K/V entries and concatenate them with the cached historical K/V data for subsequent attention computation. By eliminating large volumes of redundant computation, KV Cache becomes a foundational mechanism for improving LLM inference efficiency.

02 KV Cache Reuse Reduces GPU Resource Pressure

Storing historical K/V states and retrieving them on demand can reduce GPU workload from both compute and memory perspectives.

1. Improving Effective GPU Throughput

During AI inference system planning, teams often estimate the required number of GPUs by starting with a small-scale benchmark, measuring performance under latency and throughput constraints, and then extrapolating to the full target deployment.

A simplified formula is:

Required GPUs = Target throughput / (Throughput of small-scale setup × Utilization) × Number of GPUs in small-scale setup

Throughput is typically measured in tokens per second.

KV Cache reuse eliminates redundant recomputation and can reduce response latency, including time-to-first-token (TTFT), when cache hits are high and data movement is well managed. This accelerates request processing and leaves more GPU resources available for additional requests.

One common concern is that user requests may not share enough context for KV Cache reuse to be effective. In real workloads, however, shared context is becoming increasingly common. Agentic AI is a typical example.

For instance, a user may send a single request to OpenClaw, asking it to open a browser, visit example.com, and click the “Learn More” link. Although the user sends only one message from the front end, backend traces show that OpenClaw sends multiple model requests with shared historical context to complete the task.

....

As agentic applications become more widely adopted, workloads with long historical context are expected to become increasingly common.

For this reason, benchmarks with shared context are more representative than those that only simulate fully random requests. The vLLM community’s multi-round QA benchmark is one example of this direction [1]. In such test environments, a storage layer should be considered as part of the inference system design.

2. Offloading KV Cache from GPU Memory

To improve throughput, system architects often increase batch size, or the number of concurrent requests processed by the GPU cluster. However, GPU memory capacity becomes a limiting factor as concurrency grows, and KV Cache can account for a significant portion of that memory footprint.

Assuming the KV Cache for all concurrent requests is stored in GPU memory, the memory footprint of classic MHA can be roughly estimated as:

2 × B × S × L × H × D

where B is batch size, S is the total sequence length (including both prompt and generated tokens) of tokens in the prompt, L is the number of model layers, H is the number of attention heads, and D is the dimension of each head vector. The factor 2 represents K and V.

As request lengths increase and new requests continue to arrive, KV Cache size can eventually exceed available GPU memory. Although techniques such as DeepSeek’s MLA, can reduce the dimensionality of KV Cache per token, these improvements may also encourage longer sequences and larger-scale deployments.

As a result, offloading KV Cache from GPU memory to external storage, such as NVMe SSDs, is gaining attention as a way to reduce GPU memory pressure.

03 Addressing the Latency Challenge of SSD-Based KV Cache

NVMe SSDs offer strong capacity advantages for storing large volumes of KV data. The main challenge is read latency. As the AI infrastructure software ecosystem evolves, mainstream frameworks such as vLLM and LMCache are introducing mechanisms to make SSD-based KV Cache more practical.

1. Overlapping I/O and Model Computation

The vLLM engine includes four core components: input processing, scheduling, model execution, and output processing. According to the vLLM community [2]:

(1) Input processing handles tokenization with the specified tokenizer.
(2) Scheduling selects which requests are processed at each step.
(3) Model execution manages LLM execution, including distributed execution across multiple GPUs.
(4) Output processing decodes model-generated token IDs into human-readable text.

vLLM exposes a KVConnector interface that allows KV Cache offloading modules to connect to the inference engine. LMCache sits between vLLM and external storage systems, such as host memory, local drives, or distributed storage. This decouples the inference backend from third-party storage systems and provides a standard interface for external KV Cache storage.

The overlap between I/O and compute is enabled by the coordination between vLLM’s scheduler and model runner.

During each scheduling step, the scheduler selects the requests to be served and sets KVConnector metadata describing the KV data to be transferred later. This is implemented in the Scheduler.schedule function.

When the model runner executes the requests, it can initiate background KV data transfer at the same time.

2. Packing Multi-Token KV Cache into Larger I/O Transfers

Both vLLM and LMCache support packing KV Cache data from multiple tokens for transfer, improving memory efficiency and I/O efficiency.

vLLM uses PagedAttention to manage KV Cache in fixed-size pages. A typical page contains 16 tokens, and the value can be adjusted through command-line parameters. LMCache can further group KV data into larger chunks, such as 256-token chunks, with configuration options available in lmcache/v1/config.py.

External storage systems can use these larger chunks to generate flash-friendly, sequential large-block I/O patterns, improving bandwidth utilization on NVMe SSDs.

04 Advantages of NVMe SSDs for KV Cache

Once latency and I/O patterns are properly managed, NVMe SSDs can deliver strong benefits for KV Cache offloading. These requirements align well with the physical characteristics of DapuStor R6 Series SSDs.

1. Reducing KV Cache Eviction Probability

If a system can store more historical KV Cache data, it can reduce recomputation overhead. Over time, losing KV Cache may also affect model performance in long-sequence tasks [3].

As single-drive capacity increases, DapuStor R6 Series SSDs, with capacities up to 245TB per drive, make it possible to build a massive KV Cache storage backend.

2. Optimizing Storage Cost

KV Cache reuse is not always critical to the functional correctness of LLM output. If cached K/V states are lost, they can be recomputed, although at additional compute cost. This makes KV Cache a suitable target for a lower-cost, capacity-optimized storage tier.

As software such as vLLM and LMCache reshapes I/O patterns to be more SSD-friendly, I/O latency can be hidden behind the compute pipeline. KV Cache data also often follows a write-once, read-many access pattern in reuse scenarios. Combined with the transition from PCIe Gen4 to PCIe Gen5, DapuStor R6 Series SSDs can help data centers reduce the total cost of AI inference while maintaining system reliability.

05 Building a Storage Foundation for AI Inference

KV Cache has become an essential component of AI inference system architecture. From the perspectives of GPU memory pressure and deployment cost, offloading KV Cache to external storage is becoming increasingly practical.

As AI infrastructure software continues to evolve, NVMe SSDs are expected to play a growing role in this emerging workload. With high capacity, PCIe Gen5 performance, and strong cost efficiency, DapuStor R6 Series SSDs are well aligned with the requirements of KV Cache offloading and can provide a scalable storage foundation for next-generation AI inference systems.

References

[1] https://github.com/vllm-project/production-stack/tree/main/benchmarks/multi-round-qa

[2] https://github.com/vllm-project/vllm/blob/main/docs/design/arch_overview.md

[3] https://arxiv.org/abs/2412.19442

BACK