SeKV: Resolution-Adaptive KV Cache with Hierarchical Semantic Memory for Long-Context LLM Inference
Quick Answer
SeKV introduces a resolution-adaptive KV cache for long-context LLMs, enhancing semantic memory without information loss.
Quick Take
SeKV introduces a resolution-adaptive KV cache for long-context LLMs, enhancing semantic memory without information loss. It achieves a 5.9% performance improvement over existing methods while reducing GPU memory usage by 53.3% at 128K context, with minimal additional parameters.
Key Points
- SeKV organizes context into entropy-guided semantic spans for efficient memory use.
- It retains a lightweight summary vector on GPU for coarse routing during inference.
- The method reduces GPU memory requirements by 53.3% compared to full KV caching.
- SeKV improves semantic compression performance by an average of 5.9% across benchmarks.
- Less than 0.05% additional trainable parameters are needed to implement SeKV.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 31145v1 Announce Type: new Abstract: Large language models increasingly operate over long contexts, where the KV cache becomes a dominant memory bottleneck: its size grows linearly with sequence length and must be retained throughout decoding, making full GPU caching prohibitively expensive without compression. Existing KV cache compression methods struggle to balance efficiency with faithful context preservation.
Token eviction discards information, while semantic grouping fixes compression decisions at prefill time; neither can recover token-level detail from a compressed span once it becomes relevant during generation. As a solution, we propose SeKV, a resolution-adaptive semantic KV cache that organizes context into entropy-guided semantic spans and stores them across a GPU-CPU memory hierarchy without discarding information.
Each span keeps a lightweight summary vector on GPU for coarse routing and a low-rank SVD basis on CPU for on-demand token-level reconstruction. A trained zoom-in mechanism selectively expands query-relevant spans during decoding, enabling precise retrieval without materializing the full KV cache on GPU. SeKV enables adaptive token-level reconstruction while keeping the base LLM fully frozen and adding fewer than 0. 05% trainable parameters.
Across four benchmarks, SeKV improves over the strongest semantic compression baseline by 5. 9% on average while reducing GPU memory by 53. 3% versus full KV caching at 128K context. Code is available on https://github. com/AmirAbaskohi/SeKV.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.


