Probing the Prompt KV Cache: Where It Becomes Dispensable
Quick Take
Research reveals that the prompt KV cache can be partially redundant during decoding in models like Qwen3, Gemma 3, and Llama 3. By replacing upper layer prompt spans with neutral fillers from chat templates, accuracy is maintained, while zeroing slots results in significant accuracy loss, indicating redundancy is form-based rather than content-based.
Key Points
- Prompt KV cache redundancy observed in Qwen3, Gemma 3, and Llama 3 models.
- Replacing upper layer prompt spans with neutral fillers maintains accuracy.
- Zeroing KV cache slots leads to significant accuracy loss.
- Redundancy identified as form-based rather than content-based.
- Findings replicated across multiple datasets.
Article Excerpt
From source RSS / original summaryarXiv:2605. 30574v1 Announce Type: new Abstract: Prior KV cache compression schemes empirically demonstrate that the prompt cache is partially redundant during decoding, dropping or summarising entries with little accuracy loss. We ask when and what kind of redundancy: at which layers, after how many decoding steps, and in what form can the prompt span KV cache be replaced without breaking the task.
A controlled splice intervention swept over layer cutoff and decoding steps shows this redundancy is about form (chat template scaffolding) rather than content. Replacing the upper layer prompt span KV cache with KV cache from a chat template scaffold whose user content is a neutral filler recovers near clean accuracy, while zeroing the same slots collapses accuracy. The dissociation replicates across the Qwen3, Gemma 3, and Llama 3 families on multiple datasets.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.