TGV-KV: Text-Grounded KV Eviction for Vision-Language Models
Quick Take
The TGV-KV method enhances Vision-Language Models (VLMs) by implementing text-grounded key-value eviction, achieving 99.2% accuracy on the VizWiz-VQA task with LLaVA-NeXT while increasing throughput by 52.6% with a 5% retention budget. This approach addresses memory inefficiencies in VLMs by prioritizing visual information based on textual guidance.
Key Points
- TGV-KV includes three submodules: TVB, TWR, and TPR for efficient KV management.
- Achieves 99.2% full-KV accuracy on VizWiz-VQA with LLaVA-NeXT.
- Boosts end-to-end throughput by 52.6% using a 5% retention budget.
- Addresses memory consumption issues in VLMs due to context length.
- Code is publicly available on GitHub for further exploration.
Article Content
From source RSS / original summaryarXiv:2606. 03075v1 Announce Type: new Abstract: Vision-Language Models (VLMs) inherit the auto-regressive generation paradigm and cache the keys and values (KV) of all previous tokens to accelerate inference, resulting in memory consumption that scales linearly with context length. This issue is particularly pronounced in VLMs due to substantial redundancy in the visual modality.
Although KV cache eviction approaches can effectively reduce inference memory, they often incur significant performance degradation in VLMs, as most are designed for language models and overlook the inherent gap between text and vision. By systematically analyzing the modality gap in VLMs in this work, we argue that the importance of visual information should be grounded in textual guidance and accordingly propose a Text-Grounded KV Eviction method for VLMs (TGV-KV).
TGV-KV comprises three submodules: (1) Text-Vision Budgeting (TVB) assigns budget to each layer based on the mutual information interaction. (2) Text-Weighted Ranking (TWR) assesses the priority of text and ranks vision importance based on weighted text-image attention. (3) Text-Prioritised Retention (TPR) policy strategically preserves text KV to avoid acute information loss. We evaluate TGV-KV across five models with different sizes and architectures, showing that TGV-KV preserves 99.
2% full-KV accuracy on the VizWiz-VQA task with LLaVA-NeXT and boosts end-to-end throughput by 52. 6% with an extreme retention budget of 5%. Code is available at https://github. com/Danielement321/TGV-KV.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Plan2Map: A Multimodal Benchmark for Document-Grounded Geospatial Boundary Reconstruction from Planning Records
Plan2Map introduces a 208-case benchmark for reconstructing geospatial boundaries from UK planning documents. The GeoPlanAgent system achieves a mean IoU of 0.736, significantly outperforming baseline models, highlighting the challenges in localization and map registration.