Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference

arXiv cs.CV·Beomseok Kang, Dongwon Jo, Jiwon Song, Donghwee Son, Jae-Joon Kim

17h ago

·~2 min·5/20/2026·en·1

Quick Take

RotateK enhances vision-language model inference by optimizing key channel pruning with a rotation-based framework.

Key Points

Addresses KV cache pressure in vision-language models.
Introduces rotation-based structured key channel pruning.
Outperforms previous methods in accuracy and latency.

📖 Reader Mode

~2 min read

[Submitted on 19 May 2026]

View PDF HTML (experimental)

Abstract:Vision-Language Models suffer severe KV cache pressure at inference, as a single image often encodes into thousands of tokens. Most existing methods exploit token sparsity through token pruning, but permanently discarding visual content causes substantial degradation on fine-grained perception tasks. This motivates a complementary axis, feature sparsity: under a fixed KV cache budget, compressing the channel dimension preserves more visual tokens at the same memory cost. Prior Key channel pruning methods, however, face a structural trade-off: token-wise channel pruning is expressive but unstructured and slow, while head-wise approach is hardware-friendly but less robust. We resolve this with RotateK, a rotation-based structured Key channel pruning framework. RotateK applies an online PCA-based rotation that aligns token-dependent channel importance into a shared low-dimensional subspace, enabling accurate pruning under lightweight head-wise masks; a fused Triton attention kernel operates directly on sparse-channel Keys for efficient decoding. Experiments on two representative VLM backbones show that RotateK consistently outperforms prior Key channel pruning in both accuracy and decoding latency, while joint token-channel pruning improves over token-only baselines at matched KV cache budgets.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2605.19218 [cs.CV]
	(or arXiv:2605.19218v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2605.19218 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Beomseok Kang [view email]
[v1] Tue, 19 May 2026 00:45:00 UTC (1,605 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.CV

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

Structuring Open-Ended NAS: Semi-Automated Design Knowledge Structuring with LLMs for Efficient Neural Architecture Search

MedFM-Robust: Benchmarking Robustness of Medical Foundation Models

Related in this space

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

From Prompts to Protocols: An AI Agent for Laboratory Automation

Agentic Trading: When LLM Agents Meet Financial Markets