AdaMerge: Salience-Aware Adaptive Token Merging for Training-Free Acceleration of Vision Transformers
Quick Take
AdaMerge introduces a novel token-merging framework for Vision Transformers, enhancing performance by addressing token equality assumptions. It outperforms existing methods like ToMe and DSM on ImageNet-1k, achieving only a -1.06% Top-1 accuracy drop at 13.4G FLOPs, compared to -1.45% and -4.62% for competitors.
Key Points
- AdaMerge combines salience-weighted similarity and adaptive per-layer reduction.
- Achieves superior performance over ToMe, PiToMe, and DSM across FLOPs-matched regimes.
- At 13.4G FLOPs, AdaMerge maintains a Top-1 accuracy drop of only -1.06%.
- Utilizes column-wise feature-affinity centrality for token importance assessment.
- First training-free token merging framework to integrate these advanced techniques.
Article Content
From source RSS / original summaryarXiv:2605. 27465v1 Announce Type: new Abstract: The quadratic cost of self-attention in Vision Transformers (ViTs) constitutes a fundamental bottleneck for practical deployment, motivating a vibrant line of research on token reduction.
Among existing approaches, token merging (ToMe) has emerged as an elegant training-free solution; yet its design rests on an unspoken premise of token equality, which contravenes the well-documented non-uniformity of self-attention and leads to information loss in high-salience tokens under aggressive compression. We address this limitation with AdaMerge, a token-merging framework based on two complementary mechanisms.
First, salience-weighted similarity leverages column-wise feature-affinity centrality as a token-importance proxy and incorporates the resulting salience scores into the bipartite matching score, ensuring that pivotal tokens contribute more strongly to the merged representation. Second, adaptive merging intensity uses pre-computed layer-wise similarity statistics to dynamically modulate the per-layer reduction count in accordance with input-specific redundancy.
On ImageNet-1k with ViT-B/16, AdaMerge consistently outperforms ToMe, PiToMe, and DSM across all FLOPs-matched regimes. The accuracy gap widens monotonically with compression: at the 13. 4G FLOPs operating point, AdaMerge sustains a Top-1 degradation of only -1. 06%, compared to -1. 45% for PiToMe and -4. 62% for DSM.
To our knowledge, AdaMerge is the first to combine salience-weighted similarity and adaptive per-layer reduction into a single training-free token merging framework, advancing the accuracy-FLOPs Pareto frontier of ViT acceleration.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.
