AVIS: Adaptive Test-Time Scaling for Vision-Language Models
Quick Answer
AVIS introduces Adaptive Visual Inference Scaling for Vision-Language Models, optimizing both Visual Context Scaling and Visual Reasoning Scaling.
Quick Take
AVIS introduces Adaptive Visual Inference Scaling for Vision-Language Models, optimizing both Visual Context Scaling and Visual Reasoning Scaling. This method enhances accuracy while reducing computation costs, outperforming VCS-only and VRS-only baselines across various benchmarks.
Key Points
- AVIS adapts Visual Context Scaling and Visual Reasoning Scaling per query.
- Key Diversity Visual pruning reduces redundant visual tokens efficiently.
- Adaptive self-consistency uses a learned predictor for reasoning rollouts.
- AVIS maintains low compute and latency while improving accuracy.
- Effective on RL post-trained Vision-Language Models across diverse benchmarks.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 11576v1 Announce Type: new Abstract: Modern Vision-Language Models (VLMs) benefit from chain-of-thought prompting and test-time scaling, but these gains often come with prohibitive inference cost due to large visual contexts and long decoding chains. We view this cost through two coupled axes: Visual Context Scaling (VCS), which controls how much visual evidence is passed to the language model, and Visual Reasoning Scaling (VRS), which controls how much inference-time reasoning search is performed.
Existing methods typically optimize one axis at a time, leaving the joint allocation of compute across these axes underexplored. We introduce Adaptive Visual Inference Scaling (AVIS), a lightweight policy that adapts both VCS and VRS per query.
AVIS realizes VCS through Key Diversity Visual (KDV) pruning, a training-free $O(N)$ key-based rule for removing redundant visual tokens before prefilling, and realizes VRS through adaptive self-consistency, using a learned difficulty predictor to select the number of reasoning rollouts. AVIS is deployment-friendly and compatible with shared-prefill inference, where all rollouts reuse a single prefilling pass and KV cache.
Across diverse image and video reasoning benchmarks, AVIS improves the accuracy--compute trade-off relative to VCS-only and VRS-only baselines, and remains effective on top of RL post-trained VLMs while keeping compute and latency low.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.