Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation
Quick Take
SceneDiver introduces a coarse-to-fine focus plan generation method for Vision-Language Models (VLMs) and Vision-Language-Action Models (VLAs), significantly reducing visual hallucinations and improving task execution efficiency. Evaluations on standard embodied AI benchmarks demonstrate enhanced performance in robotic manipulation and navigation tasks while maintaining computational efficiency.
Key Points
- SceneDiver constructs a holistic scene graph for initial scene comprehension.
- It decomposes tasks into simpler sub-problems through iterative recognition and analysis.
- The method reduces visual hallucinations for both VLMs and VLAs.
- A lightweight adapter distills focus ability for reactive control in VLAs.
- Code and data are available at the project's GitHub page.
Article Content
From source RSS / original summaryarXiv:2606. 04046v1 Announce Type: new Abstract: In embodied vision-language decision making tasks such as robotic manipulation and navigation, Vision-Language and Vision-Language-Action Models (VLMs & VLAs) are powerful tools with different benefits: VLMs are better at long-term planning, while VLAs are better at reactive control. However, their performance is limited by the same perceptual bottleneck: visual hallucinations arise due to the models' inability to distinguish task-relevant objects from distractors.
In principle, accurate identification and focus on critical objects while filtering out irrelevant ones is the key to break this limitation. A straightforward solution is one-step focus: directly attending to essential objects. However, this approach proves ineffective because effective focus inherently requires deep scene understanding.
To this end, we propose SceneDiver, a coarse-to-fine focus plan generation method for VLMs leveraging their long-term planning abilities, that first constructs a holistic scene graph to establish initial comprehension, then progressively decomposes the task into simpler sub-problems through an iterative cycle of recognition, understanding, and analysis. To enable reactive control, we also design a lightweight adapter for distilling the deliberate focus ability into VLAs.
Evaluations on standard embodied AI benchmarks confirm that our method substantially reduces visual hallucinations for both VLMs and VLAs, while preserving computational efficiency in tasks requiring fast execution. Our code and data are released at: https://future-item. github. io/SceneDiver.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Optimal Transport Flow Matching by Design
The study presents a novel approach to optimal transport (OT) flow matching, reformulating the problem by treating the prior as a design choice. This method achieves over 2x reduction in trajectory curvature compared to existing methods, improving generation quality in few-step regimes without altering the flow model. The approach integrates seamlessly with latent-space models and classifier-free guidance.
