ReGuLaR: Relation-Grounded Latent Reasoning for Large Vision-Language Models
Quick Take
ReGuLaR introduces a relation-grounded latent reasoning framework for large vision-language models, enhancing reasoning by focusing on relevant objects and their relations. It outperforms existing methods on diverse benchmarks, achieving state-of-the-art results. The framework is supported by a new dataset, RGROUNDING-351K, annotated with object bounding boxes and relations.
Key Points
- ReGuLaR uses a ReGFormer to enhance latent reasoning during training.
- The model operates independently of ReGFormer during inference.
- RGROUNDING-351K dataset includes 351,000 annotations for object relations.
- Extensive experiments demonstrate consistent performance improvements over existing methods.
- Code and training data will be publicly released upon acceptance.
Article Content
From source RSS / original summaryarXiv:2605. 30587v1 Announce Type: new Abstract: Chain-of-thought (CoT) reasoning has significantly improved the reasoning ability of large vision-language models (LVLMs) by verbalizing intermediate reasoning steps in natural language. However, such discrete textual rationales are often insufficient for encoding continuous visual evidence. Recent work addresses this limitation by moving reasoning into continuous latent space.
Despite promising progress, existing methods leave latent reasoning insufficiently connected to the compositional and relational structure of visual evidence. To address this gap, we introduce ReGuLaR, a relation grounded latent reasoning framework that explicitly grounds latent states in these critical yet overlooked visual evidence.
ReGuLaR uses a training-time ReGFormer to focus latent reasoning on question-relevant objects and inter-object relations, while at inference time the model reasons and generates answers without invoking the ReGFormer. To support training ReGuLaR, we construct RGROUNDING-351K, a real-world vision-language dataset annotated with key object bounding boxes and inter-object relations.
Extensive experiments across diverse benchmarks show that ReGuLaR consistently outperforms existing approaches and achieves state-of-the-art performance. We include our code in the submission and will release the code and training data publicly upon acceptance.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, enabling efficient fine-tuning with only 0.11% parameter updates. It significantly enhances performance in few-shot learning and domain shifts across 15 biomedical imaging datasets, demonstrating robustness for clinical applications.