Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual Evidence
Quick Take
The EAGLE framework enhances multi-agent visual question answering (VQA) by aligning visual evidence, achieving superior performance across six benchmarks. Unlike traditional text-centric methods, EAGLE emphasizes mutual verification of visual grounding, leading to more reliable consensus among agents.
Key Points
- EAGLE stands for Evidence-Aligned Grounded Multi-agent Reasoning.
- The framework does not require training and focuses on visual evidence.
- EAGLE shows best average performance across multiple VQA benchmarks.
- It enhances interpretability and practical deployment of VQA systems.
- The approach mitigates individual hallucinations in multi-agent collaboration.
Article Content
From source RSS / original summaryarXiv:2605. 30698v1 Announce Type: new Abstract: Vision-language models (VLMs) have achieved strong performance on visual question answering (VQA). To mitigate individual hallucinations and blind spots, aggregating diverse perspectives via multi-agent collaboration has emerged as a promising paradigm. While this approach has shown great success in textual QA, its potential in the multimodal domain remains under-explored.
Existing multi-agent VQA methods predominantly adapt text-centric protocols, focusing on textual discussions while ignoring the alignment of visual information. In this work, we reveal a key insight: answer-level agreement is insufficient for reliable multi-agent VQA; \textit{aligned visual evidence} -- shared support from the image regions agents rely on -- is essential for trustworthy consensus.
To leverage this insight, we propose EAGLE (\textbf{E}vidence-\textbf{A}ligned \textbf{G}rounded mu\textbf{L}ti-agent r\textbf{E}asoning), a training-free evidence-centered framework for coordinating multiple VLM agents. EAGLE explicitly exposes each agent's grounding regions as visual evidence, enables mutual verification over the evidence, and uses evidence consistency to guide final decision-making.
Experiments on six VQA benchmarks show that EAGLE achieves best average performance across domains while remaining lightweight, interpretable, and practical for deployment.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, enabling efficient fine-tuning with only 0.11% parameter updates. It significantly enhances performance in few-shot learning and domain shifts across 15 biomedical imaging datasets, demonstrating robustness for clinical applications.