Ground Then Rank: Revisiting Knowledge-Based VQA with Training-Free Entity Identification

arXiv cs.CL·Qian Ma, Qiong Wu, Zhengyi Zhou, Yao Ma

1w ago

·~2 min·6/24/2026·en·0

Quick Answer

Quick Take

The paper introduces a training-free framework for Knowledge-Based Visual Question Answering (KB-VQA) that separates entity identification from evidence ranking, improving performance on benchmarks like Encyclopedic-VQA and InfoSeek. This method reduces complexity and consistently outperforms existing multi-modal re-ranking approaches by enhancing entity recognition and evidence selection.

Key Points

Proposes an Identify-Before-Answer (IBA) framework for KB-VQA tasks.
Decouples entity identification from section-level re-ranking, enhancing performance.
Outperforms fine-tuned multi-modal re-ranking baselines on key benchmarks.
Reduces training and inference complexity significantly.
Improvements stem from better entity identification and informative evidence selection.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 22 Jun 2026]

View PDF HTML (experimental)

Abstract:Knowledge-Based Visual Question Answering (KB-VQA) requires grounding visual queries to external knowledge beyond directly observable content in images. While recent multi modal large language models (MLLMs) show strong perceptual abilities, they struggle on KB-VQA tasks requiring groundings from both fine-grained entity and evidence levels. Most existing multi-modal retrieval augmented generation (MM-RAG) methods tightly couple entity discrimination and section-level evidence ranking into a single re-ranking stage, leading to high cost and limited generalization. In this work, we revisit existing MM-RAG solutions from a workflow perspective and argue both entity-level and fact-level groundings are key bottlenecks. We observe that although MLLMs often fail under open-ended entity naming, they can better identify the correct entity when selecting from a small set of candidate names. Based on this insight, we propose a simple and training-free identify-before-answer IBA framework that decouples entity identification from section-level re-ranking. Our approach prompts an MLLM to select high-confidence entities using only candidate names, followed by an off-the-shelf textual re-ranker for evidence selection. Experiments on Encyclopedic-VQA and InfoSeek show that our method consistently outperforms fine-tuned multi-modal re-ranking baselines while reducing training and inference complexity. Additional analyses reveal that the improvements arise not only from better entity identification, but also from selecting more informative evidence once correct entity is fixed. Our implementation is made public to ease reproducibility.

Comments:	Accepted by ACL 2026 Findings. Project page this https URL
Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
Cite as:	arXiv:2606.23881 [cs.CL]
	(or arXiv:2606.23881v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2606.23881 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Qian Ma [view email]
[v1] Mon, 22 Jun 2026 19:27:00 UTC (5,836 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1w ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

Ground Then Rank: Revisiting Knowledge-Based VQA with Training-Free Entity Identification

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems