Self-Evolving Visual Questioner
Quick Answer
The proposed self-evolving framework allows vision-language models (VLMs) to autonomously generate diverse and challenging visual questions, enhancing their performance as both questioners and answerers.
Quick Take
The proposed self-evolving framework allows vision-language models (VLMs) to autonomously generate diverse and challenging visual questions, enhancing their performance as both questioners and answerers. This method outperforms traditional static training data approaches, improving question quality and expanding difficulty boundaries without external supervision.
Key Points
- VLMs can now self-improve as visual questioners without external supervision.
- The framework generates harder, more informative questions while maintaining diversity.
- Experiments show substantial enhancements in question quality and difficulty boundaries.
- Self-supervision is more effective than training on static data under the same budget.
- The self-evolving questioner remains competitive or better as an answerer.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 13929v1 Announce Type: new Abstract: Vision-language models (VLMs) are typically trained as passive answerers, while their ability to actively ask diverse, non-trivial, visual-centric and grounded questions remains underexplored. Existing visual questioners' performance is bottlenecked by the availability of high-quality training data or the cost of curating them. We show that a VLM can continuously improve itself as a visual questioner without any external supervision.
We propose a self-evolving framework that uses a VLM itself as both a proposer and a filter to produce harder, more informative, and visual-centric questions, while maintaining their exploration diversity to avoid training collapse. These questions are then used to train the VLM in both questioner and answerer modes. To evaluate the questioner, we introduce an agentic protocol that assesses questions along perception, reasoning, and diversity dimensions.
Experiments across various backbone VLMs show that our method substantially enhances the quality and substantially expands the difficulty boundary of autonomous question generation. Under the same budget, our self-supervision is more effective than training on the static source data. Moreover, the self-evolving questioner remains a competitive or even better answerer.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.