GRIP: Feedback-Guided Prompt Retrieval for Large Multimodal Models
Quick Answer
The GRIP framework enhances Multimodal In-Context Learning (M-ICL) by using feedback from Large Multimodal Models (LMMs) to improve prompt retrieval, outperforming similarity-based methods on tasks like classification and visual question answering.
Quick Take
The GRIP framework enhances Multimodal In-Context Learning (M-ICL) by using feedback from Large Multimodal Models (LMMs) to improve prompt retrieval, outperforming similarity-based methods on tasks like classification and visual question answering. Notably, it shows significant gains on Qwen2.5-VL-7B and Idefics2-8B, and retrievers trained on one model can be transferred to others, including GPT-4o and Gemini, facilitating cost-effective deployment.
Key Points
- GRIP uses feedback from LMMs to enhance prompt retrieval effectiveness.
- It consistently outperforms similarity-based retrieval across classification, captioning, and VQA tasks.
- Strongest performance gains observed on Qwen2.5-VL-7B and Idefics2-8B.
- Retrievers trained on one model can be transferred to others without retraining.
- Code will be available upon acceptance of the paper.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 12744v1 Announce Type: new Abstract: In-Context Learning (ICL) has become a powerful mechanism for adapting Large Language Models (LLMs) to new tasks without fine-tuning. Extending this concept to Large Multimodal Models (LMMs), Multimodal In-Context Learning (M-ICL) relies on retrieving relevant examples, such as images, captions, or question-answer pairs, to guide predictions across tasks like classification, captioning, and visual question answering (VQA).
Most existing approaches select in-context examples based on feature-space similarity, assuming that semantically similar samples provide the most useful context. However, our systematic analysis reveals that this assumption does not always hold: visually similar examples are not necessarily those that most effectively enhance in-context learning performance.
To address this, we propose the Guided Retrieval of In-context Prompts (GRIP), a learnable vision-only retrieval framework that leverages feedback from LMMs to identify examples that truly improve model predictions. GRIP learns to distinguish beneficial from detrimental in-context examples through contrastive training, refining retrieval beyond pure similarity. Across three multimodal tasks, namely classification, captioning, and VQA, GRIP improves consistently over similarity-based retrieval on Qwen2.
5-VL-7B, with its strongest gains in classification on Idefics2-8B. Moreover, we demonstrate that retrievers trained with feedback from one open LMM can be transferred to other models without retraining, including closed-source GPT-4o and Gemini, enabling scalable and cost-efficient deployment of M-ICL. Code will be published upon acceptance.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.