GRIP: Feedback-Guided Prompt Retrieval for Large Multimodal Models

arXiv cs.CV·Garvita Allabadi, Matteo Sodano, Roberto Estev\~ao, Yuxiong Wang, Vikram Adve, Emre Kiciman, Ranveer Chandra

1d ago

·~2 min·6/12/2026·en·0

Quick Answer

Quick Take

The GRIP framework enhances Multimodal In-Context Learning (M-ICL) by using feedback from Large Multimodal Models (LMMs) to improve prompt retrieval, outperforming similarity-based methods on tasks like classification and visual question answering. Notably, it shows significant gains on Qwen2.5-VL-7B and Idefics2-8B, and retrievers trained on one model can be transferred to others, including GPT-4o and Gemini, facilitating cost-effective deployment.

Key Points

GRIP uses feedback from LMMs to enhance prompt retrieval effectiveness.
It consistently outperforms similarity-based retrieval across classification, captioning, and VQA tasks.
Strongest performance gains observed on Qwen2.5-VL-7B and Idefics2-8B.
Retrievers trained on one model can be transferred to others without retraining.
Code will be available upon acceptance of the paper.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 12744v1 Announce Type: new Abstract: In-Context Learning (ICL) has become a powerful mechanism for adapting Large Language Models (LLMs) to new tasks without fine-tuning. Extending this concept to Large Multimodal Models (LMMs), Multimodal In-Context Learning (M-ICL) relies on retrieving relevant examples, such as images, captions, or question-answer pairs, to guide predictions across tasks like classification, captioning, and visual question answering (VQA).

Most existing approaches select in-context examples based on feature-space similarity, assuming that semantically similar samples provide the most useful context. However, our systematic analysis reveals that this assumption does not always hold: visually similar examples are not necessarily those that most effectively enhance in-context learning performance.

To address this, we propose the Guided Retrieval of In-context Prompts (GRIP), a learnable vision-only retrieval framework that leverages feedback from LMMs to identify examples that truly improve model predictions. GRIP learns to distinguish beneficial from detrimental in-context examples through contrastive training, refining retrieval beyond pure similarity. Across three multimodal tasks, namely classification, captioning, and VQA, GRIP improves consistently over similarity-based retrieval on Qwen2.

5-VL-7B, with its strongest gains in classification on Idefics2-8B. Moreover, we demonstrate that retrievers trained with feedback from one open LMM can be transferred to other models without retraining, including closed-source GPT-4o and Gemini, enabling scalable and cost-efficient deployment of M-ICL. Code will be published upon acceptance.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

1w ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup