A Dataset for Dynamic Human Preferences for Vision Language Models
Quick Answer
This paper presents a new benchmark for evaluating Vision Language Models (VLMs) on dynamic human preferences, focusing on real-time adaptability rather than static capabilities.
Quick Take
This paper presents a new benchmark for evaluating Vision Language Models (VLMs) on dynamic human preferences, focusing on real-time adaptability rather than static capabilities. An automated pipeline generates a multi-modal dataset to assess state-of-the-art models, addressing the need for VLMs to understand context-specific user preferences during inference.
Key Points
- Introduces a benchmark for dynamic human preferences in VLMs.
- Focuses on real-time user preferences rather than static evaluations.
- Provides an automated pipeline for generating a multi-modal dataset.
- Evaluates state-of-the-art models on this novel benchmark.
- Addresses the gap in existing vision-language benchmarks.
Article Excerpt
From source RSS / original summaryarXiv:2606. 07653v1 Announce Type: new Abstract: Given the increased adoption of Vision Language Models (VLMs) in human-interactive settings, it is important that we evaluate how well these models can adapt to real-time preferences for different users. While an increasing number of vision-language benchmarks have recently been introduced, they focus largely on evaluating static capabilities and generally-held preferences learned from extensive training data.
This work introduces a new benchmark for evaluating the ability of VLMs to understand dynamic human-preferences, i. e. preferences that are passed in-context at inference time. We provide an automated pipeline for generating this benchmark with variations on image dependence, a dynamic multi-modal human-preference dataset, and evaluations of state-of-the-art models on the novel benchmark.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.