Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs
Quick Answer
This study introduces a fine-grained preference optimization method for Large Vision-Language Models (LVLMs) in medical imaging, addressing limitations like sequence-level reward signals and static supervised fine-tuning.
Quick Take
This study introduces a fine-grained preference optimization method for Large Vision-Language Models (LVLMs) in medical imaging, addressing limitations like sequence-level reward signals and static supervised fine-tuning. By employing a bidirectional token-wise KL regularizer and a visual-contrastive grounding objective, the approach enhances clinical correctness and visual grounding, validated through extensive experiments on medical imaging tasks and clinical text generation benchmarks.
Key Points
- Addresses critical limitations in LVLMs for medical imaging tasks.
- Introduces a bidirectional token-wise KL regularizer for improved alignment.
- Employs visual-contrastive grounding to enhance clinical correctness.
- Validated through extensive experiments on medical imaging and text generation.
- Corrects clinically erroneous spans while preserving linguistic style.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 12590v1 Announce Type: new Abstract: Large Vision-Language Models (LVLMs) have achieved strong performance across medical imaging tasks, yet they remain prone to factual inconsistencies, poor visual grounding, and misalignment with clinically meaningful feedback.
Existing post-training alignment approaches, including Direct Preference Optimization (DPO) and its variants, face three critical limitations in the medical domain: (1) sequence-level reward signals treat clinically critical tokens identically to generic filler text; (2) reliance on static supervised fine-tuning references as preferred responses introduces an off-policy distribution shift, steering optimization toward stylistic artifacts over clinical correctness; and (3) alignment objectives lack explicit visual grounding constraints, leaving models insensitive to subtle yet diagnostically decisive pathological features.
Our method leverages a bidirectional token-wise KL regularizer alongside a visual-contrastive grounding objective that pairs clean and lesion-corrupted images to penalize responses generated without adequate visual evidence. Together, these components form a fine-grained, on-policy alignment framework that constructs preference pairs by minimally editing model-generated outputs, correcting only clinically erroneous spans while preserving the original linguistic style.
Extensive experiments across medical imaging tasks and clinical text generation benchmarks validate the effectiveness of our approach.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.