Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs

arXiv cs.CV·Shayan Mohammadizadehsamakosh, Pritam Sarkar, Leonid Sigal, Ali Etemad, Elham Dolatabadi

1d ago

·~2 min·6/12/2026·en·0

Quick Answer

Quick Take

This study introduces a fine-grained preference optimization method for Large Vision-Language Models (LVLMs) in medical imaging, addressing limitations like sequence-level reward signals and static supervised fine-tuning. By employing a bidirectional token-wise KL regularizer and a visual-contrastive grounding objective, the approach enhances clinical correctness and visual grounding, validated through extensive experiments on medical imaging tasks and clinical text generation benchmarks.

Key Points

Addresses critical limitations in LVLMs for medical imaging tasks.
Introduces a bidirectional token-wise KL regularizer for improved alignment.
Employs visual-contrastive grounding to enhance clinical correctness.
Validated through extensive experiments on medical imaging and text generation.
Corrects clinically erroneous spans while preserving linguistic style.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 12590v1 Announce Type: new Abstract: Large Vision-Language Models (LVLMs) have achieved strong performance across medical imaging tasks, yet they remain prone to factual inconsistencies, poor visual grounding, and misalignment with clinically meaningful feedback.

Existing post-training alignment approaches, including Direct Preference Optimization (DPO) and its variants, face three critical limitations in the medical domain: (1) sequence-level reward signals treat clinically critical tokens identically to generic filler text; (2) reliance on static supervised fine-tuning references as preferred responses introduces an off-policy distribution shift, steering optimization toward stylistic artifacts over clinical correctness; and (3) alignment objectives lack explicit visual grounding constraints, leaving models insensitive to subtle yet diagnostically decisive pathological features.

Our method leverages a bidirectional token-wise KL regularizer alongside a visual-contrastive grounding objective that pairs clean and lesion-corrupted images to penalize responses generated without adequate visual evidence. Together, these components form a fine-grained, on-policy alignment framework that constructs preference pairs by minimally editing model-generated outputs, correcting only clinically erroneous spans while preserving the original linguistic style.

Extensive experiments across medical imaging tasks and clinical text generation benchmarks validate the effectiveness of our approach.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

1w ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup