CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis
Quick Take
CRAFT enhances medical image synthesis by aligning generated images with clinical criteria using a novel scoring system.
Key Points
- Introduces Clinical Alignment Score (CAS) for evaluating medical images.
- Implements reward-based adaptation using multimodal models.
- Demonstrates significant reduction in hallucination-like generations.
📖 Reader Mode
~2 min readAbstract:Foundation diffusion models can generate photorealistic natural images, but adapting them to medical imaging remains challenging. In medical adaptation, limited labeled data can exacerbate hallucination-like and clinically implausible synthesis, while existing metrics such as FID or Inception Score do not quantify per-image alignment with pathology-relevant criteria. We introduce the Clinical Alignment Score (CAS), a foundation-model-based proxy for clinical alignment that evaluates generated images along four complementary dimensions beyond visual fidelity. Building on CAS, we propose Clinical Reward-Aligned Finetuning (CRAFT), a reward-based adaptation framework that transfers medical knowledge from multimodal large language models and vision-language models through label-conditioned prompt enrichment, clinical checklists, and differentiable reward optimization. Across four diverse modalities, CRAFT improves CAS and downstream classification performance over strong adaptation baselines. Beyond average CAS gains, CRAFT reduces the empirical low-alignment tail below a real-image reference threshold by 5.5-34.7% points relative to the strongest baseline, corresponding to a 20.4% average relative reduction across datasets. These results indicate fewer hallucination-like generations under CAS, and are corroborated by out-of-family evaluator evaluation, structured checklist auditing, memorization analysis, and a blinded physician preference study on CheXpert.
| Subjects: | Computer Vision and Pattern Recognition (cs.CV) |
| Cite as: | arXiv:2605.12650 [cs.CV] |
| (or arXiv:2605.12650v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2605.12650 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Yunsung Chung [view email]
[v1]
Tue, 12 May 2026 18:56:34 UTC (8,802 KB)
— Originally published at arxiv.org
More from arXiv cs.CV
See more →CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers
CoReDiT enhances Diffusion Transformers by optimizing token pruning for efficiency and quality.
