CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis

arXiv cs.CV·Yunsung Chung, Alex El Darzi, Carlo El Khoury, Han Feng, Nassir Marrouche, Jihun Hamm

3d ago

·~2 min·5/14/2026·en·1

Quick Take

CRAFT enhances medical image synthesis by aligning generated images with clinical criteria using a novel scoring system.

Key Points

Introduces Clinical Alignment Score (CAS) for evaluating medical images.
Implements reward-based adaptation using multimodal models.
Demonstrates significant reduction in hallucination-like generations.

📖 Reader Mode

~2 min read

[Submitted on 12 May 2026]

View PDF HTML (experimental)

Abstract:Foundation diffusion models can generate photorealistic natural images, but adapting them to medical imaging remains challenging. In medical adaptation, limited labeled data can exacerbate hallucination-like and clinically implausible synthesis, while existing metrics such as FID or Inception Score do not quantify per-image alignment with pathology-relevant criteria. We introduce the Clinical Alignment Score (CAS), a foundation-model-based proxy for clinical alignment that evaluates generated images along four complementary dimensions beyond visual fidelity. Building on CAS, we propose Clinical Reward-Aligned Finetuning (CRAFT), a reward-based adaptation framework that transfers medical knowledge from multimodal large language models and vision-language models through label-conditioned prompt enrichment, clinical checklists, and differentiable reward optimization. Across four diverse modalities, CRAFT improves CAS and downstream classification performance over strong adaptation baselines. Beyond average CAS gains, CRAFT reduces the empirical low-alignment tail below a real-image reference threshold by 5.5-34.7% points relative to the strongest baseline, corresponding to a 20.4% average relative reduction across datasets. These results indicate fewer hallucination-like generations under CAS, and are corroborated by out-of-family evaluator evaluation, structured checklist auditing, memorization analysis, and a blinded physician preference study on CheXpert.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2605.12650 [cs.CV]
	(or arXiv:2605.12650v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2605.12650 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yunsung Chung [view email]
[v1] Tue, 12 May 2026 18:56:34 UTC (8,802 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.CV

CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers

ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers

Related in this space

Generative Floor Plan Design with LLMs via Reinforcement Learning with Verifiable Rewards

China bypasses US GPU bans with 1.54-exaflops 'LineShine' supercomputer — CPU-only monster packs 2.4 million Huawei-designed Armv9 cores