Diff-Instruct with Diffused Reward: Towards Principled One-step Generator RL
Quick Take
The Diff-Instruct with Diffused Reward (DIDR) framework enhances one-step text-to-image generation by optimizing reward distribution across noise levels, achieving superior performance over existing SDXL baselines. It significantly improves image fidelity while requiring only a single generation step, outperforming a 50-step teacher model in preference alignment.
Key Points
- DIDR uses a data-free trajectory-level alignment framework based on Integral KL minimization.
- It propagates an RLHF-optimal reward-tilted clean-image distribution across all noise levels.
- DIDR introduces the Diffused Reward Score (DRS) for reward-driven corrections.
- Extensive experiments show DIDR consistently outperforms existing one-step SDXL models.
- DIDR achieves better preference alignment with a 6B DiT backbone (Z-Image) using one generation step.
Article Content
From source RSS / original summaryarXiv:2605. 24001v1 Announce Type: new Abstract: Recent advances in one-step text-to-image generation have enabled real-time synthesis with remarkable efficiency and quality. Previous reinforcement learning methods for one-step generators combine image-space reward optimization with diffusion noisy-space distribution matching. This paradigm brings challenges due to a mismatch between terminal reward optimization and the underlying generative dynamics.
As a result, optimization tends to exploit stochastic degrees of freedom, often improving reward at the expense of image fidelity. To address this issue, we propose Diff-Instruct with Diffused Reward (DIDR), a data-free trajectory-level alignment framework derived from Integral KL minimization. DIDR propagates the RLHF-optimal reward-tilted clean-image distribution across all noise levels along the diffusion trajectory.
We show that this objective admits the same minimizer as clean-image RLHF, while naturally inducing the Diffused Reward Score (DRS), which acts as a reward-driven correction to the reference score function. To make this practical, we further introduce the Diffused Reward Proxy (DRP), an efficient estimator of DRS based on differentiable short-step denoising. Extensive experiments demonstrate that DIDR consistently Pareto-dominates existing one-step SDXL baselines.
Moreover, when transferred to a 6B DiT backbone (Z-Image), DIDR surpasses its 50-step teacher in preference alignment while requiring only a single generation step.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.
