Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models
Quick Take
Hierarchical Token GRPO enhances reinforcement learning for diffusion multi-modal large language models in image generation.
Key Points
- Proposes a Sketch-Then-Paint training scheme.
- Implements a Hierarchical Credit Assignment mechanism.
- Demonstrates significant improvements in image quality and human preference.
📖 Reader Mode
~2 min readAuthors:Siqi Luo, Jianghan Shen, Yi Xin, Huayu Zheng, Haoxing Chen, Yan Tai, Yue Li, Junjun He, Yihao Liu, Guangtao Zhai, Yuewen Cao, Xiaohong Liu
Abstract:Diffusion Multi-Modal Large Language Models (dMLLMs) are powerful for image generation, but optimizing them through reinforcement learning (RL) remains a major challenge. One primary difficulty is that a single image can be generated through many different unmasking sequences, which makes calculating importance ratios often intractable. Additionally, existing methods tend to ignore the hierarchical generation process of dMLLMs, where early tokens define the global layout and later tokens focus on local details. By assigning uniform rewards to all tokens, these current methods fail to reflect the actual contribution of each token to the final image. To address these issues, we propose Hierarchical Token GRPO (HT-GRPO), which integrates this hierarchy directly into the policy optimization process. Our approach features a Sketch-Then-Paint training scheme that organizes updates into three distinct stages: global, structure, and refinement. We also use a prompt-conditioned estimator to calculate importance ratios starting from a fully masked state. Furthermore, we introduce a Hierarchical Credit Assignment mechanism that prioritizes key structural tokens to ensure accurate reward propagation. Experiments using two popular dMLLM backbones, MMaDA and Lumina-DiMOO, demonstrate that HT-GRPO achieves substantial gains on the GenEval and DPG benchmarks. Evaluations across six additional metrics confirm significant improvements in image quality, aesthetics, and human preference.
| Subjects: | Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2605.16842 [cs.AI] |
| (or arXiv:2605.16842v1 [cs.AI] for this version) | |
| https://doi.org/10.48550/arXiv.2605.16842 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Jianghan Shen [view email]
[v1]
Sat, 16 May 2026 06:59:54 UTC (4,879 KB)
— Originally published at arxiv.org
More from arXiv cs.AI
See more →From Prompts to Protocols: An AI Agent for Laboratory Automation
An AI agent integrates large language models for automating laboratory protocols, enhancing efficiency and accuracy.