Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

arXiv cs.AI·Siqi Luo, Jianghan Shen, Yi Xin, Huayu Zheng, Haoxing Chen, Yan Tai, Yue Li, Junjun He, Yihao Liu, Guangtao Zhai, Yuewen Cao, Xiaohong Liu

1d ago

·~2 min·5/19/2026·en·2

Quick Take

Hierarchical Token GRPO enhances reinforcement learning for diffusion multi-modal large language models in image generation.

Key Points

Proposes a Sketch-Then-Paint training scheme.
Implements a Hierarchical Credit Assignment mechanism.
Demonstrates significant improvements in image quality and human preference.

📖 Reader Mode

~2 min read

[Submitted on 16 May 2026]

Authors:Siqi Luo, Jianghan Shen, Yi Xin, Huayu Zheng, Haoxing Chen, Yan Tai, Yue Li, Junjun He, Yihao Liu, Guangtao Zhai, Yuewen Cao, Xiaohong Liu

View PDF HTML (experimental)

Abstract:Diffusion Multi-Modal Large Language Models (dMLLMs) are powerful for image generation, but optimizing them through reinforcement learning (RL) remains a major challenge. One primary difficulty is that a single image can be generated through many different unmasking sequences, which makes calculating importance ratios often intractable. Additionally, existing methods tend to ignore the hierarchical generation process of dMLLMs, where early tokens define the global layout and later tokens focus on local details. By assigning uniform rewards to all tokens, these current methods fail to reflect the actual contribution of each token to the final image. To address these issues, we propose Hierarchical Token GRPO (HT-GRPO), which integrates this hierarchy directly into the policy optimization process. Our approach features a Sketch-Then-Paint training scheme that organizes updates into three distinct stages: global, structure, and refinement. We also use a prompt-conditioned estimator to calculate importance ratios starting from a fully masked state. Furthermore, we introduce a Hierarchical Credit Assignment mechanism that prioritizes key structural tokens to ensure accurate reward propagation. Experiments using two popular dMLLM backbones, MMaDA and Lumina-DiMOO, demonstrate that HT-GRPO achieves substantial gains on the GenEval and DPG benchmarks. Evaluations across six additional metrics confirm significant improvements in image quality, aesthetics, and human preference.

Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2605.16842 [cs.AI]
	(or arXiv:2605.16842v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2605.16842 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Jianghan Shen [view email]
[v1] Sat, 16 May 2026 06:59:54 UTC (4,879 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.AI

From Prompts to Protocols: An AI Agent for Laboratory Automation

Agentic Trading: When LLM Agents Meet Financial Markets

Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems

Related in this space

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?