Read the Trace, Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models
Quick Take
CAPR (Cached-Amortized Path Refinement) enhances reinforcement learning for diffusion language models (dLLMs) by summarizing denoising traces into compact path states. It achieves a new state of the art in RL-tuned dLLMs, outperforming tree-structured baselines on benchmarks like Sudoku with reduced compute costs, achieving 0.75x the cost of flat rollouts and 0.6x of tree rollouts.
Key Points
- CAPR reduces rollout generation costs to 0.75x of flat rollouts and 0.6x of tree rollouts.
- Achieves new state of the art for RL-tuned dLLMs on benchmarks like Sudoku and Math500.
- Utilizes cached trajectory states for efficient sibling continuation generation.
- Records path-state and block-progress features for improved local supervision.
- Matches tree-structured baseline performance with less than one third of the compute.
Article Content
From source RSS / original summaryarXiv:2606. 04396v1 Announce Type: new Abstract: Diffusion large language models (dLLMs) generate responses by iteratively unmasking and revising many positions in parallel. This process leaves a rich denoising trace depicting which tokens become confident, which remain unstable, and when commitments form. Existing dLLM reinforcement learning methods use this signal only weakly. Flat rollouts are cheap, but assign a single outcome reward to the whole trajectory.
Tree rollouts provide finer, verifiable training signals by branching partial trajectories and propagating leaf rewards upward, but are compute intensive. We ask whether the denoising trace itself can provide tree-like supervision without tree-level compute.
We introduce CAPR (Cached-Amortized Path Refinement), a dLLM-RL algorithm that summarizes the denoising trace into a compact path state, uses cached trajectory states to generate cheap sibling continuations, and trains a block-level value head for local block-wise supervision. Under a block-wise unmasking schedule, CAPR records path-state and block-progress features, then redistributes the final outcome reward across blocks according to the tokens revealed in each block.
This trains the value head to convert one sparse reward into block-level PPO weights. CAPR therefore recovers much of the granularity of tree search while avoiding full tree expansion, reducing rollout-generation cost to roughly 0. 75x that of flat rollouts and 0. 6x that of tree rollouts (under standard settings). Across 4x4 Sudoku, Countdown, GSM8K, and Math500, on dense and mixture-of-experts LLaDA backbones, CAPR sets a new state of the art for RL-tuned dLLMs at 256- and 512-token budgets.
On Sudoku, it matches the strongest tree-structured baseline at less than one third of the per-step compute.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.