Roll Out and Roll Back: Diffusion LLMs are Their Own Efficiency Teachers
Quick Take
Diffusion LLMs enhance efficiency by utilizing revokable parallel decoding methods.
Key Points
- WINO enables revokable parallel generation for improved quality.
- WINO+ integrates verified denoising trajectories into model training.
- Experiments show significant accuracy and efficiency improvements.
📖 Reader Mode
~2 min readAbstract:Diffusion Large Language Models (DLLMs) promise fast parallel generation, yet open-source DLLMs still face a severe quality-speed trade-off: accelerating decoding by revealing multiple tokens often causes substantial quality degradation. We attribute this dilemma to a train-inference mismatch amplified by irreversible decoding. While training reconstructs tokens from randomly corrupted states, efficient inference requires an adaptive denoising order, where easier tokens are revealed earlier and context-dependent ones are deferred. This view motivates two complementary methods: an inference-time method that makes parallel decoding revokable, and a training-time extension that distills the reliable order exposed by this revokable process. Accordingly, we first propose Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable parallel generation. WINO aggressively drafts multiple tokens, verifies generated tokens with enriched global context, and re-masks unreliable ones for later refinement. Building on this discovered order, we further introduce WINO+, which injects the verified denoising trajectories produced by WINO into model parameters, aligning training with efficient inference. Experiments on LLaDA and MMaDA show that WINO improves both quality and efficiency, while WINO+ further strengthens this progression. On GSM8K, WINO improves accuracy from 73.24% to 75.82% with a 6.10x step reduction, and WINO+ further achieves 76.58% with a 6.83x reduction. On Flickr30K, WINO+ reaches a 16.22x step reduction with improved CIDEr. These results demonstrate that DLLMs can serve as their own efficiency teachers by first discovering reliable denoising orders through revokable decoding and then learning to follow them for faster generation. Code is available at this https URL.
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2605.16941 [cs.CL] |
| (or arXiv:2605.16941v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.16941 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Fanqin Zeng [view email]
[v1]
Sat, 16 May 2026 11:27:40 UTC (849 KB)
— Originally published at arxiv.org
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.