Roll Out and Roll Back: Diffusion LLMs are Their Own Efficiency Teachers
Quick Answer
This paper shows that Diffusion Large Language Models (DLLMs) like LLaDA and MMaDA demonstrate improved efficiency and quality through the WINO and WINO+ methods, achieving accuracy increases from 73.24% to 76.58% on GSM8K while reducing decoding steps significantly.
Quick Take
Diffusion Large Language Models (DLLMs) like LLaDA and MMaDA demonstrate improved efficiency and quality through the WINO and WINO+ methods, achieving accuracy increases from 73.24% to 76.58% on GSM8K while reducing decoding steps significantly. These models effectively learn optimal denoising orders, enhancing parallel generation capabilities.
Key Points
- WINO enables revokable parallel generation by drafting and verifying multiple tokens.
- WINO+ integrates verified denoising trajectories into model parameters for better alignment.
- GSM8K accuracy improved from 73.24% to 76.58% with a 6.83x reduction in decoding steps.
- On Flickr30K, WINO+ achieved a 16.22x step reduction with improved CIDEr scores.
- DLLMs can self-optimize by discovering and following reliable denoising orders.
Paper Resources
📖 Reader Mode
~2 min readAbstract:Diffusion Large Language Models (DLLMs) promise fast parallel generation, yet open-source DLLMs still face a severe quality-speed trade-off: accelerating decoding by revealing multiple tokens often causes substantial quality degradation. We attribute this dilemma to a train-inference mismatch amplified by irreversible decoding. While training reconstructs tokens from randomly corrupted states, efficient inference requires an adaptive denoising order, where easier tokens are revealed earlier and context-dependent ones are deferred. This view motivates two complementary methods: an inference-time method that makes parallel decoding revokable, and a training-time extension that distills the reliable order exposed by this revokable process. Accordingly, we first propose Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable parallel generation. WINO aggressively drafts multiple tokens, verifies generated tokens with enriched global context, and re-masks unreliable ones for later refinement. Building on this discovered order, we further introduce WINO+, which injects the verified denoising trajectories produced by WINO into model parameters, aligning training with efficient inference. Experiments on LLaDA and MMaDA show that WINO improves both quality and efficiency, while WINO+ further strengthens this progression. On GSM8K, WINO improves accuracy from 73.24% to 75.82% with a 6.10x step reduction, and WINO+ further achieves 76.58% with a 6.83x reduction. On Flickr30K, WINO+ reaches a 16.22x step reduction with improved CIDEr. These results demonstrate that DLLMs can serve as their own efficiency teachers by first discovering reliable denoising orders through revokable decoding and then learning to follow them for faster generation. Code is available at this https URL.
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2605.16941 [cs.CL] |
| (or arXiv:2605.16941v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.16941 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Fanqin Zeng [view email]
[v1]
Sat, 16 May 2026 11:27:40 UTC (849 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.