Roll Out and Roll Back: Diffusion LLMs are Their Own Efficiency Teachers

arXiv cs.CL·Fanqin Zeng, Feng Hong, Geng Yu, Huangjie Zheng, Xiaofeng Cao, Ya Zhang, Bo Han, Yanfeng Wang, Jiangchao Yao

5/19/2026

·~2 min·5/19/2026·en·3

Quick Answer

This paper shows that Diffusion Large Language Models (DLLMs) like LLaDA and MMaDA demonstrate improved efficiency and quality through the WINO and WINO+ methods, achieving accuracy increases from 73.24% to 76.58% on GSM8K while reducing decoding steps significantly.

Quick Take

Diffusion Large Language Models (DLLMs) like LLaDA and MMaDA demonstrate improved efficiency and quality through the WINO and WINO+ methods, achieving accuracy increases from 73.24% to 76.58% on GSM8K while reducing decoding steps significantly. These models effectively learn optimal denoising orders, enhancing parallel generation capabilities.

Key Points

WINO enables revokable parallel generation by drafting and verifying multiple tokens.
WINO+ integrates verified denoising trajectories into model parameters for better alignment.
GSM8K accuracy improved from 73.24% to 76.58% with a 6.83x reduction in decoding steps.
On Flickr30K, WINO+ achieved a 16.22x step reduction with improved CIDEr scores.
DLLMs can self-optimize by discovering and following reliable denoising orders.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 16 May 2026]

View PDF HTML (experimental)

Abstract:Diffusion Large Language Models (DLLMs) promise fast parallel generation, yet open-source DLLMs still face a severe quality-speed trade-off: accelerating decoding by revealing multiple tokens often causes substantial quality degradation. We attribute this dilemma to a train-inference mismatch amplified by irreversible decoding. While training reconstructs tokens from randomly corrupted states, efficient inference requires an adaptive denoising order, where easier tokens are revealed earlier and context-dependent ones are deferred. This view motivates two complementary methods: an inference-time method that makes parallel decoding revokable, and a training-time extension that distills the reliable order exposed by this revokable process. Accordingly, we first propose Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable parallel generation. WINO aggressively drafts multiple tokens, verifies generated tokens with enriched global context, and re-masks unreliable ones for later refinement. Building on this discovered order, we further introduce WINO+, which injects the verified denoising trajectories produced by WINO into model parameters, aligning training with efficient inference. Experiments on LLaDA and MMaDA show that WINO improves both quality and efficiency, while WINO+ further strengthens this progression. On GSM8K, WINO improves accuracy from 73.24% to 75.82% with a 6.10x step reduction, and WINO+ further achieves 76.58% with a 6.83x reduction. On Flickr30K, WINO+ reaches a 16.22x step reduction with improved CIDEr. These results demonstrate that DLLMs can serve as their own efficiency teachers by first discovering reliable denoising orders through revokable decoding and then learning to follow them for faster generation. Code is available at this https URL.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2605.16941 [cs.CL]
	(or arXiv:2605.16941v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.16941 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Fanqin Zeng [view email]
[v1] Sat, 16 May 2026 11:27:40 UTC (849 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1w ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

Roll Out and Roll Back: Diffusion LLMs are Their Own Efficiency Teachers

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems