Improved Large Language Diffusion Models

arXiv cs.CL·Shen Nie, Qiyang Min, Shaoxuan Xu, Zihao Huang, Yuxuan Song, Yong Shan, Yankai Lin, Wayne Xin Zhao, Chongxuan Li, Ji-Rong Wen

15h ago

·~1 min·6/25/2026·en·0

Quick Answer

This paper shows that The iLLaDA model, an 8B masked diffusion language model, outperforms its predecessor LLaDA across various benchmarks by significant margins, achieving improvements of 21.6 points on BBH and 14.5 points on MATH.

Quick Take

The iLLaDA model, an 8B masked diffusion language model, outperforms its predecessor LLaDA across various benchmarks by significant margins, achieving improvements of 21.6 points on BBH and 14.5 points on MATH. Trained with fully bidirectional attention, it demonstrates competitive performance against Qwen2.5 despite non-autoregressive training.

Key Points

iLLaDA is trained from scratch with 12T tokens and 25B-token instruction corpus.
Improvements include 14.9 points on ARC-Challenge and 16.5 points on HumanEval.
The model utilizes variable-length generation and confidence-based scoring for efficiency.
iLLaDA remains competitive with Qwen2.5 7B on several benchmarks despite its architecture.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Excerpt

From source RSS / original summary

arXiv:2606. 25331v1 Announce Type: new Abstract: Modern large language models are predominantly trained with autoregressive factorization and causal attention. We present \emph{iLLaDA}, an 8B masked diffusion language model trained from scratch with fully bidirectional attention. iLLaDA keeps the masked diffusion objective throughout pre-training and supervised fine-tuning (SFT), scaling pre-training to 12T tokens and fine-tuning on a 25B-token instruction corpus for 12 epochs.

We further use variable-length generation for efficiency and introduce confidence-based scoring for multiple-choice evaluation. Compared with LLaDA, iLLaDA improves broadly across general, mathematical, and code benchmarks; for example, iLLaDA-Base improves by 21. 6 points on BBH and 14. 9 points on ARC-Challenge, while iLLaDA-Instruct improves by 14. 5 points on MATH and 16. 5 points on HumanEval. Despite its non-autoregressive training, iLLaDA also remains competitive with Qwen2. 5 7B on several benchmarks.

These results show that fully bidirectional diffusion training from scratch is a competitive path toward strong language models. Model weights and codes: https://github. com/ML-GSAI/LLaDA.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1d ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

Improved Large Language Diffusion Models

Quick Answer

Quick Take

Key Points

Paper Resources

Article Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

Article Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems