Improved Large Language Diffusion Models
Quick Answer
This paper shows that The iLLaDA model, an 8B masked diffusion language model, outperforms its predecessor LLaDA across various benchmarks by significant margins, achieving improvements of 21.6 points on BBH and 14.5 points on MATH.
Quick Take
The iLLaDA model, an 8B masked diffusion language model, outperforms its predecessor LLaDA across various benchmarks by significant margins, achieving improvements of 21.6 points on BBH and 14.5 points on MATH. Trained with fully bidirectional attention, it demonstrates competitive performance against Qwen2.5 despite non-autoregressive training.
Key Points
- iLLaDA is trained from scratch with 12T tokens and 25B-token instruction corpus.
- Improvements include 14.9 points on ARC-Challenge and 16.5 points on HumanEval.
- The model utilizes variable-length generation and confidence-based scoring for efficiency.
- iLLaDA remains competitive with Qwen2.5 7B on several benchmarks despite its architecture.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 25331v1 Announce Type: new Abstract: Modern large language models are predominantly trained with autoregressive factorization and causal attention. We present \emph{iLLaDA}, an 8B masked diffusion language model trained from scratch with fully bidirectional attention. iLLaDA keeps the masked diffusion objective throughout pre-training and supervised fine-tuning (SFT), scaling pre-training to 12T tokens and fine-tuning on a 25B-token instruction corpus for 12 epochs.
We further use variable-length generation for efficiency and introduce confidence-based scoring for multiple-choice evaluation. Compared with LLaDA, iLLaDA improves broadly across general, mathematical, and code benchmarks; for example, iLLaDA-Base improves by 21. 6 points on BBH and 14. 9 points on ARC-Challenge, while iLLaDA-Instruct improves by 14. 5 points on MATH and 16. 5 points on HumanEval. Despite its non-autoregressive training, iLLaDA also remains competitive with Qwen2. 5 7B on several benchmarks.
These results show that fully bidirectional diffusion training from scratch is a competitive path toward strong language models. Model weights and codes: https://github. com/ML-GSAI/LLaDA.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.