DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models
Quick Take
DLLM-JEPA integrates Joint Embedding Predictive Architectures with masked-diffusion language models, reducing training FLOPs by 33% compared to LLM-JEPA. It achieves significant accuracy improvements across various tasks, including +18.7 pp on LLaDA-8B GSM8K and +11.4 pp on Dream-7B GSM8K, while maintaining MMLU accuracy.
Key Points
- DLLM-JEPA eliminates the need for explicit multi-view data in training.
- Achieves up to +18.7 pp accuracy on LLaDA-8B GSM8K benchmark.
- Reduces training costs by requiring only a single gradient-carrying forward pass.
- Exhibits improved performance across various architectures and tasks.
- Demonstrates a dual-win property by lowering held-out Wikitext loss.
Article Content
From source RSS / original summaryarXiv:2606. 00091v1 Announce Type: new Abstract: Joint Embedding Predictive Architectures (JEPAs) have reshaped self-supervised representation learning in vision. The recent LLM-JEPA ported JEPA to autoregressive language models but inherited two steep costs from the causal-attention substrate: it demands explicit multi-view data (e. g. , text-code pairs), and it requires two gradient-carrying forward passes per step.
We introduce DLLM-JEPA, which pairs JEPA with masked-diffusion language models to eliminate both costs at once. The bidirectional attention of diffusion models yields two semantically distinct views of the same input via different masking rates -- no explicit pairs needed -- and supports a single gradient-carrying forward pass, cutting training FLOPs by 33% relative to LLM-JEPA. DLLM-JEPA improves over diffusion-only fine-tuning in every (task, architecture) combination we evaluate: up to +18.
7 pp on LLaDA-8B GSM8K and +11. 4 pp on Dream-7B GSM8K, with consistent positive gains on Spider, NL-RX-SYNTH, and Django. Beyond accuracy, DLLM-JEPA exhibits a dual-win property: on LLaDA-8B with the Wide-t configuration, it simultaneously raises GSM8K accuracy (67. 1 vs. 65. 2, +1. 8 pp), drives held-out Wikitext loss below the pre-trained base, and preserves MMLU accuracy at base level across three fine-tuning seeds -- whereas an L2-to-base parameter anchor matches baseline accuracy with no task gain.
Layer-wise probing reveals the mechanism: a geometric-functional drift dissociation in which the fine-tuned backbone moves further from the pre-trained weights than the baseline yet forgets less on held-out Wikitext, with the amplification concentrated in middle transformer layers. The pattern appears on Dream-7B as well, indicating the phenomenon is not specific to a single backbone.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.