DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models

6/2/2026

·~2 min·6/2/2026·en·1

Quick Answer

This paper shows that DLLM-JEPA integrates Joint Embedding Predictive Architectures with masked-diffusion language models, reducing training FLOPs by 33% compared to LLM-JEPA.

Quick Take

DLLM-JEPA integrates Joint Embedding Predictive Architectures with masked-diffusion language models, reducing training FLOPs by 33% compared to LLM-JEPA. It achieves significant accuracy improvements across various tasks, including +18.7 pp on LLaDA-8B GSM8K and +11.4 pp on Dream-7B GSM8K, while maintaining accuracy.

Key Points

DLLM-JEPA eliminates the need for explicit multi-view data in training.
Achieves up to +18.7 pp accuracy on LLaDA-8B GSM8K benchmark.
Reduces training costs by requiring only a single gradient-carrying forward pass.
Exhibits improved performance across various architectures and tasks.
Demonstrates a dual-win property by lowering held-out Wikitext loss.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 00091v1 Announce Type: new Abstract: Joint Embedding Predictive Architectures (JEPAs) have reshaped self-supervised representation learning in vision. The recent LLM-JEPA ported JEPA to autoregressive language models but inherited two steep costs from the causal-attention substrate: it demands explicit multi-view data (e. g. , text-code pairs), and it requires two gradient-carrying forward passes per step.

We introduce DLLM-JEPA, which pairs JEPA with masked-diffusion language models to eliminate both costs at once. The bidirectional attention of diffusion models yields two semantically distinct views of the same input via different masking rates -- no explicit pairs needed -- and supports a single gradient-carrying forward pass, cutting training FLOPs by 33% relative to LLM-JEPA. DLLM-JEPA improves over diffusion-only fine-tuning in every (task, architecture) combination we evaluate: up to +18.

7 pp on LLaDA-8B GSM8K and +11. 4 pp on Dream-7B GSM8K, with consistent positive gains on Spider, NL-RX-SYNTH, and Django. Beyond accuracy, DLLM-JEPA exhibits a dual-win property: on LLaDA-8B with the Wide-t configuration, it simultaneously raises GSM8K accuracy (67. 1 vs. 65. 2, +1. 8 pp), drives held-out Wikitext loss below the pre-trained base, and preserves accuracy at base level across three fine-tuning seeds -- whereas an L2-to-base parameter anchor matches baseline accuracy with no task gain.

Layer-wise probing reveals the mechanism: a geometric-functional drift dissociation in which the fine-tuned backbone moves further from the pre-trained weights than the baseline yet forgets less on held-out Wikitext, with the amplification concentrated in middle transformer layers. The pattern appears on Dream-7B as well, indicating the phenomenon is not specific to a single backbone.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Miguel Arana-Catania, Catherine Conisbee, Matthew Kidd

4d ago

FeaturedOriginal

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

AI Summary

The study evaluates three NLP approaches—Named Entity Recognition, Keyword Extraction, and Topic Modelling—using the Their Finest Hour Online Archive to automate keyword extraction from crowdsourced WWII collections. Findings suggest that while NLP methods show promise, no single approach is sufficient, and ethical considerations in automated keyword extraction are crucial for responsible stewardship.

#AI Coding #Inference #Open Source #Policy

DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Quantifying Prior Dominance in Systems