Predict and Reconstruct: Joint Objectives for Self-Supervised Language Representation Learning

2d ago

·~1 min·6/5/2026·en·1

Quick Answer

This paper shows that A new hybrid pre-training objective combining JEPA-style latent-space prediction and masked language modeling (MLM) shows improved representation uniformity and semantic balance over traditional MLM.

Quick Take

A new hybrid pre-training objective combining JEPA-style latent-space prediction and masked language modeling (MLM) shows improved representation uniformity and semantic balance over traditional MLM. Tested on English Wikipedia, the hybrid model outperformed MLM in embedding uniformity (-0.16 vs -0.05) while achieving similar downstream accuracy across five GLUE benchmarks.

Key Points

Hybrid model combines JEPA-style prediction with standard MLM for improved training.
Achieved significantly more uniform embeddings with a uniformity score of -0.16.
Demonstrated richer spectral geometry under max pooling compared to pure MLM.
Maintained similar linear-probe accuracy across five GLUE benchmarks.
JEPA objective reshapes latent space beyond what accuracy metrics can capture.

Article Content

From source RSS / original summary

arXiv:2606. 05173v1 Announce Type: new Abstract: Masked language modelling (MLM) has been the dominant pre-training objective for text encoders since BERT, yet it encourages representations that are strongly anchored to surface-form token identity rather than deeper semantic structure.

Inspired by the success of Joint Embedding Predictive Architectures (JEPA) (LeCun, 2022) in vision and audio, we propose a hybrid pre-training objective that combines a JEPA-style latent-space prediction loss with a standard MLM objective over a single shared encoder. A learnable scalar parameter continuously balances the two objectives during training. We pre-train both a hybrid model and a pure-MLM baseline on English Wikipedia using identical architectures and compute budgets (NVIDIA H100).

Extensive representation analysis across five GLUE benchmarks (SST-2, MRPC, MNLI, CoLA, STS-B) using four pooling strategies reveals that the hybrid encoder produces significantly more uniform embeddings (uniformity less than -0. 16 vs -0. 05 for MLM), exhibits richer spectral geometry under max pooling, encodes less surface-level lexical information, and achieves a better semantic-to-lexical balance.

Despite similar linear-probe downstream accuracy, the geometric differences are consistent and significant, suggesting that the JEPA predictive objective reshapes the latent space in ways that standard accuracy metrics alone cannot capture.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

2w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy