Predict and Reconstruct: Joint Objectives for Self-Supervised Language Representation Learning
Quick Answer
This paper shows that A new hybrid pre-training objective combining JEPA-style latent-space prediction and masked language modeling (MLM) shows improved representation uniformity and semantic balance over traditional MLM.
Quick Take
A new hybrid pre-training objective combining JEPA-style latent-space prediction and masked language modeling (MLM) shows improved representation uniformity and semantic balance over traditional MLM. Tested on English Wikipedia, the hybrid model outperformed MLM in embedding uniformity (-0.16 vs -0.05) while achieving similar downstream accuracy across five GLUE benchmarks.
Key Points
- Hybrid model combines JEPA-style prediction with standard MLM for improved training.
- Achieved significantly more uniform embeddings with a uniformity score of -0.16.
- Demonstrated richer spectral geometry under max pooling compared to pure MLM.
- Maintained similar linear-probe accuracy across five GLUE benchmarks.
- JEPA objective reshapes latent space beyond what accuracy metrics can capture.
Article Content
From source RSS / original summaryarXiv:2606. 05173v1 Announce Type: new Abstract: Masked language modelling (MLM) has been the dominant pre-training objective for text encoders since BERT, yet it encourages representations that are strongly anchored to surface-form token identity rather than deeper semantic structure.
Inspired by the success of Joint Embedding Predictive Architectures (JEPA) (LeCun, 2022) in vision and audio, we propose a hybrid pre-training objective that combines a JEPA-style latent-space prediction loss with a standard MLM objective over a single shared encoder. A learnable scalar parameter continuously balances the two objectives during training. We pre-train both a hybrid model and a pure-MLM baseline on English Wikipedia using identical architectures and compute budgets (NVIDIA H100).
Extensive representation analysis across five GLUE benchmarks (SST-2, MRPC, MNLI, CoLA, STS-B) using four pooling strategies reveals that the hybrid encoder produces significantly more uniform embeddings (uniformity less than -0. 16 vs -0. 05 for MLM), exhibits richer spectral geometry under max pooling, encodes less surface-level lexical information, and achieves a better semantic-to-lexical balance.
Despite similar linear-probe downstream accuracy, the geometric differences are consistent and significant, suggesting that the JEPA predictive objective reshapes the latent space in ways that standard accuracy metrics alone cannot capture.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.