Language models struggle with compartmentalization
Quick Take
Language models struggle with compartmentalization, failing to unify distinct presentations of concepts.
Key Points
- LLMs often learn redundant internal representations.
- Multilingual learning in small models is highly compartmentalized.
- Interventions show phase transitions based on presentation diversity.
📖 Reader Mode
~2 min readAbstract:In the training data used by large language models (LLMs), the same latent concept is often presented in multiple distinct ways: the same facts appear in English and Swahili; many functions can be expressed in both Python and Haskell; we can express propositions in both formal and natural language. We show that LLMs can exhibit compartmentalization, where they fail to identify and share statistical strength between distinct presentations of unified concepts. In the worst case, LLMs simply learn parallel internal representations of each presentation of the concept, saturating model capacity with redundancies and decreasing sample efficiency with the number of such presentations. We also demonstrate that synthetic parallel data can fail to improve this despite being easily learned itself. Under this framework, we find that, for small models, early multilingual learning is nearly entirely compartmentalized. Finally, all interventions that we study exhibit a phase transition in which their effectiveness depends on the number of distinct presentations, suggesting that the language modeling objective may only inconsistently unify representations.
| Comments: | 9 pages, 8 figures, plus 9 pages of appendices. Submitted to NeurIPS 2026. Code: this https URL. Eval data: this https URL |
| Subjects: | Computation and Language (cs.CL); Machine Learning (cs.LG) |
| Cite as: | arXiv:2605.19284 [cs.CL] |
| (or arXiv:2605.19284v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.19284 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Thomas Howe [view email]
[v1]
Tue, 19 May 2026 03:02:46 UTC (361 KB)
— Originally published at arxiv.org
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.