Toward LLMs Beyond English-Centric Development
Quick Take
LLMs exhibit significant English bias, necessitating dedicated investments for non-English language development.
Key Points
- Analysis shows LLMs favor English over other languages.
- Continual pre-training lacks cost benefits for target languages.
- Future LLMs may require language-specific resources.
📖 Reader Mode
~1 min readAbstract:Through an analysis of sequences generated by open-weight large language models (LLMs), we demonstrate that LLMs are heavily biased toward English. While continual pre-training is commonly used to adapt LLMs to a target language, we show that it does not offer a cost advantage over training from scratch, even for improving cultural understanding in the target language. These findings suggest that dedicated per-language investment may become increasingly important for future LLM development, rather than relying primarily on the expansion of English-centric resources.
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2605.15613 [cs.CL] |
| (or arXiv:2605.15613v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.15613 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Sho Takase [view email]
[v1]
Fri, 15 May 2026 04:51:07 UTC (234 KB)
— Originally published at arxiv.org
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.