Toward LLMs Beyond English-Centric Development

arXiv cs.CL·Sho Takase, Ukyo Honda

2d ago

·~1 min·5/18/2026·en·2

Quick Take

LLMs exhibit significant English bias, necessitating dedicated investments for non-English language development.

Key Points

Analysis shows LLMs favor English over other languages.
Continual pre-training lacks cost benefits for target languages.
Future LLMs may require language-specific resources.

📖 Reader Mode

~1 min read

[Submitted on 15 May 2026]

View PDF HTML (experimental)

Abstract:Through an analysis of sequences generated by open-weight large language models (LLMs), we demonstrate that LLMs are heavily biased toward English. While continual pre-training is commonly used to adapt LLMs to a target language, we show that it does not offer a cost advantage over training from scratch, even for improving cultural understanding in the target language. These findings suggest that dedicated per-language investment may become increasingly important for future LLM development, rather than relying primarily on the expansion of English-centric resources.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2605.15613 [cs.CL]
	(or arXiv:2605.15613v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.15613 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Sho Takase [view email]
[v1] Fri, 15 May 2026 04:51:07 UTC (234 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Toward LLMs Beyond English-Centric Development

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent

Related in this space

Verifiable Agentic Infrastructure: Proof-Derived Authorization for Sovereign AI Systems

MedFM-Robust: Benchmarking Robustness of Medical Foundation Models