Brain-LLM Alignment Tracks Training Data, Not Typology

arXiv cs.CL·Dongxin Guo, Jikun Wu, Siu Ming Yiu

9h ago

·~1 min·5/25/2026·en·0

Quick Take

Brain-LLM alignment is influenced by training data dominance rather than inherent language properties.

Key Points

Study uses fMRI data from 112 participants.
Chinese-dominant models align better with Chinese brains.
Syntactic processing shows greater typological gradients.

Article Content

From source RSS / original summary

arXiv:2605. 23032v1 Announce Type: new Abstract: Brain-LLM alignment is well established in English, yet the brain's language network is neuroanatomically universal across languages. Does alignment also generalize cross-linguistically, and what governs the variation? We test this using fMRI data from 112 participants across English, Chinese, and French (the Le Petit Prince corpus) and seven LLMs spanning English-dominant, Chinese-dominant, and multilingual architectures.

Our central finding is that training-language dominance, not an inherent property of English, drives the alignment pattern: a Chinese-dominant model (Baichuan2-7B), architecture-matched to LLaMA-2-7B, reverses the gradient entirely, aligning best with Chinese brains and worst with English. Beyond training dominance, formal typological distance independently covaries with alignment degradation, syntax-associated brain regions (IFG) show $2.

3\times$ steeper typological gradients than lexico-semantic regions (PTL), and tokenization fertility accounts for $\sim$60% of a cross-linguistic shift in optimal encoding layer. These results reveal that the apparent "English advantage" in brain-LLM alignment is an artifact of training data composition, while the remaining variation reflects genuine typological structure concentrated in syntactic processing.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

Brain-LLM Alignment Tracks Training Data, Not Typology

Quick Take

Key Points

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

How Far Will They Go? Red-Teaming Online Influence with Large Language Models

Related in this space

Verifiable Agentic Infrastructure: Proof-Derived Authorization for Sovereign AI Systems