Disentangling Linguistic Relatedness from Task Alignment in Cross-Lingual Transfer
Quick Answer
This study investigates cross-lingual transfer in seven large language models (4B-671B parameters) fine-tuned on Arabic, revealing no Semitic-specific transfer.
Quick Take
This study investigates cross-lingual transfer in seven large language models (4B-671B parameters) fine-tuned on Arabic, revealing no Semitic-specific transfer. Models with weak baselines showed significant improvements across languages, while strong baselines had marginal gains, indicating task-format alignment rather than cross-lingual knowledge transfer.
Key Points
- Seven large language models (4B-671B parameters) were fine-tuned on Arabic.
- No evidence of Semitic-specific transfer was found across language families.
- Weak baseline models improved significantly; strong baselines showed marginal gains.
- Inference-time reasoning benefited models equally, indicating task-format alignment.
- Study reinforces the importance of task alignment over cross-lingual knowledge transfer.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 19346v1 Announce Type: new Abstract: We study cross-lingual transfer by fine-tuning seven large language models (4B--671B parameters) on Arabic and evaluating zero-shot reading comprehension on Semitic languages and non-Semitic controls. Across dense and Mixture-of-Experts architectures, we find no evidence of Semitic-specific transfer: models with weak baselines improve dramatically across all languages, while strong-baseline models show only marginal gains regardless of language family.
A chain-of-thought ablation reinforces this finding -- the same models that benefit most from fine-tuning benefit equally from inference-time reasoning, suggesting both mechanisms address task-format alignment rather than cross-lingual knowledge transfer.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.