Improving Cross-Lingual Factual Recall via Consistency-Driven Reinforcement Learning
Quick Answer
The study introduces PolyFact, a multilingual QA dataset with 100K facts across 12 languages, enhancing cross-lingual factual recall in models like Qwen-2.5-7B and OLMo-2-1124-7B.
Quick Take
The study introduces PolyFact, a multilingual QA dataset with 100K facts across 12 languages, enhancing cross-lingual factual recall in models like Qwen-2.5-7B and OLMo-2-1124-7B. Reinforcement learning via Group Relative Policy Optimization (GRPO) outperforms supervised fine-tuning, improving consistency and generalization to new languages.
Key Points
- PolyFact contains 100K multilingual facts from Wikidata across 12 languages.
- GRPO consistently outperforms supervised fine-tuning in cross-lingual tasks.
- CPT on parallel data shows limited additional gains in factual recall.
- Mechanistic analyses reveal GRPO reduces language specialization in models.
- Code, models, and dataset are publicly released for further research.
Article Excerpt
From source RSS / original summaryarXiv:2606. 06586v1 Announce Type: new Abstract: Large language models (LLMs) trained predominantly on English data encode substantial world knowledge, yet often fail to express it reliably in other languages, a phenomenon known as cross-lingual factual inconsistency. To study and address this, we introduce PolyFact, a large-scale parallel multilingual factual QA dataset containing 100K Wikidata-grounded facts across 12 typologically diverse languages.
Using PolyFact, we compare light continual pretraining (CPT), supervised fine-tuning (SFT), and reinforcement learning via Group Relative Policy Optimization (GRPO) for improving cross-lingual factual recall in Qwen-2. 5-7B and OLMo-2-1124-7B. We find that GRPO consistently outperforms SFT, improving both cross-lingual consistency and generalization to unseen languages, while CPT on parallel data yields limited additional gains.
Mechanistic analyses further show that GRPO reorganizes multilingual routing by reducing language specialization in MLP layers and attention heads, thereby promoting more shared cross-lingual representations. We release our code, models, and dataset.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.