Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax
Quick Answer
The study introduces a semantic-space alignment paradigm using Group Relative Policy Optimization (GRPO) for low-resource language expansion, significantly reducing the alignment tax associated with supervised fine-tuning.
Quick Take
The study introduces a semantic-space alignment paradigm using Group Relative Policy Optimization (GRPO) for low-resource language expansion, significantly reducing the alignment tax associated with supervised fine-tuning. Evaluated on Tibetan-Chinese translation, it preserves general competence while enhancing semantic quality in generation tasks, demonstrating more robust representations under limited supervision.
Key Points
- Proposes a new paradigm for low-resource language models using semantic rewards.
- Demonstrates effective Tibetan-Chinese machine translation and headline generation.
- Reduces catastrophic forgetting while enhancing general competence over SFT.
- Achieves higher semantic quality in open-ended generation tasks.
- Shows improved transferability and robustness with limited supervision.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2605. 14366v1 Announce Type: new Abstract: Extending large language models (LLMs) to low-resource languages often incurs an "alignment tax": improvements in the target language come at the cost of catastrophic forgetting in general capabilities. We argue that this trade-off arises from the rigidity of supervised fine-tuning (SFT), which enforces token-level surface imitation on narrow and biased data distributions.
To address this limitation, we propose a semantic-space alignment paradigm powered by Group Relative Policy Optimization (GRPO), where the model is optimized using embedding-level semantic rewards rather than likelihood maximization. This objective encourages meaning preservation through flexible realizations, enabling controlled updates that reduce destructive interference with pretrained knowledge. We evaluate our approach on Tibetan-Chinese machine translation and Tibetan headline generation.
Experiments show that our method acquires low-resource capabilities while markedly mitigating alignment tax, preserving general competence more effectively than SFT. Despite producing less rigid surface overlap, semantic RL yields higher semantic quality and preference in open-ended generation, and few-shot transfer results indicate that it learns more transferable and robust representations under limited supervision.
Overall, our study demonstrates that reinforcement learning with semantic rewards provides a safer and more reliable pathway for inclusive low-resource language expansion.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.