CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations

arXiv cs.CL·Mike Zhang, Ali Basirat, Desmond Elliott

2h ago

·~1 min·5/27/2026·en·0

Quick Take

CroCo enables effective cross-lingual contrastive preference tuning on self-generated responses without language-specific annotations.

Key Points

Evaluated across 14 languages with diverse tasks.
Reward model trained on English preferences shows effectiveness.
Gains require on-policy data for optimal performance.

Article Content

From source RSS / original summary

arXiv:2605. 26293v1 Announce Type: new Abstract: Prior work establishes that controlled contrastiveness between self-generated responses from large language models, set via reward scores, improves downstream preference tuning in English. We extend this method to multiple languages and evaluate two models across a total of 14 high and low-resource languages on a diverse set of tasks.

Our central finding is that cross-lingual contrastive preference tuning on self-generations (CroCo) transfers without language-specific preference annotation. A reward model trained on English preferences (atop a multilingual base) produces useful within-language rankings across most languages, and pairing in either a monolingual or multilingual setting improves over each model on the majority of setups while preventing the catastrophic forgetting of supervised fine-tuning.

We observe that the gains require on-policy data. Off-policy responses reduce the benefit and online preference optimization fails to improve over the offline variant. Specifically, on structured tasks, our method matches or exceeds the base in 6/7 languages for EuroLLM-9B and 4/7 settings for Aya-3B. On open-ended generation, both tuned models win against their respective base across 11 evaluated languages. Overall, we show promising directions for multilingual preference tuning.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations

Quick Take

Key Points

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

In-Context Optimization for Retrieval-Augmented Generation: A Gradient-Descent Perspective

Extracting Training Data from Diffusion Language Models via Infilling