Steerable Cultural Preference Optimization of Reward Models
Quick Answer
The paper introduces a novel reward model training algorithm, SCPO, which enhances large language models' alignment with diverse cultural preferences, achieving up to 7-point performance improvements on minority models across two datasets, while being 280% more data-efficient than traditional fine-tuning methods.
Quick Take
The paper introduces a novel reward model training algorithm, SCPO, which enhances large language models' alignment with diverse cultural preferences, achieving up to 7-point performance improvements on minority models across two datasets, while being 280% more data-efficient than traditional fine-tuning methods.
Key Points
- SCPO incorporates diverse cultural preferences in a balanced manner.
- Performance increases of up to 7 points on minority reward models were observed.
- The method is 280% more data-efficient than full-data fine-tuning.
- Bias analysis shows reduced excessive bias through a weighting method.
- Code for the model is publicly available on GitHub.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 18606v1 Announce Type: new Abstract: It is essential for large language model (LLM) technology to serve many different cultural sub-communities in a manner that is acceptable to each community. However, research on LLM alignment has so far predominantly focused on predicting a unified response preference of annotators from certain regions.
This paper aims to advance the development of alignment models with a more global outlook, that are able to accurately represent the preferences of subcommunities and do not exhibit excessive bias towards any of them. We focus on the development of reward models for this purpose and present a novel reward model training algorithm (SCPO) that can incorporate diverse cultural preferences in a balanced manner.
Our method results in performance increases of the minority reward model of up to 7 points over the baseline model across two datasets, PRISM and GlobalOpinionQA, and across 7 countries. SCPO is up to 280% more training data-efficient than full-data finetuning of reward models. In addition, we perform analysis of bias by separately evaluating on the preference of subcommunities and show that excessive bias is mitigated via our weighting method. Our code is available at https://github.
com/minsik-ai/Steerable-Cultural-Preference
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.