Configurable Reward Model for Balanced Safety Alignment
Quick Take
The Configurable Safety Reward Model (CSRM) enhances large language models' alignment with evolving safety requirements, achieving state-of-the-art F1 scores of 94.6% on CoSApien and 75.8% on DynaBench without extra human annotation. This model improves the helpfulness-safety tradeoff significantly over existing baselines.
Key Points
- CSRM is optimized for calibrated safety compliance and reward modeling.
- Configuration-targeted data augmentation enforces instruction adherence.
- Achieves state-of-the-art performance on configurable safety benchmarks.
- Improves generalization to unseen safety configurations significantly.
- Enhances helpfulness-safety tradeoff for downstream safety alignment.
Article Content
From source RSS / original summaryarXiv:2605. 30487v1 Announce Type: new Abstract: Aligning large language models (LLMs) to heterogeneous and rapidly evolving safety requirements remains a critical challenge. Existing instruction-tuned LLMs and standalone safety classifiers often fail to generalize to new safety configurations, motivating the need for Reward Models (RMs) that are explicitly configurable to changing specifications.
We introduce the Configurable Safety Reward Model (CSRM), which is jointly optimized for calibrated safety compliance and reward modeling. Our approach is supported by configuration-targeted data augmentation that enforces instruction adherence while preserving relative severity structure. The resulting RM is sensitive to fine-grained safety configurations and conversational nuances, substantially improving generalization to previously unseen safety configurations.
CSRM achieves state-of-the-art performance on recent configurable safety benchmarks, including CoSApien (94. 6% F1) and DynaBench (75. 8% F1), without requiring additional human annotation. When used for downstream safety alignment, CSRM yields LLMs with a significantly improved helpfulness-safety tradeoff compared to existing baselines.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.