Configurable Reward Model for Balanced Safety Alignment

arXiv cs.CL·Zhengping Jiang, Mehran Khodabandeh, Akash Bharadwaj, Manik Bhandari, Mayur Srungarapu, Anqi Liu, Benjamin Van Durme, Li Chen

4h ago

·~1 min·6/1/2026·en·0

Quick Take

The Configurable Safety Reward Model (CSRM) enhances large language models' alignment with evolving safety requirements, achieving state-of-the-art F1 scores of 94.6% on CoSApien and 75.8% on DynaBench without extra human annotation. This model improves the helpfulness-safety tradeoff significantly over existing baselines.

Key Points

CSRM is optimized for calibrated safety compliance and reward modeling.
Configuration-targeted data augmentation enforces instruction adherence.
Achieves state-of-the-art performance on configurable safety benchmarks.
Improves generalization to unseen safety configurations significantly.
Enhances helpfulness-safety tradeoff for downstream safety alignment.

Article Content

From source RSS / original summary

arXiv:2605. 30487v1 Announce Type: new Abstract: Aligning large language models (LLMs) to heterogeneous and rapidly evolving safety requirements remains a critical challenge. Existing instruction-tuned LLMs and standalone safety classifiers often fail to generalize to new safety configurations, motivating the need for Reward Models (RMs) that are explicitly configurable to changing specifications.

We introduce the Configurable Safety Reward Model (CSRM), which is jointly optimized for calibrated safety compliance and reward modeling. Our approach is supported by configuration-targeted data augmentation that enforces instruction adherence while preserving relative severity structure. The resulting RM is sensitive to fine-grained safety configurations and conversational nuances, substantially improving generalization to previously unseen safety configurations.

CSRM achieves state-of-the-art performance on recent configurable safety benchmarks, including CoSApien (94. 6% F1) and DynaBench (75. 8% F1), without requiring additional human annotation. When used for downstream safety alignment, CSRM yields LLMs with a significantly improved helpfulness-safety tradeoff compared to existing baselines.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

1w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy