CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning
Quick Answer
CHASE introduces a co-evolutionary framework for LLM safety, reducing mean StrongREJECT scores by 43.2% with 0% false refusals on benign prompts.
Quick Take
CHASE introduces a co-evolutionary framework for LLM safety, reducing mean StrongREJECT scores by 43.2% with 0% false refusals on benign prompts. It utilizes Group Relative Policy Optimization to train both attackers and defenders, enhancing resilience against adaptive black-box adversaries.
Key Points
- CHASE employs a closed-loop red-blue teaming approach for LLM safety.
- Achieves 43.2% reduction in mean StrongREJECT scores on benchmark tests.
- Utilizes Group Relative Policy Optimization for training both attackers and defenders.
- Maintains 0% false refusals on benign prompts during evaluations.
- Demonstrates template-free RL exploration for broader attack resilience.
Article Content
From source RSS / original summaryarXiv:2606. 05523v1 Announce Type: new Abstract: Despite advances in safety alignment, prompt-rewriting attacks such as persona modulation, fictional framing and persuasion-based reformulation, can bypass safety filters even on frontier models. Existing defenses either rely on non-scalable human curation or white-box optimisation that overfits to specific model internals, leaving aligned models brittle against the very class of adaptive black-box adversaries they will face in deployment.
To address this gap, we introduce CHASE (Co-evolutionary Hardening through Adversarial Safety-Escalation), a closed-loop red-blue teaming framework in which a black-box attacker and a safety-aligned defender co-evolve.
The attacker is trained via Group Relative Policy Optimization (GRPO) under a multiplicative reward that jointly enforces bypass effectiveness and intent fidelity, while the defender is hardened on the harvested adversarial rewrites through a two-stage GRPO + rejection-sampled SFT pipeline balanced with benign data. Evaluated on BeaverTails and JailbreakBench against five held-out attack families (PAIR, TAP, AutoDAN, PAP, Translation), CHASE cuts mean StrongREJECT score by 43.
2\% with 0\% false-refusal on benign prompts. Beyond the headline result, CHASE shows that template-free RL exploration recovers latent attack primitives that transfer across mechanistically distinct attack families, suggesting a path toward LLM safety hardening that generalises beyond the narrow distributions achieved thus far in adversarial training.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.
