HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment
Quick Answer
The HARC method enhances safety alignment in LLMs by coupling harmfulness and refusal directions, improving robustness without degrading general capabilities.
Quick Take
The HARC method enhances safety alignment in LLMs by coupling harmfulness and refusal directions, improving robustness without degrading general capabilities. Extensive experiments show HARC outperforms six baseline safety methods across various model families and scales.
Key Points
- HARC fine-tunes harmfulness and refusal directions across prompt and response positions.
- Jailbreaks succeed by suppressing harmfulness or refusal directions before token generation.
- HARC maintains general capabilities while improving safety alignment in LLMs.
- The method shows strong robustness-capability-usability trade-offs in extensive experiments.
- Findings apply across five model families and two scales without architecture-specific tuning.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2607. 00572v1 Announce Type: new Abstract: Understanding how aligned LLMs internally represent safety is critical for diagnosing alignment vulnerabilities, as it explains why jailbreaks succeed and informs the design of robust alignment strategies. Prior work shows that aligned LLMs encode harmfulness and refusal as separable directions in the residual stream at prompt-side token positions.
We show that jailbreaks succeed at prompt encoding by suppressing either the refusal or harmfulness direction before any token is generated, with distinct attack classes occupying separable regions of the harmfulness-refusal plane. Extending the analysis to response-token positions, we find that the model recognizes harmful content while it is generating that content, even when it failed to recognize the input as harmful at the prompt side.
Motivated by our findings, we introduce HARC (Harmfulness-And-Refusal Coupling), a fine-tuning method that pairs the two directions across both prompt and response positions. Since the intervention is confined to the harmfulness-refusal subspace, it leaves the rest of the residual stream intact and does not degrade general capability or inflate over-refusal.
Across extensive experiments, HARC achieves the strongest robustness-capability-usability trade-off among six baselines spanning the major training-time and inference-time safety methods. The harmfulness and refusal directions at prompt and response positions transfer across the five model families and two scales we tested without architecture-specific tuning.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Verification Horizon: No Silver Bullet for Coding Agent Rewards
As coding agents evolve, verifying solutions becomes more challenging than generating them, necessitating a focus on scalable, faithful, and robust verification methods. The study reveals that no fixed reward function can sustain effectiveness as model capabilities advance, emphasizing the need for verification to evolve alongside solution generation.