Expert-Aware Refusal Steering
Quick Take
The study extends refusal steering methods to three open-source Mixture-of-Experts (MoE) LLMs, demonstrating that refusal behavior can be effectively managed using expert-specific routing patterns. Results indicate that steering signals differ from expert routing behavior, highlighting the importance of attention mechanisms in MoE architectures.
Key Points
- Refusal steering methods applied to three open-source MoE LLMs.
- Expert-specific routing patterns effectively suppress normal refusal behavior.
- Refusal behavior can be steered based on a single expert's output.
- Steering signals differ from expert routing, indicating attention's role.
- Complex routing patterns in MoE do not inhibit steering performance.
Article Excerpt
From source RSS / original summaryarXiv:2606. 04160v1 Announce Type: new Abstract: Safety alignment in instruction-tuned large language models (LLMs) depends on a model's ability to reliably refuse to respond to harmful or disallowed requests. Recent work has shown that a steering vector can be applied to a dense LLM during inference to effectively suppress refusal behavior, inducing response to harmful requests.
We extend this refusal steering method to three open-source Mixture-of-Experts (MoE) LLMs and find that steering performance is uninhibited by the complex routing patterns inherent to the MoE architecture. We then propose two expert-aware refusal steering methods that leverage refusal-specific expert routing patterns and expert-specific steering directions to suppress normal refusal behavior. We find that refusal behavior can be effectively steered based on the output of a single expert.
Our results show that refusal signals captured by steering methods differ from expert routing behavior, suggesting a substantial role for attention in MoE refusal behavior.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.