Expert-Aware Refusal Steering

arXiv cs.CL·Anna C. Marbut, Daniel R. Olson, Travis J. Wheeler

3h ago

·~1 min·6/4/2026·en·0

Quick Take

The study extends refusal steering methods to three open-source Mixture-of-Experts (MoE) LLMs, demonstrating that refusal behavior can be effectively managed using expert-specific routing patterns. Results indicate that steering signals differ from expert routing behavior, highlighting the importance of attention mechanisms in MoE architectures.

Key Points

Refusal steering methods applied to three open-source MoE LLMs.
Expert-specific routing patterns effectively suppress normal refusal behavior.
Refusal behavior can be steered based on a single expert's output.
Steering signals differ from expert routing, indicating attention's role.
Complex routing patterns in MoE do not inhibit steering performance.

Article Excerpt

From source RSS / original summary

arXiv:2606. 04160v1 Announce Type: new Abstract: Safety alignment in instruction-tuned large language models (LLMs) depends on a model's ability to reliably refuse to respond to harmful or disallowed requests. Recent work has shown that a steering vector can be applied to a dense LLM during inference to effectively suppress refusal behavior, inducing response to harmful requests.

We extend this refusal steering method to three open-source Mixture-of-Experts (MoE) LLMs and find that steering performance is uninhibited by the complex routing patterns inherent to the MoE architecture. We then propose two expert-aware refusal steering methods that leverage refusal-specific expert routing patterns and expert-specific steering directions to suppress normal refusal behavior. We find that refusal behavior can be effectively steered based on the output of a single expert.

Our results show that refusal signals captured by steering methods differ from expert routing behavior, suggesting a substantial role for attention in MoE refusal behavior.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

2w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy