RepSelect: Robust LLM Unlearning via Representation Selectivity
Quick Answer
RepSelect introduces a novel approach to LLM unlearning by isolating forget-set-specific representations, achieving a 4-50x greater reduction in post-relearning accuracy compared to five baselines across models like Llama 3 and Qwen 3.5, while maintaining general capabilities.
Quick Take
RepSelect introduces a novel approach to LLM unlearning by isolating forget-set-specific representations, achieving a 4-50x greater reduction in post-relearning accuracy compared to five baselines across models like Llama 3 and Qwen 3.5, while maintaining general capabilities.
Key Points
- RepSelect collapses top principal components of weight gradients to enhance unlearning.
- Evaluated on biohazardous knowledge and abusive tendencies across four model families.
- Achieves near-perfect robustness against few-shot prompting attacks.
- Outperforms GradDiff, NPO, SimNPO, RMU, and UNDIAL in unlearning effectiveness.
- Targets selective representations for deeper and more robust LLM forgetting.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 17168v1 Announce Type: new Abstract: Making large language models (LLMs) deeply forget specific knowledge and values without sacrificing general capabilities remains a central challenge in unlearning. However, current methods are easily reversed by fine-tuning or few-shot prompting, suggesting their forgetting is only shallow. We identify the root cause.
Existing methods target representations shared with both the retain set and the subspace recovered by a fine-tuning attacker, making unlearning both disruptive to general capabilities and easy to reverse. We propose RepSelect (Representation Selectivity), isolates forget-set-specific representations by collapsing top principal components of weight gradients before each update, leaving general capabilities intact while limiting what fine-tuning can recover.
We evaluate across two forget categories, biohazardous knowledge and abusive tendencies, and four model families spanning dense and Mixture-of-Experts architectures (Llama 3, Qwen 3. 5, Gemma 4 E4B, DeepSeek V2 Lite). Compared to five popular baselines (GradDiff, NPO, SimNPO, RMU, UNDIAL), RepSelect achieves a 4-50x larger reduction in post-relearning answer accuracy than the strongest baseline, and is near-perfectly robust to few-shot prompting attacks.
Targeting selective representations is thus an important step towards deep and robust LLM forgetting.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.