Model Unlearning Objectives Vary for Distinct… · DeepSignal

Model Unlearning Objectives Vary for Distinct Language Functions

arXiv cs.CL·Berk Atil, Vipul Gupta, Rebecca J. Passonneau

3d ago

·~1 min·5/27/2026·en·1

Quick Take

This study introduces tailored unlearning methods for large language models (LLMs) to mitigate dangerous knowledge and toxic text generation. Using a cosine-based RMU variant and a multi-layer objective across four open-source 7-8B models, the authors demonstrate effective results, emphasizing the need for distinct unlearning objectives based on language functions.

Key Points

Introduces cosine-based RMU for dangerous knowledge unlearning.
Proposes multi-layer objectives for toxicity unlearning.
Achieves strong results across four open-source 7-8B models.
Highlights the need for distinct unlearning objectives in LLMs.
Suggests unlearning should be treated as a family of problems.

Article Excerpt

From source RSS / original summary

arXiv:2605. 26454v1 Announce Type: new Abstract: Large language models (LLMs) learn undesirable properties during pretraining, including dangerous knowledge and toxic text generation. Just as post-training uses different objectives to shape different behaviors, we argue that unlearning methods should be designed for the language function at issue. To study this, we consider two mechanistically distinct unlearning goals, dangerous-knowledge unlearning and toxicity unlearning.

For dangerous knowledge, we introduce a cosine-based, meta-learned variant of RMU. For toxicity, we propose a multi-layer objective based on layer-specific probe directions. Across four open-source 7-8B models, our methods achieve strong results, based on distinct training objectives for the two types of unlearning. Overall, our results suggest that unlearning should be studied as a family of problems, analogous to the multiple types of LLM post-training.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

Model Unlearning Objectives Vary for Distinct Language Functions

Quick Take

Key Points

Article Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

What are They Thinking? Delineation, Probing and Tracking of Concepts in LLMs

In-Context Optimization for Retrieval-Augmented Generation: A Gradient-Descent Perspective