SafeGene: Reusable Adapters for Transferable Safety Alignment
Quick Answer
SafeGene introduces a reusable safety-adapter module for open-weight LLMs, enhancing safety alignment without compromising performance.
Quick Take
SafeGene introduces a reusable safety-adapter module for open-weight LLMs, enhancing safety alignment without compromising performance. It effectively reduces harmful response rates across various model families while maintaining downstream task efficiency, outperforming existing safe adaptation methods in safety-utility trade-offs.
Key Points
- SafeGene decouples safety capability from task-specific updates for better adaptability.
- It utilizes aligned-degraded model discrepancies to create transferable safety vectors.
- Experiments show reduced harmful response rates while maintaining performance across tasks.
- SafeGene outperforms traditional safe adaptation methods in safety-utility trade-offs.
- The approach is applicable across multiple architecture-compatible model families.
Article Content
From source RSS / original summaryarXiv:2606. 06519v1 Announce Type: new Abstract: Open-weight LLMs are increasingly fine-tuned into customized assistants, but downstream fine-tuning can weaken safety alignment and make models more vulnerable to malicious prompts, even when the training data is not intentionally harmful. This creates a recurring safety recovery problem as target models are repeatedly updated with new task data or user interactions.
We propose SafeGene, a reusable safety-adapter module designed for cross-task reuse within each architecture-compatible model family. Rather than treating safety recovery as a model-specific repair step, SafeGene treats safety capability as an independent, reusable adapter representation decoupled from task-specific updates.
This representation is obtained from aligned--degraded model discrepancies, refined into task-transferable safety vectors through data-aware layer selection, and expressed in each downstream task-adapted model via few-shot layer-wise coefficient recalibration.
Experiments across multiple model families, downstream tasks, and safety judges show that SafeGene-enhanced models reduce harmful response rates while maintaining downstream performance, outperforming representative safe adaptation methods in safety--utility trade-off.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective
This paper addresses the sim-to-real gap for foundation model agents by framing it within a Markov Decision Process (MDP) structure. It advocates for established solutions like domain randomization to enhance agent robustness, aiming to create standardized benchmarks for reliable real-world applications.