GradShield: Alignment Preserving Finetuning
Quick Answer
GradShield introduces a filtering method for Large Language Models (LLMs) that identifies and removes harmful data during finetuning, achieving an Attack Success Rate (ASR) below 6% while maintaining utility performance across various tasks.
Quick Take
GradShield introduces a filtering method for Large Language Models (LLMs) that identifies and removes harmful data during finetuning, achieving an Attack Success Rate (ASR) below 6% while maintaining utility performance across various tasks.
Key Points
- GradShield computes a Finetuning Implicit Harmfulness Score (FIHS) for data points.
- An adaptive thresholding algorithm is employed to filter harmful data effectively.
- Results show GradShield outperforms baseline methods in safety and utility.
- Maintains an Attack Success Rate (ASR) below 6% across multiple tasks.
- Evaluated across various utility fine-tuning tasks with varying harmful data levels.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2605. 14194v1 Announce Type: new Abstract: Large Language Models (LLMs) pose a significant risk of safety misalignment after finetuning, as models can be compromised by both explicitly and implicitly harmful data. Even some seemingly benign data can inadvertently steer a model towards misaligned behaviors. To address this, we introduce GradShield, a principled filtering method that safeguards LLMs during finetuning by identifying and removing harmful data points before they corrupt the model's alignment.
It removes potentially harmful data by computing a Finetuning Implicit Harmfulness Score (FIHS) for each data point and employs an adaptive thresholding algorithm. We apply GradShield to multiple utility fine-tuning tasks across varying levels of harmful data and evaluate the safety and utility performance of the resulting LLMs using various metrics.
The results show that GradShield outperforms all baseline methods, consistently maintaining an Attack Success Rate (ASR) below $6\%$ while preserving utility performance.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.


