The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment
Quick Answer
This paper shows that The Piggyback Hypothesis suggests that chat-template tokens can transfer finetuned behaviors to out-of-domain queries, addressing emergent misalignment (EM) in LLMs.
Quick Take
The Piggyback Hypothesis suggests that chat-template tokens can transfer finetuned behaviors to out-of-domain queries, addressing emergent misalignment (EM) in LLMs. Token-Regularized Finetuning (TReFT) reduces EM by 33.5% on Llama-3.1-8B in legal domains, while maintaining in-domain learning, indicating unintended generalization in LLMs and a need for constrained finetuning.
Key Points
- Emergent misalignment (EM) occurs when LLMs over-generalize to unrelated tasks.
- TReFT reduces EM by 33.5% on Llama-3.1-8B compared to data interleaving.
- Subtle changes to prefix tokens can restore alignment without altering user queries.
- TReFT is effective across various narrow-finetuning scenarios, reducing off-topic generalization by 54.3%.
- The study emphasizes the need for further exploration of shared input features in LLMs.
Article Content
From source RSS / original summaryarXiv:2606. 06667v1 Announce Type: new Abstract: The mechanisms behind LLMs' broad over-generalization beyond training examples remain unclear. Emergent misalignment (EM) offers a striking case study: finetuning on narrow tasks induces broad misalignment to semantically-unrelated test domains. In this work, we propose the Piggyback Hypothesis: the chat-template tokens can piggyback the finetuned behaviour onto out-of-domain queries.
We validate this hypothesis by showing that subtle perturbations to the prefix (tokens preceding all user queries), or patching the prefix representations with those from the unfinetuned model, can restore alignment without changing the user query. Building on this finding, we propose Token-Regularized Finetuning (TReFT), which regularizes specific token representations during training to mitigate EM. Across different models and multiple EM-inducing datasets, TReFT reduces EM while preserving in-domain learning.
On Llama-3. 1-8B finetuned on the legal domain, TReFT achieves 33. 5% more EM reduction than data interleaving with a retain set of aligned examples. We further show that TReFT extends to other narrow-finetuning settings, including abstention, , and refusal (off-topic generalization is reduced by 54. 3% on average), supporting the Piggyback Hypothesis. Broadly, our work highlights that LLMs may learn and generalize in unintended ways and suggests a path toward more constrained finetuning.
It also calls for further study of how shared input features can piggyback model behavior across domains.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.