The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment

arXiv cs.CL·Jiachen Zhao, Zhengxuan Wu, Aryaman Arora, Yiyou Sun, David Bau, Weiyan Shi

6/8/2026

·~2 min·6/8/2026·en·1

Quick Answer

This paper shows that The Piggyback Hypothesis suggests that chat-template tokens can transfer finetuned behaviors to out-of-domain queries, addressing emergent misalignment (EM) in LLMs.

Quick Take

Token-Regularized Finetuning (TReFT) reduces EM by 33.5% on Llama-3.1-8B in legal domains, while maintaining in-domain learning, indicating unintended generalization in and a need for constrained finetuning.

Key Points

Emergent misalignment (EM) occurs when LLMs over-generalize to unrelated tasks.
TReFT reduces EM by 33.5% on Llama-3.1-8B compared to data interleaving.
Subtle changes to prefix tokens can restore alignment without altering user queries.
TReFT is effective across various narrow-finetuning scenarios, reducing off-topic generalization by 54.3%.
The study emphasizes the need for further exploration of shared input features in LLMs.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Source Excerpt

arXiv:2606. 06667v1 Announce Type: new Abstract: The mechanisms behind ' broad over-generalization beyond training examples remain unclear. Emergent misalignment (EM) offers a striking case study: finetuning on narrow tasks induces broad misalignment to semantically-unrelated test domains. In this work, we propose the Piggyback Hypothesis: the chat-template tokens can piggyback the finetuned behaviour onto out-of-domain queries.

We validate this hypothesis by showing that subtle perturbations to the prefix (tokens preceding all user queries), or patching the prefix representations with those from the unfinetuned model, can restore alignment without changing the user query. …

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Isabel Xu (The Overlake School), Cynthia Xu (The Overlake School), Rachel Ren (Edwards Vacuum Inc.), Cong Guo (The University of Memphis), Jiacheng Ding (The University of Memphis)

6h ago

FeaturedOriginal

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis

AI Summary

TriAgent introduces a cost-efficient multi-agent system for financial sentiment analysis, combining VADER, FinBERT, and Qwen2.5. It achieves an F1 score of ~0.87 with significant savings of $9.3M/year at a 10M-user scale compared to GPT-4o-mini, while also detecting hallucinations with an AUC of 0.90.

#LLM #Agent #AI Startup #Enterprise AI

The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis

RF-Agent: A Practical Framework for Building Language Agents for RFIC Design

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

TriAgent: Divergence-Aware Multi-Agent Committees for Cost-Efficient Financial Sentiment Analysis

RF-Agent: A Practical Framework for Building Language Agents for RFIC Design

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis