MixSD: Mixed Contextual Self-Distillation for Knowledge Injection

arXiv cs.CL·Jiarui Liu, Lechen Zhang, Yongjin Yang, Yinghui He, Yingheng Wang, Weihao Xuan, Zhijing Jin, Mona Diab

1d ago

·~2 min·5/19/2026·en·2

Quick Take

MixSD enhances knowledge injection in language models while preserving pretrained capabilities.

Key Points

MixSD uses dynamic supervision from model's own conditionals.
Achieves better memorization-retention than supervised fine-tuning.
Reduces catastrophic forgetting by aligning with model's distribution.

📖 Reader Mode

~2 min read

[Submitted on 16 May 2026]

View PDF HTML (experimental)

Abstract:Supervised fine-tuning (SFT) is widely used to inject new knowledge into language models, but it often degrades pretrained capabilities such as reasoning and general-domain performance. We argue this forgetting arises because fine-tuning targets from humans or external systems diverge from the model's autoregressive distribution, forcing the optimizer to imitate low-probability token sequences. To address this problem, we propose MixSD, a simple external-teacher-free method for distribution-aligned knowledge injection. Instead of training on fixed targets, MixSD constructs supervision dynamically by mixing tokens from two conditionals of the base model itself: an expert conditional that observes the injected fact in context, and a naive conditional that reflects the model's original prior. The resulting supervision sequences preserve the factual learning signal while remaining substantially closer to the base model's distribution. We evaluate MixSD on two synthetic corpora that we construct to study factual recall and arithmetic function acquisition in a controlled setting, together with established benchmarks for open-domain factual question answering and knowledge editing. Across multiple model scales and settings, MixSD consistently achieves a better memorization-retention trade-off compared to SFT and on-policy self distillation baselines, retaining up to 100% of the base model's held-out capability while maintaining near-perfect training accuracy, whereas standard SFT retains as little as 1%. We further show that MixSD produces substantially lower-NLL supervision targets under the base model and reduces harmful movement along Fisher-sensitive parameter directions. These results suggest that aligning supervision with the model's native generation distribution is a simple and effective principle for knowledge injection that mitigates catastrophic forgetting.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2605.16865 [cs.CL]
	(or arXiv:2605.16865v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.16865 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Jiarui Liu [view email]
[v1] Sat, 16 May 2026 07:57:09 UTC (738 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

MixSD: Mixed Contextual Self-Distillation for Knowledge Injection

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent

Related in this space

From Prompts to Protocols: An AI Agent for Laboratory Automation

Agentic Trading: When LLM Agents Meet Financial Markets