EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation

arXiv cs.AI·Aristotelis Lazaridis, Dylan Bates, Aman Sharma, Brian King, Vincent Lu, Jack FitzGerald

5/25/2026

·~2 min·5/25/2026·en·5

Quick Answer

Quick Take

The paper introduces EDGE-OPD, a modification of On-Policy Self-Distillation (OPSD) that effectively incorporates privileged context during training, addressing issues of model behavior alteration and performance degradation. By using guided rollouts and an evidence mask, EDGE-OPD enhances the learning of rare target identities, outperforming traditional OPSD methods.

Key Points

EDGE-OPD modifies OPSD to better utilize privileged context during model training.
Guided rollouts ensure rare target behaviors are present in on-policy data.
An evidence mask updates the student model only at supportive token positions.
Empirical results show OPSD fails to learn target identities without EDGE-OPD.
Insights on efficient knowledge transfer and preserving general capabilities are provided.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 22 May 2026]

View PDF HTML (experimental)

Abstract:On-Policy Distillation (OPD) has gained wide attraction as an LLM post-training paradigm due to its effectiveness in improving capabilities without introducing model distribution drift, and consequently, regression in general tasks. On-Policy Self-Distillation (OPSD) is an efficient use-case of OPD, which is appealing as it requires only a single model as a student and teacher, and it also has the benefit of providing privileged context that is a absent at inference time (e.g. a persona, a private fact, or a worked solution) to the teacher during the training process. The challenge in this approach is that the privileged information can change model behavior more than intended: it can modify reasoning, degrade general capabilities, and affect performance indicators like response length, style, or local token preferences. Consequently, OPSD may train the student on side effects rather than a desired, transferable behavior. In this paper, we study this problem in a rare-token/identity setting and propose EviDence GuidEd On-Policy Distillation (EDGE-OPD), a modification of OPSD with two distinct characteristics: a) it uses guided rollouts to inject privileged-context behavior to the student at sampling time, so that the rare target behavior is actually present in the on-policy data, and b) it applies an evidence mask: the student is updated only at token positions where the privileged context supports the sampled token, rather than on every token in the rollout. We empirically show that OPSD (and its variant RLSD, with and without a verifier) completely fail to learn a target identity, while the integration of guided rollouts allows them to succeed. Additionally, mask-region ablations show that the persona signal is localized to the positive-evidence tail, allows us to draw valuable insights about efficient knowledge transfer and preservation of general purpose capabilities.

Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2605.23493 [cs.AI]
	(or arXiv:2605.23493v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2605.23493 arXiv-issued DOI via DataCite

Submission history

From: Aristotelis Lazaridis [view email]
[v1] Fri, 22 May 2026 10:55:15 UTC (944 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Cheng Qian

1d ago

FeaturedOriginal

Information Limits and Attractor Dynamics in Economies of Frontier LLM Agents: A Pre-Registered Test

AI Summary

A pre-registered experiment on Claude Opus 4.8 investigates wealth growth and population misalignment in economies, revealing that relative growth aligns with claimed information but fails to demonstrate expected noise-maintained dispersion. The experiment cost $138.76 and is fully reproducible from cached outputs.

#LLM #Agent #Open Source #AI Startup

EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.AI

Information Limits and Attractor Dynamics in Economies of Frontier LLM Agents: A Pre-Registered Test

Onnes: A Physics-Grounded LLM Simulator for Cryogenic Fault Diagnosis in Quantum Computing Infrastructure

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.AI

Information Limits and Attractor Dynamics in Economies of Frontier LLM Agents: A Pre-Registered Test

Onnes: A Physics-Grounded Multi-Agent LLM Simulator for Cryogenic Fault Diagnosis in Quantum Computing Infrastructure

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

Onnes: A Physics-Grounded LLM Simulator for Cryogenic Fault Diagnosis in Quantum Computing Infrastructure