Prefill Awareness in Large Language Models
Quick Answer
This paper shows that Frontier language models, such as Claude Opus 4.5, exhibit significant prefill awareness, detecting tampered outputs in 9-35% of cases.
Quick Take
Frontier language models, such as Claude Opus 4.5, exhibit significant prefill awareness, detecting tampered outputs in 9-35% of cases. This capability impacts the effectiveness of AI safety protocols and highlights the need for developers to monitor this feature in advanced systems.
Key Points
- Claude Opus 4.5 detects prefills opposing its preferences with a 0% false positive rate.
- Models revert to baseline behavior without indicating foreign prefill presence.
- Detection relies on stylistic mismatch, while preference mismatch affects baseline reversion.
- Results suggest prefill awareness complicates prefill-based evaluation methods.
- Developers should track prefill awareness in frontier AI systems.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 12747v1 Announce Type: new Abstract: Safety-relevant studies of language models, including alignment and jailbreaking evaluations and AI control protocols, often rely on prefilling model outputs. If AI models can recognize and act on the fact their prior assistant messages have been inserted or edited, the effectiveness and validity of these methods could be compromised.
We investigate whether frontier language models can distinguish between tampered and untampered assistant-side context, a capability we call prefill awareness. To do so, we construct a binary preference benchmark across three prefill mechanisms, filtering for cases where models show consistent stances. We find that frontier models show substantial prefill awareness: Claude Opus 4.
5 detects prefills opposing its preferences in 9-35% of cases with a 0% false positive rate when prompted; additionally, models often revert towards baseline behavior without explicitly reporting that the prefill was foreign. Controlled ablations later also show that detection and resistance rely on different cues, where stylistic mismatch mainly affects whether models flag a prefill as foreign, while preference mismatch mainly affects whether they revert toward their baseline answer.
We also examine more realistic agentic settings such as misalignment-continuation evaluations and trajectories, where frontier models sometimes disavow prefilled assistant turns in ways that depend strongly on dataset, task success, and hidden formatting artifacts. Our results indicate that prefill awareness is already a substantial confound for some prefill-based methods. We recommend that model developers track this capability in frontier systems.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.