Prefill Awareness in Large Language Models

arXiv cs.AI·Andy Wang, Parv Mahajan, David Demitri Africa, Alexandra Souly, Jordan Taylor, Robert Kirk

1d ago

·~2 min·6/12/2026·en·0

Quick Answer

This paper shows that Frontier language models, such as Claude Opus 4.5, exhibit significant prefill awareness, detecting tampered outputs in 9-35% of cases.

Quick Take

Frontier language models, such as Claude Opus 4.5, exhibit significant prefill awareness, detecting tampered outputs in 9-35% of cases. This capability impacts the effectiveness of AI safety protocols and highlights the need for developers to monitor this feature in advanced systems.

Key Points

Claude Opus 4.5 detects prefills opposing its preferences with a 0% false positive rate.
Models revert to baseline behavior without indicating foreign prefill presence.
Detection relies on stylistic mismatch, while preference mismatch affects baseline reversion.
Results suggest prefill awareness complicates prefill-based evaluation methods.
Developers should track prefill awareness in frontier AI systems.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 12747v1 Announce Type: new Abstract: Safety-relevant studies of language models, including alignment and jailbreaking evaluations and AI control protocols, often rely on prefilling model outputs. If AI models can recognize and act on the fact their prior assistant messages have been inserted or edited, the effectiveness and validity of these methods could be compromised.

We investigate whether frontier language models can distinguish between tampered and untampered assistant-side context, a capability we call prefill awareness. To do so, we construct a binary preference benchmark across three prefill mechanisms, filtering for cases where models show consistent stances. We find that frontier models show substantial prefill awareness: Claude Opus 4.

5 detects prefills opposing its preferences in 9-35% of cases with a 0% false positive rate when prompted; additionally, models often revert towards baseline behavior without explicitly reporting that the prefill was foreign. Controlled ablations later also show that detection and resistance rely on different cues, where stylistic mismatch mainly affects whether models flag a prefill as foreign, while preference mismatch mainly affects whether they revert toward their baseline answer.

We also examine more realistic agentic settings such as misalignment-continuation evaluations and trajectories, where frontier models sometimes disavow prefilled assistant turns in ways that depend strongly on dataset, task success, and hidden formatting artifacts. Our results indicate that prefill awareness is already a substantial confound for some prefill-based methods. We recommend that model developers track this capability in frontier systems.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Neha Prakriya, Chaojun Hou, Zheng Gong, Huasha Zhao, Xi Zhao, Mou Li, Zhenyu Gu, Emad Barsoum

1d ago

FeaturedOriginal

Arbor: Tree Search as a Cognition Layer for Autonomous Agents

AI Summary

Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.

#LLM #Agent #Inference #AI Startup