LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness

arXiv cs.CL·Igor Ivanov, David Demitri Africa

3d ago

·~1 min·5/27/2026·en·2

Quick Take

The LURE method introduces Live-Usage Replay Evaluations to mitigate evaluation awareness in large language models, enhancing the realism of safety and alignment benchmarks. By replaying realistic interaction trajectories, LURE evaluations are found to be less distinguishable from actual deployments, addressing critical concerns in AI safety and alignment.

Key Points

LURE reduces evaluation awareness, improving the validity of AI safety benchmarks.
Automated realism measurement pipeline detects evaluation awareness in transcripts.
LURE evaluations closely mimic real user interactions, enhancing benchmark reliability.
Implemented in contexts like AI safety sabotage and sycophancy.
Evaluation realism should be reported alongside benchmark results for safety cases.

Article Content

From source RSS / original summary

arXiv:2605. 26438v1 Announce Type: new Abstract: Large language models can recognize when they are being evaluated (evaluation awareness) and behave differently because of that, which undermines the validity of safety and alignment benchmarks. We propose LURE (Live-Usage Replay Evaluations), a method for constructing deployment-like evaluations by replaying realistic agentic interaction trajectories and appending evaluation prompt at the end.

We also introduce an automated pipeline for measuring evaluation realism, combining detection of verbalized evaluation awareness and judge-model estimates of the probability of logs being an evaluation, and validate it on a large dataset of deployment and evaluation transcripts. We find that LURE-based evaluations are substantially less distinguishable from deployment than widely used benchmarks and synthetic evaluation generators, and can approach the realism of real conversations with users.

We instantiate LURE in scheming, AI safety sabotage, and sycophancy settings. Our results suggest that evaluation realism is a crucial property of alignment benchmarks and should be reported alongside benchmark results, especially when such results are used in safety cases.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness

Quick Take

Key Points

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

What are They Thinking? Delineation, Probing and Tracking of Concepts in LLMs

In-Context Optimization for Retrieval-Augmented Generation: A Gradient-Descent Perspective