LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness
Quick Take
The LURE method introduces Live-Usage Replay Evaluations to mitigate evaluation awareness in large language models, enhancing the realism of safety and alignment benchmarks. By replaying realistic interaction trajectories, LURE evaluations are found to be less distinguishable from actual deployments, addressing critical concerns in AI safety and alignment.
Key Points
- LURE reduces evaluation awareness, improving the validity of AI safety benchmarks.
- Automated realism measurement pipeline detects evaluation awareness in transcripts.
- LURE evaluations closely mimic real user interactions, enhancing benchmark reliability.
- Implemented in contexts like AI safety sabotage and sycophancy.
- Evaluation realism should be reported alongside benchmark results for safety cases.
Article Content
From source RSS / original summaryarXiv:2605. 26438v1 Announce Type: new Abstract: Large language models can recognize when they are being evaluated (evaluation awareness) and behave differently because of that, which undermines the validity of safety and alignment benchmarks. We propose LURE (Live-Usage Replay Evaluations), a method for constructing deployment-like evaluations by replaying realistic agentic interaction trajectories and appending evaluation prompt at the end.
We also introduce an automated pipeline for measuring evaluation realism, combining detection of verbalized evaluation awareness and judge-model estimates of the probability of logs being an evaluation, and validate it on a large dataset of deployment and evaluation transcripts. We find that LURE-based evaluations are substantially less distinguishable from deployment than widely used benchmarks and synthetic evaluation generators, and can approach the realism of real conversations with users.
We instantiate LURE in scheming, AI safety sabotage, and sycophancy settings. Our results suggest that evaluation realism is a crucial property of alignment benchmarks and should be reported alongside benchmark results, especially when such results are used in safety cases.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.