ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions

arXiv cs.CL·Xianzhong Ding, Yangyang Yu, Changwei Liu, Bill Zhao

5/26/2026

·~2 min·5/26/2026·en·2

Quick Answer

ContextEcho introduces a benchmark for measuring persona drift in long coding sessions, revealing that 23 frontier models exhibit significant persona changes that may go unnoticed during deployment.

Quick Take

ContextEcho introduces a benchmark for measuring persona drift in long coding sessions, revealing that 23 frontier models exhibit significant persona changes that may go unnoticed during deployment. The framework allows for auditing model behavior across thousands of tool-using turns, highlighting the need for better evaluations of AI personas in real-world applications.

Key Points

ContextEcho combines a 25-probe identity suite for comprehensive persona evaluation.
The benchmark reveals that persona drift is common across different organizations' models.
In-session compaction does not reliably reset persona drift during long sessions.
Drift can aid tool-using continuation but disrupts formatting in tool-free chats.
The framework is open-source, allowing researchers to audit model personas effectively.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2605. 24279v1 Announce Type: new Abstract: A frontier language model's acknowledged "helpful programming assistant" persona does not survive long agentic-coding sessions in the deployment regime that production products actually run. After hours of tool-using debugging, a model that initially hedges preferences ("I don't have preferences") may begin asserting them ("Python - the feedback loop is instant... "), revealing user-visible drift that deployer evaluations may miss.

Existing persona-stability studies focus on short dialogues and report little shift, leaving real-world code-generation regimes - thousands of tool-using turns, compaction, and hours-long sessions - largely uncharacterized. We introduce ContextEcho, a benchmark and reusable harness for measuring persona drift at deployment scale.

It combines a 25-probe identity suite, a snapshot-then-probe protocol that forks conversation state without perturbing the main session, complementary judged and judge-free measurement surfaces, and three anonymized Claude Code sessions spanning 3,746-9,716 turns.

Across 23 frontier models, ContextEcho shows that persona drift is general across organizations rather than family-specific, that in-session compaction does not reliably reset it, and that a single-shot anchor restores the trained register across measured targets. It also reveals mode-dependent downstream effects: while drift can facilitate tool-using continuation, in tool-free chat it breaks formatting contracts and inflates output length.

Overall, ContextEcho provides researchers and deployers an open-source framework to audit whether the persona a model ships with is the persona users encounter at session end, across chat-completions API targets and without retraining.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

2w ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

From Solvers to Research: Large Language Model-Driven Formal Mathematics at the Research Frontier

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

From Solvers to Research: Large Language Model-Driven Formal Mathematics at the Research Frontier

Quantifying Prior Dominance in Systems