Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents

arXiv cs.AI·Harshita Chopra, Kshitish Ghate, Aylin Caliskan, Tadayoshi Kohno, Chirag Shah, Natasha Jaques

3d ago

·~2 min·5/14/2026·en·1

Quick Take

The study introduces Persona Policies to enhance LLM agent training with realistic user simulations.

Key Points

PPol generates diverse, realistic user personas for LLM evaluation.
Achieves 33-62% fitness score improvement over baseline simulators.
Trained agents show 17% higher task success with PPol.

📖 Reader Mode

~2 min read

[Submitted on 13 May 2026]

View PDF HTML (experimental)

Abstract:Large Language Model (LLM) agents are increasingly deployed in settings where they interact with a wide variety of people, including users who are unclear, impatient, or reluctant to share information. However, collecting real interaction data at scale remains expensive. The field has turned to LLM-based user simulators as stand-ins, but these simulators inherit the behavior of their underlying models: cooperative and homogeneous. As a result, agents that appear strong in simulation often fail under the unseen, diverse communication patterns of real users. To narrow this gap, we introduce Persona Policies (PPol), a plug-and-play control layer that induces realistic behavioral variation in user simulators while preserving the original task goals. Rather than hand-crafting personas, we cast persona generation as an LLM-driven evolutionary program search that optimizes a Python generator to discover behaviors and translate them into task-preserving roleplay policies. Candidate generators are guided by a multi-objective fitness score combining human-likeness with broad coverage of human behavioral patterns. Once optimized, the generator produces a diverse population of human-like personas for any task in the domain. Across tau^2-bench retail and airline domains, evolved PPol programs yield 33-62% absolute gains in fitness score over the baseline simulator. In a blinded evaluation, annotators rated PPol-conditioned users as human 80.4% of the time, close to real human traces and nearly twice as frequently as baseline simulators. Agents trained with PPol are more robust to challenging, out-of-distribution behaviors, improving task success by +17% relative to training only on existing simulated interactions. This offers a novel approach to strengthen simulator-based evaluation and training without changing tasks or rewards.

Comments:	Preprint under review
Subjects:	Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2605.12894 [cs.AI]
	(or arXiv:2605.12894v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2605.12894 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Harshita Chopra [view email]
[v1] Wed, 13 May 2026 02:16:51 UTC (2,309 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.AI

Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems

Distribution-Aware Algorithm Design with LLM Agents

Enhanced and Efficient Reasoning in Large Learning Models

Related in this space

Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study