Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents
Quick Take
The study introduces Persona Policies to enhance LLM agent training with realistic user simulations.
Key Points
- PPol generates diverse, realistic user personas for LLM evaluation.
- Achieves 33-62% fitness score improvement over baseline simulators.
- Trained agents show 17% higher task success with PPol.
📖 Reader Mode
~2 min readAbstract:Large Language Model (LLM) agents are increasingly deployed in settings where they interact with a wide variety of people, including users who are unclear, impatient, or reluctant to share information. However, collecting real interaction data at scale remains expensive. The field has turned to LLM-based user simulators as stand-ins, but these simulators inherit the behavior of their underlying models: cooperative and homogeneous. As a result, agents that appear strong in simulation often fail under the unseen, diverse communication patterns of real users. To narrow this gap, we introduce Persona Policies (PPol), a plug-and-play control layer that induces realistic behavioral variation in user simulators while preserving the original task goals. Rather than hand-crafting personas, we cast persona generation as an LLM-driven evolutionary program search that optimizes a Python generator to discover behaviors and translate them into task-preserving roleplay policies. Candidate generators are guided by a multi-objective fitness score combining human-likeness with broad coverage of human behavioral patterns. Once optimized, the generator produces a diverse population of human-like personas for any task in the domain. Across tau^2-bench retail and airline domains, evolved PPol programs yield 33-62% absolute gains in fitness score over the baseline simulator. In a blinded evaluation, annotators rated PPol-conditioned users as human 80.4% of the time, close to real human traces and nearly twice as frequently as baseline simulators. Agents trained with PPol are more robust to challenging, out-of-distribution behaviors, improving task success by +17% relative to training only on existing simulated interactions. This offers a novel approach to strengthen simulator-based evaluation and training without changing tasks or rewards.
| Comments: | Preprint under review |
| Subjects: | Artificial Intelligence (cs.AI); Computation and Language (cs.CL) |
| Cite as: | arXiv:2605.12894 [cs.AI] |
| (or arXiv:2605.12894v1 [cs.AI] for this version) | |
| https://doi.org/10.48550/arXiv.2605.12894 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Harshita Chopra [view email]
[v1]
Wed, 13 May 2026 02:16:51 UTC (2,309 KB)
— Originally published at arxiv.org
More from arXiv cs.AI
See more →Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems
Invisible orchestrators in multi-agent LLM systems pose significant safety risks and affect behavior dynamics.