OdysSim: Building Foundation Models for Human Behavior Simulation
Quick Answer
OdysSim introduces a novel 8B OSim model, outperforming existing models on 8 out of 23 tasks, particularly in conversational and social simulations.
Quick Take
OdysSim introduces a novel 8B OSim model, outperforming existing models on 8 out of 23 tasks, particularly in conversational and social simulations. The study highlights the need to rethink LLM training paradigms to bridge the Sim2Real gap and improve human-like interaction quality.
Key Points
- OdysSim corpus includes 21.4M interactions and 10B tokens for training.
- SOUL taxonomy unifies 62 datasets and 23 benchmark tasks into one framework.
- OSim model achieves 93.2 alignment with real users on reaction tasks.
- Post-training reward-hacking patterns are mitigated using specialized detectors.
- All research artifacts are released to support future investigations.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 14199v1 Announce Type: new Abstract: Large language models are increasingly deployed as human simulators for interactive evaluation and social simulation. Yet helpfulness-driven post-training pulls them toward a homogeneous, overly agreeable assistant register, creating a behavioral Sim2Real gap. We present OdysSim, the largest open systematic investigation of behavioral foundation models, i. e. , models trained to simulate human behavior at scale.
We propose SOUL, a taxonomy of five capability axes (CONV, SS, COG, ROLE, EVAL) that unifies 62 datasets and 23 benchmark tasks under one framework. Specifically, we curate the OdysSim corpus (21. 4M interactions, 10B tokens, retrofitted with back-generated social contexts), construct the SOUL-Index benchmark, and develop an end-to-end training recipe combining midtraining, task-specific RL, and expert distillation.
The resulting open 8B OSim model ranks first or tied-first on 8 of 23 tasks, outperforming any individual frontier model by this count, with the strongest gains on conversational and social tasks. Its outputs are also more human-like in length, formatting, and word choice, and it transfers zero-shot to out-of-distribution user simulation on $\tau$-bench, nearly matching real users on reaction alignment (93. 2 vs. 93. 5).
We further show that LLM-as-judge RL induces reward-hacking patterns, and that our detectors can mitigate them during post-training. Together, our findings suggest that behavioral foundation models require rethinking the LLM training paradigm. We release all artifacts to support future research.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.