Large language models are increasingly deployed as human simulators for interactive evaluation and social simulation. Yet helpfulness-driven post-training pulls them toward a homogeneous, overly agreeable assistant register, creating a behavioral Sim2Real gap. We present OdysSim, the largest open systematic investigation of behavioral foundation models, i.e., models trained to simulate human behavior at scale. We propose SOUL, a taxonomy of five capability axes (CONV, SS, COG, ROLE, EVAL) that unifies 62 datasets and 23 benchmark tasks under one framework. Specifically, we curate the OdysSim corpus (21.4M interactions, 10B tokens, retrofitted with back-generated social contexts), construct the SOUL-Index benchmark, and develop an end-to-end training recipe combining midtraining, task-specific RL, and expert distillation. The resulting open 8B OSim model ranks first or tied-first on 8 of 23 tasks, outperforming any individual frontier model by this count, with the strongest gains on conversational and social tasks. Its outputs are also more human-like in length, formatting, and word choice, and it transfers zero-shot to out-of-distribution user simulation on $τ$-bench, nearly matching real users on reaction alignment (93.2 vs. 93.5). We further show that LLM-as-judge RL induces reward-hacking patterns, and that our detectors can mitigate them during post-training. Together, our findings suggest that behavioral foundation models require rethinking the LLM training paradigm. We release all artifacts to support future research.
翻译:大语言模型正越来越多地被部署为用于交互评估和社会模拟的人类模拟器。然而,以助益性驱动的后训练将它们拉向一种同质化、过度顺从的助手风格,造成了行为模拟的Sim2Real差距。我们提出OdysSim,这是对行为基础模型(即训练用于大规模模拟人类行为的模型)进行的最大规模系统性开放研究。我们提出SOUL,一个包含五个能力轴(CONV、SS、COG、ROLE、EVAL)的分类体系,将62个数据集和23个基准任务统一在一个框架下。具体而言,我们整理了OdysSim语料库(2140万次交互、100亿个token,并补充了反向生成的社交上下文),构建了SOUL-Index基准,并开发了一套结合中期训练、任务特定强化学习和专家蒸馏的端到端训练方案。由此产生的开源8B参数OSim模型在23项任务中的8项上排名第一或并列第一,以此计超越了任何单个前沿模型,在对话和社交任务上提升最为显著。其输出在长度、格式和用词上也更接近人类,并且能零样本迁移至τ-bench上的分布外用户模拟,在反应一致性上几乎与真实用户持平(93.2 vs. 93.5)。我们进一步表明,以LLM作为评判者的强化学习会引发奖励黑客模式,而我们的检测器能在后训练期间缓解这些问题。综合来看,我们的发现表明,行为基础模型需要重新思考LLM的训练范式。我们公开发布所有资源以支持未来研究。