This paper introduces OMAR: One Model, All Roles, a reinforcement learning framework that enables AI to develop social intelligence through multi-turn, multi-agent conversational self-play. Unlike traditional paradigms that rely on static, single-turn optimizations, OMAR allows a single model to role-play all participants in a conversation simultaneously, learning to achieve long-term goals and complex social norms directly from dynamic social interaction. To ensure training stability across long dialogues, we implement a hierarchical advantage estimation that calculates turn-level and token-level advantages. Evaluations in the SOTOPIA social environment and Werewolf strategy games show that our trained models develop fine-grained, emergent social intelligence, such as empathy, persuasion, and compromise seeking, demonstrating the effectiveness of learning collaboration even under competitive scenarios. While we identify practical challenges like reward hacking, our results show that rich social intelligence can emerge without human supervision. We hope this work incentivizes further research on AI social intelligence in group conversations.
翻译:本文提出OMAR(单一模型,多重角色)强化学习框架,该框架通过多轮次、多智能体的对话式自博弈,使人工智能能够发展社会智能。与依赖静态单轮优化的传统范式不同,OMAR允许单一模型同时扮演对话中的所有参与者,直接从动态社会互动中学习实现长期目标与复杂社会规范。为确保长对话训练的稳定性,我们实现了分层优势估计方法,分别计算轮次级别与词元级别的优势。在SOTOPIA社会环境和狼人杀策略游戏中的评估表明,经训练的模型能够发展出细粒度的、涌现式的社会智能,如共情、说服与寻求妥协,即使在竞争性场景下也展现出学习协作的有效性。尽管我们发现了奖励破解等实际挑战,但研究结果表明,丰富的社会智能可以在无需人类监督的情况下自然涌现。我们希望这项工作能激励针对群体对话中人工智能社会智能的进一步研究。