DEBATE: A Large-Scale Benchmark for Evaluating Opinion Dynamics in Role-Playing LLM Agents

Yun-Shiuan Chuang,Ruixuan Tu,Chengtao Dai,Smit Vasani,You Li,Binwei Yao,Michael Henry Tessler,Sijia Yang,Dhavan Shah,Robert Hawkins,Junjie Hu,Timothy T. Rogers

Accurately modeling opinion change through social interactions is crucial for understanding and mitigating polarization, misinformation, and societal conflict. Recent work simulates opinion dynamics with role-playing LPL agents (RPLAs), but multi-agent simulations often display unnatural group behavior, such as premature convergence, and lack empirical benchmarks for assessing alignment with real human group interactions. We introduce DEBATE, a large-scale benchmark for evaluating the authenticity of opinion dynamics in multi-agent RPLA simulations. DEBATE contains 30,707 messages from 2,832 U.S.-based participants across 708 groups and 107 topics, with both public messages and private Likert-scale beliefs, enabling evaluation at the utterance and group levels while also supporting future individual-level analyses. We instantiate "digital twin" RPLAs with seven LLMs and evaluate them in two settings: next-message prediction and full conversation rollout, using stance-alignment and opinion-convergence metrics. In zero-shot settings, RPLA groups exhibit strong opinion convergence relative to human groups. Post-training via supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) improves stance alignment and brings group-level convergence closer to human behavior, though discrepancies in opinion change and belief updating remain. DEBATE enables rigorous benchmarking of simulated opinion dynamics and supports future research on aligning multi-agent RPLAs with realistic human interactions. The benchmark is publicly available at.

翻译：通过社会互动精确建模舆论变化对于理解和缓解极化、虚假信息及社会冲突至关重要。近期研究利用角色扮演大语言模型智能体模拟舆论动力学，但多智能体模拟常表现出过早趋同等非自然群体行为，且缺乏评估其与真实人类群体互动对齐程度的经验基准。我们提出DEBATE，一个用于评估多智能体角色扮演大语言模型智能体模拟中舆论动力学真实性的基准。该基准包含来自2832名美国参与者的30707条消息，涵盖708个群体及107个话题，同时包含公开消息和私有李克特量表信念，支持语句级和群体级评估，并兼顾未来个体级分析。我们使用七个大语言模型实例化"数字孪生"角色扮演大语言模型智能体，在两种设置（下一条消息预测和完整对话生成）下，采用立场对齐和舆论趋同指标进行评估。在零样本设置中，角色扮演大语言模型智能体群体相较于人类群体表现出强烈的舆论趋同。通过监督微调和直接偏好优化进行后训练可提升立场对齐度，并使群体级趋同性更接近人类行为，但在观点变化和信念更新方面仍存在差异。DEBATE为模拟舆论动力学提供了严谨的基准测试，并支持多智能体角色扮演大语言模型智能体与真实人类互动对齐的未来研究。该基准已开源。