DEBATE：用于评估角色扮演大语言模型智能体中观点动态的大规模基准 (DEBATE: A Large-Scale Benchmark for Evaluating Opinion Dynamics in Role-Playing LLM Agents)

Yun-Shiuan Chuang,Ruixuan Tu,Chengtao Dai,Smit Vasani,You Li,Binwei Yao,Michael Henry Tessler,Sijia Yang,Dhavan Shah,Robert Hawkins,Junjie Hu,Timothy T. Rogers

Accurately modeling opinion change through social interactions is crucial for understanding and mitigating polarization, misinformation, and societal conflict. Recent work simulates opinion dynamics with role-playing LLM agents (RPLAs), but multi-agent simulations often display unnatural group behavior (e.g., premature convergence) and lack empirical benchmarks for assessing alignment with real human group interactions. We introduce DEBATE, a large-scale benchmark for evaluating the authenticity of opinion dynamics in multi-agent RPLA simulations. DEBATE contains 36,383 messages from 2,832 U.S.-based participants across 708 groups and 107 topics, with both public messages and private Likert-scale beliefs, enabling evaluation at the utterance and group levels (and supporting future individual-level analyses). We instantiate "digital twin" RPLAs with seven LLMs and evaluate across two settings: next-message prediction and full conversation rollout, using stance-alignment and opinion-convergence metrics. In zero-shot settings, RPLA groups exhibit strong opinion convergence relative to human groups. Post-training via supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) improves stance alignment and brings group-level convergence closer to human behavior, though discrepancies in opinion change and belief updating remain. DEBATE enables rigorous benchmarking of simulated opinion dynamics and supports future research on aligning multi-agent RPLAs with realistic human interactions.

翻译：通过社会互动准确建模观点变化对于理解和缓解极化、错误信息及社会冲突至关重要。近期研究利用角色扮演大语言模型智能体（RPLAs）模拟观点动态，但多智能体仿真常表现出非自然的群体行为（如过早收敛），且缺乏评估其与真实人类群体互动一致性的实证基准。我们提出DEBATE——一个用于评估多智能体RPLA仿真中观点动态真实性的大规模基准。DEBATE包含来自708个小组、107个主题的2,832名美国参与者的36,383条消息，涵盖公开消息和私有李克特量表信念数据，支持在话语层面和群体层面进行评估（并为未来个体层面分析提供基础）。我们使用七种大语言模型实例化“数字孪生”RPLAs，并在两种设置下进行评估：下一条消息预测和完整对话推演，采用立场对齐和观点收敛度量指标。在零样本设置中，RPLA群体相较于人类群体表现出强烈的观点收敛趋势。通过监督微调（SFT）和直接偏好优化（DPO）进行后训练后，立场对齐得到改善，群体层面收敛行为更接近人类表现，但观点变化与信念更新方面仍存在差异。DEBATE为仿真观点动态提供了严格基准测试框架，并支持未来关于多智能体RPLAs与真实人类互动对齐的研究。