Social deduction games like Werewolf combine language, reasoning, and strategy, providing a testbed for studying natural language and social intelligence. However, most studies reduce the game to LLM-based self-play, yielding templated utterances and anecdotal cases that overlook the richness of social gameplay. Evaluation further relies on coarse metrics such as survival time or subjective scoring due to the lack of quality reference data. To address these gaps, we curate a high-quality, human-verified multimodal Werewolf dataset containing over 100 hours of video, 32.4M utterance tokens, and 15 rule variants. Based on this dataset, we propose a novel strategy-alignment evaluation that leverages the winning faction's strategies as ground truth in two stages: 1) Speech evaluation, formulated as multiple-choice-style tasks that assess whether the model can adopt appropriate stances across five dimensions of social ability; and 2) Decision evaluation, which assesses the model's voting choices and opponent-role inferences. This framework enables a fine-grained evaluation of models' linguistic and reasoning capabilities, while capturing their ability to generate strategically coherent gameplay. Our experiments show that state-of-the-art LLMs show diverse performance, with roughly half remain below 0.50, revealing clear gaps in deception and counterfactual reasoning. We hope our dataset further inspires research on language, reasoning, and strategy in multi-agent interaction.
翻译:《狼人杀》等社交推理游戏融合了语言、推理与策略,为研究自然语言与社交智能提供了测试平台。然而,现有研究大多将游戏简化为基于LLM的自我博弈,导致生成模板化话语与碎片化案例,忽视了社交博弈的丰富性。由于缺乏高质量参考数据,评估进一步依赖生存时间或主观评分等粗粒度指标。为弥补这些不足,我们构建了一个经人工验证的高质量多模态《狼人杀》数据集,包含超过100小时视频、3240万话语标记及15种规则变体。基于此数据集,我们提出一种新颖的策略对齐评估方法,分两阶段利用获胜阵营的策略作为基准真值:1)话语评估,通过多选题形式评估模型能否在社交能力的五个维度上采取恰当立场;2)决策评估,分析模型的投票选择与对手角色推断能力。该框架能对模型的语言与推理能力进行细粒度评估,同时捕捉其生成策略连贯博弈的能力。实验表明,前沿LLM表现差异显著,约半数模型得分低于0.50,在欺骗与反事实推理方面存在明显不足。我们希望本数据集能进一步推动多智能体交互中语言、推理与策略的融合研究。