In this paper, we explore the potential of Large Language Models (LLMs) Agents in playing the strategic social deduction game, Resistance Avalon. Players in Avalon are challenged not only to make informed decisions based on dynamically evolving game phases, but also to engage in discussions where they must deceive, deduce, and negotiate with other players. These characteristics make Avalon a compelling test-bed to study the decision-making and language-processing capabilities of LLM Agents. To facilitate research in this line, we introduce AvalonBench - a comprehensive game environment tailored for evaluating multi-agent LLM Agents. This benchmark incorporates: (1) a game environment for Avalon, (2) rule-based bots as baseline opponents, and (3) ReAct-style LLM agents with tailored prompts for each role. Notably, our evaluations based on AvalonBench highlight a clear capability gap. For instance, models like ChatGPT playing good-role got a win rate of 22.2% against rule-based bots playing evil, while good-role bot achieves 38.2% win rate in the same setting. We envision AvalonBench could be a good test-bed for developing more advanced LLMs (with self-playing) and agent frameworks that can effectively model the layered complexities of such game environments.
翻译:在本文中,我们探索了大型语言模型(LLM)智能体在策略性社交推理游戏《抵抗组织:阿瓦隆》中的潜力。阿瓦隆玩家不仅需要根据动态演变的游戏阶段做出明智决策,还需参与涉及欺骗、推理和与其他玩家谈判的讨论。这些特性使阿瓦隆成为研究LLM智能体决策与语言处理能力的理想测试平台。为促进相关研究,我们推出了AvalonBench——一个专为评估多智能体LLM而设计的综合游戏环境。该基准包含:(1)阿瓦隆游戏环境,(2)作为基线对手的规则型机器人,(3)为每个角色定制提示的ReAct风格LLM智能体。值得注意的是,基于AvalonBench的评估揭示了明显的能力差距。例如,ChatGPT扮演好人角色时,对抗扮演坏人的规则型机器人仅获得22.2%的胜率,而好人角色机器人在相同设置下达到38.2%的胜率。我们期望AvalonBench能成为开发更先进LLM(通过自我对弈)和能够有效建模此类游戏环境分层复杂性的智能体框架的良好测试平台。