Recent advancements in autonomous multi-agent systems (MAS) based on large language models (LLMs) have enhanced the application scenarios and improved the capability of LLMs to handle complex tasks. Despite demonstrating effectiveness, existing studies still evidently struggle to evaluate, analysis, and reproducibility of LLM-based MAS. In this paper, to facilitate the research on LLM-based MAS, we introduce an open, scalable, and real-time updated platform for accessing and analyzing the LLM-based MAS based on the games Who is Spy?" (WiS). Our platform is featured with three main worths: (1) a unified model evaluate interface that supports models available on Hugging Face; (2) real-time updated leaderboard for model evaluation; (3) a comprehensive evaluation covering game-winning rates, attacking, defense strategies, and reasoning of LLMs. To rigorously test WiS, we conduct extensive experiments coverage of various open- and closed-source LLMs, we find that different agents exhibit distinct and intriguing behaviors in the game. The experimental results demonstrate the effectiveness and efficiency of our platform in evaluating LLM-based MAS. Our platform and its documentation are publicly available at \url{https://whoisspy.ai/}
翻译:基于大语言模型(LLM)的自主多智能体系统(MAS)的最新进展拓展了其应用场景,并提升了LLM处理复杂任务的能力。尽管现有研究已展现出有效性,但在评估、分析和复现基于LLM的MAS方面仍面临明显挑战。为促进基于LLM的MAS研究,本文引入一个基于游戏“谁是卧底?”(WiS)的开放、可扩展且实时更新的平台,用于访问和分析基于LLM的MAS。本平台具有三大核心价值:(1)支持Hugging Face上可用模型的统一模型评估接口;(2)用于模型评估的实时更新排行榜;(3)涵盖游戏胜率、攻击与防御策略以及LLM推理能力的综合评估体系。为严格测试WiS平台,我们开展了覆盖多种开源与闭源LLM的大规模实验,发现不同智能体在游戏中表现出独特且有趣的行为。实验结果证明了本平台在评估基于LLM的MAS方面的有效性与高效性。我们的平台及相关文档已公开于 \url{https://whoisspy.ai/}。