Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents, yet existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination and competition. In this paper, we introduce MultiAgentBench, a comprehensive benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios. Our framework measures not only task completion but also the quality of collaboration and competition using novel, milestone-based key performance indicators. Moreover, we evaluate various coordination protocols (including star, chain, tree, and graph topologies) and innovative strategies such as group discussion and cognitive planning. Notably, gpt-4o-mini reaches the average highest task score, graph structure performs the best among coordination protocols in the research scenario, and cognitive planning improves milestone achievement rates by 3%. Code and datasets are public available at https://github.com/MultiagentBench/MARBLE.
翻译:大语言模型(LLM)作为自主智能体已展现出卓越能力,然而现有基准测试要么专注于单智能体任务,要么局限于狭窄领域,未能捕捉多智能体协调与竞争的动态特性。本文提出MultiAgentBench,这是一个旨在多样化交互场景中评估基于LLM的多智能体系统的综合性基准测试框架。该框架不仅衡量任务完成度,还通过创新的基于里程碑的关键性能指标来评估协作与竞争的质量。此外,我们评估了多种协调协议(包括星型、链型、树型和图型拓扑结构)以及群体讨论、认知规划等创新策略。值得注意的是,gpt-4o-mini在研究场景中取得了平均最高任务分数,图结构在协调协议中表现最优,而认知规划将里程碑达成率提升了3%。代码与数据集已公开于https://github.com/MultiagentBench/MARBLE。