Large Language Models (LLMs) have marked a significant advancement in the field of natural language processing, demonstrating exceptional capabilities in reasoning, tool usage, and memory. As their applications extend into multi-agent environments, a need has arisen for a comprehensive evaluation framework that captures their abilities in reasoning, planning, collaboration, and more. This work introduces a novel benchmarking framework specifically tailored to assess LLMs within multi-agent settings, providing quantitative metrics to evaluate their judgment, reasoning, deception, self-awareness, cooperation, coordination, and rationality. We utilize games such as Chameleon and Undercover, alongside game theory scenarios like Cost Sharing, Multi-player Prisoner's Dilemma, and Public Good, to create diverse testing environments. Our framework is fortified with the Probabilistic Graphical Modeling (PGM) method, enhancing the LLMs' capabilities in navigating complex social and cognitive dimensions. The benchmark evaluates seven multi-agent systems powered by different LLMs, quantitatively highlighting a significant capability gap over threefold between the strongest, GPT-4, and the weakest, Llama-2-70B. It also confirms that our PGM enhancement boosts the inherent abilities of all selected models by 50% on average. Our codes are released here https://github.com/cathyxl/MAgIC.
翻译:大型语言模型(LLMs)在自然语言处理领域取得了显著进展,展现出在推理、工具使用和记忆方面的卓越能力。随着其应用扩展至多智能体环境,亟需一套全面的评估框架,以捕捉其在推理、规划、协作等方面的能力。本研究提出了一种新颖的基准测试框架,专为评估多智能体环境中的LLMs而设计,提供了量化指标来评估其判断、推理、欺骗、自我意识、合作、协调与合理性。我们利用Chameleon和Undercover等游戏,以及成本分摊、多人囚徒困境和公共品等博弈论场景,构建了多样化的测试环境。该框架通过概率图模型(PGM)方法得到强化,增强了LLMs在复杂社会与认知维度中的导航能力。该基准测试评估了由不同LLMs驱动的七种多智能体系统,定量揭示了最强模型GPT-4与最弱模型Llama-2-70B之间超过三倍的能力差距,并证实了我们的PGM增强使所有选定模型的内在能力平均提升了50%。我们的代码已发布于https://github.com/cathyxl/MAgIC。