Decision-making, a complicated task requiring various types of abilities, presents an excellent framework for assessing Large Language Models (LLMs). Our research investigates LLMs' decision-making capabilities through the lens of a well-established field, Game Theory. We focus specifically on games that support the participation of more than two agents simultaneously. Subsequently, we introduce our framework, GAMA-Bench, including eight classical multi-agent games. We design a scoring scheme to assess a model's performance in these games quantitatively. Through GAMA-Bench, we investigate LLMs' robustness, generalizability, and enhancement strategies. Results reveal that while GPT-3.5 shows satisfying robustness, its generalizability is relatively limited. However, its performance can be improved through approaches such as Chain-of-Thought. Additionally, we conduct evaluations across various LLMs and find that GPT-4 outperforms other models on GAMA-Bench, achieving a score of 60.5. Moreover, Gemini-1.0-Pro and GPT-3.5 (0613, 1106, 0125) demonstrate similar intelligence on GAMA-Bench. The code and experimental results are made publicly available via https://github.com/CUHK-ARISE/GAMABench.
翻译:决策是一项需要多种能力的复杂任务,为评估大型语言模型(LLMs)提供了极佳框架。本研究通过一个成熟领域——博弈论的视角,探究LLMs的决策能力,并重点关注支持两个以上智能体同时参与的博弈。我们提出框架GAMA-Bench,包含八个经典的多智能体博弈,并设计量化评分方案来评估模型在这些博弈中的表现。通过GAMA-Bench,我们研究了LLMs的鲁棒性、泛化性以及增强策略。结果表明:尽管GPT-3.5展现出令人满意的鲁棒性,但其泛化性相对有限,然而通过思维链等方法可提升其性能。此外,我们对多种LLMs进行了评估,发现GPT-4在GAMA-Bench上以60.5分超越其他模型。同时,Gemini-1.0-Pro与GPT-3.5(0613、1106、0125版本)在GAMA-Bench上表现出相近的智能水平。代码与实验结果通过https://github.com/CUHK-ARISE/GAMABench公开提供。