我们在LLM决策能力方面进展如何？评估大语言模型在多智能体环境中的博弈能力 (How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments)

from arxiv, 11 pages of main text; 19 pages of appendices. Included models: GPT-3.5-{0613, 1106, 0125}, GPT-4-0125, Gemini-{1.0, 1.5)-Pro, LLaMA-3.1-{7, 70, 405}B, Mixtral-8x{7, 22}B, Qwen-2-72B

Decision-making is a complex process requiring diverse abilities, making it an excellent framework for evaluating Large Language Models (LLMs). Researchers have examined LLMs' decision-making through the lens of Game Theory. However, existing evaluation mainly focus on two-player scenarios where an LLM competes against another. Additionally, previous benchmarks suffer from test set leakage due to their static design. We introduce GAMA($\gamma$)-Bench, a new framework for evaluating LLMs' Gaming Ability in Multi-Agent environments. It includes eight classical game theory scenarios and a dynamic scoring scheme specially designed to quantitatively assess LLMs' performance. $\gamma$-Bench allows flexible game settings and adapts the scoring system to different game parameters, enabling comprehensive evaluation of robustness, generalizability, and strategies for improvement. Our results indicate that GPT-3.5 demonstrates strong robustness but limited generalizability, which can be enhanced using methods like Chain-of-Thought. We also evaluate twelve LLMs from six model families, including GPT-3.5, GPT-4, Gemini, LLaMA-3.1, Mixtral, and Qwen-2. Gemini-1.5-Pro outperforms others, scoring of $68.1$ out of $100$, followed by LLaMA-3.1-70B ($64.5$) and Mixtral-8x22B ($61.4$). All code and experimental results are publicly available via https://github.com/CUHK-ARISE/GAMABench.

翻译：决策是一个需要多种能力的复杂过程，使其成为评估大语言模型的绝佳框架。研究者已通过博弈论视角考察了LLM的决策能力。然而，现有评估主要集中于LLM与另一个智能体对抗的双玩家场景。此外，由于静态设计，先前基准测试存在测试集泄露问题。我们提出了GAMA($\gamma$)-Bench，这是一个用于评估LLM在多智能体环境中博弈能力的新框架。它包含八个经典博弈论场景和一个专门设计的动态评分方案，用于定量评估LLM的表现。$\gamma$-Bench允许灵活的博弈设置，并使评分系统适应不同的博弈参数，从而实现对鲁棒性、泛化性和改进策略的全面评估。我们的结果表明，GPT-3.5展现出较强的鲁棒性但泛化能力有限，这可以通过思维链等方法加以提升。我们还评估了来自六个模型系列的十二个LLM，包括GPT-3.5、GPT-4、Gemini、LLaMA-3.1、Mixtral和Qwen-2。Gemini-1.5-Pro表现最佳，在100分中获得了$68.1$分，其次是LLaMA-3.1-70B（$64.5$分）和Mixtral-8x22B（$61.4$分）。所有代码和实验结果均通过https://github.com/CUHK-ARISE/GAMABench 公开提供。