Decision-making is a complex process requiring diverse abilities, making it an excellent framework for evaluating Large Language Models (LLMs). Researchers have examined LLMs' decision-making through the lens of Game Theory. However, existing evaluation mainly focus on two-player scenarios where an LLM competes against another. Additionally, previous benchmarks suffer from test set leakage due to their static design. We introduce GAMA($\gamma$)-Bench, a new framework for evaluating LLMs' Gaming Ability in Multi-Agent environments. It includes eight classical game theory scenarios and a dynamic scoring scheme specially designed to quantitatively assess LLMs' performance. $\gamma$-Bench allows flexible game settings and adapts the scoring system to different game parameters, enabling comprehensive evaluation of robustness, generalizability, and strategies for improvement. Our results indicate that GPT-3.5 demonstrates strong robustness but limited generalizability, which can be enhanced using methods like Chain-of-Thought. We also evaluate twelve LLMs from six model families, including GPT-3.5, GPT-4, Gemini, LLaMA-3.1, Mixtral, and Qwen-2. Gemini-1.5-Pro outperforms others, scoring of $68.1$ out of $100$, followed by LLaMA-3.1-70B ($64.5$) and Mixtral-8x22B ($61.4$). All code and experimental results are publicly available via https://github.com/CUHK-ARISE/GAMABench.
翻译:决策是一个需要多种能力的复杂过程,使其成为评估大语言模型的绝佳框架。研究者已通过博弈论视角考察了LLM的决策能力。然而,现有评估主要集中于LLM与另一个智能体对抗的双玩家场景。此外,由于静态设计,先前基准测试存在测试集泄露问题。我们提出了GAMA($\gamma$)-Bench,这是一个用于评估LLM在多智能体环境中博弈能力的新框架。它包含八个经典博弈论场景和一个专门设计的动态评分方案,用于定量评估LLM的表现。$\gamma$-Bench允许灵活的博弈设置,并使评分系统适应不同的博弈参数,从而实现对鲁棒性、泛化性和改进策略的全面评估。我们的结果表明,GPT-3.5展现出较强的鲁棒性但泛化能力有限,这可以通过思维链等方法加以提升。我们还评估了来自六个模型系列的十二个LLM,包括GPT-3.5、GPT-4、Gemini、LLaMA-3.1、Mixtral和Qwen-2。Gemini-1.5-Pro表现最佳,在100分中获得了$68.1$分,其次是LLaMA-3.1-70B($64.5$分)和Mixtral-8x22B($61.4$分)。所有代码和实验结果均通过https://github.com/CUHK-ARISE/GAMABench 公开提供。