GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

As Large Language Models (LLMs) are integrated into critical real-world applications, their strategic and logical reasoning abilities are increasingly crucial. This paper evaluates LLMs' reasoning abilities in competitive environments through game-theoretic tasks, e.g., board and card games that require pure logic and strategic reasoning to compete with opponents. We first propose GTBench, a language-driven environment composing 10 widely recognized tasks, across a comprehensive game taxonomy: complete versus incomplete information, dynamic versus static, and probabilistic versus deterministic scenarios. Then, we (1) Characterize the game-theoretic reasoning of LLMs; and (2) Perform LLM-vs.-LLM competitions as reasoning evaluation. We observe that (1) LLMs have distinct behaviors regarding various gaming scenarios; for example, LLMs fail in complete and deterministic games yet they are competitive in probabilistic gaming scenarios; (2) Most open-source LLMs, e.g., CodeLlama-34b-Instruct and Llama-2-70b-chat, are less competitive than commercial LLMs, e.g., GPT-4, in complex games, yet the recently released Llama-3-70b-Instruct makes up for this shortcoming. In addition, code-pretraining greatly benefits strategic reasoning, while advanced reasoning methods such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT) do not always help. We further characterize the game-theoretic properties of LLMs, such as equilibrium and Pareto Efficiency in repeated games. Detailed error profiles are provided for a better understanding of LLMs' behavior. We hope our research provides standardized protocols and serves as a foundation to spur further explorations in the strategic reasoning of LLMs.

翻译：随着大语言模型（LLMs）被集成到关键的现实世界应用中，其策略与逻辑推理能力日益重要。本文通过博弈论任务评估LLMs在竞争环境中的推理能力，例如需要纯粹逻辑和策略推理与对手竞争的棋盘和纸牌游戏。我们首先提出了GTBench，这是一个由语言驱动的评估环境，包含10个广泛认可的任务，涵盖完整的游戏分类体系：完全信息与不完全信息、动态与静态、以及概率性与确定性场景。随后，我们（1）刻画LLMs的博弈论推理特性；（2）执行LLM对LLM竞赛作为推理评估。我们观察到：（1）LLMs在不同游戏场景中表现出显著差异的行为；例如，LLMs在完全信息确定性游戏中表现不佳，但在概率性游戏场景中具有竞争力；（2）大多数开源LLMs（如CodeLlama-34b-Instruct和Llama-2-70b-chat）在复杂游戏中不如商用LLMs（如GPT-4）具有竞争力，但最新发布的Llama-3-70b-Instruct弥补了这一不足。此外，代码预训练对策略推理有显著增益，而进阶推理方法如思维链（CoT）和思维树（ToT）并不总是有效。我们进一步刻画了LLMs的博弈论性质，例如重复博弈中的均衡与帕累托效率。本文提供了详细的错误分析以更好地理解LLMs的行为模式。我们希望本研究能为LLMs的策略推理领域提供标准化评估框架，并作为推动进一步探索的基础。