As Large Language Models (LLMs) are integrated into critical real-world applications, their strategic and logical reasoning abilities are increasingly crucial. This paper evaluates LLMs' reasoning abilities in competitive environments through game-theoretic tasks, e.g., board and card games that require pure logic and strategic reasoning to compete with opponents. We first propose GTBench, a language-driven environment composing 10 widely recognized tasks, across a comprehensive game taxonomy: complete versus incomplete information, dynamic versus static, and probabilistic versus deterministic scenarios. Then, we (1) Characterize the game-theoretic reasoning of LLMs; and (2) Perform LLM-vs.-LLM competitions as reasoning evaluation. We observe that (1) LLMs have distinct behaviors regarding various gaming scenarios; for example, LLMs fail in complete and deterministic games yet they are competitive in probabilistic gaming scenarios; (2) Most open-source LLMs, e.g., CodeLlama-34b-Instruct and Llama-2-70b-chat, are less competitive than commercial LLMs, e.g., GPT-4, in complex games, yet the recently released Llama-3-70b-Instruct makes up for this shortcoming. In addition, code-pretraining greatly benefits strategic reasoning, while advanced reasoning methods such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT) do not always help. We further characterize the game-theoretic properties of LLMs, such as equilibrium and Pareto Efficiency in repeated games. Detailed error profiles are provided for a better understanding of LLMs' behavior. We hope our research provides standardized protocols and serves as a foundation to spur further explorations in the strategic reasoning of LLMs.
翻译:随着大语言模型(LLMs)被集成到关键的现实世界应用中,其策略与逻辑推理能力日益重要。本文通过博弈论任务评估LLMs在竞争环境中的推理能力,例如需要纯粹逻辑和策略推理与对手竞争的棋盘和纸牌游戏。我们首先提出了GTBench,这是一个由语言驱动的评估环境,包含10个广泛认可的任务,涵盖完整的游戏分类体系:完全信息与不完全信息、动态与静态、以及概率性与确定性场景。随后,我们(1)刻画LLMs的博弈论推理特性;(2)执行LLM对LLM竞赛作为推理评估。我们观察到:(1)LLMs在不同游戏场景中表现出显著差异的行为;例如,LLMs在完全信息确定性游戏中表现不佳,但在概率性游戏场景中具有竞争力;(2)大多数开源LLMs(如CodeLlama-34b-Instruct和Llama-2-70b-chat)在复杂游戏中不如商用LLMs(如GPT-4)具有竞争力,但最新发布的Llama-3-70b-Instruct弥补了这一不足。此外,代码预训练对策略推理有显著增益,而进阶推理方法如思维链(CoT)和思维树(ToT)并不总是有效。我们进一步刻画了LLMs的博弈论性质,例如重复博弈中的均衡与帕累托效率。本文提供了详细的错误分析以更好地理解LLMs的行为模式。我们希望本研究能为LLMs的策略推理领域提供标准化评估框架,并作为推动进一步探索的基础。