As Large Language Models (LLMs) are integrated into critical real-world applications, their strategic and logical reasoning abilities are increasingly crucial. This paper evaluates LLMs' reasoning abilities in competitive environments through game-theoretic tasks, e.g., board and card games that require pure logic and strategic reasoning to compete with opponents. We first propose GTBench, a language-driven environment composing 10 widely-recognized tasks, across a comprehensive game taxonomy: complete versus incomplete information, dynamic versus static, and probabilistic versus deterministic scenarios. Then, we investigate two key problems: (1) Characterizing game-theoretic reasoning of LLMs; (2) LLM-vs-LLM competitions as reasoning evaluation. We observe that (1) LLMs have distinct behaviors regarding various gaming scenarios; for example, LLMs fail in complete and deterministic games yet they are competitive in probabilistic gaming scenarios; (2) Open-source LLMs, e.g., CodeLlama-34b-Instruct, are less competitive than commercial LLMs, e.g., GPT-4, in complex games. In addition, code-pretraining greatly benefits strategic reasoning, while advanced reasoning methods such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT) do not always help. Detailed error profiles are also provided for a better understanding of LLMs' behavior.
翻译:随着大语言模型(LLMs)被集成到关键的现实世界应用中,其策略和逻辑推理能力日益重要。本文通过博弈论任务(例如需要纯粹逻辑和策略推理与对手竞争的棋盘和纸牌游戏)评估LLMs在竞争环境中的推理能力。我们首先提出GTBench,一个由语言驱动的环境,包含10个广泛认可的任务,覆盖全面的博弈分类:完全信息与不完全信息、动态与静态、概率性与确定性场景。随后,我们探究两个关键问题:(1)表征LLMs的博弈论推理特性;(2)将LLM-vs-LLM竞赛作为推理评估手段。我们发现:(1)LLMs在不同博弈场景中表现截然不同;例如,LLMs在完全信息且确定性博弈中失败,但在概率性博弈场景中具有竞争力;(2)开源LLMs(如CodeLlama-34b-Instruct)在复杂博弈中不如商业LLMs(如GPT-4)具有竞争力。此外,代码预训练显著有益于策略推理,而链式思维(CoT)和树状思维(ToT)等高级推理方法并非总是有效。我们还提供了详细的错误分布分析,以更好地理解LLMs的行为。