Large language models have demonstrated remarkable few-shot performance on many natural language understanding tasks. Despite several demonstrations of using large language models in complex, strategic scenarios, there lacks a comprehensive framework for evaluating agents' performance across various types of reasoning found in games. To address this gap, we introduce GameBench, a cross-domain benchmark for evaluating strategic reasoning abilities of LLM agents. We focus on 9 different game environments, where each covers at least one axis of key reasoning skill identified in strategy games, and select games for which strategy explanations are unlikely to form a significant portion of models' pretraining corpuses. Our evaluations use GPT-3 and GPT-4 in their base form along with two scaffolding frameworks designed to enhance strategic reasoning ability: Chain-of-Thought (CoT) prompting and Reasoning Via Planning (RAP). Our results show that none of the tested models match human performance, and at worst GPT-4 performs worse than random action. CoT and RAP both improve scores but not comparable to human levels.
翻译:大型语言模型已在众多自然语言理解任务中展现出卓越的小样本性能。尽管已有若干在复杂战略场景中运用大型语言模型的实证研究,但目前仍缺乏一个系统性框架来评估智能体在各类游戏所蕴含的推理能力中的表现。为填补这一空白,我们提出了GameBench——一个用于评估LLM智能体战略推理能力的跨领域基准。我们聚焦于9种不同的游戏环境,每种环境至少涵盖策略游戏中关键推理技能的一个维度,并特别选取那些战略解释不太可能构成模型预训练语料重要组成部分的游戏。评估过程中,我们使用基础版本的GPT-3与GPT-4,并结合两种旨在增强战略推理能力的框架化方法:思维链提示与基于规划的推理。实验结果表明,所有测试模型均未达到人类表现水平,GPT-4在最差情况下甚至弱于随机决策。虽然思维链提示与基于规划的推理均能提升模型得分,但仍远未达到人类水平。