This paper examines the reasoning capabilities of Large Language Models (LLMs) from a novel perspective, focusing on their ability to operate within formally specified, rule-governed environments. We evaluate four LLMs (Gemini 2.5 Pro and Flash variants, Llama 3.3 70B and GPT-OSS 120B) on a suite of forward-simulation tasks-including next / multistep state formulation, and legal action generation-across a diverse set of reasoning problems illustrated through General Game Playing (GGP) game instances. Beyond reporting instance-level performance, we characterize games based on 40 structural features and analyze correlations between these features and LLM performance. Furthermore, we investigate the effects of various game obfuscations to assess the role of linguistic semantics in game definitions and the impact of potential prior exposure of LLMs to specific games during training. The main results indicate that three of the evaluated models generally perform well across most experimental settings, with performance degradation observed as the evaluation horizon increases (i.e., with a higher number of game steps). Detailed case-based analysis of the LLM performance provides novel insights into common reasoning errors in the considered logic-based problem formulation, including hallucinated rules, redundant state facts, or syntactic errors. Overall, the paper reports clear progress in formal reasoning capabilities of contemporary models.
翻译:本文从一个新颖的视角审视大型语言模型(LLMs)的推理能力,重点关注其在形式化指定、规则约束的环境中的操作能力。我们通过通用游戏博弈(GGP)游戏实例所展示的一系列多样化推理问题,评估了四种LLM模型(Gemini 2.5 Pro及其Flash变体、Llama 3.3 70B和GPT-OSS 120B)在一套前向模拟任务上的表现——包括下一步/多步状态推演以及合法行动生成。除了报告实例级别的性能外,我们还基于40个结构特征对游戏进行了表征,并分析了这些特征与LLM性能之间的相关性。此外,我们研究了各种游戏混淆处理的效果,以评估语言语义在游戏定义中的作用,以及LLMs在训练期间可能对特定游戏存在先验接触所带来的影响。主要结果表明,所评估的模型中有三个在大多数实验设置中普遍表现良好,但随着评估视野的扩大(即游戏步数增加),观察到性能下降。对LLM性能进行的详细案例分析,为所考虑的基于逻辑的问题表述中常见的推理错误提供了新的见解,包括规则幻觉、冗余状态事实或句法错误。总体而言,本文报告了当代模型在形式推理能力方面取得的明确进展。