Although the capabilities of large language models have been increasingly tested on complex reasoning tasks, their long-horizon planning abilities have not yet been extensively investigated. In this work, we provide a systematic assessment of the planning and long-horizon reasoning capabilities of state-of-the-art Large Reasoning Models (LRMs). We propose a novel benchmark based on Sokoban puzzles, intentionally simplified to isolate long-horizon planning from state persistence. Our findings reveal a consistent degradation in planning performance when more than 25 moves are required to reach the solution, suggesting a fundamental constraint on forward planning capacity. We show that equipping LRMs with Planning Domain Definition Language (PDDL) parsing, validation, and solving tools allows for modest improvements, suggesting inherent architectural limitations which might not be overcome by test-time scaling approaches alone.
翻译:尽管大语言模型在复杂推理任务上的能力已得到日益广泛的测试,但其长时程规划能力尚未被深入探究。本研究对当前最先进的大推理模型的规划与长时程推理能力进行了系统性评估。我们提出了一种基于推箱子游戏的新型基准测试,该测试经过刻意简化以将长时程规划与状态持久性问题相隔离。研究结果表明,当求解所需步数超过25步时,模型的规划性能会出现系统性下降,这揭示了其前向规划能力存在根本性约束。我们进一步证明,为LRM配备规划领域定义语言的解析、验证与求解工具可带来有限改进,这表明其存在固有的架构局限性,仅通过测试时扩展方法可能无法完全克服。