Recent advancements in Large Language Models (LLMs) have showcased their ability to perform complex reasoning tasks, but their effectiveness in planning remains underexplored. In this study, we evaluate the planning capabilities of OpenAI's o1 models across a variety of benchmark tasks, focusing on three key aspects: feasibility, optimality, and generalizability. Through empirical evaluations on constraint-heavy tasks (e.g., $\textit{Barman}$, $\textit{Tyreworld}$) and spatially complex environments (e.g., $\textit{Termes}$, $\textit{Floortile}$), we highlight o1-preview's strengths in self-evaluation and constraint-following, while also identifying bottlenecks in decision-making and memory management, particularly in tasks requiring robust spatial reasoning. Our results reveal that o1-preview outperforms GPT-4 in adhering to task constraints and managing state transitions in structured environments. However, the model often generates suboptimal solutions with redundant actions and struggles to generalize effectively in spatially complex tasks. This pilot study provides foundational insights into the planning limitations of LLMs, offering key directions for future research on improving memory management, decision-making, and generalization in LLM-based planning. Code available at: $\href{https://github.com/VITA-Group/o1-planning}{\text{https://github.com/VITA-Group/o1-planning}}$.
翻译:大型语言模型(LLM)的最新进展已展示其执行复杂推理任务的能力,但其在规划方面的有效性仍未得到充分探索。本研究评估了OpenAI o1模型在多种基准任务中的规划能力,重点关注三个关键方面:可行性、最优性与泛化性。通过对约束密集型任务(如$\textit{Barman}$、$\textit{Tyreworld}$)和空间复杂环境(如$\textit{Termes}$、$\textit{Floortile}$)的实证评估,我们揭示了o1-preview在自我评估与约束遵循方面的优势,同时也识别了其在决策制定和记忆管理方面的瓶颈,尤其是在需要强空间推理能力的任务中。我们的结果表明,在结构化环境中,o1-preview在遵循任务约束和管理状态转移方面优于GPT-4。然而,该模型常生成包含冗余动作的次优解,且在空间复杂任务中难以有效泛化。这项初步研究为理解LLM的规划局限性提供了基础性见解,并为未来改进基于LLM的规划中的记忆管理、决策制定和泛化能力指明了关键方向。代码发布于:$\href{https://github.com/VITA-Group/o1-planning}{\text{https://github.com/VITA-Group/o1-planning}}$。