On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability

Recent advancements in Large Language Models (LLMs) have showcased their ability to perform complex reasoning tasks, but their effectiveness in planning remains underexplored. In this study, we evaluate the planning capabilities of OpenAI's o1 models across a variety of benchmark tasks, focusing on three key aspects: feasibility, optimality, and generalizability. Through empirical evaluations on constraint-heavy tasks (e.g., $\textit{Barman}$, $\textit{Tyreworld}$) and spatially complex environments (e.g., $\textit{Termes}$, $\textit{Floortile}$), we highlight o1-preview's strengths in self-evaluation and constraint-following, while also identifying bottlenecks in decision-making and memory management, particularly in tasks requiring robust spatial reasoning. Our results reveal that o1-preview outperforms GPT-4 in adhering to task constraints and managing state transitions in structured environments. However, the model often generates suboptimal solutions with redundant actions and struggles to generalize effectively in spatially complex tasks. This pilot study provides foundational insights into the planning limitations of LLMs, offering key directions for future research on improving memory management, decision-making, and generalization in LLM-based planning. Code available at: $\href{https://github.com/VITA-Group/o1-planning}{\text{https://github.com/VITA-Group/o1-planning}}$.

翻译：大型语言模型（LLM）的最新进展已展示其执行复杂推理任务的能力，但其在规划方面的有效性仍未得到充分探索。本研究评估了OpenAI o1模型在多种基准任务中的规划能力，重点关注三个关键方面：可行性、最优性与泛化性。通过对约束密集型任务（如$\textit{Barman}$、$\textit{Tyreworld}$）和空间复杂环境（如$\textit{Termes}$、$\textit{Floortile}$）的实证评估，我们揭示了o1-preview在自我评估与约束遵循方面的优势，同时也识别了其在决策制定和记忆管理方面的瓶颈，尤其是在需要强空间推理能力的任务中。我们的结果表明，在结构化环境中，o1-preview在遵循任务约束和管理状态转移方面优于GPT-4。然而，该模型常生成包含冗余动作的次优解，且在空间复杂任务中难以有效泛化。这项初步研究为理解LLM的规划局限性提供了基础性见解，并为未来改进基于LLM的规划中的记忆管理、决策制定和泛化能力指明了关键方向。代码发布于：$\href{https://github.com/VITA-Group/o1-planning}{\text{https://github.com/VITA-Group/o1-planning}}$。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日