Autonomous robot navigation and manipulation in open environments require reasoning and replanning with closed-loop feedback. In this work, we present COME-robot, the first closed-loop robotic system utilizing the GPT-4V vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios.COME-robot incorporates two key innovative modules: (i) a multi-level open-vocabulary perception and situated reasoning module that enables effective exploration of the 3D environment and target object identification using commonsense knowledge and situated information, and (ii) an iterative closed-loop feedback and restoration mechanism that verifies task feasibility, monitors execution success, and traces failure causes across different modules for robust failure recovery. Through comprehensive experiments involving 8 challenging real-world mobile and tabletop manipulation tasks, COME-robot demonstrates a significant improvement in task success rate (~35%) compared to state-of-the-art methods. We further conduct comprehensive analyses to elucidate how COME-robot's design facilitates failure recovery, free-form instruction following, and long-horizon task planning.
翻译:在开放环境中的自主机器人导航与操作需要基于闭环反馈进行推理与重规划。本研究提出了COME-robot,这是首个利用GPT-4V视觉-语言基础模型在现实场景中进行开放式推理与自适应规划的闭环机器人系统。COME-robot包含两个关键创新模块:(i) 多层次开放词汇感知与情境推理模块,该模块利用常识知识与情境信息实现对三维环境的有效探索与目标物体识别;(ii) 迭代式闭环反馈与恢复机制,该机制验证任务可行性、监控执行成功率,并跨不同模块追踪失败原因以实现鲁棒的故障恢复。通过对8项具有挑战性的现实世界移动与桌面操作任务进行的综合实验,COME-robot相较于现有最优方法在任务成功率上展现出显著提升(约35%)。我们进一步开展了全面分析,以阐明COME-robot的设计如何促进故障恢复、自由形式指令跟随以及长时域任务规划。