Large language models (LLMs) encode a vast amount of semantic knowledge and possess remarkable understanding and reasoning capabilities. Previous work has explored how to ground LLMs in robotic tasks to generate feasible and executable textual plans. However, low-level execution in the physical world may deviate from the high-level textual plan due to environmental perturbations or imperfect controller design. In this paper, we propose \textbf{DoReMi}, a novel language model grounding framework that enables immediate Detection and Recovery from Misalignments between plan and execution. Specifically, we leverage LLMs to play a dual role, aiding not only in high-level planning but also generating constraints that can indicate misalignment during execution. Then vision language models (VLMs) are utilized to detect constraint violations continuously. Our pipeline can monitor the low-level execution and enable timely recovery if certain plan-execution misalignment occurs. Experiments on various complex tasks including robot arms and humanoid robots demonstrate that our method can lead to higher task success rates and shorter task completion times. Videos of DoReMi are available at \url{https://sites.google.com/view/doremi-paper}.
翻译:大型语言模型(LLMs)蕴含海量语义知识,并具备卓越的理解与推理能力。已有工作探索了如何将LLMs嵌入机器人任务以生成可行且可执行的文本计划。然而,由于环境扰动或控制器设计不完善,物理世界中的底层执行可能与高层文本计划产生偏差。本文提出\textbf{DoReMi}——一种新颖的语言模型具身化框架,能够即时检测并恢复计划与执行之间的不匹配。具体而言,我们利用LLMs发挥双重作用:不仅辅助高层规划,还生成可指示执行中不匹配的约束条件。随后,利用视觉语言模型(VLMs)持续检测约束违反情况。我们的流水线能监控底层执行,并在发生计划-执行不匹配时及时恢复。在机械臂与类人机器人等多种复杂任务上的实验表明,本方法能实现更高的任务成功率与更短的任务完成时间。DoReMi演示视频见\url{https://sites.google.com/view/doremi-paper}。