Large language models encode a vast amount of semantic knowledge and possess remarkable understanding and reasoning capabilities. Previous research has explored how to ground language models in robotic tasks to ensure that the sequences generated by the language model are both logically correct and practically executable. However, low-level execution may deviate from the high-level plan due to environmental perturbations or imperfect controller design. In this paper, we propose DoReMi, a novel language model grounding framework that enables immediate Detection and Recovery from Misalignments between plan and execution. Specifically, LLMs are leveraged for both planning and generating constraints for planned steps. These constraints can indicate plan-execution misalignments and we use a vision question answering (VQA) model to check constraints during low-level skill execution. If certain misalignment occurs, our method will call the language model to re-plan in order to recover from misalignments. Experiments on various complex tasks including robot arms and humanoid robots demonstrate that our method can lead to higher task success rates and shorter task completion times. Videos of DoReMi are available at https://sites.google.com/view/doremi-paper.
翻译:大型语言模型编码了海量语义知识,并具备卓越的理解与推理能力。已有研究探索如何将语言模型具身化到机器人任务中,以确保其生成的序列既逻辑正确又实际可执行。然而,由于环境扰动或控制器设计的不完善,底层执行过程可能偏离高层规划。本文提出DoReMi,一种新颖的语言模型具身化框架,能够即时检测并恢复规划与执行之间的偏差。具体而言,我们利用大语言模型同时进行规划与生成规划步骤的约束条件。这些约束可指示规划-执行偏差,并通过视觉问答模型在底层技能执行期间对约束进行检查。若发生偏差,本方法将调用语言模型重新规划以恢复偏差。在包括机械臂与仿人机器人在内的多种复杂任务上的实验表明,本方法能实现更高的任务成功率和更短的任务完成时间。DoReMi视频演示请见https://sites.google.com/view/doremi-paper。