In this study, we are interested in imbuing robots with the capability of physically-grounded task planning. Recent advancements have shown that large language models (LLMs) possess extensive knowledge useful in robotic tasks, especially in reasoning and planning. However, LLMs are constrained by their lack of world grounding and dependence on external affordance models to perceive environmental information, which cannot jointly reason with LLMs. We argue that a task planner should be an inherently grounded, unified multimodal system. To this end, we introduce Robotic Vision-Language Planning (ViLa), a novel approach for long-horizon robotic planning that leverages vision-language models (VLMs) to generate a sequence of actionable steps. ViLa directly integrates perceptual data into its reasoning and planning process, enabling a profound understanding of commonsense knowledge in the visual world, including spatial layouts and object attributes. It also supports flexible multimodal goal specification and naturally incorporates visual feedback. Our extensive evaluation, conducted in both real-robot and simulated environments, demonstrates ViLa's superiority over existing LLM-based planners, highlighting its effectiveness in a wide array of open-world manipulation tasks.
翻译:在本研究中,我们致力于赋予机器人具备基于物理世界的任务规划能力。最新进展表明,大语言模型(LLMs)拥有丰富的知识,可应用于机器人任务,尤其是在推理和规划方面。然而,LLMs受限于其缺乏世界具身性经验,且需依赖外部可负担性模型来感知环境信息,这些模型无法与LLMs进行联合推理。我们认为,任务规划器应是一个本质上具身统一的、多模态系统。为此,我们提出了机器人视觉-语言规划方法(ViLa),这是一种新颖的长期机器人规划方法,利用视觉-语言模型(VLMs)生成一系列可执行步骤。ViLa直接将感知数据融入其推理与规划过程,从而深刻理解视觉世界中的常识性知识,包括空间布局和物体属性。它还支持灵活的多模态目标指定,并自然融入视觉反馈。我们在真实机器人和模拟环境中进行的广泛评估表明,ViLa优于现有基于LLMs的规划器,突显了其在多种开放世界操作任务中的有效性。