In this study, we are interested in imbuing robots with the capability of physically-grounded task planning. Recent advancements have shown that large language models (LLMs) possess extensive knowledge useful in robotic tasks, especially in reasoning and planning. However, LLMs are constrained by their lack of world grounding and dependence on external affordance models to perceive environmental information, which cannot jointly reason with LLMs. We argue that a task planner should be an inherently grounded, unified multimodal system. To this end, we introduce Robotic Vision-Language Planning (ViLa), a novel approach for long-horizon robotic planning that leverages vision-language models (VLMs) to generate a sequence of actionable steps. ViLa directly integrates perceptual data into its reasoning and planning process, enabling a profound understanding of commonsense knowledge in the visual world, including spatial layouts and object attributes. It also supports flexible multimodal goal specification and naturally incorporates visual feedback. Our extensive evaluation, conducted in both real-robot and simulated environments, demonstrates ViLa's superiority over existing LLM-based planners, highlighting its effectiveness in a wide array of open-world manipulation tasks.
翻译:本研究致力于赋予机器人基于物理世界情境的任务规划能力。最新进展表明,大型语言模型(LLMs)具备丰富的知识,尤其适用于机器人任务中的推理与规划环节。然而,LLMs受限于其缺乏世界情境根基,且需依赖外部可供性模型感知环境信息,无法与LLMs进行联合推理。我们认为任务规划器应是内在地统一的多模态系统。为此,我们提出机器人视觉-语言规划(ViLa)——一种利用视觉语言模型(VLMs)生成可操作步骤序列的新型长程机器人规划方法。ViLa将感知数据直接融入推理与规划过程,从而实现对视觉世界中常识知识(包括空间布局与物体属性)的深度理解。该方法还支持灵活的多模态目标设定,并能自然整合视觉反馈。通过在实际机器人与仿真环境中的广泛评估,我们验证了ViLa相较于现有基于LLMs的规划器的优越性,突显其在多样化的开放世界操控任务中的有效性。