Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation in the form of reasoning and planning. Despite the progress, most still rely on pre-defined motion primitives to carry out the physical interactions with the environment, which remains a major bottleneck. In this work, we aim to synthesize robot trajectories, i.e., a dense sequence of 6-DoF end-effector waypoints, for a large variety of manipulation tasks given an open-set of instructions and an open-set of objects. We achieve this by first observing that LLMs excel at inferring affordances and constraints given a free-form language instruction. More importantly, by leveraging their code-writing capabilities, they can interact with a vision-language model (VLM) to compose 3D value maps to ground the knowledge into the observation space of the agent. The composed value maps are then used in a model-based planning framework to zero-shot synthesize closed-loop robot trajectories with robustness to dynamic perturbations. We further demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions. We present a large-scale study of the proposed method in both simulated and real-robot environments, showcasing the ability to perform a large variety of everyday manipulation tasks specified in free-form natural language. Videos and code at https://voxposer.github.io
翻译:大型语言模型(LLMs)被证明具备丰富的可操作知识,可通过推理与规划的形式提取用于机器人操作。尽管取得进展,但多数方法仍依赖预定义的运动基元与环境进行物理交互,这仍是主要瓶颈。本研究旨在针对开放式指令与开放式物体集,为多种操作任务合成机器人轨迹(即密集的末端执行器六自由度路径点序列)。我们首先发现LLMs擅长从自由形式语言指令中推断功能可供性约束,更重要的是,通过利用其代码编写能力,可与视觉语言模型(VLM)交互以组合3D价值图,将知识锚定到代理的观测空间中。组合后的价值图随后被用于基于模型的规划框架,零样本合成具备动态扰动鲁棒性的闭环机器人轨迹。我们进一步展示了该框架如何通过高效学习涉及接触交互场景的动力学模型,从在线经验中获益。我们在仿真与真实机器人环境中开展了大规模研究,展示了其执行自由形式自然语言指定的各类日常操作任务的能力。视频与代码参见 https://voxposer.github.io