Open-World Object Manipulation using Pre-trained Vision-Language Models

For robots to follow instructions from people, they must be able to connect the rich semantic information in human vocabulary, e.g. "can you get me the pink stuffed whale?" to their sensory observations and actions. This brings up a notably difficult challenge for robots: while robot learning approaches allow robots to learn many different behaviors from first-hand experience, it is impractical for robots to have first-hand experiences that span all of this semantic information. We would like a robot's policy to be able to perceive and pick up the pink stuffed whale, even if it has never seen any data interacting with a stuffed whale before. Fortunately, static data on the internet has vast semantic information, and this information is captured in pre-trained vision-language models. In this paper, we study whether we can interface robot policies with these pre-trained models, with the aim of allowing robots to complete instructions involving object categories that the robot has never seen first-hand. We develop a simple approach, which we call Manipulation of Open-World Objects (MOO), which leverages a pre-trained vision-language model to extract object-identifying information from the language command and image, and conditions the robot policy on the current image, the instruction, and the extracted object information. In a variety of experiments on a real mobile manipulator, we find that MOO generalizes zero-shot to a wide range of novel object categories and environments. In addition, we show how MOO generalizes to other, non-language-based input modalities to specify the object of interest such as finger pointing, and how it can be further extended to enable open-world navigation and manipulation. The project's website and evaluation videos can be found at https://robot-moo.github.io/

翻译：为使机器人能够遵循人类指令，它们必须将人类词汇中的丰富语义信息（例如"你能帮我拿那个粉色填充鲸鱼吗？"）与自身的感知观测和动作相连接。这对机器人提出了一个尤为困难的挑战：尽管机器人学习方法能让机器人通过第一手经验习得多种行为，但让机器人获得涵盖所有语义信息的亲身经验是不切实际的。我们希望机器人的策略能感知并抓取粉色填充鲸鱼，即使它从未见过与填充鲸鱼交互的数据。幸运的是，互联网上的静态数据蕴含海量语义信息，这些信息已被预训练的视觉-语言模型捕获。本文研究如何将机器人策略与这些预训练模型对接，旨在使机器人能完成涉及从未亲身接触过的物体类别的指令。我们提出一种简洁方法——开放世界物体操作（MOO），该方法利用预训练视觉-语言模型从语言指令和图像中提取物体标识信息，并基于当前图像、指令及提取的物体信息来约束机器人策略。在真实移动操作平台的多项实验中，我们发现MOO能零样本泛化至多种新颖物体类别与环境。此外，我们还展示了MOO如何泛化至其他非语言输入模态（如手指指向）以指定目标物体，以及如何进一步扩展至开放世界导航与操作。项目网站与评估视频详见https://robot-moo.github.io/