Open-vocabulary generalization requires robotic systems to perform tasks involving complex and diverse environments and task goals. While the recent advances in vision language models (VLMs) present unprecedented opportunities to solve unseen problems, how to utilize their emergent capabilities to control robots in the physical world remains an open question. In this paper, we present MOKA (Marking Open-vocabulary Keypoint Affordances), an approach that employs VLMs to solve robotic manipulation tasks specified by free-form language descriptions. At the heart of our approach is a compact point-based representation of affordance and motion that bridges the VLM's predictions on RGB images and the robot's motions in the physical world. By prompting a VLM pre-trained on Internet-scale data, our approach predicts the affordances and generates the corresponding motions by leveraging the concept understanding and commonsense knowledge from broad sources. To scaffold the VLM's reasoning in zero-shot, we propose a visual prompting technique that annotates marks on the images, converting the prediction of keypoints and waypoints into a series of visual question answering problems that are feasible for the VLM to solve. Using the robot experiences collected in this way, we further investigate ways to bootstrap the performance through in-context learning and policy distillation. We evaluate and analyze MOKA's performance on a variety of manipulation tasks specified by free-form language descriptions, such as tool use, deformable body manipulation, and object rearrangement.
翻译:开放词汇泛化要求机器人系统能够处理涉及复杂多样环境与任务目标的场景。尽管视觉语言模型(VLM)的最新进展为解决未知问题带来了前所未有的机遇,但如何利用其涌现能力在物理世界中控制机器人仍是一个开放性问题。本文提出MOKA(Marking Open-vocabulary Keypoint Affordances,即标记式开放词汇关键点可供性),该方法利用VLM解决由自由形式语言描述指定的机器人操作任务。其核心是一种紧凑的基于点的可供性与运动表征,将VLM对RGB图像的预测与物理世界中机器人的运动相衔接。通过提示一个在海量互联网数据上预训练的VLM,我们的方法利用广泛来源的概念理解与常识知识,预测可供性并生成相应运动。为支撑VLM在零样本场景下的推理,我们提出一种视觉提示技术,在图像上标注标记,将关键点与路径点的预测转化为一系列VLM可求解的视觉问答问题。利用此方式收集的机器人经验,我们进一步研究通过情境学习与策略蒸馏提升性能的途径。我们评估并分析了MOKA在多种由自由形式语言描述指定的操作任务(如工具使用、可变形物体操作与物体重排)上的表现。