In recent years, much progress has been made in learning robotic manipulation policies that follow natural language instructions. Such methods typically learn from corpora of robot-language data that was either collected with specific tasks in mind or expensively re-labelled by humans with rich language descriptions in hindsight. Recently, large-scale pretrained vision-language models (VLMs) like CLIP or ViLD have been applied to robotics for learning representations and scene descriptors. Can these pretrained models serve as automatic labelers for robot data, effectively importing Internet-scale knowledge into existing datasets to make them useful even for tasks that are not reflected in their ground truth annotations? To accomplish this, we introduce Data-driven Instruction Augmentation for Language-conditioned control (DIAL): we utilize semi-supervised language labels leveraging the semantic understanding of CLIP to propagate knowledge onto large datasets of unlabelled demonstration data and then train language-conditioned policies on the augmented datasets. This method enables cheaper acquisition of useful language descriptions compared to expensive human labels, allowing for more efficient label coverage of large-scale datasets. We apply DIAL to a challenging real-world robotic manipulation domain where 96.5% of the 80,000 demonstrations do not contain crowd-sourced language annotations. DIAL enables imitation learning policies to acquire new capabilities and generalize to 60 novel instructions unseen in the original dataset.
翻译:近年来,在学习遵循自然语言指令的机器人操作策略方面取得了显著进展。此类方法通常从机器人-语言数据语料库中学习,这些数据要么是针对特定任务收集的,要么是通过人工事后用丰富的语言描述进行昂贵标注的。近期,大规模预训练视觉-语言模型(如CLIP或ViLD)已被应用于机器人领域,用于学习表征和场景描述符。这些预训练模型能否充当机器人数据的自动标注器,从而将互联网规模的知识有效导入现有数据集,使其即使对于未反映在真实标注中的任务也具备实用性?为实现此目标,我们提出了面向语言条件控制的数据驱动指令增强(DIAL)方法:利用CLIP的语义理解能力,通过半监督语言标签将知识传播至大规模无标注示范数据集,并基于增强后的数据集训练语言条件策略。与昂贵的人工标注相比,该方法能以更低成本获取有用的语言描述,从而更高效地实现大规模数据集的标注覆盖。我们将DIAL应用于一个具有挑战性的真实机器人操作领域,其中80,000条示范数据中有96.5%缺少众包语言标注。DIAL使模仿学习策略能够获得新能力,并泛化至原始数据集中未出现的60条新指令。