Language-Guided Robotic Manipulation (LGRM) is a challenging task as it requires a robot to understand human instructions to manipulate everyday objects. Recent approaches in LGRM rely on pre-trained Visual Grounding (VG) models to detect objects without adapting to manipulation environments. This results in a performance drop due to a substantial domain gap between the pre-training and real-world data. A straightforward solution is to collect additional training data, but the cost of human-annotation is extortionate. In this paper, we propose Grounding Vision to Ceaselessly Created Instructions (GVCCI), a lifelong learning framework for LGRM, which continuously learns VG without human supervision. GVCCI iteratively generates synthetic instruction via object detection and trains the VG model with the generated data. We validate our framework in offline and online settings across diverse environments on different VG models. Experimental results show that accumulating synthetic data from GVCCI leads to a steady improvement in VG by up to 56.7% and improves resultant LGRM by up to 29.4%. Furthermore, the qualitative analysis shows that the unadapted VG model often fails to find correct objects due to a strong bias learned from the pre-training data. Finally, we introduce a novel VG dataset for LGRM, consisting of nearly 252k triplets of image-object-instruction from diverse manipulation environments.
翻译:语言引导机器人操作(LGRM)是一项具有挑战性的任务,因为它要求机器人理解人类指令以操作日常物体。近期LGRM方法依赖预训练的视觉定位(VG)模型来检测物体,而无需适应操作环境。由于预训练数据与真实世界数据之间存在显著的领域差距,这会导致性能下降。一个直接的解决方案是收集额外的训练数据,但人工标注成本过高。本文提出GVCCI(持续生成指令以优化视觉定位),一种用于LGRM的终身学习框架,可在无需人工监督的情况下持续学习VG。GVCCI通过物体检测迭代生成合成指令,并利用生成的数据训练VG模型。我们在离线与在线设置下,跨多种环境对不同VG模型验证了该框架。实验结果表明,累积GVCCI生成的合成数据可使VG性能稳定提升高达56.7%,并使得LGRM性能提升高达29.4%。此外,定性分析显示,未自适应的VG模型常因预训练数据中习得的强烈偏差而无法定位正确物体。最后,我们为LGRM引入了一个包含来自多样化操作环境的近25.2万个图像-物体-指令三元组的新型VG数据集。