Humans interpret scenes by recognizing both the identities and positions of objects in their observations. For a robot to perform tasks such as \enquote{pick and place}, understanding both what the objects are and where they are located is crucial. While the former has been extensively discussed in the literature that uses the large language model to enrich the text descriptions, the latter remains underexplored. In this work, we introduce the \textit{Object-Centric Instruction Augmentation (OCI)} framework to augment highly semantic and information-dense language instruction with position cues. We utilize a Multi-modal Large Language Model (MLLM) to weave knowledge of object locations into natural language instruction, thus aiding the policy network in mastering actions for versatile manipulation. Additionally, we present a feature reuse mechanism to integrate the vision-language features from off-the-shelf pre-trained MLLM into policy networks. Through a series of simulated and real-world robotic tasks, we demonstrate that robotic manipulator imitation policies trained with our enhanced instructions outperform those relying solely on traditional language instructions.
翻译:人类通过识别观察中物体的身份和位置来理解场景。对于机器人执行诸如"拾取和放置"等任务而言,理解物体是什么及其所在位置至关重要。虽然前者已在利用大语言模型丰富文本描述的文献中得到广泛探讨,但后者仍未得到充分研究。在本工作中,我们提出了一种名为"对象中心指令增强"(OCI)的框架,旨在为具有高语义性和信息密集性的语言指令补充位置线索。我们利用多模态大语言模型将物体位置知识融入自然语言指令,从而辅助策略网络掌握通用操作的动作。此外,我们提出了一种特征重用机制,将现成预训练MLLM的视觉-语言特征集成到策略网络中。通过一系列仿真和真实机器人任务,我们证明:使用增强指令训练的机器人操作模仿策略在性能上优于仅依赖传统语言指令的策略。