Humans interpret scenes by recognizing both the identities and positions of objects in their observations. For a robot to perform tasks such as \enquote{pick and place}, understanding both what the objects are and where they are located is crucial. While the former has been extensively discussed in the literature that uses the large language model to enrich the text descriptions, the latter remains underexplored. In this work, we introduce the \textit{Object-Centric Instruction Augmentation (OCI)} framework to augment highly semantic and information-dense language instruction with position cues. We utilize a Multi-modal Large Language Model (MLLM) to weave knowledge of object locations into natural language instruction, thus aiding the policy network in mastering actions for versatile manipulation. Additionally, we present a feature reuse mechanism to integrate the vision-language features from off-the-shelf pre-trained MLLM into policy networks. Through a series of simulated and real-world robotic tasks, we demonstrate that robotic manipulator imitation policies trained with our enhanced instructions outperform those relying solely on traditional language instructions.
翻译:人类通过识别观察中对象的身份和位置来理解场景。对于执行诸如"抓取与放置"等任务的机器人而言,理解对象是什么及其所在位置至关重要。尽管现有文献已广泛探讨利用大语言模型增强文本描述以识别对象身份,但对对象位置的理解仍研究不足。本文提出以对象为中心的指令增强(OCI)框架,通过位置线索增强语义丰富且信息密集的语言指令。我们利用多模态大语言模型(MLLM)将对象位置知识融入自然语言指令,从而协助策略网络掌握多种操控动作。此外,我们提出一种特征复用机制,将现成预训练MLLM的视觉-语言特征整合至策略网络。通过一系列仿真及真实机器人任务实验证明,基于增强指令训练的操作模仿策略,其性能优于仅依赖传统语言指令的策略。