Although domestic service robots are expected to assist individuals who require support, they cannot currently interact smoothly with people through natural language. For example, given the instruction "Bring me a bottle from the kitchen," it is difficult for such robots to specify the bottle in an indoor environment. Most conventional models have been trained on real-world datasets that are labor-intensive to collect, and they have not fully leveraged simulation data through a transfer learning framework. In this study, we propose a novel transfer learning approach for multimodal language understanding called Prototypical Contrastive Transfer Learning (PCTL), which uses a new contrastive loss called Dual ProtoNCE. We introduce PCTL to the task of identifying target objects in domestic environments according to free-form natural language instructions. To validate PCTL, we built new real-world and simulation datasets. Our experiment demonstrated that PCTL outperformed existing methods. Specifically, PCTL achieved an accuracy of 78.1%, whereas simple fine-tuning achieved an accuracy of 73.4%.
翻译:尽管家用服务机器人预计能够协助需要支持的个人,但目前它们无法通过自然语言与人顺畅交互。例如,对于指令“从厨房拿一个瓶子给我”,这类机器人很难在室内环境中准确定位瓶子。大多数传统模型是在人工标注成本高昂的真实世界数据集上训练的,并且未能通过迁移学习框架充分利用仿真数据。在本研究中,我们提出了一种名为原型对比迁移学习(Prototypical Contrastive Transfer Learning, PCTL)的新型多模态语言理解迁移学习方法,该方法采用了一种称为Dual ProtoNCE的新对比损失函数。我们将PCTL应用于根据自由形式自然语言指令在家庭环境中识别目标物体的任务。为验证PCTL,我们构建了新的真实世界和仿真数据集。实验表明,PCTL性能优于现有方法。具体而言,PCTL的准确率达到78.1%,而简单微调方法仅为73.4%。