This paper describes a domestic service robot (DSR) that fetches everyday objects and carries them to specified destinations according to free-form natural language instructions. Given an instruction such as "Move the bottle on the left side of the plate to the empty chair," the DSR is expected to identify the bottle and the chair from multiple candidates in the environment and carry the target object to the destination. Most of the existing multimodal language understanding methods are impractical in terms of computational complexity because they require inferences for all combinations of target object candidates and destination candidates. We propose Switching Head-Tail Funnel UNITER, which solves the task by predicting the target object and the destination individually using a single model. Our method is validated on a newly-built dataset consisting of object manipulation instructions and semi photo-realistic images captured in a standard Embodied AI simulator. The results show that our method outperforms the baseline method in terms of language comprehension accuracy. Furthermore, we conduct physical experiments in which a DSR delivers standardized everyday objects in a standardized domestic environment as requested by instructions with referring expressions. The experimental results show that the object grasping and placing actions are achieved with success rates of more than 90%.
翻译:摘要:本文描述了一种家政服务机器人(DSR),能够根据自由形式的自然语言指令,取放日常物品并将其运送至指定目的地。给定诸如“将盘子左侧的瓶子移至空椅子处”的指令,DSR需从环境中多个候选目标中识别瓶子与椅子,并将目标物体运送至目的地。现有大多数多模态语言理解方法因需对目标物体候选与目的地候选的所有组合进行推理,在计算复杂度上缺乏实用性。我们提出切换头尾漏斗型UNITER,通过单一模型分别预测目标物体与目的地来解决该任务。该方法在新建数据集上得到验证,该数据集包含物体操作指令及在标准具身AI模拟器中采集的半写实图像。结果表明,我们的方法在语言理解准确率上优于基线方法。此外,我们开展了物理实验:DSR根据含指代表达的指令要求,在标准化家庭环境中取放标准化日常物品。实验结果显示,物体抓取与放置动作的成功率均超过90%。