Object goal visual navigation is a challenging task that aims to guide a robot to find the target object based on its visual observation, and the target is limited to the classes pre-defined in the training stage. However, in real households, there may exist numerous target classes that the robot needs to deal with, and it is hard for all of these classes to be contained in the training stage. To address this challenge, we study the zero-shot object goal visual navigation task, which aims at guiding robots to find targets belonging to novel classes without any training samples. To this end, we also propose a novel zero-shot object navigation framework called semantic similarity network (SSNet). Our framework use the detection results and the cosine similarity between semantic word embeddings as input. Such type of input data has a weak correlation with classes and thus our framework has the ability to generalize the policy to novel classes. Extensive experiments on the AI2-THOR platform show that our model outperforms the baseline models in the zero-shot object navigation task, which proves the generalization ability of our model. Our code is available at: https://github.com/pioneer-innovation/Zero-Shot-Object-Navigation.
翻译:目标视觉导航是一项具有挑战性的任务,旨在基于视觉观察引导机器人找到目标物体,且目标仅限于训练阶段预定义的类别。然而,在真实家庭环境中,机器人可能需要处理的物体类别众多,所有这些类别很难在训练阶段全部涵盖。为应对这一挑战,我们研究了零样本目标视觉导航任务,该任务旨在引导机器人无需任何训练样本即可找到属于新类别的目标。为此,我们提出了一种名为语义相似性网络(SSNet)的新型零样本目标导航框架。该框架利用检测结果和语义词嵌入之间的余弦相似度作为输入。此类输入数据与类别之间的相关性较弱,因此我们的框架能够将策略泛化到新类别上。在AI2-THOR平台上进行的大量实验表明,我们的模型在零样本目标导航任务中优于基线模型,证明了模型的泛化能力。我们的代码已开源:https://github.com/pioneer-innovation/Zero-Shot-Object-Navigation。