As a new embodied vision task, Instance ImageGoal Navigation (IIN) aims to navigate to a specified object depicted by a goal image in an unexplored environment. The main challenge of this task lies in identifying the target object from different viewpoints while rejecting similar distractors. Existing ImageGoal Navigation methods usually adopt the simple Exploration-Exploitation framework and ignore the identification of specific instance during navigation. In this work, we propose to imitate the human behaviour of ``getting closer to confirm" when distinguishing objects from a distance. Specifically, we design a new modular navigation framework named Instance-aware Exploration-Verification-Exploitation (IEVE) for instance-level image goal navigation. Our method allows for active switching among the exploration, verification, and exploitation actions, thereby facilitating the agent in making reasonable decisions under different situations. On the challenging HabitatMatterport 3D semantic (HM3D-SEM) dataset, our method surpasses previous state-of-the-art work, with a classical segmentation model (0.684 vs. 0.561 success) or a robust model (0.702 vs. 0.561 success). Our code will be made publicly available at https://github.com/XiaohanLei/IEVE.
翻译:作为一种新的具身视觉任务,实例图像目标导航(IIN)旨在通过目标图像在未知环境中导航至指定物体。该任务的主要挑战在于从不同视角识别目标对象,同时排除相似干扰物。现有图像目标导航方法通常采用简单的探索-利用框架,忽略了导航过程中对特定实例的识别。本研究提出模仿人类在远距离区分物体时“靠近确认”的行为,具体设计了名为实例感知探索-验证-利用(IEVE)的新型模块化导航框架,用于实例级图像目标导航。该方法支持在探索、验证和利用动作之间主动切换,从而帮助智能体在不同情境下做出合理决策。在具有挑战性的HabitatMatterport 3D语义(HM3D-SEM)数据集上,我们的方法超越了先前最先进的工作:使用经典分割模型时成功率为0.684(vs. 0.561),使用鲁棒模型时成功率为0.702(vs. 0.561)。我们的代码将在https://github.com/XiaohanLei/IEVE 公开提供。