In the last years, the research interest in visual navigation towards objects in indoor environments has grown significantly. This growth can be attributed to the recent availability of large navigation datasets in photo-realistic simulated environments, like Gibson and Matterport3D. However, the navigation tasks supported by these datasets are often restricted to the objects present in the environment at acquisition time. Also, they fail to account for the realistic scenario in which the target object is a user-specific instance that can be easily confused with similar objects and may be found in multiple locations within the environment. To address these limitations, we propose a new task denominated Personalized Instance-based Navigation (PIN), in which an embodied agent is tasked with locating and reaching a specific personal object by distinguishing it among multiple instances of the same category. The task is accompanied by PInNED, a dedicated new dataset composed of photo-realistic scenes augmented with additional 3D objects. In each episode, the target object is presented to the agent using two modalities: a set of visual reference images on a neutral background and manually annotated textual descriptions. Through comprehensive evaluations and analyses, we showcase the challenges of the PIN task as well as the performance and shortcomings of currently available methods designed for object-driven navigation, considering modular and end-to-end agents.
翻译:近年来,针对室内环境中物体视觉导航的研究兴趣显著增长。这一增长可归因于近期在照片级真实感模拟环境(如Gibson和Matterport3D)中大型导航数据集的可用性。然而,这些数据集支持的导航任务通常局限于采集时环境中存在的物体。此外,它们未能考虑一个现实场景:目标物体是用户特定的实例,可能容易与相似物体混淆,并且可能在环境中的多个位置出现。为应对这些局限性,我们提出了一项称为个性化实例导航的新任务,其中具身智能体的任务是定位并抵达一个特定的个人物品,通过在同一类别多个实例中将其区分出来。该任务配有PInNED,一个由照片级真实感场景增强额外3D物体构成的专用新数据集。在每个情节中,目标物体通过两种模态呈现给智能体:一组在中性背景上的视觉参考图像,以及手动标注的文本描述。通过全面的评估与分析,我们展示了PIN任务的挑战性,以及当前为物体驱动导航设计的方法(考虑模块化与端到端智能体)的性能与不足。