Object-goal navigation is a challenging task that requires guiding an agent to specific objects based on first-person visual observations. The ability of agent to comprehend its surroundings plays a crucial role in achieving successful object finding. However, existing knowledge-graph-based navigators often rely on discrete categorical one-hot vectors and vote counting strategy to construct graph representation of the scenes, which results in misalignment with visual images. To provide more accurate and coherent scene descriptions and address this misalignment issue, we propose the Aligning Knowledge Graph with Visual Perception (AKGVP) method for object-goal navigation. Technically, our approach introduces continuous modeling of the hierarchical scene architecture and leverages visual-language pre-training to align natural language description with visual perception. The integration of a continuous knowledge graph architecture and multimodal feature alignment empowers the navigator with a remarkable zero-shot navigation capability. We extensively evaluate our method using the AI2-THOR simulator and conduct a series of experiments to demonstrate the effectiveness and efficiency of our navigator. Code available: https://github.com/nuoxu/AKGVP.
翻译:物体目标导航是一项具有挑战性的任务,要求智能体基于第一人称视觉观察引导其找到特定物体。智能体理解周围环境的能力在成功实现物体寻找中起着关键作用。然而,现有的基于知识图谱的导航器通常依赖离散类别独热向量和投票计数策略来构建场景的图表示,这导致其与视觉图像存在不对齐问题。为提供更准确且连贯的场景描述并解决这一不对齐问题,我们提出了面向物体目标导航的视觉感知对齐知识图谱(AKGVP)方法。在技术上,我们的方法引入了层级场景架构的连续建模,并利用视觉-语言预训练将自然语言描述与视觉感知对齐。连续知识图谱架构与多模态特征对齐的集成赋予导航器显著的零样本导航能力。我们使用AI2-THOR模拟器对方法进行了广泛评估,并通过一系列实验证明了导航器的有效性与效率。代码地址:https://github.com/nuoxu/AKGVP。