The Zero-Shot Object Navigation (ZSON) task requires embodied agents to find a previously unseen object by navigating in unfamiliar environments. Such a goal-oriented exploration heavily relies on the ability to perceive, understand, and reason based on the spatial information of the environment. However, current LLM-based approaches convert visual observations to language descriptions and reason in the linguistic space, leading to the loss of spatial information. In this paper, we introduce TopV-Nav, a MLLM-based method that directly reasons on the top-view map with complete spatial information. To fully unlock the MLLM's spatial reasoning potential in top-view perspective, we propose the Adaptive Visual Prompt Generation (AVPG) method to adaptively construct semantically-rich top-view map. It enables the agent to directly utilize spatial information contained in the top-view map to conduct thorough reasoning. Besides, we design a Dynamic Map Scaling (DMS) mechanism to dynamically zoom top-view map at preferred scales, enhancing local fine-grained reasoning. Additionally, we devise a Target-Guided Navigation (TGN) mechanism to predict and to utilize target locations, facilitating global and human-like exploration. Experiments on MP3D and HM3D benchmarks demonstrate the superiority of our TopV-Nav, e.g., $+3.9\%$ SR and $+2.0\%$ SPL absolute improvements on HM3D.
翻译:零样本目标导航任务要求具身智能体在陌生环境中通过导航找到先前未见过的物体。这种以目标为导向的探索在很大程度上依赖于感知、理解和基于环境空间信息进行推理的能力。然而,当前基于LLM的方法将视觉观察转换为语言描述并在语言空间中进行推理,导致空间信息丢失。本文提出TopV-Nav,一种基于MLLM的方法,可直接在包含完整空间信息的俯视地图上进行推理。为充分释放MLLM在俯视视角下的空间推理潜力,我们提出自适应视觉提示生成方法,以自适应地构建语义丰富的俯视地图。这使得智能体能够直接利用俯视地图中包含的空间信息进行深入推理。此外,我们设计了动态地图缩放机制,以在优选尺度上动态缩放俯视地图,增强局部细粒度推理能力。同时,我们设计了目标引导导航机制来预测并利用目标位置,促进全局且类人的探索。在MP3D和HM3D基准测试上的实验证明了TopV-Nav的优越性,例如在HM3D上实现了$+3.9\%$的成功率和$+2.0\%$的SPL绝对提升。