Zero-shot object navigation requires agents to locate unseen target objects in unfamiliar environments without prior maps or task-specific training which remains a significant challenge. Although recent advancements in vision-language models(VLMs) provide promising commonsense reasoning capabilities for this task, these models still suffer from spatial hallucinations, local exploration deadlocks, and a disconnect between high-level semantic intent and low-level control. In this regard, we propose a novel hierarchical navigation framework named ReMemNav, which seamlessly integrates panoramic semantic priors and episodic memory with VLMs. We introduce the Recognize Anything Model to anchor the spatial reasoning process of the VLM. We also design an adaptive dual-modal rethinking mechanism based on an episodic semantic buffer queue. The proposed mechanism actively verifies target visibility and corrects decisions using historical memory to prevent deadlocks. For low-level action execution, ReMemNav extracts a sequence of feasible actions using depth masks, allowing the VLM to select the optimal action for mapping into actual spatial movement. Extensive evaluations on HM3D and MP3D demonstrate that ReMemNav outperforms existing training-free zero-shot baselines in both success rate and exploration efficiency. Specifically, we achieve significant absolute performance improvements, with SR and SPL increasing by 1.7% and 7.0% on HM3D v0.1, 18.2% and 11.1% on HM3D v0.2, and 8.7% and 7.9% on MP3D.
翻译:零样本目标导航要求智能体在无先验地图或任务专用训练的情况下,于陌生环境中定位未见过的目标物体,这仍是一项重大挑战。尽管视觉语言模型(VLM)的最新进展为此任务提供了具有前景的常识推理能力,但这些模型仍面临空间幻觉、局部探索死锁,以及高层语义意图与低层控制之间的脱节问题。针对此,我们提出了一种名为ReMemNav的新型分层导航框架,该框架将全景语义先验和情景记忆与VLM无缝集成。我们引入了“万物识别模型”以锚定VLM的空间推理过程。同时,基于情景语义缓冲队列,设计了一种自适应双模态反思机制。该机制利用历史记忆主动验证目标可见性并纠正决策,从而防止死锁。在低层动作执行方面,ReMemNav利用深度掩码提取可行的动作序列,使VLM能够选择最优动作映射为实际空间位移。在HM3D和MP3D上的广泛评估表明,ReMemNav在成功率和探索效率上均优于现有免训练零样本基线方法。具体而言,我们实现了显著的绝对性能提升:在HM3D v0.1上SR和SPL分别提高1.7%和7.0%,在HM3D v0.2上分别提高18.2%和11.1%,在MP3D上分别提高8.7%和7.9%。