Object Goal Navigation-requiring an agent to locate a specific object in an unseen environment-remains a core challenge in embodied AI. Although recent progress in Vision-Language Model (VLM)-based agents has demonstrated promising perception and decision-making abilities through prompting, none has yet established a fully modular world model design that reduces risky and costly interactions with the environment by predicting the future state of the world. We introduce WMNav, a novel World Model-based Navigation framework powered by Vision-Language Models (VLMs). It predicts possible outcomes of decisions and builds memories to provide feedback to the policy module. To retain the predicted state of the environment, WMNav proposes the online maintained Curiosity Value Map as part of the world model memory to provide dynamic configuration for navigation policy. By decomposing according to a human-like thinking process, WMNav effectively alleviates the impact of model hallucination by making decisions based on the feedback difference between the world model plan and observation. To further boost efficiency, we implement a two-stage action proposer strategy: broad exploration followed by precise localization. Extensive evaluation on HM3D and MP3D validates WMNav surpasses existing zero-shot benchmarks in both success rate and exploration efficiency (absolute improvement: +3.2% SR and +3.2% SPL on HM3D, +13.5% SR and +1.1% SPL on MP3D). Project page: https://b0b8k1ng.github.io/WMNav/.
翻译:目标物体导航——要求智能体在未见环境中定位特定物体——仍然是具身人工智能的核心挑战。尽管基于视觉语言模型(VLM)的智能体近期通过提示机制展现了感知与决策方面的潜力,但尚未有研究建立完全模块化的世界模型设计,以通过预测世界未来状态来减少与环境高风险、高成本的交互。本文提出WMNav,一种由视觉语言模型(VLM)驱动的新型基于世界模型的导航框架。该框架通过预测决策的可能结果并构建记忆,为策略模块提供反馈。为持续记录环境预测状态,WMNav提出在线维护的"好奇心价值地图"作为世界模型记忆的组成部分,为导航策略提供动态配置。通过模拟人类思维过程进行任务分解,WMNav基于世界模型规划与观测反馈的差异进行决策,有效缓解了模型幻觉的影响。为进一步提升效率,我们实施了两阶段动作提议策略:先进行广泛探索,再进行精确定位。在HM3D和MP3D数据集上的大量实验表明,WMNav在成功率和探索效率上均超越现有零样本基准(绝对提升:HM3D上SR+3.2%、SPL+3.2%,MP3D上SR+13.5%、SPL+1.1%)。项目页面:https://b0b8k1ng.github.io/WMNav/。