WMNav：将视觉语言模型集成至世界模型以实现目标物体导航 (WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation)

Object Goal Navigation-requiring an agent to locate a specific object in an unseen environment-remains a core challenge in embodied AI. Although recent progress in Vision-Language Model (VLM)-based agents has demonstrated promising perception and decision-making abilities through prompting, none has yet established a fully modular world model design that reduces risky and costly interactions with the environment by predicting the future state of the world. We introduce WMNav, a novel World Model-based Navigation framework powered by Vision-Language Models (VLMs). It predicts possible outcomes of decisions and builds memories to provide feedback to the policy module. To retain the predicted state of the environment, WMNav proposes the online maintained Curiosity Value Map as part of the world model memory to provide dynamic configuration for navigation policy. By decomposing according to a human-like thinking process, WMNav effectively alleviates the impact of model hallucination by making decisions based on the feedback difference between the world model plan and observation. To further boost efficiency, we implement a two-stage action proposer strategy: broad exploration followed by precise localization. Extensive evaluation on HM3D and MP3D validates WMNav surpasses existing zero-shot benchmarks in both success rate and exploration efficiency (absolute improvement: +3.2% SR and +3.2% SPL on HM3D, +13.5% SR and +1.1% SPL on MP3D). Project page: https://b0b8k1ng.github.io/WMNav/.

翻译：目标物体导航——要求智能体在未见环境中定位特定物体——仍然是具身人工智能的核心挑战。尽管基于视觉语言模型（VLM）的智能体近期通过提示机制展现了感知与决策方面的潜力，但尚未有研究建立完全模块化的世界模型设计，以通过预测世界未来状态来减少与环境高风险、高成本的交互。本文提出WMNav，一种由视觉语言模型（VLM）驱动的新型基于世界模型的导航框架。该框架通过预测决策的可能结果并构建记忆，为策略模块提供反馈。为持续记录环境预测状态，WMNav提出在线维护的"好奇心价值地图"作为世界模型记忆的组成部分，为导航策略提供动态配置。通过模拟人类思维过程进行任务分解，WMNav基于世界模型规划与观测反馈的差异进行决策，有效缓解了模型幻觉的影响。为进一步提升效率，我们实施了两阶段动作提议策略：先进行广泛探索，再进行精确定位。在HM3D和MP3D数据集上的大量实验表明，WMNav在成功率和探索效率上均超越现有零样本基准（绝对提升：HM3D上SR+3.2%、SPL+3.2%，MP3D上SR+13.5%、SPL+1.1%）。项目页面：https://b0b8k1ng.github.io/WMNav/。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日