Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment. While state-of-the-art methods leverage the reasoning capabilities of Vision-Language Models (VLMs) for end-to-end action prediction, they often lack an explicit and explainable understanding of the relationships between the agent, the instruction, and the scene. Conversely, explicitly building a scene map for heuristic planning is intuitively appealing but relies on additional 3D sensors and hinders large-scale vision-language pre-training. To bridge this gap, we propose AwareVLN, a novel framework that equips the navigation model with a self-aware reasoning mechanism, enabling it to understand the agent's state and task progress in a fully end-to-end and data-driven manner. Our approach features two key innovations: (1) a structural reasoning module that fosters spatial and task-oriented self-awareness, and (2) an automatic data engine with progress division for effective training. Extensive experiments on various datasets in Habitat simulator show our AwareVLN significantly outperforms previous state-of-the-art vision-language navigation methods. Project page: https://gwxuan.github.io/AwareVLN/.
翻译:视觉语言导航(VLN)要求智能体在视觉环境中根据语言指令引导自身移动。尽管现有最先进方法利用视觉语言模型(VLM)的推理能力进行端到端动作预测,但它们通常缺乏对智能体、指令与场景之间关系的显式且可解释的理解。相反,显式构建场景地图以进行启发式规划在直觉上具有吸引力,但这依赖于额外的3D传感器,且阻碍了大规模视觉语言预训练。为弥合这一差距,我们提出AwareVLN,一种新颖框架,该框架为导航模型配备自我意识推理机制,使其能够以完全端到端且数据驱动的方式理解智能体的状态与任务进度。我们的方法包含两项关键创新:(1)一个促进空间与任务导向自我意识的结构化推理模块,以及(2)一个带有进度划分的自动数据引擎,用于高效训练。在Habitat模拟器中多个数据集上的广泛实验表明,我们的AwareVLN显著优于先前最先进的视觉语言导航方法。项目页面:https://gwxuan.github.io/AwareVLN/。