SEDualVLN: A Spatially-Enhanced Dual-System for Vision-Language Navigation

Vision-Language Navigation (VLN) approaches have currently followed two primary paradigms: the end-to-end Vision-Language Model (VLM) policy fine-tuned on navigation trajectories to directly predict actions, and the zero-shot modular pipeline integrating pre-trained Multimodal Large Language Model (MLLM) for training-free generalization to unseen environments. However, end-to-end methods struggle with long-horizon navigation and lack dynamic reasoning, whereas zero-shot methods are constrained by limited spatial grounding for reliable planning and also require substantial reasoning time. To bridge this gap, we introduce SEDualVLN, a spatially-enhanced dual-system VLN framework. System 1 is a VLM model enhanced with both global and local spatial awareness, used for action generation. System 2 integrates a general MLLM with a mapping module, wherein the MLLM plans waypoints by leveraging top-down views of the real-time 3D map alongside streams of rendered path images. Both systems leverage different forms of spatial enhancement to cultivate the agent's sense of direction in VLN tasks. Ultimately, they cooperate to complete the navigation task through a fast-slow coordinated approach. SEDualVLN achieves state-of-the-art performance on VLN-CE benchmarks, and further ablation studies demonstrate the effectiveness of each system and module.

翻译：视觉语言导航（VLN）方法目前主要遵循两种范式：基于导航轨迹微调、直接预测动作的端到端视觉语言模型（VLM）策略，以及集成预训练多模态大语言模型（MLLM）、无需训练即可泛化至未知环境的零样本模块化流水线。然而，端到端方法难以应对长程导航任务且缺乏动态推理能力，而零样本方法受限于有限的空间定位能力难以实现可靠规划，且推理耗时较长。为弥合这一鸿沟，我们提出SEDualVLN——一种空间增强的双系统VLN框架。系统1是一个融合全局与局部空间感知能力的VLM模型，用于动作生成；系统2则集成通用MLLM与映射模块，通过利用实时三维地图的俯视图与渲染路径图像序列，由MLLM规划路径点。两系统借助不同形式的空间增强策略，培养智能体在VLN任务中的方向感知能力，最终通过快慢协同机制完成导航任务。SEDualVLN在VLN-CE基准上取得最优性能，消融实验进一步验证了各系统及模块的有效性。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

[ICML 2026] 看见的还是思考的？用奖励机制区分“看错”与“想错”：视觉语言模型奖励感知

专知会员服务

10+阅读 · 5月15日

面向空中机器人的视觉语言导航：迈向大语言模型时代

专知会员服务

16+阅读 · 4月11日

面向战斗模拟空间推理的大语言模型指挥官智能体框架

专知会员服务

27+阅读 · 3月18日

【ICLR2025】视觉与语言导航的通用场景适应

专知会员服务

9+阅读 · 2025年1月31日