NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

Vision-and-language navigation (VLN) stands as a key research problem of Embodied AI, aiming at enabling agents to navigate in unseen environments following linguistic instructions. In this field, generalization is a long-standing challenge, either to out-of-distribution scenes or from Sim to Real. In this paper, we propose NaVid, a video-based large vision language model (VLM), to mitigate such a generalization gap. NaVid makes the first endeavor to showcase the capability of VLMs to achieve state-of-the-art level navigation performance without any maps, odometers, or depth inputs. Following human instruction, NaVid only requires an on-the-fly video stream from a monocular RGB camera equipped on the robot to output the next-step action. Our formulation mimics how humans navigate and naturally gets rid of the problems introduced by odometer noises, and the Sim2Real gaps from map or depth inputs. Moreover, our video-based approach can effectively encode the historical observations of robots as spatio-temporal contexts for decision making and instruction following. We train NaVid with 510k navigation samples collected from continuous environments, including action-planning and instruction-reasoning samples, along with 763k large-scale web data. Extensive experiments show that NaVid achieves state-of-the-art performance in simulation environments and the real world, demonstrating superior cross-dataset and Sim2Real transfer. We thus believe our proposed VLM approach plans the next step for not only the navigation agents but also this research field.

翻译：视觉与语言导航（VLN）是具身人工智能的核心研究问题，其目标是使智能体能够依据语言指令在未知环境中进行导航。在该领域中，泛化能力——无论是面对分布外场景还是从仿真环境迁移到真实世界——始终是一项长期存在的挑战。本文提出NaVid，一种基于视频的大型视觉语言模型，旨在弥合此类泛化差距。NaVid首次证明了视觉语言模型无需任何地图、里程计或深度输入即可实现最先进水平的导航性能。在遵循人类指令时，NaVid仅需机器人搭载的单目RGB相机实时采集的视频流，即可输出下一步动作。我们的方法模拟了人类的导航方式，自然规避了里程计噪声带来的问题，以及由地图或深度输入引起的仿真到真实迁移差距。此外，我们基于视频的方法能有效编码机器人历史观测信息作为时空上下文，以支持决策与指令跟随。我们使用从连续环境中收集的51万个导航样本（包括动作规划与指令推理样本）以及76.3万份大规模网络数据对NaVid进行训练。大量实验表明，NaVid在仿真环境与真实世界中均达到了最先进的性能，展现出卓越的跨数据集与仿真到真实迁移能力。因此，我们相信所提出的视觉语言模型方法不仅为导航智能体规划了下一步，也为该研究领域指明了前进方向。