Previous Vision-Language-Action models face critical limitations in navigation: scarce, diverse data from labor-intensive collection and static representations that fail to capture temporal dynamics and physical laws. We propose NavDreamer, a video-based framework for 3D navigation that leverages generative video models as a universal interface between language instructions and navigation trajectories. Our main hypothesis is that video's ability to encode spatiotemporal information and physical dynamics, combined with internet-scale availability, enables strong zero-shot generalization in navigation. To mitigate the stochasticity of generative predictions, we introduce a sampling-based optimization method that utilizes a VLM for trajectory scoring and selection. An inverse dynamics model is employed to decode executable waypoints from generated video plans for navigation. To systematically evaluate this paradigm in several video model backbones, we introduce a comprehensive benchmark covering object navigation, precise navigation, spatial grounding, language control, and scene reasoning. Extensive experiments demonstrate robust generalization across novel objects and unseen environments, with ablation studies revealing that navigation's high-level decision-making nature makes it particularly suited for video-based planning.
翻译:先前视觉-语言-动作模型在导航任务中存在关键局限:数据稀缺且多样,依赖劳动密集型采集;静态表征无法捕捉时序动态与物理规律。我们提出NavDreamer——基于视频的三维导航框架,利用生成式视频模型作为语言指令与导航轨迹间的通用接口。核心假设是:视频编码时空信息与物理动态的能力,结合互联网规模的数据可用性,可实现导航任务的强零样本泛化。为缓解生成预测的随机性,我们提出基于采样的优化方法,利用视觉语言模型进行轨迹评分与选择。通过逆动力学模型将生成的视频规划解码为可执行的导航路径点。为系统评估该范式在多种视频模型骨干上的表现,我们构建了涵盖目标导航、精确导航、空间定位、语言控制与场景推理的综合基准测试。大量实验表明,模型在新物体与未见环境中具有稳健的泛化能力,消融研究揭示导航的高层决策特性使其特别适用于基于视频的规划范式。