Vision-and-Language Navigation (VLN) requires agents to interpret natural language instructions and act coherently in visually rich environments. However, most existing methods rely on reactive state-action mappings without explicitly action-grounded visual dynamics modeling. Lacking awareness of how actions transform subsequent visual observations, agents cannot plan actions rationally, leading to unstable behaviors, weak generalization, and cumulative error along trajectory. To address these issues, we introduce \textsc{NaVIDA} (\textbf{Nav}igation with \textbf{I}nverse \textbf{D}ynamics \textbf{A}ugmentation), a lightweight VLN framework that incorporates inverse dynamics supervision (IDS) as an explicit objective to embed action-grounded visual dynamics into policy learning. By jointly optimizing this visual dynamics with instruction-conditioned action prediction in a shared representation and action space, \textsc{NaVIDA} provides additional structured supervision that regularizes learning and leads to more stable and consistent navigation. To structure this supervision and extend the effective planning range, \textsc{NaVIDA} employs hierarchical probabilistic action chunking (HPAC), which organizes trajectories into multi-step chunks and provides discriminative, longer-range visual-change cues. Extensive experiments show that \textsc{NaVIDA} achieves superior navigation performance compared to state-of-the-art methods with fewer parameters (3B vs. 8B). Real-world robot evaluations further validate the practical feasibility and effectiveness of our approach.
翻译:视觉语言导航要求智能体解析自然语言指令并在视觉丰富的环境中协调行动。然而,现有方法大多依赖反应式的状态-动作映射,缺乏显式的动作驱动的视觉动态建模。由于未能理解动作如何改变后续视觉观测,智能体无法合理规划行动,导致行为不稳定、泛化能力弱以及轨迹上的误差累积。为解决这些问题,我们提出\textsc{NaVIDA}(基于逆动力学增强的导航),一种轻量级视觉语言导航框架,通过引入逆动力学监督作为显式目标,将动作驱动的视觉动态嵌入策略学习。通过在共享表示与动作空间中联合优化该视觉动态与指令条件下的动作预测,\textsc{NaVIDA}提供了额外的结构化监督,从而规范化学习过程,实现更稳定、一致的导航。为构建此监督并扩展有效规划范围,\textsc{NaVIDA}采用分层概率动作分块方法,将轨迹组织为多步分块,并提供具有判别性的长程视觉变化线索。大量实验表明,与现有最优方法相比,\textsc{NaVIDA}以更少的参数量(30亿 vs. 80亿)实现了更优的导航性能。真实机器人实验进一步验证了本方法的实际可行性与有效性。