Vision Navigation Foundation Models (VNMs) promise end-to-end learned navigation policies capable of zero-shot deployment across diverse embodiments and environments. To maintain generality, many vision-based navigation models predict normalized actions. However, this normalization introduces a critical deployment vulnerability: applying different scaling factors to the same normalized trajectory alters its physical geometry, which degrades navigation performance and increases collision risks. We address this vulnerability by conditioning the model on normalized action histories alongside image observations, providing explicit context on the relationship between the model's predictions and the robot's actual physical displacement. Furthermore, current VNMs often struggle in visually repetitive environments that lack distinct features. To resolve this issue, we integrate a DINOv3 encoder, whose richer representations enable our model to capture both spatial and geometric dimensions between observations. VISTA generalizes robustly to out-of-distribution environments, achieving 100% goal prediction accuracy in zero-shot, real-world deployment in Outdoor, Forest and Office settings, and an average of 95% checkpoints crossed, demonstrating consistent path following in unseen environments.
翻译:视觉导航基础模型(VNMs)承诺实现端到端学习的导航策略,能够在不同实体和环境配置中实现零样本部署。为保持通用性,许多基于视觉的导航模型预测归一化动作。然而,这种归一化引入了一个关键的部署缺陷:对同一归一化轨迹应用不同缩放因子会改变其物理几何结构,从而降低导航性能并增加碰撞风险。我们通过将模型条件化为与图像观测共同作用的归一化动作历史来解决这一缺陷,为模型预测与机器人实际物理位移之间的关系提供显式上下文。此外,当前的VNMs在缺乏显著特征的视觉重复环境中常常表现不佳。为解决该问题,我们集成了DINOv3编码器,其更丰富的表征能力使模型能够捕捉观测之间的空间与几何维度。VISTA对分布外环境具有鲁棒泛化能力,在室外、森林和办公室场景的零样本真实世界部署中实现了100%的目标预测准确率,平均穿越95%的检查点,证明了在未知环境中一致的路径跟踪能力。