Visual Navigation Models (VNMs) promise generalizable, robot navigation by learning from large-scale visual demonstrations. Despite growing real-world deployment, existing evaluations rely almost exclusively on success rate, whether the robot reaches its goal, which conceals trajectory quality, collision behavior, and robustness to environmental change. We present a real-world evaluation of five state-of-the-art VNMs (GNM, ViNT, NoMaD, NaviBridger, and CrossFormer) across two robot platforms and five environments spanning indoor and outdoor settings. Beyond success rate, we combine path-based metrics with vision-based goal-recognition scores and assess robustness through controlled image perturbations (motion blur, sunflare). Our analysis uncovers three systematic limitations: (a) even architecturally sophisticated diffusion and transformer-based models exhibit frequent collisions, indicating limited geometric understanding; (b) models fail to discriminate between different locations that are perceptually similar, however some semantics differences are present, causing goal prediction errors in repetitive environments; and (c) performance degrades under distribution shift. We will publicly release our evaluation codebase and dataset to facilitate reproducible benchmarking of VNMs.
翻译:视觉导航模型(VNMs)有望通过从大规模视觉演示中学习,实现可泛化的机器人导航。尽管现实世界部署日益增多,但现有评估几乎完全依赖成功率(即机器人是否到达目标),这掩盖了轨迹质量、碰撞行为以及对环境变化的鲁棒性。我们针对五种最先进的VNMs(GNM、ViNT、NoMaD、NaviBridger和CrossFormer),在两个机器人平台和五个涵盖室内与室外的环境中进行了现实世界评估。除成功率外,我们结合了基于路径的指标与基于视觉的目标识别分数,并通过受控图像扰动(运动模糊、眩光)评估鲁棒性。我们的分析揭示了三个系统性局限:(a) 即使架构先进的扩散模型和Transformer模型也频繁发生碰撞,表明其几何理解能力有限;(b) 模型无法区分感知相似但存在语义差异的不同位置,导致在重复性环境中出现目标预测错误;(c) 在分布偏移下性能下降。我们将公开发布评估代码库与数据集,以促进VNMs的可复现基准测试。