Vision-Language Models (VLMs) have shown remarkable progress in Vision-Language Navigation (VLN), offering new possibilities for navigation decision-making that could benefit both robotic platforms and human users. However, real-world navigation is inherently conditioned by the agent's mobility constraints. For example, a sweeping robot cannot traverse stairs, while a quadruped can. We introduce Capability-Conditioned Navigation (CapNav), a benchmark designed to evaluate how well VLMs can navigate complex indoor spaces given an agent's specific physical and operational capabilities. CapNav defines five representative human and robot agents, each described with physical dimensions, mobility capabilities, and environmental interaction abilities. CapNav provides 45 real-world indoor scenes, 473 navigation tasks, and 2365 QA pairs to test if VLMs can traverse indoor environments based on agent capabilities. We evaluate 13 modern VLMs and find that current VLM's navigation performance drops sharply as mobility constraints tighten, and that even state-of-the-art models struggle with obstacle types that require reasoning on spatial dimensions. We conclude by discussing the implications for capability-aware navigation and the opportunities for advancing embodied spatial reasoning in future VLMs. The benchmark is available at https://github.com/makeabilitylab/CapNav
翻译:视觉语言模型(VLMs)在视觉语言导航(VLN)领域展现出显著进展,为导航决策提供了新的可能性,有望同时惠及机器人平台和人类用户。然而,现实世界的导航本质上受限于智能体的移动能力约束。例如,扫地机器人无法跨越楼梯,而四足机器人则可以。我们提出了能力条件导航(CapNav)基准,旨在评估VLMs在给定智能体特定物理与操作能力的情况下,在复杂室内空间中导航的表现。CapNav定义了五个具有代表性的人类与机器人智能体,每个智能体均通过物理尺寸、移动能力和环境交互能力进行描述。该基准提供45个真实室内场景、473项导航任务和2365组问答对,用于测试VLMs能否根据智能体能力在室内环境中穿行。我们对13个现代VLM进行了评估,发现当前VLM的导航性能随着移动约束收紧而急剧下降,即使最先进的模型在面对需要空间维度推理的障碍物类型时也表现不佳。最后,我们讨论了能力感知导航的意义,以及在未来VLMs中推进具身空间推理的机遇。基准测试资源发布于 https://github.com/makeabilitylab/CapNav