Navigational signs enable humans to navigate unfamiliar environments without maps. This work studies how robots can similarly exploit signs for mapless navigation in the open world. A central challenge lies in interpreting signs: real-world signs are diverse and complex, and their abstract semantic contents need to be grounded in the local 3D scene. We formalize this as sign grounding, the problem of mapping semantic instructions on signs to corresponding scene elements and navigational actions. Recent Vision-Language Models (VLMs) offer the semantic common-sense and reasoning capabilities required for this task, but are sensitive to how spatial information is represented. We propose SignScene, a sign-centric spatial-semantic representation that captures navigation-relevant scene elements and sign information, and presents them to VLMs in a form conducive to effective reasoning. We evaluate our grounding approach on a dataset of 114 queries collected across nine diverse environment types, achieving 88% grounding accuracy and significantly outperforming baselines. Finally, we demonstrate that it enables real-world mapless navigation on a Spot robot using only signs.
翻译:导航标牌使人类能够在没有地图的情况下在陌生环境中导航。本研究探讨机器人如何类似地利用标牌在开放世界中进行无地图导航。核心挑战在于标牌解读:现实世界的标牌具有多样性和复杂性,其抽象语义内容需要与局部三维场景进行关联。我们将此形式化为标牌感知问题,即把标牌上的语义指令映射到对应场景元素与导航动作的问题。当前视觉语言模型具备完成该任务所需的语义常识与推理能力,但其性能对空间信息表征方式极为敏感。我们提出SignScene——一种以标牌为中心的空间语义表征方法,该方法捕获与导航相关的场景元素和标牌信息,并以有利于有效推理的形式呈现给视觉语言模型。我们在涵盖九种不同环境类型收集的114个查询数据集上评估了该感知方法,实现了88%的感知准确率,显著优于基线方法。最后,我们通过Spot机器人仅使用标牌完成真实世界无地图导航,验证了该方法的有效性。