VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation

Understanding how humans leverage semantic knowledge to navigate unfamiliar environments and decide where to explore next is pivotal for developing robots capable of human-like search behaviors. We introduce a zero-shot navigation approach, Vision-Language Frontier Maps (VLFM), which is inspired by human reasoning and designed to navigate towards unseen semantic objects in novel environments. VLFM builds occupancy maps from depth observations to identify frontiers, and leverages RGB observations and a pre-trained vision-language model to generate a language-grounded value map. VLFM then uses this map to identify the most promising frontier to explore for finding an instance of a given target object category. We evaluate VLFM in photo-realistic environments from the Gibson, Habitat-Matterport 3D (HM3D), and Matterport 3D (MP3D) datasets within the Habitat simulator. Remarkably, VLFM achieves state-of-the-art results on all three datasets as measured by success weighted by path length (SPL) for the Object Goal Navigation task. Furthermore, we show that VLFM's zero-shot nature enables it to be readily deployed on real-world robots such as the Boston Dynamics Spot mobile manipulation platform. We deploy VLFM on Spot and demonstrate its capability to efficiently navigate to target objects within an office building in the real world, without any prior knowledge of the environment. The accomplishments of VLFM underscore the promising potential of vision-language models in advancing the field of semantic navigation. Videos of real-world deployment can be viewed at naoki.io/vlfm.

翻译：理解人类如何利用语义知识在陌生环境中导航并决定下一步探索方向，对于开发具备类人搜索行为的机器人至关重要。我们提出了一种零样本导航方法——视觉-语言前沿地图（VLFM），该方法受人类推理启发，旨在未知环境中导航至未见过的语义目标。VLFM利用深度观测构建占据地图以识别前沿区域，并借助RGB观测及预训练视觉-语言模型生成语言约束的价值地图。随后，VLFM通过此价值地图确定最具探索潜力的前沿区域，以寻找给定目标对象类别的实例。我们在Habitat模拟器中基于Gibson、Habitat-Matterport 3D（HM3D）和Matterport 3D（MP3D）数据集的光照真实环境中评估VLFM。值得注意的是，在目标物体导航任务的路径长度加权成功率（SPL）指标上，VLFM在所有三个数据集均取得了最先进的结果。此外，我们证明VLFM的零样本特性使其可直接部署于波士顿动力Spot移动操作平台等真实机器人上。我们在Spot上部署VLFM，并展示其在无需任何环境先验知识的情况下，能够有效在办公楼内导航至目标物体。VLFM的成就凸显了视觉-语言模型在推动语义导航领域发展中的巨大潜力。真实环境部署视频可访问naoki.io/vlfm观看。