Grounding language to the visual observations of a navigating agent can be performed using off-the-shelf visual-language models pretrained on Internet-scale data (e.g., image captions). While this is useful for matching images to natural language descriptions of object goals, it remains disjoint from the process of mapping the environment, so that it lacks the spatial precision of classic geometric maps. To address this problem, we propose VLMaps, a spatial map representation that directly fuses pretrained visual-language features with a 3D reconstruction of the physical world. VLMaps can be autonomously built from video feed on robots using standard exploration approaches and enables natural language indexing of the map without additional labeled data. Specifically, when combined with large language models (LLMs), VLMaps can be used to (i) translate natural language commands into a sequence of open-vocabulary navigation goals (which, beyond prior work, can be spatial by construction, e.g., "in between the sofa and TV" or "three meters to the right of the chair") directly localized in the map, and (ii) can be shared among multiple robots with different embodiments to generate new obstacle maps on-the-fly (by using a list of obstacle categories). Extensive experiments carried out in simulated and real world environments show that VLMaps enable navigation according to more complex language instructions than existing methods. Videos are available at https://vlmaps.github.io.
翻译:将语言指令与导航智能体的视觉观察相结合,可利用在大规模互联网数据(如图像描述)上预训练的现成视觉语言模型来实现。尽管这有助于将图像与物体目标的自然语言描述进行匹配,但其与建图过程相互独立,因而缺乏经典几何地图的空间精度。为解决此问题,我们提出VLMaps——一种直接融合预训练视觉语言特征与物理世界三维重建的空间地图表示。VLMaps可通过标准探索方法由机器人视频流自主构建,且无需额外标注数据即可实现地图的自然语言索引。具体而言,结合大型语言模型(LLMs),VLMaps可(i)将自然语言指令转换为一系列开放词汇导航目标(相比先前工作,这些目标可通过构造获得空间属性,例如"沙发与电视之间"或"椅子右侧三米处"),并直接在地图中定位;以及(ii)可共享于多台不同构型的机器人,通过障碍物类别列表即时生成新的障碍物地图。在仿真与真实环境中开展的大量实验表明,VLMaps能支持比现有方法更复杂的语言指令导航。视频资料参见https://vlmaps.github.io。