Large language models (LLMs) exhibit a variety of promising capabilities in robotics, including long-horizon planning and commonsense reasoning. However, their performance in place recognition is still underexplored. In this work, we introduce multimodal LLMs (MLLMs) to visual place recognition (VPR), where a robot must localize itself using visual observations. Our key design is to use vision-based retrieval to propose several candidates and then leverage language-based reasoning to carefully inspect each candidate for a final decision. Specifically, we leverage the robust visual features produced by off-the-shelf vision foundation models (VFMs) to obtain several candidate locations. We then prompt an MLLM to describe the differences between the current observation and each candidate in a pairwise manner, and reason about the best candidate based on these descriptions. Our results on three datasets demonstrate that integrating the general-purpose visual features from VFMs with the reasoning capabilities of MLLMs already provides an effective place recognition solution, without any VPR-specific supervised training. We believe our work can inspire new possibilities for applying and designing foundation models, i.e., VFMs, LLMs, and MLLMs, to enhance the localization and navigation of mobile robots.
翻译:大语言模型(LLMs)在机器人学中展现出多种有前景的能力,包括长程规划和常识推理。然而,其在地点识别方面的性能仍未得到充分探索。在本工作中,我们将多模态大语言模型(MLLMs)引入视觉地点识别(VPR)任务,即机器人必须通过视觉观测进行自身定位。我们的核心设计是利用基于视觉的检索提出若干候选位置,然后借助基于语言的推理对每个候选进行细致检查以作出最终决策。具体而言,我们利用现成视觉基础模型(VFMs)生成的鲁棒视觉特征来获取多个候选位置。接着,我们提示一个MLLM以成对方式描述当前观测与每个候选位置之间的差异,并基于这些描述推理出最佳候选。我们在三个数据集上的结果表明,将VFMs的通用视觉特征与MLLMs的推理能力相结合,即使无需任何VPR特定的监督训练,也能提供有效的地点识别解决方案。我们相信,这项工作能够为应用和设计基础模型(即VFMs、LLMs和MLLMs)以增强移动机器人的定位与导航能力,激发新的可能性。