Accurately recognizing a revisited place is crucial for embodied agents to localize and navigate. This requires visual representations to be distinct, despite strong variations in camera viewpoint and scene appearance. Existing visual place recognition pipelines encode the "whole" image and search for matches. This poses a fundamental challenge in matching two images of the same place captured from different camera viewpoints: "the similarity of what overlaps can be dominated by the dissimilarity of what does not overlap". We address this by encoding and searching for "image segments" instead of the whole images. We propose to use open-set image segmentation to decompose an image into `meaningful' entities (i.e., things and stuff). This enables us to create a novel image representation as a collection of multiple overlapping subgraphs connecting a segment with its neighboring segments, dubbed SuperSegment. Furthermore, to efficiently encode these SuperSegments into compact vector representations, we propose a novel factorized representation of feature aggregation. We show that retrieving these partial representations leads to significantly higher recognition recall than the typical whole image based retrieval. Our segments-based approach, dubbed SegVLAD, sets a new state-of-the-art in place recognition on a diverse selection of benchmark datasets, while being applicable to both generic and task-specialized image encoders. Finally, we demonstrate the potential of our method to ``revisit anything'' by evaluating our method on an object instance retrieval task, which bridges the two disparate areas of research: visual place recognition and object-goal navigation, through their common aim of recognizing goal objects specific to a place. Source code: https://github.com/AnyLoc/Revisit-Anything.
翻译:准确识别重访地点对于具身智能体的定位与导航至关重要。这要求视觉表征具备区分性,即使相机视角和场景外观存在显著变化。现有的视觉地点识别流程通常编码“整张”图像并搜索匹配项。这带来了一个根本性挑战:当匹配从不同相机视角拍摄的同一地点图像时,“重叠区域的相似性可能被非重叠区域的差异性所主导”。我们通过编码并搜索“图像片段”而非整张图像来解决这一问题。我们提出使用开放集图像分割将图像分解为“有意义”的实体(即物体与背景)。这使得我们能够创建一种新颖的图像表征:将每个片段与其相邻片段连接形成多个重叠子图的集合,称为超片段。此外,为高效地将这些超片段编码为紧凑的向量表征,我们提出了一种新颖的分解式特征聚合表征方法。实验表明,检索这些局部表征比典型的基于整图的检索获得显著更高的识别召回率。我们基于片段的方法(命名为SegVLAD)在多个精选基准数据集上实现了地点识别的最新最优性能,同时适用于通用与任务专用图像编码器。最后,我们通过在物体实例检索任务上评估方法,展示了“重访万物”的潜力。该任务通过识别特定地点的目标物体这一共同目标,将视觉地点识别与物体目标导航这两个不同研究领域联系起来。源代码:https://github.com/AnyLoc/Revisit-Anything。