Visual place recognition is a challenging task in the field of computer vision, and autonomous robotics and vehicles, which aims to identify a location or a place from visual inputs. Contemporary methods in visual place recognition employ convolutional neural networks and utilize every region within the image for the place recognition task. However, the presence of dynamic and distracting elements in the image may impact the effectiveness of the place recognition process. Therefore, it is meaningful to focus on task-relevant regions of the image for improved recognition. In this paper, we present PlaceFormer, a novel transformer-based approach for visual place recognition. PlaceFormer employs patch tokens from the transformer to create global image descriptors, which are then used for image retrieval. To re-rank the retrieved images, PlaceFormer merges the patch tokens from the transformer to form multi-scale patches. Utilizing the transformer's self-attention mechanism, it selects patches that correspond to task-relevant areas in an image. These selected patches undergo geometric verification, generating similarity scores across different patch sizes. Subsequently, spatial scores from each patch size are fused to produce a final similarity score. This score is then used to re-rank the images initially retrieved using global image descriptors. Extensive experiments on benchmark datasets demonstrate that PlaceFormer outperforms several state-of-the-art methods in terms of accuracy and computational efficiency, requiring less time and memory.
翻译:摘要:视觉地点识别是计算机视觉、自主机器人及车辆领域的一项挑战性任务,旨在根据视觉输入识别某一地点或场所。当代视觉地点识别方法采用卷积神经网络,并利用图像中的每个区域进行地点识别任务。然而,图像中动态元素和干扰因素的存在可能影响地点识别过程的有效性。因此,聚焦于与任务相关的图像区域以改进识别具有重要意义。本文提出PlaceFormer,一种基于Transformer的新型视觉地点识别方法。PlaceFormer利用Transformer的补丁令牌生成全局图像描述符,并用于图像检索。为对检索到的图像进行重排序,PlaceFormer合并Transformer的补丁令牌以形成多尺度补丁。借助Transformer的自注意力机制,它选择图像中与任务相关区域的补丁。这些选中的补丁经过几何验证,生成不同补丁尺寸下的相似度分数。随后,各补丁尺寸的空间分数被融合以产生最终相似度分数。该分数用于对初始通过全局图像描述符检索到的图像进行重排序。在基准数据集上的广泛实验表明,PlaceFormer在准确性和计算效率上优于多个最先进方法,且所需时间和内存更少。