Visual place recognition (VPR) is a highly challenging task that has a wide range of applications, including robot navigation and self-driving vehicles. VPR is particularly difficult due to the presence of duplicate regions and the lack of attention to small objects in complex scenes, resulting in recognition deviations. In this paper, we present ClusVPR, a novel approach that tackles the specific issues of redundant information in duplicate regions and representations of small objects. Different from existing methods that rely on Convolutional Neural Networks (CNNs) for feature map generation, ClusVPR introduces a unique paradigm called Clustering-based Weighted Transformer Network (CWTNet). CWTNet leverages the power of clustering-based weighted feature maps and integrates global dependencies to effectively address visual deviations encountered in large-scale VPR problems. We also introduce the optimized-VLAD (OptLAD) layer that significantly reduces the number of parameters and enhances model efficiency. This layer is specifically designed to aggregate the information obtained from scale-wise image patches. Additionally, our pyramid self-supervised strategy focuses on extracting representative and diverse information from scale-wise image patches instead of entire images, which is crucial for capturing representative and diverse information in VPR. Extensive experiments on four VPR datasets show our model's superior performance compared to existing models while being less complex.
翻译:视觉地点识别(VPR)是一项极具挑战性的任务,广泛应用于机器人导航和自动驾驶等领域。由于复杂场景中存在重复区域以及对小物体缺乏关注,VPR尤为困难,这会导致识别偏差。本文提出ClusVPR,一种解决重复区域冗余信息和小物体表示等特定问题的新方法。与现有依赖卷积神经网络(CNN)生成特征图的方法不同,ClusVPR引入了一种名为聚类加权Transformer网络(CWTNet)的独特范式。CWTNet利用聚类加权重特征图的能力,并集成全局依赖关系,有效解决大规模VPR问题中遇到的视觉偏差。同时,我们引入了优化VLAD(OptLAD)层,该层显著减少了参数数量并提升了模型效率,其设计旨在聚合尺度图像块中的信息。此外,我们提出的金字塔自监督策略专注于从尺度图像块而非整个图像中提取代表性和多样性信息,这对在VPR中捕获代表性和多样性信息至关重要。在四个VPR数据集上的大量实验表明,我们的模型在性能上优于现有模型,同时复杂度更低。