Visual place recognition (VPR) is a highly challenging task that has a wide range of applications, including robot navigation and self-driving vehicles. VPR is particularly difficult due to the presence of duplicate regions and the lack of attention to small objects in complex scenes, resulting in recognition deviations. In this paper, we present ClusVPR, a novel approach that tackles the specific issues of redundant information in duplicate regions and representations of small objects. Different from existing methods that rely on Convolutional Neural Networks (CNNs) for feature map generation, ClusVPR introduces a unique paradigm called Clustering-based Weighted Transformer Network (CWTNet). CWTNet leverages the power of clustering-based weighted feature maps and integrates global dependencies to effectively address visual deviations encountered in large-scale VPR problems. We also introduce the optimized-VLAD (OptLAD) layer that significantly reduces the number of parameters and enhances model efficiency. This layer is specifically designed to aggregate the information obtained from scale-wise image patches. Additionally, our pyramid self-supervised strategy focuses on extracting representative and diverse information from scale-wise image patches instead of entire images, which is crucial for capturing representative and diverse information in VPR. Extensive experiments on four VPR datasets show our model's superior performance compared to existing models while being less complex.
翻译:视觉地点识别(VPR)是一项极具挑战性的任务,广泛应用于机器人导航和自动驾驶等领域。由于复杂场景中存在重复区域以及缺乏对小目标的关注,导致识别偏差,这使得VPR尤为困难。本文提出ClusVPR,一种新颖的方法,专门解决重复区域中的冗余信息和小目标表征问题。与依赖卷积神经网络(CNN)生成特征图的现有方法不同,ClusVPR引入了一种称为"基于聚类的加权Transformer网络"(CWTNet)的独特范式。CWTNet利用基于聚类的加权特征图,并结合全局依赖性,有效解决大规模VPR问题中遇到的视觉偏差。我们还引入了优化VLAD(OptLAD)层,该层显著减少了参数量并提升了模型效率,专门用于聚合来自尺度级图像块的信息。此外,我们的金字塔自监督策略聚焦于从尺度级图像块而非完整图像中提取具有代表性和多样性的信息,这对在VPR中捕获代表性及多样性特征至关重要。在四个VPR数据集上的大量实验表明,我们的模型在复杂性更低的同时,性能优于现有模型。