In visual place recognition, accurately identifying and matching images of locations under varying environmental conditions and viewpoints remains a significant challenge. In this paper, we introduce a new technique, called Bag-of-Queries (BoQ), which learns a set of global queries designed to capture universal place-specific attributes. Unlike existing methods that employ self-attention and generate the queries directly from the input features, BoQ employs distinct learnable global queries, which probe the input features via cross-attention, ensuring consistent information aggregation. In addition, our technique provides an interpretable attention mechanism and integrates with both CNN and Vision Transformer backbones. The performance of BoQ is demonstrated through extensive experiments on 14 large-scale benchmarks. It consistently outperforms current state-of-the-art techniques including NetVLAD, MixVPR and EigenPlaces. Moreover, as a global retrieval technique (one-stage), BoQ surpasses two-stage retrieval methods, such as Patch-NetVLAD, TransVPR and R2Former, all while being orders of magnitude faster and more efficient. The code and model weights are publicly available at https://github.com/amaralibey/Bag-of-Queries.
翻译:在视觉地点识别中,在不同环境条件和视角下准确识别和匹配地点图像仍然是一个重大挑战。本文提出了一种称为查询袋的新技术,它学习一组旨在捕获通用地点特定属性的全局查询。与现有方法使用自注意力并直接从输入特征生成查询不同,BoQ采用独特的可学习全局查询,这些查询通过交叉注意力探测输入特征,确保了一致的信息聚合。此外,我们的技术提供了一种可解释的注意力机制,并能与CNN和Vision Transformer骨干网络集成。BoQ的性能通过在14个大规模基准测试上的广泛实验得到验证。它始终优于当前最先进的技术,包括NetVLAD、MixVPR和EigenPlaces。此外,作为一种全局检索技术,BoQ超越了诸如Patch-NetVLAD、TransVPR和R2Former等两阶段检索方法,同时速度快几个数量级且效率更高。代码和模型权重已在https://github.com/amaralibey/Bag-of-Queries公开提供。