Towards Implicit Aggregation: Robust Image Representation for Place Recognition in the Transformer Era

Visual place recognition (VPR) is typically regarded as a specific image retrieval task, whose core lies in representing images as global descriptors. Over the past decade, dominant VPR methods (e.g., NetVLAD) have followed a paradigm that first extracts the patch features/tokens of the input image using a backbone, and then aggregates these patch features into a global descriptor via an aggregator. This backbone-plus-aggregator paradigm has achieved overwhelming dominance in the CNN era and remains widely used in transformer-based models. In this paper, however, we argue that a dedicated aggregator is not necessary in the transformer era, that is, we can obtain robust global descriptors only with the backbone. Specifically, we introduce some learnable aggregation tokens, which are prepended to the patch tokens before a particular transformer block. All these tokens will be jointly processed and interact globally via the intrinsic self-attention mechanism, implicitly aggregating useful information within the patch tokens to the aggregation tokens. Finally, we only take these aggregation tokens from the last output tokens and concatenate them as the global representation. Although implicit aggregation can provide robust global descriptors in an extremely simple manner, where and how to insert additional tokens, as well as the initialization of tokens, remains an open issue worthy of further exploration. To this end, we also propose the optimal token insertion strategy and token initialization method derived from empirical studies. Experimental results show that our method outperforms state-of-the-art methods on several VPR datasets with higher efficiency and ranks 1st on the MSLS challenge leaderboard. The code is available at https://github.com/lu-feng/image.

翻译：视觉地点识别通常被视为一种特定的图像检索任务，其核心在于将图像表示为全局描述符。过去十年间，主流VPR方法（如NetVLAD）遵循着一种范式：首先使用骨干网络提取输入图像的局部特征/标记，随后通过聚合器将这些局部特征聚合成全局描述符。这种“骨干网络+聚合器”范式在CNN时代取得了压倒性优势，并在基于Transformer的模型中仍被广泛采用。然而，本文认为在Transformer时代，专用的聚合器并非必需，即仅通过骨干网络即可获得鲁棒的全局描述符。具体而言，我们引入若干可学习的聚合标记，在特定Transformer模块前将其预置到局部标记序列中。所有标记将通过固有的自注意力机制进行联合处理与全局交互，从而将局部标记中的有效信息隐式聚合至聚合标记。最终，我们仅从最终输出标记中提取这些聚合标记，并将其拼接为全局表示。尽管隐式聚合能以极其简洁的方式提供鲁棒的全局描述符，但附加标记的插入位置与方式，以及标记的初始化策略，仍是值得深入探索的开放性问题。为此，我们通过实证研究提出了最优的标记插入策略与标记初始化方法。实验结果表明，我们的方法在多个VPR数据集上以更高效率超越现有最优方法，并在MSLS挑战赛排行榜中位列第一。代码已开源：https://github.com/lu-feng/image。