Query-Based Adaptive Aggregation for Multi-Dataset Joint Training Toward Universal Visual Place Recognition

Deep learning methods for Visual Place Recognition (VPR) have advanced significantly, largely driven by large-scale datasets. However, most existing approaches are trained on a single dataset, which can introduce dataset-specific inductive biases and limit model generalization. While multi-dataset joint training offers a promising solution for developing universal VPR models, divergences among training datasets can saturate the limited information capacity in feature aggregation layers, leading to suboptimal performance. To address these challenges, we propose Query-based Adaptive Aggregation (QAA), a novel feature aggregation technique that leverages learned queries as reference codebooks to effectively enhance information capacity without significant computational or parameter complexity. We show that computing the Cross-query Similarity (CS) between query-level image features and reference codebooks provides a simple yet effective way to generate robust descriptors. Our results demonstrate that QAA outperforms state-of-the-art models, achieving balanced generalization across diverse datasets while maintaining peak performance comparable to dataset-specific models. Ablation studies further explore QAA's mechanisms and scalability. Visualizations reveal that the learned queries exhibit diverse attention patterns across datasets. Project page: http://xjh19971.github.io/QAA.

翻译：视觉地点识别（VPR）的深度学习方法已取得显著进展，这主要得益于大规模数据集的驱动。然而，现有方法大多在单一数据集上进行训练，这可能会引入数据集特定的归纳偏差并限制模型的泛化能力。尽管多数据集联合训练为开发通用VPR模型提供了有前景的解决方案，但训练数据集之间的差异会使特征聚合层中有限的信息容量趋于饱和，导致性能欠佳。为应对这些挑战，我们提出了一种新颖的特征聚合技术——基于查询的自适应聚合（QAA），该方法利用学习到的查询作为参考码本，在不显著增加计算或参数复杂度的前提下有效提升信息容量。我们证明，通过计算查询级图像特征与参考码本之间的跨查询相似度（CS），能够以简单而有效的方式生成鲁棒的描述符。实验结果表明，QAA 优于现有最先进的模型，在保持与数据集专用模型相当的峰值性能的同时，实现了跨多样数据集的均衡泛化能力。消融研究进一步探讨了 QAA 的作用机制与可扩展性。可视化结果表明，学习到的查询在不同数据集上展现出多样化的注意力模式。项目页面：http://xjh19971.github.io/QAA。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【AAAI2026】TOFA：面向视觉-语言模型的免训练一次性联邦自适应方法

专知会员服务

13+阅读 · 2025年11月23日

《面向视觉语言地理基础模型》综述

专知会员服务

47+阅读 · 2024年6月15日

《利用真实和合成红外海上图像进行自动目标识别的深度学习》英国国防学院

专知会员服务

46+阅读 · 2023年6月25日

【伯克利博士论文】通过对齐表示和图像来跨域自适应，95页pdf

专知会员服务

44+阅读 · 2020年12月27日