Large-scale visual localization systems continue to rely on 3D point clouds built from image collections using structure-from-motion. While the 3D points in these models are represented using local image features, directly matching a query image's local features against the point cloud is challenging due to the scale of the nearest-neighbor search problem. Many recent approaches to visual localization have thus proposed a hybrid method, where first a global (per image) embedding is used to retrieve a small subset of database images, and local features of the query are matched only against those. It seems to have become common belief that global embeddings are critical for said image-retrieval in visual localization, despite the significant downside of having to compute two feature types for each query image. In this paper, we take a step back from this assumption and propose Constrained Approximate Nearest Neighbors (CANN), a joint solution of k-nearest-neighbors across both the geometry and appearance space using only local features. We first derive the theoretical foundation for k-nearest-neighbor retrieval across multiple metrics and then showcase how CANN improves visual localization. Our experiments on public localization benchmarks demonstrate that our method significantly outperforms both state-of-the-art global feature-based retrieval and approaches using local feature aggregation schemes. Moreover, it is an order of magnitude faster in both index and query time than feature aggregation schemes for these datasets. Code will be released.
翻译:大规模视觉定位系统仍然依赖于从图像集合中通过运动恢复结构构建的3D点云。尽管这些模型中的3D点使用局部图像特征表示,但由于最近邻搜索问题的规模,直接将查询图像的局部特征与点云匹配颇具挑战性。因此,许多近期视觉定位方法提出了一种混合方案:首先使用全局(每张图像)嵌入检索小型数据库图像子集,随后仅将查询的局部特征与这些图像进行匹配。尽管每个查询图像需计算两种特征类型会带来显著缺陷,但全局嵌入似乎已成为视觉定位中图像检索环节的普遍认知。本文重新审视这一假设,提出约束近似最近邻(CANN)方法,仅使用局部特征在几何空间与外观空间上联合求解k近邻。我们首先推导了跨多度量空间的k近邻检索理论基础,随后展示了CANN如何改进视觉定位。在公开定位基准上的实验表明,我们的方法显著优于最先进的基于全局特征的检索方法以及采用局部特征聚合方案的方法。此外,在这些数据集上,我们的索引和查询时间比特征聚合方案快一个数量级。代码将开源。