Large-scale visual localization systems continue to rely on 3D point clouds built from image collections using structure-from-motion. While the 3D points in these models are represented using local image features, directly matching a query image's local features against the point cloud is challenging due to the scale of the nearest-neighbor search problem. Many recent approaches to visual localization have thus proposed a hybrid method, where first a global (per image) embedding is used to retrieve a small subset of database images, and local features of the query are matched only against those. It seems to have become common belief that global embeddings are critical for said image-retrieval in visual localization, despite the significant downside of having to compute two feature types for each query image. In this paper, we take a step back from this assumption and propose Constrained Approximate Nearest Neighbors (CANN), a joint solution of k-nearest-neighbors across both the geometry and appearance space using only local features. We first derive the theoretical foundation for k-nearest-neighbor retrieval across multiple metrics and then showcase how CANN improves visual localization. Our experiments on public localization benchmarks demonstrate that our method significantly outperforms both state-of-the-art global feature-based retrieval and approaches using local feature aggregation schemes. Moreover, it is an order of magnitude faster in both index and query time than feature aggregation schemes for these datasets. Code will be released.
翻译:大规模视觉定位系统仍依赖于从图像集通过运动恢复结构建立的三维点云。尽管这些模型中的三维点使用局部图像特征表示,但近邻搜索问题的规模使得直接将查询图像的局部特征与点云匹配面临挑战。因此,近期许多视觉定位方法提出混合方案:首先利用全局(逐图像)嵌入检索少量数据库图像,然后将查询图像的局部特征仅与这些图像匹配。尽管需要为每张查询图像计算两种特征类型存在显著缺陷,但使用全局嵌入进行图像检索似乎已成为视觉定位领域的共识。本文重新审视这一假设,提出约束近似最近邻(CANN)方法,利用仅包含局部特征的几何空间与外观空间联合求解k近邻问题。我们首先推导了跨度量k近邻检索的理论基础,随后展示了CANN如何改进视觉定位。在公开定位基准上的实验表明,我们的方法显著优于当前最先进的全局特征检索方案及局部特征聚合方法,同时在这些数据集上的索引构建与查询速度均比特征聚合方案快一个数量级。代码将开源。