We propose a compact pipeline to unify all the steps of Visual Localization: image retrieval, candidate re-ranking and initial pose estimation, and camera pose refinement. Our key assumption is that the deep features used for these individual tasks share common characteristics, so we should reuse them in all the procedures of the pipeline. Our DRAN (Deep Retrieval and image Alignment Network) is able to extract global descriptors for efficient image retrieval, use intermediate hierarchical features to re-rank the retrieval list and produce an initial pose guess, which is finally refined by means of a feature-metric optimization based on learned deep multi-scale dense features. DRAN is the first single network able to produce the features for the three steps of visual localization. DRAN achieves competitive performance in terms of robustness and accuracy under challenging conditions in public benchmarks, outperforming other unified approaches and consuming lower computational and memory cost than its counterparts using multiple networks. Code and models will be publicly available at https://github.com/jmorlana/DRAN.
翻译:我们提出一种紧凑流水线,以统一视觉定位的所有步骤:图像检索、候选重排序与初始姿态估计,以及相机姿态精化。我们的核心假设是,用于这些独立任务的深度特征具有共同特性,因此应在流水线的所有过程中复用它们。所提出的DRAN(深度检索与图像对齐网络)能够提取全局描述符以实现高效图像检索,利用中间层次特征对检索列表进行重排序并生成初始姿态估计,最后通过基于学习到的深度多尺度密集特征的特征度量优化进行精化。DRAN是首个能够为视觉定位三个步骤生成特征的单一网络。在公开基准测试中的挑战性条件下,DRAN在鲁棒性和精度方面取得了具有竞争力的性能,优于其他统一方法,且相比使用多个网络的同类方法,其计算和内存成本更低。代码与模型将在https://github.com/jmorlana/DRAN 公开。