Visual localization is critical to many applications in computer vision and robotics. To address single-image RGB localization, state-of-the-art feature-based methods match local descriptors between a query image and a pre-built 3D model. Recently, deep neural networks have been exploited to regress the mapping between raw pixels and 3D coordinates in the scene, and thus the matching is implicitly performed by the forward pass through the network. However, in a large and ambiguous environment, learning such a regression task directly can be difficult for a single network. In this work, we present a new hierarchical scene coordinate network to predict pixel scene coordinates in a coarse-to-fine manner from a single RGB image. The proposed method, which is an extension of HSCNet, allows us to train compact models which scale robustly to large environments. It sets a new state-of-the-art for single-image localization on the 7-Scenes, 12 Scenes, Cambridge Landmarks datasets, and the combined indoor scenes.
翻译:视觉定位对计算机视觉和机器人领域的众多应用至关重要。针对单幅RGB图像定位任务,当前最先进的基于特征的方法需要将查询图像与预建3D模型间的局部描述子进行匹配。近年来,深度神经网络已被用于学习原始像素与场景3D坐标之间的映射关系,从而通过网络前向传播隐式完成匹配过程。然而,在大规模且存在歧义的环境中,单个网络直接学习此类回归任务可能面临困难。本文提出一种新的层级场景坐标网络,通过从粗到精的方式从单幅RGB图像预测像素级场景坐标。该方法是HSCNet的扩展,能够训练出可扩展至大型环境的紧凑模型,在7-Scenes、12-Scenes、Cambridge Landmarks数据集以及组合室内场景上均实现了单幅图像定位的最新性能。