Implicit Learning of Scene Geometry from Poses for Global Localization

Global visual localization estimates the absolute pose of a camera using a single image, in a previously mapped area. Obtaining the pose from a single image enables many robotics and augmented/virtual reality applications. Inspired by latest advances in deep learning, many existing approaches directly learn and regress 6 DoF pose from an input image. However, these methods do not fully utilize the underlying scene geometry for pose regression. The challenge in monocular relocalization is the minimal availability of supervised training data, which is just the corresponding 6 DoF poses of the images. In this paper, we propose to utilize these minimal available labels (.i.e, poses) to learn the underlying 3D geometry of the scene and use the geometry to estimate the 6 DoF camera pose. We present a learning method that uses these pose labels and rigid alignment to learn two 3D geometric representations (\textit{X, Y, Z coordinates}) of the scene, one in camera coordinate frame and the other in global coordinate frame. Given a single image, it estimates these two 3D scene representations, which are then aligned to estimate a pose that matches the pose label. This formulation allows for the active inclusion of additional learning constraints to minimize 3D alignment errors between the two 3D scene representations, and 2D re-projection errors between the 3D global scene representation and 2D image pixels, resulting in improved localization accuracy. During inference, our model estimates the 3D scene geometry in camera and global frames and aligns them rigidly to obtain pose in real-time. We evaluate our work on three common visual localization datasets, conduct ablation studies, and show that our method exceeds state-of-the-art regression methods' pose accuracy on all datasets.

翻译：全球视觉定位利用单张图像在预先建图区域中估计相机的绝对位姿，该技术为众多机器人及增强/虚拟现实应用提供了支撑。受深度学习最新进展启发，现有方法多直接学习并回归输入图像的六自由度位姿。然而，这些方法未充分利用底层场景几何信息进行位姿回归。单目重定位面临的挑战在于监督训练数据极为有限——仅包含图像对应的六自由度位姿。本文提出利用这些最小可用标签（即位姿）学习场景底层三维几何结构，并基于该几何结构估计六自由度相机位姿。我们提出一种学习方法，通过位姿标签与刚性对齐来学习场景的两种三维几何表征（X， Y， Z坐标），分别位于相机坐标系和全局坐标系中。给定单张图像，模型可估计这两种三维场景表征，并通过对齐操作获得与位姿标签匹配的位姿估计。该框架可主动引入额外学习约束：一方面最小化两种三维场景表征间的三维对齐误差，另一方面最小化全局三维场景表征与二维图像像素间的重投影误差，从而提升定位精度。在推理阶段，模型实时估计相机坐标系和全局坐标系下的三维场景几何，通过刚性对齐快速获取位姿。我们在三个主流视觉定位数据集上进行了实验评估与消融分析，结果表明本方法在所有数据集上的位姿精度均超越现有最先进的回归方法。