EP2P-Loc: End-to-End 3D Point to 2D Pixel Localization for Large-Scale Visual Localization

Visual localization is the task of estimating a 6-DoF camera pose of a query image within a provided 3D reference map. Thanks to recent advances in various 3D sensors, 3D point clouds are becoming a more accurate and affordable option for building the reference map, but research to match the points of 3D point clouds with pixels in 2D images for visual localization remains challenging. Existing approaches that jointly learn 2D-3D feature matching suffer from low inliers due to representational differences between the two modalities, and the methods that bypass this problem into classification have an issue of poor refinement. In this work, we propose EP2P-Loc, a novel large-scale visual localization method that mitigates such appearance discrepancy and enables end-to-end training for pose estimation. To increase the number of inliers, we propose a simple algorithm to remove invisible 3D points in the image, and find all 2D-3D correspondences without keypoint detection. To reduce memory usage and search complexity, we take a coarse-to-fine approach where we extract patch-level features from 2D images, then perform 2D patch classification on each 3D point, and obtain the exact corresponding 2D pixel coordinates through positional encoding. Finally, for the first time in this task, we employ a differentiable PnP for end-to-end training. In the experiments on newly curated large-scale indoor and outdoor benchmarks based on 2D-3D-S and KITTI, we show that our method achieves the state-of-the-art performance compared to existing visual localization and image-to-point cloud registration methods.

翻译：视觉定位任务要求根据给定的三维参考地图估计查询图像的6自由度相机位姿。近年来各类三维传感器的进步使得三维点云成为构建参考地图更精确且经济的选择，但如何将三维点云中的点与二维图像中的像素进行匹配仍面临挑战。现有联合学习二维-三维特征匹配的方法因模态间表征差异导致内点数量不足，而将其转化为分类问题的方案则存在精细化能力不足的问题。为此，本文提出EP2P-Loc——一种新型大规模视觉定位方法，通过缓解表观差异实现端到端位姿估计训练。为增加内点数量，我们设计简单算法剔除图像中不可见的三维点，并在无关键点检测条件下建立全部二维-三维对应关系。为降低内存消耗与搜索复杂度，我们采用由粗到精的策略：首先提取二维图像的图像块级特征，接着对每个三维点进行二维图像块分类，最终通过位置编码获取精确的二维像素坐标。本方法首次在该任务中引入可微PnP算法实现端到端训练。在基于2D-3D-S与KITTI数据集构建的新型大规模室内外基准测试中，相较于现有视觉定位与图像-点云配准方法，本方法取得了最先进的性能表现。