Place recognition is an important technique for autonomous cars to achieve full autonomy since it can provide an initial guess to online localization algorithms. Although current methods based on images or point clouds have achieved satisfactory performance, localizing the images on a large-scale point cloud map remains a fairly unexplored problem. This cross-modal matching task is challenging due to the difficulty in extracting consistent descriptors from images and point clouds. In this paper, we propose the I2P-Rec method to solve the problem by transforming the cross-modal data into the same modality. Specifically, we leverage on the recent success of depth estimation networks to recover point clouds from images. We then project the point clouds into Bird's Eye View (BEV) images. Using the BEV image as an intermediate representation, we extract global features with a Convolutional Neural Network followed by a NetVLAD layer to perform matching. We evaluate our method on the KITTI dataset. The experimental results show that, with only a small set of training data, I2P-Rec can achieve a recall rate at Top-1 over 90\%. Also, it can generalize well to unknown environments, achieving recall rates at Top-1\% over 80\% and 90\%, when localizing monocular images and stereo images on point cloud maps, respectively.
翻译:位置识别是自动驾驶汽车实现完全自主的关键技术,因为它可以为在线定位算法提供初始估计。尽管当前基于图像或点云的方法已取得令人满意的性能,但在大规模点云地图上定位图像仍是一个尚未充分探索的问题。这种跨模态匹配任务具有挑战性,原因在于从图像和点云中提取一致描述子存在困难。本文提出I2P-Rec方法,通过将跨模态数据转换为同一模态来解决该问题。具体而言,我们利用近期深度估计网络的成功经验,从图像中恢复点云。随后将点云投影为鸟瞰图。以鸟瞰图作为中间表示,我们使用卷积神经网络结合NetVLAD层提取全局特征以执行匹配。在KITTI数据集上的评估表明,仅需少量训练数据,I2P-Rec即可实现Top-1召回率超过90%。此外,该方法能良好泛化至未知环境:在点云地图上定位单目图像和立体图像时,分别达到Top-1%召回率超过80%和90%。