High-precision vehicle localization with commercial setups is a crucial technique for high-level autonomous driving tasks. Localization with a monocular camera in LiDAR map is a newly emerged approach that achieves promising balance between cost and accuracy, but estimating pose by finding correspondences between such cross-modal sensor data is challenging, thereby damaging the localization accuracy. In this paper, we address the problem by proposing a novel Transformer-based neural network to register 2D images into 3D LiDAR map in an end-to-end manner. Poses are implicitly represented as high-dimensional feature vectors called pose queries and can be iteratively updated by interacting with the retrieved relevant information from cross-model features using attention mechanism in a proposed POse Estimator Transformer (POET) module. Moreover, we apply a multiple hypotheses aggregation method that estimates the final poses by performing parallel optimization on multiple randomly initialized pose queries to reduce the network uncertainty. Comprehensive analysis and experimental results on public benchmark conclude that the proposed image-to-LiDAR map localization network could achieve state-of-the-art performances in challenging cross-modal localization tasks.
翻译:高精度车辆定位是高级自动驾驶任务的关键技术。使用单目相机在激光雷达地图中进行定位是一种新兴方法,在成本与精度之间实现了有前景的平衡,但通过寻找此类跨模态传感器数据间的对应关系来估计姿态颇具挑战性,从而损害了定位精度。本文针对该问题,提出一种新型基于Transformer的神经网络,以端到端方式将2D图像配准到3D激光雷达地图。姿态被隐式表示为称为姿态查询的高维特征向量,并能在所提出的姿态估计Transformer模块中,通过注意力机制与从跨模态特征中检索的相关信息交互,实现迭代更新。此外,我们应用多假设聚合方法,通过对多个随机初始化的姿态查询执行并行优化来估计最终姿态,以降低网络不确定性。在公开基准上的全面分析与实验结果证明,所提出的图像到激光雷达地图定位网络能在具有挑战性的跨模态定位任务中达到最先进性能。