Accurate and reliable ego-localization is critical for autonomous driving. In this paper, we present EgoVM, an end-to-end localization network that achieves comparable localization accuracy to prior state-of-the-art methods, but uses lightweight vectorized maps instead of heavy point-based maps. To begin with, we extract BEV features from online multi-view images and LiDAR point cloud. Then, we employ a set of learnable semantic embeddings to encode the semantic types of map elements and supervise them with semantic segmentation, to make their feature representation consistent with BEV features. After that, we feed map queries, composed of learnable semantic embeddings and coordinates of map elements, into a transformer decoder to perform cross-modality matching with BEV features. Finally, we adopt a robust histogram-based pose solver to estimate the optimal pose by searching exhaustively over candidate poses. We comprehensively validate the effectiveness of our method using both the nuScenes dataset and a newly collected dataset. The experimental results show that our method achieves centimeter-level localization accuracy, and outperforms existing methods using vectorized maps by a large margin. Furthermore, our model has been extensively tested in a large fleet of autonomous vehicles under various challenging urban scenes.
翻译:精确且可靠的自车定位对于自动驾驶至关重要。本文提出了EgoVM,一种端到端定位网络,该网络在实现与先前最优方法相当的定位精度的同时,仅使用轻量级矢量化地图而非稠密点云地图。首先,我们从在线多视角图像和激光雷达点云中提取BEV特征。接着,采用一组可学习的语义嵌入来编码地图元素的语义类型,并通过语义分割进行监督,使其特征表示与BEV特征保持一致。然后,将由可学习语义嵌入和地图元素坐标构成的地图查询输入至Transformer解码器,以与BEV特征执行跨模态匹配。最后,我们采用鲁棒的基于直方图的位姿解算器,通过对候选位姿进行穷举搜索来估计最优位姿。我们利用nuScenes数据集和新采集的数据集全面验证了方法的有效性。实验结果表明,该方法可实现厘米级定位精度,并大幅优于现有使用矢量化地图的方法。此外,该模型已在大型自动驾驶车队中经过各种复杂城市场景的广泛测试。