The fusion of LiDARs and cameras has been increasingly adopted in autonomous driving for perception tasks. The performance of such fusion-based algorithms largely depends on the accuracy of sensor calibration, which is challenging due to the difficulty of identifying common features across different data modalities. Previously, many calibration methods involved specific targets and/or manual intervention, which has proven to be cumbersome and costly. Learning-based online calibration methods have been proposed, but their performance is barely satisfactory in most cases. These methods usually suffer from issues such as sparse feature maps, unreliable cross-modality association, inaccurate calibration parameter regression, etc. In this paper, to address these issues, we propose CalibFormer, an end-to-end network for automatic LiDAR-camera calibration. We aggregate multiple layers of camera and LiDAR image features to achieve high-resolution representations. A multi-head correlation module is utilized to identify correlations between features more accurately. Lastly, we employ transformer architectures to estimate accurate calibration parameters from the correlation information. Our method achieved a mean translation error of $0.8751 \mathrm{cm}$ and a mean rotation error of $0.0562 ^{\circ}$ on the KITTI dataset, surpassing existing state-of-the-art methods and demonstrating strong robustness, accuracy, and generalization capabilities.
翻译:激光雷达与相机的融合在自动驾驶感知任务中被日益广泛应用。此类融合算法的性能高度依赖于传感器标定的精度,而跨模态数据中共同特征的识别困难使标定极具挑战性。此前许多标定方法需借助特定标定物和/或人工干预,已被证明繁琐且成本高昂。尽管基于学习的在线标定方法已被提出,但其性能在多数情况下仍不尽人意。这些方法通常存在特征图稀疏、跨模态关联不可靠、标定参数回归不精确等问题。针对上述问题,本文提出CalibFormer——一种用于激光雷达-相机自动标定的端到端网络。我们通过聚合多层相机与激光雷达图像特征实现高分辨率表征,并利用多头互相关模块更精确地识别特征间关联。最后,采用Transformer架构从关联信息中估计精确的标定参数。在KITTI数据集上,本方法实现平均平移误差$0.8751 \mathrm{cm}$、平均旋转误差$0.0562^{\circ}$,超越现有最优方法,展现出强鲁棒性、高精度与优异的泛化能力。